I'm not a professional statistician. In a previous life, as an undergraduate, I did study some statistics but that was a long time ago, in a galaxy far, far away. Since then, I've kept well clear from statistics. Readers should keep that in mind.

Whatever my limitations and foibles, however, I'm a curious bloke and the data set below (scroll to the very bottom) caught my attention when I found it. Have a look at it.

What do those 579 numbers mean, you may ask? Let's leave the precise details out for now. Although obviously important, I think we may profitably proceed for a while with the basic statistics of that data, unencumbered by their detailed interpretation. Trust me, that will come later.

Let me just say that it's supposed to be real data, obtained from an online survey. Think of it as how many "likes" a series of options get: a "10" in that data set, for example, means that option was liked 10 times.

----------

The first step when dealing with data, statisticians advice, is to visualise it.

Sounds good to you?

Okay. A first possibility is a histogram. A professional statistical package would be ideal, but it may be unavailable. Lacking that, one has two options: (1) to use a spreadsheet package, if one already has one installed in one's machine, or (2) to use a free online application.

Let's begin with (1), as I use LibreOffice, which comes with Calc (a kind of free Microsoft Excel clone with most of Excel's functionality). Spreadsheets are convenient, many people already use them.

This is my first attempt.

Right-click to open a larger version in a separate tab |

Not very good, is it?

Before going into the information that chart reveals, it may be worth a general comment on what one needs to do to get that chart, poor as it is. As far as I know, spreadsheet packages do not implement histograms, but bar charts. Although visually similar, they aren't the same thing. That's why I present the same data using a bar chart (above) and a scatter plot (below). Soon you'll understand why.

Right-click to open a larger version in a separate tab |

The data both charts display took a little previous number crunching. I counted how many 1s, for instance, were present in the data set (97, it turns out). Then I did that for each number up to 3,253. That's what those charts show.

Whatever chart you consider, the data set is arranged in a general "L" shape: most data points to the left, a few sparse ones to the right, along the range of the distribution (i.e. the maximum minus the minimum values). This means that the "likes" distribution is positively skewed. To be sure, this convey some information about the whole data set and particularly those few data points near its upper end, but it tells nothing about the many data points crammed in the lower reaches of the "likes" scale.

Here is where a key difference between a histogram (and a scatter plot) and a bar chart appears. One can't do this with a bar chart:

Right-click to open a larger version in a separate tab |

It looks different, doesn't it? Still, that's the same data, over the same range: only the horizontal scale changed, from linear to log. This "blows", "magnifies" the lower values in the "likes" scale, and "shrinks", "compresses" the higher values. It's still an "L", but now the vertical bit is "fatter"; the horizontal bit, "shorter". It tells a lot more about the "lower reaches". For those interested: James Hamilton (Econbrowser) explains more about logarithms.

----------

A digression: You can see another difference between a histogram and a bar chart below.

Right-click to open a larger version in a separate tab |

Sarah Donovan of the US Congressional Research Service prepared that chart. That report is meant to instruct Congress people and their staffers about income distribution. Maybe its intended readership won't appreciate her efforts. You should: it does that very well. That chart shows the 2013 US household income distribution, based on US Census data (like our "likes" distribution, income distribution is also very positively skewed).

Look at the horizontal axis. It represents income "bins" (intervals). For example, the second bin includes households with annual incomes between US$ 5,000 and 9,999 (inclusive): about 4% of American households in 2013. The difference (i.e. the bin's width) is US$4,999: the same for every other bin, except the first and the last two (see the note). The second last, for example, starts at US$200,000 and goes to 249,999: its width is therefore US$49,999: 10 times the previous bins. We can't say how wide the last one is: a household goes into that bin if its annual income is at least $250,000. It also goes there if its income is 2,500,000 or 25,000,000: it has no upper limit.

There are reasons for that. Donovan explains the confidentiality concerns, applying to the last bin (which additionally precludes the use of log scale). Although the data must exist somewhere, that implies that the range of the distribution is unknown to the public.

I'd add that to express the second last bin in the US$4,999 scale would make it 10 times wider: either one extends the chart towards the right or one "shrinks" the whole chart. That's not practical.

That chart behaves like a histogram for those bins represented in the US$4,999 scale. It does not in the other three bins: in them it's a bar chart. That's why it has those otherwise strange-looking tall bars at the right end. The second last bin, for example, condenses in a given location in the chart a relative frequency (over 2%) that should be scattered over an area 10 times wider.

(This, incidentally, is where the work of Thomas Piketty and associates fills an important need: it provides a glimpse within those two otherwise undifferentiated tall bars.)

Donovan's approach is obviously different to the "log scale" approach, but they share plenty: they focus analytical attention on the lower part of the scale. A loss of detail in the upper part of the scale is the price paid. Highly unequal distributions can be tricky.

----------

That digression, I hope, was in itself of interest to readers worried about income or wealth inequality, but it's also relevant to the problem at hand: a very unequally distributed variable ("likes").

So far, the three data visualisations employed seem less than satisfactory. The best one, the third [the scatter plot of counts vs log("likes")] ticks two important boxes: it shows the whole range of the distribution and it also displays details about the left end of the "likes" distribution. It, however, presents the log scale which may intimidate some readers (it doesn't look like a histogram at all, either).

Donovan's "hybrid" approach could be useful:

Right-click to open a larger version in a separate tab |

As in her chart, that bar at the right end of the horizontal axis represents the data not allocated to the first 34 bins. The first bins account for 522 out of the 579 numbers in the data set (90.2% of the total count). The right-most bar represents the other 57 or 9.8%. Paraphrasing the Occupy Wall Street movement: the right-most bar is the "10%", we to the left are the "90%".

Provided readers remember that (1) all bins up to the 34th are equally wide (1 "like") and (2) the last bin is in fact 3,219 times wider than its graphical representation above (and therefore, that bar concentrates data points that should be unevenly scattered) this visualisation may usefully replace the third chart. But it's up to the readers.

Whatever the visualisation readers prefer, the picture that emerges is that the further down the "likes" scale, the more "densely populated" it gets. There is a peak at 1 "like" (97 data points); after 5 "likes" the mountain transitions into a piedmont; a plain at "sea level" (0 data points) follows, initially with bumps (2 or more frequently 1 data point high): these become more and more spaced as we walk "eastwards".

Readers are invited to try free online histogram generators. I tried these two: P. Wessa's Histogram, and Shodor's. Shodor's allows for live modification of the histogram and seems more user-friendly; Wessa's can generate a histogram automatically (from default values) and in addition produces a frequency table, from which I extracted this:

Frequency Table (Histogram)

=======================================

Abs. Rel.

Bins Frequency Frequency

---------------------------------------

[0, 500[ 573 0.989637

[500, 1000[ 4 0.006908

[1000,1500[ 1 0.001727

[1500,2000[ 0 0

[2000,2500[ 0 0

[2500,3000[ 0 0

[3000,3500] 1 0.001727

--------------------------------------

Source: Wessa P., (2017), Histogram (v1,.19) in Free Statistics Software (v1.2.1), Office for Research Development and Education, URL

Ironically, to do data visualisation often requires some number crunching. Thanks to a lucky set of default values, that table also allows for a better OWS paraphrase: the first 573 data points are the "99%"; the top 6 data points (3,253 "likes", 1,080, 904, 708, 607, 591) are the "1%". Indeed, the top-most data point (3,253 "likes") is roughly the "0.1%". Cool, uh?

Those off the cuff remarks lead to something important. So far, we have focused on the "90%" or the "99%". Let's not forget, however, the top of the distribution in general. Soon you'll see why that's important.

----------

Like Excel, Calc can produce a table with descriptive statistics, too. An extract is below. It shows, among others, the mode (1 "like": the "peak"), the median (4), the mean (29.6): the order of those three measures is typical of positively-skewed distributions (like income and wealth). Usually the larger the mean with respect to the median, the more positively skewed the distribution.

Descriptive Statistics

============================

Mean 29.6

Mode 1

Minimum 1

First Quartile 2

Median 4

Third Quartile 8

Maximum 3,253

Range 3,252

Sum 17,147

Count 579

It also presents other numerical measures of shape, location, and dispersion. Of particular interest are the 4 quartiles: 1st quartile, median (2nd), 3rd, and 4th, the maximum (which we already know is 3,253).

I built the table below, for instance, with the help of those statistics. It shows how sensitive the mean is to the "1%". The baseline mean is 29.6 "likes", calculated from the entire data set. Remove the top data point (3,253 "likes") and the mean falls 5.6 "likes" to 24.0. Remove then both the top and the second data points and the mean falls again: this time to 22.2 "likes"; and so on. Removing the 6 top data points would reduce the mean, from 29.6 "likes" to 17.5: a 41% reduction (without ever changing the median or the mode!).

In other words, 1% of the sample determines 41% of the sample's mean.

Mean and the "1%"

=====================

Mean Contribution

---------------------

29.6

24.0 5.6

22.2 7.4

20.7 8.9

19.5 10.1

18.5 11.2

17.5 12.2

There's more. Similar calculations would reveal other disproportionate effects. From 3,252 "likes", for example, the range would collapse to 480 with the removal of the "1%": a 85% fall! Recall that the large range was one of the causes behind the difficulties visualising that data. Consider additionally that in a distribution of strictly positive values like this, the range itself is an index of inequality.

Just another set of figures to put this in historical perspective. Vilfredo Pareto was the first close student of income/wealth inequality. The Pareto principle, named after him, would have it that 80% of the income in a community goes to 20% of the members of that community. If one assimilated "likes" with money, one would find that the top 7.4% of the data set accounts for 80% of "likes". Believe it or not.

Extreme values are important. Damn important. Fucking important. When faced with highly unequal distributions one better be really, really, really sure those extreme values reflect something real. Otherwise one might either mislead one's readers, if they are gullible, or make a fool of oneself, if they aren't.

There are more structured methods, of course, but if one looked for outliers, those 6 would be the first suspects, yes? Moreover, the further to the right, the more suspect.

----------

The figure below represents a box plot (aka box-and-whisker plot), familiar to statisticians but less popular among the lay public. Although spreadsheet packages don't implement it, it was made with Calc. It took some extra work and its daily use would be impractical (but it was fun to make and it gives me the chance to show off: how to create one). It displays a log vertical axis, because of that it conveys a view of the whole distribution of "likes":

Right-click to open a larger version in a separate tab |

That is the most basic box plot: the upper whisker (the little "T"-shaped top end) represents the 3,253 "likes" data point: that is also the 4th quartile and maximum of the distribution. The lid of the "box" represents the 3rd quartile: in "likes" that is 8 (below it is 75% of the sample). The line dividing the yellow and green areas is the median (the 2nd quartile): 4 "likes". The bottom of the box is the 1st quartile (below it lies 25% of the sample): 2 "likes".

The "box" contains, therefore, 50% of the sample. The upside-down "T" at the bottom represents the minimum of the "likes" scale: 1 "like". As we've seen, it's the most frequent result (i.e. the mode).

Even on a log scale this is clearly a very asymmetric distribution around the median. It gets a whole lot worse when we move back to a linear scale (check against the Descriptive Statistics table, above): in words, 25% of the sample gets between 1 and 2 "likes"; the next quarter gets between 2 and 4; the third gets between 4 and 8; and the last between 8 and 3,253 (yes, that's right: three thousand two hundred fifty three, believe it or not). Welcome to inequality.

Instead of going through all the trouble making that diagram, however, one can instead use a web-based generator. It's a lot easier. The one I employed creates the plot, and, as a bonus, applies by default more structured methods designed to determine outliers (it also explains in detail the construction of a box plot; check it out). This is its output:

Right-click to open a larger version in a separate tab |

Personally, I find it less attractive than mine. Visually, it looks more symmetric, but that's an artefact of the log scale and the removal of outliers (the black dots). The algorithm flags not only the 6 data points in the "1%" but a whole bunch of them (89, to be precise) as possible outliers: 3,253, 1,080, 904, 708, 607, 591, 481, 469, 456, 383, 336, 320, 319, 319, 268, 267, 250, 246, 216, 211, 208, 205, 174, 153, 141, 138, 102, 97, 87, 78, 76, 72, 71, 67, 64, 60, 58, 57, 54, 53, 52, 51, 49, 49, 48, 48, 43, 42, 41, 41, 40, 38, 38, 37, 37, 36, 35, 34, 32, 32, 31, 28, 27, 27, 27, 26, 24, 24, 24, 23, 23, 22, 22, 22, 22, 22, 21, 20, 20, 20, 20, 19, 19, 19, 19, 18, 18, 17, 17.

What to do about them outliers? Well, there's no clearcut answer to that question, I think. There's a reasonable suspicion that something is wrong with them. Maybe someone just pressed the wrong keys: fat fingers. Shit happens. Like I said, one must be really sure of those values, or one may end up red-faced.

Even if real, sometimes outliers may be safely ignored: for purposes of provision of public health, for instance, the presence of extremely high-worth individuals may be all but ignored. In other circumstances that would be reckless: certainly Piketty and associates would oppose that in the case of high-worth individuals and taxes.

----------

I ain't no expert, but it seems to me that the presence of extreme -- overblown perhaps would be a better description -- values like the "1%" in this case, makes it surprisingly tricky to get a general view of the whole distribution, doesn't it?

This leads us straight back to the interpretation of the data. See? I didn't forget that. :-)

Bear with me a little longer. We'll deal with that in the next post.

DATA SET

5,9,5,4,1,20,2,1,1,1,1,4,3,27,4,3253,3,1,2,1,6,3,2,2,6,5,1,1,5,43,4,3,11,5,3,2,9,2,4,38,3,3,4,1,102,48,5,4,19,26,5,2,1,4,1,22,3,3,4,3,4,4,2,1,27,3,3,22,3,2,4,16,41,2,1,1,456,36,9,138,21,2,3,3,4,6,3,32,4,5,2,19,2,10,3,3,3,5,1,319,53,5,78,7,2,8,12,11,205,1080,6,1,4,3,10,2,2,32,14,6,5,11,9,4,1,4,5,1,5,4,3,4,2,5,7,3,49,3,3,37,4,141,3,2,57,6,5,250,6,2,2,16,4,2,1,1,2,1,1,1,5,11,9,1,17,2,4,15,15,6,5,3,1,7,3,1,4,3,7,4,3,2,7,1,2,3,1,10,3,10,5,2,9,4,6,87,4,20,1,10,1,1,4,3,1,5,4,35,3,17,5,1,15,2,3,2,4,5,7,58,15,6,3,2,2,153,5,2,3,5,4,1,3,1,42,4,4,2,4,11,5,12,31,3,64,9,4,2,5,3,18,1,19,54,2,1,5,4,4,3,27,1,1,52,3,23,1,2,469,4,3,3,48,208,13,7,2,41,2,9,1,2,4,2,1,3,2,174,4,3,2,2,7,3,2,3,2,4,5,1,5,4,2,3,6,1,6,3,2,2,3,8,8,3,5,5,1,1,8,5,2,5,4,72,5,3,4,3,2,1,40,22,5,1,1,2,1,9,4,6,1,51,4,4,1,37,4,4,2,22,2,4,2,1,10,5,12,3,3,1,1,9,5,6,3,5,5,49,4,1,2,1,23,2,5,383,1,15,5,4,2,5,4,5,1,3,6,3,216,7,1,2,4,319,19,4,2,904,16,1,5,1,2,5,4,1,20,9,4,2,5,591,9,3,7,4,2,5,8,3,5,4,1,24,5,1,2,1,5,3,5,7,2,481,4,1,6,3,1,6,9,2,3,5,2,4,15,12,3,1,10,1,1,7,1,1,1,5,4,11,5,2,1,76,34,4,267,6,1,6,8,4,607,4,4,3,211,4,4,320,16,2,1,4,1,6,38,3,2,4,1,2,5,15,4,2,67,9,268,22,2,97,3,5,10,6,1,24,5,1,4,4,2,1,7,1,2,2,1,4,4,246,8,5,2,4,1,1,5,3,24,708,18,5,2,1,2,1,1,3,1,5,1,9,8,5,1,336,20,6,4,2,3,5,14,71,1,1,10,5,3,2,60,3,28,14,11,2,6,3,2,2,6,3,2,1,4,3,1

Fun post, Magpie. Looking forward to the next installment.

ReplyDeleteThanks, Pete!

DeleteTo me, the next one was a lot more fun to write. I remains to be seen whether all my readers will enjoy it so much. :-)

What Mark Twain said... my head hurts.

ReplyDeleteC'mon, it wasn't that bad. :-)

DeleteYour dataset came from an opinion poll, whose results are subjective and trivial.

DeleteWhen I was studying remote sensing, there was plenty of statistical analysis related to frequency distribution. Would you like to see the reports I prepared as part of my studies? Those were painful, especially the histograms.

DeleteWould you like to see the reports I prepared as part of my studies? Those were painful, especially the histograms.I can imagine. Still, I think that data set could allow for some cool exercises for students: how to build useful, readable, charts; how to spot outliers; how sensitive the descriptive statistics are to them.

And, if you think about income/wealth distributions, in this "likes" data set we have a piece of the jigsaw that is universally missing for researchers outside government: we do know the precise identities of every single individual in the distribution.

That is beyond cool.