— The Two Weeker

HackerNews frontpage feed bot
hn

Web
news.ycombinator.com
Joined December 2013

Things (38332)

thetwoweeker.com/blog/2014/6/30/kkbbwyicx0nwlprkn7yh2edsiwlrw2

xkcd'ers: We know where you live (+ more)

Data exploration by The Two Weeker

As longtime xkcd followers, reformed powerpoint formatters (see: investment banking), and newfound owners of a fledgling webcomic, we decided to apply our vast (used liberally) programming, stats and machine learning skills in conjunction with our lack of social lives (used literally) to answer the age-old question: Who reads xkcd? To start, we want to confess our sins and apologize to our fellow xkcd readers because we've been stalking you for the past month. In the process, we've scraped 20,000+ xkcd tweets, 100,000+ regular tweets, and installed more R / Python packages than the number of words in Entish but we've finally determined:

Are they male or female?
Where do xkcd readers come from?
When do people read xkcd?
Are they popular (follow-able)?
What kind of people are they?

Our results are explained below with nerd (spoiler) tags containing extra math for those who click. Our code will be available soon on Github and please feel free to reach out to us via Twitter/FB/carrier pigeon if you have any comments.

Nerd Talk: Click Me You nerd 1. Are they male/female?

Although Twitter doesn't contain gender data for tweeters, we were able to identify the gender of each user by their first name. We applied fuzzy string matching, matching each first name against a dataset (read: large) of boy and girl names with the same first letter. We threw out all the classifications where the ratio between the best match and the name was low (luckily this represented a tiny % of our data), as well as names that were classified as both genders (looking at you, Jesse).

What happened to the ladies ?

Where do xkcd'ers come from?

Just like gender, country/continent isn't a field for Twitter profiles either. We again applied fuzzy string matching to match a reference set of time zones/countries/continents to each tweeter's self-reported time zone. Another field, "location", was ignored entirely because you can write whatever you want there. One reader mixed up his space-time and wrote "The Future", while another claimed to be from "Somewhere with Magic n shit". Thanks team.

Our results showed that xkcd readers were mostly from Europe and North America, but by itself this doesn't prove much...

...so we compared our results against a "generic" tweet sample and found that xkcd readers are more likely to be European, but less likely to be South American or Asian (or tweet about the world cup). Mouse over to see for yourself.

When do people read xkcd?

We had to make one critical assumption here - namely, that people will tweet shortly after reading. It's entirely possible that some won't tweet until hours later while others may punch it within minutes. So technically...only know when people tweet about xkcd (shhh). Nevertheless, overlaying our xkcd tweet times against our sample of generic tweets, we concluded that reading xkcd is primarily an early morning and mid-afternoon activity, with very few reading it late at night:

Armed with this knowledge, we combined our "whens" and "wheres" to visualize global trends in xkcd reading below. The scrollbar on the top left represents the hour of day (in EST here instead of user local time) and the light-to-dark color scale represents the relative frequency of tweets between countries during that hour. Mousing over a single country will tell you the total number of tweets (across all hours) for that country:

Are xkcd'ers popular (follow-able)?

Before Marvel Studios made comic heros awesome and Zuckerberg made geeks "sexy", comic readers often had to contend with some.....uhhh.....less than stellar stereotypes. So we decided to put these claims to the test and determine (statistically), once and for all, are xkcd readers popular and likable?

We measured this on Twitter by defining:

Popularity = # of Followers Likability Ratio = # Followers / # Following

Why did we define popularity different from likability? Our logic was that everyone's got that one friend who knows everyone but few enjoy their company - these people are probably popular but not likable. On the other hand, there are plenty of well-respected folks who maintain small social circles - these people seem more likable than popular. Interestingly, it looks like the likability ratio has been actually explored by TechCrunch before (although they called it the golden ratio).

Looks like likability is a bit of a mixed bag - most xkcd'ers are less likable than their generic twitter peers, but there are also more very likable xkcder's (e.g., ratio > 1.5). Let's segment them into likable and unlikable buckets and see if likability is linked to popularity:

From looking at our plots, it seems that most xkcd'ers are usually

Slightly less likable :(
Slightly more popular :)
Significantly more popular if they are likable :D
Significantly less popular if they aren't likable :'( Nerd Talk: Now let's talk hypothesis testing

Beyond just looking at density plots and visualizations, we can also run hypothesis tests to test for statistical significance. In our case, the null hypotheses we'll be testing are:

xkcd'ers and generic twitter users are equally likable
xkcd'ers and generic twitter users are equally popular

Let's look at likability first. Interestingly, regular Twitter users present a bimodal distribution with a secondary mode near ~1 which the xkcd distribution does not have. Applying the Mann-Whitney U test to compare the ordinality of our two samples, we calculate a p-value (if the null hypothesis were true) of < 0.1%. This presents a pretty strong case for rejecting the null hypothesis and suggests that xkcd readers are usually less likable (more have a ratio < 1) but also have more highly likable people given the wider distribution.

What about popularity? Earlier we compared xkcd and generic tweeters against each other while also bucketing them by likability. Let's test if each pairs is significantly different:

Again applying the Mann-Whitney U test, we calculate a p-value (if the null hypothesis was true) of 5% for xkcd vs Twitter, < 0.1 % for xkcd "likables" vs Twitter "likables", and < 0.1% for xkcd "unlikables" vs. Twitter "unlikables". This presents a fairly strong case for rejecting the null hypotheses for all 3 pairs: xkcd'ers are slightly more popular than generic tweeters, while xkcd likables are noticeably more popular and xkcd unlikables are considerably less popular.

Nerd Talk: Why did we choose the Mann-Whitney U test?

Non-parametric tests are more appropriate for our data:

xkcd readers have a mean likability of 25x, well above their median of 0.8x and 95th percentile of 6.66x (creepy right?). These are marked in the boxplot below, where the pink line is the mean and the black whisker is the 95th percentile for each data set.

Intuitively, this implies that the mean tells us very little about the actual "average" likability of a user from this sample (due to the presence of outliers). As such, a non-parametric test should give us more insight into whether xkcd'ers are equally likable as other Twitter users. Non-parametric tests have the advantage of making no assumptions about the distributions of the variables being analyzed, making them more widely applicable and robust. In return, however, they are less efficient than parametric tests when applied to normal data.

The Mann-Whitney U test compares the distribution of rankings of each observation from our two sample sets. This is ideal for our purposes because ranking your data negates the impact of large outliers, i.e. even if the highest observation is 500 million and the next highest is 15, the difference in rank between these observations is still just 1.

When compared to other non-parametric tests (e.g. the 2-sample Kolmogorov-Smirnov test) the Mann-Whitney U test more closely examines our actual topic of interest: are xkcd'ers equally popular/likable as other Twitter users? This is because the P-value in the Mann-Whitney U test answers the question: if our two data sets are sampled from populations with identical distributions, what is the probability that random sampling would result in the mean ranks being as far apart (or more so) as observed? .

So how does the Mann-Whitney U test work?

Suppose you have 2 samples, A and B, with n1 and n2 observations in each. The Mann-Whitney U test calculates a U statistic based on the rankings of the observations across the union of A and B:

Group the observations from A and B together and assign an integer rank to each observation
Calculate the U statistic (the smaller of U1 and U2):

$${\large{U_1} = n_1n_2 + \frac{n_1(n_1+1)}{2} - \sum\limits_{i=1}^{n_1} Rank(x_i)}$$ $${\large{U_2} = n_1n_2 + \frac{n_2(n_2+1)}{2} - \sum\limits_{i=1}^{n_2} Rank(x_i)}$$ The U statistic represents the difference in mean rank of the observations from each sample group. If the two samples are mostly equally ranked, the difference between U1 and U2 should be small (and the smaller of two would be relatively larger), while the difference between U1 and U2 be large for two samples with vastly different rankings (and the smaller of the two would be relatively smaller).

P-values are calculated assuming the U statistic follows a known distribution under the null hypothesis, which is approximately normal for large sample sizes, so we can just use the standardized value accordingly:

$${\large{z} = \frac{U - m_{U}}{\sigma_{U}}}$$ $${\large m_{U} = \frac{n_1 n_2}{2}}, {\large \sigma_{U} = \sqrt{\frac{n_1 n_2(n_1 + n_2 + 1)}{12}}}$$

What kind of people are they?

In order to answer this question, we look at how xkcd'ers describe themselves in their Twitter bios, of which we scraped 15,000+. Now, we could read through all their bios to see what they say, but then we wouldn't have any time to watch Game of Thrones. Instead, let's transform the bios into "bag of word" vectors and break out some Natural Language Processing to summarize the data for us. To start, we plotted the most frequent terms used; a lot of nerdy terms pop up.

On first glance, it seem that xkcd'ers are a pretty nerdy crowd. However, all the words occur quite rarely, given that we have more than 15,000 bios. So this gives us some insight, but not much. To get a better idea of what the different kinds of people are, let's try to partition our reader population into different groups.

Clustering Twitter Bios

Using Spherical K-Means Clustering, we clustered the data into 4 distinct groups. The resulting groups are shown below as wordclouds of their bios, along with the data breakdown. Some pretty cool and intuitive groups pop up. The first one seems to represent those who work in the tech industry, and the second STEM students, both expected of xkcd readers, and the third is definitely hobby/interest focused. What the fourth group represents is anyone's guess.

Nerd Talk: What/why Spherical Clustering ?

K-means clustering is a method of grouping points of a data set together into K groups, so that members of the same group are "more similar" to each other than to points in other groups. But what does it mean to be "similar". Well for the standard K-means approach, "similarity" is defined using Euclidean distance, $$d(\boldsymbol{a}, \boldsymbol{b}) = \sqrt{ \|\boldsymbol{a}\|^2 + \|\boldsymbol{b}\|^2 -2 \boldsymbol{a}\cdot\boldsymbol{b} }$$ so that points that are more "similar" have a lower distance to each other. But here is an issue with text data: Euclidean distances incorporate vector norms in determining similiarity, which may not make sense contextually. Take a look at the toy example below: Contextually, the text "love" is most similiar to "love, love, love, love" . But if ended up doing clustering with K = 2, "love" ends up getting grouped with "war" and "peace", as it is closer to both; sadly, "love, love, love, love, love" gets no love.

For a set of documents with a large vocabulary, the vectors become very high dimensional (since there are a LOT of words) and incredibly sparse. For any two bios/documents, the proportion of words that the two share is very low, so that a*b becomes very, very small (nearly zero), and we basically end measuring the distance between a and b solely by their vector norms.

What does matter for our analysis? In a nutshull, the standard k-means algorithm will end up grouping short bios together, just by virtue of them being short. That ain't right.

See for yourself. We trained a K-means clustering with K = 3 on our dataset, and plotted both the norm distribution norm of texts for each cluster, and the percentage breakdown of the data for each cluster below. The majority of the points are grouped into cluster 2 , which we can see is composed of vectors with overall smaller norms.

Spherical K-means Clustering instead uses cosine distance, which measures the distance between a and b by $$d(\boldsymbol{a}, \boldsymbol{b}) = 1 - cos (\theta), \theta \text{ being the angle between }\boldsymbol{a} \text{ and } \boldsymbol{b}$$ Using the angle between the vectors gives us several benefits. For one, angles are unaffected by vector norms, so shorter texts won't make a difference. Also angle is only affected by the words that a and b share in common. This makes much more sense for contextual similarity.

We did the same clustering again, but this time using cosine distance. The breakdown between clusters is now more even, and none of the clusters have a norm distribution that is significantly smaller than the others.

15% of data 21% of data 29% of data 35% of data

Topic Modeling on Twitter Bios

Clustering, however may be too limiting, since it partitions our reader population into distinct groups without any overlap. Intuitively, each person probably shares some attributes with group 1, some attributes with group 2, some with group 3 etc., instead of belonging 100% to group 2. This is where topic modeling comes in; it allows us to go one level deeper (read: Inception), and represent the readers as a mixture of various groups/types.

More specifically, we will use Latent Dirichlet Allocation (LDA) to model our data. LDA models each document (e.h. Twitter bio) as a mixture of latent/hidden topics. Under this representation, we can think of the hidden topics as the various groups/types that comprise the readers.

Nerd Talk: Wait what is LDA ?

Latent Dirichlet Allocation is a generative model for the words in a document. As mentioned above, model assumes that each document is composed of a probabilistic mixture of k topics, and each word in the document is generated from the one of the k topics.

But what is a topic? In LDA, topics are represented as distributions over words; each topic has its own distribution, so that is assigns different probabilities for generating different words. So if we had a "sports" topic and "food" topic, the "sports" topic would assign a higher probability to "football" or "player" than the "food" topic.

Wanna get mathy? Let's say we a have a document d, with N total words. Assume for now that those words aren't really there, just blank spaces, so we have N blank word positions. Then, the words of the document are generated by the following process: $$\text{1. Generate a topic distribution } \theta_d \text{ from }\boldsymbol{Dirichlet}(\alpha)$$ $$\text{2. For each word position i of document d:}\ \text{i. Generate a topic z from }\boldsymbol{Multinomial}(\theta_d)\ \text{ii. Generate a word w from }\boldsymbol{Multinomial} (\beta_z),\ \text{ the word distribution for topic z }$$ $$\text{So for a documents composed of words } (w_1, w_2, ..., w_N) \text{, the probability of that document, } P(d\text{ }|\text{ } \boldsymbol{\alpha},\text{ } \boldsymbol{\beta}) \text{, is given by :} $$ $$\int_\theta \prod_i^N \sum_k^K P(w_i|z_k, \beta_{z,k} )P(z_k|\theta)P(\theta|\boldsymbol{\alpha})d\theta $$ The goal of LDA is to find the values of parameters $$(\boldsymbol{\alpha}, \boldsymbol{\beta})$$ that maximize the probability of all the documents in the data set, and then backtrack out the optimal topic distributions for each document and words distributions for each of the K topics.

Sounds complicated? It kind of is. We won't get into how the optimal parameters are calculated, which is beyond the scope of this humble webcomic to explain. But if words like Variational Expectation Maximization or Gibbs Sampling get you all hot and bothered, we recommend you check out the actual paper, by David Blei, Andrew Ng, and Michael I. Jordan:

We trained a LDA model with 15 topics on our data, and plotted the topic word distributions below. For fun, we tagged each topic with what we think it represents; feel free to disagree (silently).

alphadog #ficklepickle #hobbyist #negativenancy #STEMstudent #justaguy

pundit #artist #groupie #onedaymore #computerscientist #gamer

existentialist #geekDad #developer Nerd Talk: Why 15 Topics ?

Below shows the log-likelihoods and perplexity (on held-out data) of LDA models trained on the Twitter bios, for K up to 30.

It looks like once we are past ~15/20 topics, there is diminishing marginal benefit in increasing K. Since we want to avoid capturing excessive noice, we choose K = 15. Of course, we could use some complicated hierarchical Dirichlet techniques to determine the optimal K, but this ain't rocket science. Also, using 15 topics gives us some nice interpretability (see below).

So what have we found?

Dear reader, we accept the fact that together, we had to sacrifice a full 20 minutes to read whatever we were saying in this analysis. But we've come to the conclusion that its crazy to try and express xkcd readers in the simplest terms - in the most convenient definitions. Because what we've found is that each xkcd'er is a nerd...

And a scientist. And an artist. They read and listen to music. They drink beer and watch sports. They are fathers, husbands, and software engineers. They love, and they like. But above all, they read xkcd.

Sincerely yours, The Two Weeker

Comments (0)

Sign in to post comments.