Subreddit breakdown: AskScience

I’m a big fan of the reddit community AskScience, where anyone can ask a question and have “scientists” try to answer it. Some of the people answering may not be actual scientists, but there’s still a lot of participation from people with real expertise. You can also find some fascinating questions. Here’s a short list of questions I thought were interesting or amusing (in no particular order):

  1. If two ships travel at faster than half the speed of light away from each other, could light from one ever reach the other?
  2. What exactly is an itch?
  3. Why can’t I list every book I know, but I can tell you if I own it?
  4. Why do airplane windows need to have that hole?
  5. Do people sneeze while they sleep?
  6. If you farted hard enough in space, could you move yourself around?

Clearly, these people are asking the important questions. One thing I’ve noticed on AskScience is how physics-related questions seem to be the most common. I know I’m biased (I always hope to see more neuroscience questions), but I wanted to know for sure. I decided to analyze the subreddit submissions by scraping submission info with PRAW. Specifically, I used this Python script to download information about every submission since the subreddit started (6 years ago). The end result was 155,805 JSON files, which covered a range of 2,239 days. Below are some simple descriptive statistics about this dataset. Eventually, I’d like to do some natural language processing on the questions and answers themselves, but for now, this will be a broad look at the meta-data for submissions to AskScience.

First, a bit of history

Below, I’ve plotted the average rate of submissions over time in units of submissions per hour, starting around April 2010. You can see the that the rate grew slowly for the first eight months but then spiked around January 2011. I think this was a result of new traffic from AskScience’s nomination for the Best Little Community of 2010. Over the next nine months, the community grew by leaps and bounds until it was eventually added to the list of default subreddits in October 2011. That caused another huge spike in traffic, and you can see major fluctuations in submission rate over the following 12 months or so. Since then, the rate of submissions has continued to fluctuate up and down but held more or less steady.


I think it’s interesting to look through the early periods to see how the subreddit gained popularity. For example, this post from March 2011 has a moderator saying that the number of subscribers reached 19,000, up from only 4,500 just 6 months earlier. When AskScience was made a default subreddit, this number skyrocketed. As of this writing, there were 8.5 million subscribers. That’s quite a change.

Breaking down submissions by flair type

Sometime in 2012, AskScience started requiring each question to be tagged with one of 12 flairs corresponding to different scientific fields: physics, biology, chemistry, etc. Below, I’ve broken down the 91,014 tagged posts by flair type. Sure enough, physics is by far the most popular. Biology is a close second, but each of the other fields are less popular by a factor of 2 or more. The number of questions tagged with the bottom three flairs (psychology, social sciences, and computing) are each  around 13 times smaller than the number tagged with physics.


This confirmed my observation that physics questions were the most common, but I was also curious to see if these percentages changed over time. Below I have a stacked area plot that shows how the relative percentages changed over time. Initially, most questions were not tagged, so you can see the gray area covers almost 100% of the plot. Then in 2012, the colored areas get larger, meaning more and more submissions were getting tagged. You can tell how the percentages changed by comparing the relative size of the colored areas


It seems in this graph that the relative percentages are somewhat stable over time, but something interesting happened to the Biology category. It’s hard to tell in the above graph, but the rate of biology and physics questions were actually quite similar in the beginning. Sometime in mid-2014, the rate of biology tags dropped, which you can see as the green biology area gets smaller.

It’s much easier to see this trend when you look at the raw submission rates, which I’ve plotted below. Initially, the curves are highly correlated (they both go up and down as the overall submission rate fluctuates, but then the biology rate drops and never recovers.


I’m not sure what caused this drop. It could be the result of more aggressive moderation of biology-tagged posts, or possibly a genuine decrease in the number of biology questions. Regardless, the end result is an even greater percentage of physics questions over the last two years. The bar graph above shows the percentages of physics and biology questions are around 27% and 20%, respectively. More recently, these values are closer to 30% and 15%.

UPDATE: After talking with a few mods from AskScience, I’ve learned that the drop in Biology posts was caused by the introduction of the “Human Body” tag, which was introduced as a subset of the Medicine category. This resulted in a lot of questions getting the Medicine flair that previously had the Biology flair. Nevertheless, there is still an small decrease in Biology-related posts over time, which you can see in the stacked area plot by comparing the Biology+Medicine areas to the Physics area.

Comparing scores

An obvious question to follow-up the previous analyses is how well the posts from each field do once they’re submitted, i.e. how many upvotes they get. Reddit “fuzzes” this information to prevent spam bots from taking advantage, but the numbers I extracted should still be an okay approximation.

I decided to compare three values:

  1. Score – This is the number of upvotes a post receives minus the number of downvotes.
  2. Upvote ratio – The ratio of upvotes to the total number of votes. This number is between 0 and 1.
  3. Number of comments – This is the total number of comments in a post, so it includes all the replies to the top-level answers.

Below I have the mean values for each flair type marked with circles. The lines are 99% bootstrapped confidence intervals, so a lot of overlap of these lines implies the mean values are not significantly different from each other. You shouldn’t place too much stock in comparing individual pairs because of the problem of multiple comparisons, but if there’s a large gap between the confidence intervals, its probably safe to say that the difference is statistically significant.


The confidence intervals for the scores have a large degree of overlap, indicating there probably aren’t very many significant differences between the groups. There might be some, e.g. medicine posts seem to have a higher average score than chemistry posts, but not by much. It’s important to note that the score and number of comments have distributions that are heavily skewed towards zero (95% of posts have scores under 100). The mean isn’t a great summary statistic for these types of distributions, but the confidence intervals I have here are bootstrapped, i.e. they don’t make any assumptions about the underlying distribution.

The confidence intervals for the upvote ratios and number of comments have less overlap, so you can have more confidence in making comparisons. Take, for example, the opposite trends for mathematics and neuroscience, two categories that receive similar numbers of questions. Math questions seem to get the most comments but have the lowest upvote ratios, whereas neuroscience questions seem to have high upvote ratios but few comments. This makes me wonder if math questions are more controversial, generating a lot of discussion but not a lot of positive interest. Conversely, I think people like seeing neuroscience questions enough to upvote them, but there aren’t as many neuroscientists on reddit answering questions.

Final remarks

This may have seemed like a lot of work just to confirm my suspicion about the popularity of physics questions, but I enjoyed taking a deeper look at such an interesting part of reddit. There’s plenty more to do with this dataset, such as looking at the actual content of the questions and answers. I’m also very interested to see how much higher the scores are for answers that were given by people with “scientist” flairs. Stay tuned.


Listening to data

I enjoy browsing /r/dataisbeautiful to see all the creative data visualizations, which makes me wish for more innovative visualization techniques in scientific literature. Unfortunately, anything besides static images is still largely relegated to the supplementary info. Libraries like D3.js are amazing, but the scientific community has yet to embrace these types of tools. This will hopefully change as journals such as eLife become more popular. Until then, I think we’re stuck with static images being the principal method for displaying data. There’s nothing specifically wrong with that, but different approaches can often lead to new insights.

To show you what I mean, I’ve put together a few videos demonstrating a different way of “looking” at time series data: sound synthesis. There are three example below, each using a different data source. I’ll be the first to admit that these are essentially gimmicks, but I hope I can convince you that it’s a useful tool during the data exploration phase, which can lead to new insights and analyses.

Example 1: Neuroscience

I’ll start with something from my own research. I’ve taken a 15 s recording from a neuron in the auditory midbrain and simply used the raw waveform from the signal to create a sound. This signal is the voltage from an extracellular electrode that is slowly advanced through the brain until it a contact is close enough to a particular neuron that we can see individual action potentials. During an experiment, we have the voltage playing continuously through a speaker, since it’s helpful to listen for the distinctive sounds of an action potential when searching for a neuron. Below is an example of we hear when we get close to a neuron.

The action potentials are the large “spikes” in the waveform that are shown in more detail on the smaller set of axes. You can hear them as distinctive crackling sounds. The spikes in this example appear to occur randomly, which we call “spontaneous” spiking since no stimulus is being applied. However, there is a limit imposed by the kinetics of action potential generation on how fast any neuron can fire, often referred to as the refractory period. For this reason, the spiking of a neuron is not truly a Poisson process as it is often assumed to be. You can sometimes even hear a distinct periodicity in the spiking, which is the basis for neuronal oscillations that play critical roles in brain function. These periodicities can be quantified with tools like spectral analysis or directional statistics.

Example 2: Physics/astronomy

This next example is obviously not my field of expertise, but I think it’s nevertheless fascinating. The audio is taken from the Space Audio page from a group at the University of Iowa. It’s a recording of “whistlers”, a phenomenon that can occur when lightning causes radio waves to travel along the Earth’s magnetic field lines, which causes dispersion due to the different speeds at which the different frequencies propagate. Here’s a figure from the same group that explains it better. The result is a distinctive whistling sound where the pitch drops as a result of the dispersion.

The visualization is a simple spectrogram, which shows frequency on the y-axis, time on the x-axis, and intensity as the color. You can see the curved lines in the spectrogram that correspond to the whistling sounds, which is a sign that the frequency is changing over time.

I love this example for two reasons (besides the fact that it just sounds cool). First, the signal did not need to be shifted at all in terms of frequency before converting to sound. You can find examples of “space sounds” where something like gamma radiation is used to synthesize a sound. However, gamma rays have frequencies that are much, much higher than our audible range, so it needs to be shifted in order to hear it. This is largely an arbitrary process, since the perception of the sound can vary depending on how much frequency shifting/compressing is done.

Second, dispersion is a concept important to the propagation of waves in general, so you can find examples of it in the auditory system as well. The cochlea, for example, acts like a dispersive delay line, meaning that it not only separates out the frequency content of sounds, but it does so at slightly different speeds, resulting in what is referred to as the “traveling wave.” These can look a lot like whistlers.

Example 3: The button

Frequent visitors to reddit will probably recall /r/thebutton, reddit’s most recent April Fool’s Day joke. Explaining it fully would take a while, so check out the wiki on the subreddit linked above or this blog post for more information. Suffice it to say, it’s a very interesting dataset. The official server-side data was released after the button ended, and I used it to synthesize sound.

It’s based on the concept of a vocoder, which works by splitting a sound into different frequency bands, and then using the energy in those bands to modulate a sound like white noise or a sawtooth wave. I used the different flair values of the button as my frequency bands, i.e. the 60 s presses were mapped to a high-frequency band and the 0 s were mapped to a low frequency band. I counted the number of presses within each band within 3 hour windows and used these rates to modulate white noise. I also compressed the timescale so listening to the two-month dataset takes two minutes. I recommend playing the video below with headphones on, although you may want to turn the volume down at the beginning since it starts out loud.

The top plot is the sound waveform that you’re hearing, and the middle plot is like a spectrogram, showing the log-scaled rate of button presses over time. The bottom plots the instantaneous press rate for each flair over the 3 hour window used to compose each frame, so the minimum non-zero value is \log_{10}\left(\frac{1 \text{ press}}{180 \text{ minutes}}\right) \approx -2.3. Also, you can probably tell I cut out the first few days (60 hours to be exact), since the rate of button presses is so high at the beginning that it would’ve swamped out the rest of the audio. That and given you all hearing loss.

You can hear/see some pretty interesting things, like how the overall amplitude ebbs and flows over 24 hour periods, or how the flair values near the borders (e.g. 11 s, 21 s, etc.) stick out, almost like harmonics in a natural sound. You can even hear the hitchiker exodus, where a bunch of redditors pressed the button at 42 s. It ironically starts 42 s into the video (I swear that was unplanned).

Final Thoughts

I hope you enjoyed the examples and have a new-found appreciation for different ways of “displaying” your data. My examples here were all time-series data, but there’s no reason these techniques can’t be extended to other modalities, like spatial data from geographers or spectroscopy data from chemists. The sky’s the limit! (Get it? Because sound can’t propagate in outer space?)

I’ll see myself out.

Generating gibberish from Harry Potter fanfiction summaries

 This is a continuation of my previous post analyzing character choices in Harry Potter fanfiction.

Generating random/gibberish text is not a new idea. Perhaps you’ve been to /r/SubredditSimulator, a complete subreddit in which content is created by bots generating random text. Or perhaps you’ve heard about gibberish scientific articles actually being accepted into journals and conference proceedings. To my knowledge, though, this is the first time someone applied these tools to fanfiction, or more accurately, fanfiction summaries.

The technique is based on the concept of Markov chains, a way of describing “memoryless” random processes, i.e. a process in which the next state in the process depends only on the current state and not the previous states. It’s an enormously useful concept. It’s even the basis for how Google ranks websites in their search engine results.

But enough introduction, let’s get to the good stuff. Introducing the Harry Potter Fanfiction Summary Generator! Just click the button below to generate a new random fanfiction summary.

Harry Potter Fanfiction Summary Generator

James Potter And Ever Easy
James and Lily get together while she positively hates him? Meanwhile, Sirius lives another lovehate relationship. JL, SBOC, RLOC
Rated: T
– English – Romance/Drama – Chapters: 43 – Words: 129462 – Reviews: 403 – Favs: 313 – Follows: 317
– Sirius B., OC

Note: This generator is not creating summaries on the fly, but rather loading summaries from a previously generated list. There are about 10,000 summaries in the list, so it should take a while before you start seeing repeats.

So how does it work?

Disclaimer: the rest of this post will get somewhat technical, but I’ll try to avoid jargon.

You probably saw in the examples above that sometimes the generator produces perfectly legitimate results, even to the point of containing whole sentences from preexisting summaries (more on that later). Other times, it fails horribly and hilariously. To understand what’s going on, you need to understand the concept of Markov chains.

To construct a Markov chain, you start by analyzing the probabilities of transitions between “states”. In this case, the states are words. For example, if you start with the word ‘Harry’, you can probably guess that it is often followed by the word ‘Potter’ and less often by, say, ‘taradiddles’ (yes that’s a real word; it actually appears once in Harry Potter and the Order of the Phoenix). By analyzing all of the word transitions in a body of text, you can calculate lots of probabilities and create a diagram like the example below.

Diagram for simple Markov chain. The size of the arrows is proportional to the probability of the second word following the first, e.g. 'Potter' might follow the word 'Harry' 80% of the time, while 'and' might follow 'Harry' the other 20%. From these words, there are other likely choices,such as 'Harry Potter is' or 'Harry and Draco'. It is much less likely to see something like 'Harry and and'. (Note: these are dummy probabilities for the purpose of illustration).

Diagram for a hypothetical Markov chain. The size of the arrows is proportional to the transition probabilities between words, so a value of 0.8 would imply that the word ‘Potter’ follows the word ‘Harry’ 80% of the time. To construct a random sentence, we pick a starting point and then move between “states” (i.e. words) according to their probabilities.

As you might expect, Markov chains can be much more complicated than this example. To generate the summaries above, I constructed a Markov chain using 25,000 fanfiction summaries. This was a sample of “popular” fanfics, specifically the top 25,000 fics written in English, sorted by number of reviews. This is certainly a biased sample, but hopefully biased in an interesting way. For example, I might speculate that summaries in this sample are more successful (on average) in attracting readers’ attentions. I obviously don’t know if that’s true, but I think the sample is large enough to get a good sense of common trends in summaries.

That’s nice, but how does it work?

To actually explain the process, I need to introduce the concept of the Markov chain order, also referred to as the memory. When using a Markov chain to generate random text, this number refers to how many previous words are considered when selecting the next. For example, say we start with the phrase ‘sent back in’. With an an order of 1, only the previous word is considered, so the next word is chosen based on which words are most likely to follow ‘in’. For an order of of 3, you consider all three words, so the next most likely word is definitely ‘time’, which constructs the phrase ‘sent back in time’. As you might expect, this phrase is very common in fanfiction summaries, since a lot of stories involve time travel.

One way to analyze the effect of Markov chain order is to generate lots of random summaries and see how often these summaries match one of the input summaries used to construct the chain. By “match”, I mean an exact match, including the capitalization and punctuation. Below, I show the results of this analysis for small subsets of the full dataset. It would be nice to repeat this analysis for the entire thing, but that’s more work than I’m willing to do for a blog post.

Effect of Markov chain order on probability of producing exact matches when randomly generating summaries. A probability of zero implies that every generated summary is completely unique, while a value of one implies every generated summary is just a reproduction of an existing summary.

To calculate these probabilities, I generated lots of random summaries from each Markov chain and calculated the fraction of exact matches from the sample. I repeated this process several times with a different sample of summaries and averaged the results. This is an example of a Monte Carlo method.

There are two trends to describe in the graph:

Effect of order — With an order of 1, nearly every generated summary is unique. With an order of 5, basically all of them are just reproductions of the input data. Something special happens around order 2-3 when we start to get a lot of matches. This value has a lot to do with how long summaries tend to be. If you wanted to reproduce larger sections of text (e.g. an entire fanfiction), you would need a higher chain order.

Effect of sample size — You can see the general effect is to shift the curves to the right as the sample size increases. This implies that you get fewer matches with a larger sample.

From these results, I decided to choose an order of 3 to generate summaries from my full dataset, since I think it’s high enough to create interesting patterns, but low enough to create mostly unique results. I generated 10,000 summaries and 271 were matches. I decided to remove them from the generator above, since these were usually of result of all the crazy ways people use punctuation to make things stand out, e.g. ::Rewritten & Improved:: or ***COMPLETE***. However, you’ll still see times when it’s reproducing part of a summary, then it suddenly switches to a new one. This can create readable, yet hilarious results.

Lastly, I should mention the titles were also constructed with Markov chains, only using an order of 1 since titles are so much shorter. I also removed randomly generated titles with only one word, since these are always exact matches. Despite these precautions, ~18% of the titles are still matches.

You still haven’t told me how it works

Right. This post is already getting pretty long, so I decided to put some of the extra technical information on a separate page here. You can see the actual algorithm I used to generate random summaries, as well as the techniques I used to provide the accompanying information, e.g. genre, rating, reviews, etc.

Phrase analysis

To finish off this post, I decided to look at the most popular phrases used in the summary Markov chains. Recall that for an order n chain, we consider the n previous words to pick the next, so we can look at how often phrases of length n and + 1 occur. Since I had difficulty deciding between an order of 2 or 3, I created Markov chains for both and can analyze popular phrases from 2-4 words long. Below I have the top 15 phrases from each group.

Most popular word pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

Most popular phrases used in Harry Potter fanfiction summaries. The three lists corresponds to phrases of different lengths: pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

There are some interesting things to notice:

  1. For a length of two, 9 of the 15 phrases are prepositional phrases, i.e. not really specific to fanfiction summaries. Also, the only name mentioned is Harry.
  2. For a length of three, you start to see some interesting combinations, like character pairings and other phrases unique to Harry Potter fanfiction. I think the most interesting phrases are ‘What happens when’ and ‘back in time’, since they illustrate the hypothetical nature of Fanfiction stories.
  3.  For a length of four, you see more of the hypothetical phrases, including three variations of ‘what happens when’. I also think it’s very interesting that you see different parts of phrases that are >4 words. For example, there is ‘at the end of’ and ‘the end of the’, so I would probably predict the 5 word phrase ‘at the end of the’ would also be very popular.

Final thoughts

I hope you’re convinced that Markov chains are a neat way of analyzing text, even if it’s only to giggle at the the gibberish they can produce. Make sure to check out more involved uses of this technique like in /r/SubredditSimulator. Also, if you wanted to see some additional info like the actual algorithm I used, please visit this page. Thanks for reading!

Character choices in Harry Potter fanfiction

Harry Potter fanfiction is something I find pretty interesting. On, there are over 700,000 stories, all set within the Harry Potter universe. Not one of the authors expects any sort of financial gain, although it is possible for popular authors to get published.  E.L. James, the author of 50 Shades of Grey, got her start writing fanfiction for the Twilight series. Say what you will about either of those series, but there’s no denying their popularity.

The stories on the, also called “fics” or “fanfics”, vary wildly in length from poems to short stories to full-length novels. There’s even a few fics that contain more words than all seven Harry Potter books put together. It’s also surprising that the community is still active to this day, considering it’s been eight years since the last book came out and four years since the last movie.

I decided to take a look at the meta-information available on for these stories. When browsing for fics, you can get lists of descriptions containing the title, author, genre, length, number of reviews, etc. There’s also a short summary written by the author. I wrote a script that scrapes this information from the search results, using the Python lxml module for HTML parsing. I randomly sampled 200,000 fics, so roughly 28% of the total. Here are a few stats from the dataset:


I think these are some fairly impressive numbers. Of course, these statistics are a poor summary of the actual data, but for my first post with this dataset, I wanted to look at character choices.

Most popular single characters

Each fic can list up to four characters from a list of 375. Harry Potter has a lot of supporting characters. The first thing I did was count how many times each individual character appeared in a fic and sorted the result. Below are the most popular character choices.

Top 25 character choices. Note that the percentages sum to >100% since each fic can have multiple choices.

Top 25 character choices. Note that the percentages sum to >100% since each fic can have multiple characters


It should come as no surprise that Harry is at the top of the list. I do feel a bit bad for Ron, though. It appears that Draco has taken what could arguable be called his spot, and Ron barely made it higher than “OC”, meaning an original character. I think there are multiple reasons for this. First, a large percentage of Harry Potter fanfics are romances: 55% of fics in my sample contain the “Romance” genre label, with the next highest being “Humor” at 21%. I think Draco has that bad boy appeal that makes him popular in romances. I’ve also found that a lot of fics are “do-overs”, i.e. fan re-imaginings of the original plot. In those stories, Ron can be unpopular. The community even has a term for treating a character badly called “bashing”.

The other rankings aren’t too surprising. I kinda wish Fred and George were next to each other, but I understand since there’s also a popular type of story that picks up where the books ended. As an example, if anyone reading this is a Harry Potter fan, I recommend the short story Cauterize by Lady Altair.

Pairings and multiple character groupings

The next thing I wanted to look at was how characters are grouped together. As I mentioned, romance is a very popular genre label, so the most popular grouping is obviously two characters. The community refers to this as a “pairing”, and if a fan likes a particular pairing, they “ship” that pairing. It’s short for “relationship” (used as both a noun and a verb for some reason). There can be fervent debates about “ships” on sites like /r/harrypotter, so I think it’s an interesting thing to look at.

As an aside for those familiar with, I’ve ignored the use of square brackets [], which are supposed to explicitly denote a romantic pairing. Only 3.2% of fics in my sample use brackets, which goes up slightly to 4.5% if you just consider fics with a Romance genre label. Thus, I found it easier to just ignore them.

To do the analysis, I counted how often each combination of characters occurred and grouped the possible combinations by numbers of characters, i.e. a “double” indicates a fic with two and only two characters. These pairings may be romantic or platonic, but differentiating these cases is impossible from my dataset, since even the summary text may not indicate which pairings are romantic and which aren’t. Regardless, this is such a large dataset that I think the overall trends are still clear.


Pie chart of character percentages from a sample of 200,000 fics. The color indicates the number of characters, which are broken up into the most popular character choices for each.


A few observations:

  1. Pairings are definitely the most popular type of fic, with Draco pairings claiming the top two spots. I was actually surprised that canon pairings are as popular as they are (Ron/Hermione, Harry/Ginny, etc.).
  2. The “other doubles” category is the single biggest chunk of the graph, but these are all pairings that consist of <1% of the total. In my dataset, there are 2,779 unique pairings, and I’m only showing 12. Of course, this is only a subset of the total possible 70,125 possible pairings given 375 available characters.
  3. Of the solo acts, Snape is the second most popular (after Harry, of course). This seems appropriate to me, since Snape strikes me as a lone wolf character.
  4. The “golden trio”  of Harry, Hermione, and Ron is unsurprisingly the most popular choice for three characters. The value 0.1% may seem small, but remember we’re talking about 700,000 fics, so 0.1% is still in the hundreds.

Summary analysis for different pairings

The last thing I wanted do was see what particular pairings have in common, if anything. The method I chose was comparing the most popular words used in the summaries. I decided to look at just two pairings, specifically the top two that didn’t have a common character (Draco/Hermione and James/Lily). To make the comparison, I took the 100 most popular words in the summaries of each pairing (using the same algorithm as Wordle to count words) and clustered the words by whether they were common to both pairings or not. This doesn’t mean a word specific to a pairing never appears in the opposite pairing, it’s simply not in the top 100. The resulting “Venn diagram” is shown below. Note that I removed explicit mentions to the characters involved since they dominated the counts. For example, I removed words such as Draco, Draco’s, Malfoy, etc. Also, I limited the analysis to fics written in English for obvious reasons.

Word cloud showing the most popular words contained the summaries of fics with either a Draco/Hermione pairing or James/Lily pairing.

“Venn diagram” word cloud showing the most popular words in the summaries of fics with either a Draco/Hermione or James/Lily pairing. The font size for each words is proportional to the total number of times that word appears in summaries with these pairings. Gray words in the center are the most common words that appear in both pairings, blue words on the left are the most common words that appear in Draco/Hermione fics, and red words on the right are the most common words in James/Lily fics.


The Venn diagram look may not have panned out as I had originally hoped, but the information is still interesting. Even if it does look like a Pepsi logo.

A few observations:

  1. The top words in both pairings tend to be shared, i.e. there are more gray words than blue or red. This isn’t unique to Draco/Hermione vs. James/Lily. Words like loveHogwarts, and year are common to many types of pairings. There are also common English words like up and out in the center. I automatically remove the most common English (Wordle does the same), but these two aren’t on the list of common words I found online.
  2. You can see that the Draco/Hermione pairing is more popular than James/Lily since its unique words are larger overall. To scale the font size of the words in the center, I averaged the counts from both pairings.
  3. The most common category of unique words are shorthand names for the pairing, e.g. DMHG or LJ. The word Dramione is a portmanteau of Draco and Hermione. I’m not sure if there’s one for James/Lily yet, but my vote is for Limes.
  4. I don’t think the list of unique words is enough to make any claims of thematic differences between the pairings. For example, I could speculate that many Draco/Hermione fics are Romeo-and-Juliet-style stories of star-crossed lovers, whereas James/Lily stories could have the “will-they-or-won’t-they” trope. There might be hints of this (secret and past for Draco/Hermione; finally and hate for James/Lily), but this isn’t enough to make any strong conclusions. I might look at popular phrases/groups of words to really get at this question.

Final thoughts

Character choices in Harry Potter fanfiction can be considered both highly variable (6,950 unique character groupings from a series with essentially three main characters) and highly regular (a randomly selected fic has a 25% chance of having Harry, Hermione, or Draco as a character). I hope to do more analyses like these, and I thought character choices was a good place to start because it’s an easy dimension for clustering fics together. Next, I hope to do more with the summaries. Perhaps use Markov chains to generate pseudorandom summaries like the posts in /r/SubredditSimulator. Please leave suggestions below and thanks for reading!

Analysis of abstracts from a paper library

Author’s note: This is an old journal entry from August 15, 2014.

One thing I really respect about my PhD advisor is his effort to stay up-to-date with the recent literature. At conferences, I always see him talking to several different grad students during the poster sessions, whereas it seems like most senior scientists make it to one or two posters before talking to other PIs. He also keeps an extensive library of journal articles that gets updated regularly. I don’t know if he actually reads every new paper, but I’m nevertheless jealous of his ability to find them. So far I’ve been dissatisfied with services designed to notify me of new articles, whether it’s an RSS feed of a major journal or an email search alert from PubMed/Google Scholar. These services are either too strict and miss relevant articles, or too lax and return way too many results. A new service called PubChase holds promise, but I don’t know how well it works. Regardless, I wanted to see if I could figure out a better way to find new, relevant articles. The first step: analyzing my advisor’s library of papers.

Getting the raw data

My advisor’s paper library has over 13,000 files in it, and I certainly did not want open every file in order to get the raw text from the titles and abstracts. Endnote provided a way to automate this, although it wasn’t successful in extracting the abstracts from every article. I did, however, manage to create a huge text file with the titles and abstracts of 6,687 journal articles. This process was likely biased for newer papers, since I don’t think Pubmed can pull abstracts from scans, but this frankly doesn’t bother me as long as the sampling process was unbiased with respect to the article’s topic, which is hopefully the case. To begin, I used my code based on the Wordle algorithm (see previous post) to identify the 500 most common words and their relative usages. This counting ignores common English words as well as a list of special words, which I omitted somewhat arbitrarily after deciding they would be bad at identifying a paper’s unique content. For example, words like “results”, “effects”, “suggest”, “show”, and “significantly” could show up in any abstract regardless of the topic. Also, I counted each word in the title 3 times in order to give the title more weight. This technique is used by PubMed in its search algorithm, described here. Shown below is a word cloud of the final set of 500 words used for clustering.

Word cloud of 500 most common words in extracted abstracts and titles. The font size is linearly proportional to the number of occurrences. Words appearing in an abstract were counted once, but three times in a title.

Word cloud of 500 most common words in extracted abstracts and titles. The font size is linearly proportional to the number of occurrences. Words appearing in an abstract were counted once, but three times in a title.

Clearly, some of the words are too small to read due to the enormous number of occurrences of  “auditory” (14,431 if you were curious). This illustrates why it is important to scale each word’s weight by its overall usage. Specifically, for document i=1,2,\ldots,N and term j=1,2,\ldots,M, the total weight W_{i,j} was calculated as the product of the global weight G_j of term j and the local weight L_{i,j} of term j in document i. These weights were calculated as follows, described in more detail on the PubMed website.

W_{i,j} = L_{i,j}G_j,
G_j = \sqrt{\ln \left(\frac{N}{n_j}\right)}
L_{i,j} = \frac{10}{1+e^{\alpha\ell_i}\lambda^{k_{i,j}-1}}

N is the total number of documents (6,687), n_j is the number of documents term j appears in, \ell_i is the total number of words in document i (or 250, whichever is larger), and k_{i,j} is the number of times term j is in document i. The constants \alpha = 0.0044 and \lambda = 0.7 were given by PubMed. The total number of terms M was set at 500, resulting in a set of 6,687 feature vectors of length 500 to be clustered.

Note that the above equations have two changes from the description on the PubMed website. First, the third equation above has a factor of 10 in the numerator, not 1. I added this because the local weights in my data set were originally much smaller than the global weights, skewing the clustering process. The second change was adding the square root in the second equation, which was made for the same reason. I don’t know why Pubmed’s global weights are smaller, but perhaps it is because their database of documents is much larger.

Choosing the number of clusters

With over 6,000 articles, you can imagine the breadth of information covered is quite broad. You can find papers in the library on everything from signal processing to basic anatomy to psychology. There’s even oddball papers on topics like particle physics. Any attempt to identify every topic in the library will fall victim to overfitting, but I was confident that I could separate a small number of topics that were well-represented and identify the words that best separate topics from each other.

I started with principal component analysis to reduce the dimensionality. The figure below shows the percent of explainable variance as a function of the number of components, and you can see how multidimensional this data set is. We need over 100 components to explain just 50% of the variance! This is a testament to how variable the term usages are between papers.

Cumulative variance explained by principal components

Cumulative variance explained by principal components

To help decide on the number of components to include and eventually perform the clustering, I used Matlab’s implementation of the EM algorithm for Gaussian mixture models ( This allowed me to make some judgements on relative model quality using the Akaike information criterion (AIC). This value decreases as the likelihood of the model increases but contains an added penalty for increasing the number of free parameters. Therefore, a lower AIC value is desirable. You can see how the AIC value changes as a function of both the number of components and the number of clusters below

Akaike information criterion (AIC) for Gaussian mixture model as a function of the number of clusters. Each line represents a different number of principal components used.

Akaike information criterion (AIC) for Gaussian mixture model as a function of the number of clusters. Each line represents a different number of principal components used.

There are two obvious trends. First, increasing the number of components appears to increase the AIC value in an approximately linear fashion (lines appear offset from each other), which is due to increasing the number of free parameters. Second, increasing the number of clusters appear to decrease the AIC, which is due to increasing the likelihood of the model. Another interesting trend appears when you normalize these line by their max AIC value, as shown below

Values of Akaike information criterion (AIC) as a function of the number of clusters. The values are normalized by the maximum value for each line, i.e. when every paper was assigned to a single cluster.

Values of Akaike information criterion (AIC) as a function of the number of clusters. The values are normalized by the maximum value for each line, i.e. when every paper was assigned to a single cluster.

Here you can see that the AIC value decreases at the same relative rate when the number of components is ≥15, implying that adding additional components to these models will not significantly improve the likelihood. Therefore, I chose 15 components to be used in the final clustering. These components represent approximately 19.5% percent of the variance.

Finally, to choose the number of clusters, I used the fairly arbitrary elbow method to select a value of 8. This method is very subjective, but as the above figure demonstrates, adding additional clusters decreases the AIC value, but not by much.

Determining cluster keywords

The final result is a set of 8 clusters, each with 500-1200 papers. To identify what each cluster’s “topic” might be, I summed the feature vectors for every paper within a cluster and sorted the result in order to obtain the words with the highest weights for each cluster. I then created a word cloud with the top 20 words in each cluster, shown below, where the color indicates cluster identity.

Word cloud showing the top 20 words within each of the 8 clusters of papers, calculated by summing the feature vectors across members of a cluster. The color indicates cluster identity, and the font size is proportional to the summed weight of the word.

Word cloud showing the top 20 words within each of the 8 clusters of papers, calculated by summing the feature vectors across members of a cluster. The color indicates cluster identity, and the font size is proportional to the summed weight of the word.

Before making any claims about how the words are related within clusters, there are several general observations I’ve made:

  1. The weights are fairly evenly distributed, i.e. all the words have a somehwhat similar font size. This is especially true when compared to the earlier word cloud constructed with the raw word counts.
  2. The top words from the earlier word cloud (auditory, cochlear, and neurons) are still prominent in several of the clusters here, indicating that despite scaling for the overall usage, these words still occur often enough to produce a large summed weight within that cluster.
  3. Different forms of words are commonly grouped together, e.g. cell and cells, implant and implantation, inhibition and inhibitory, etc. This could indicate that these words co-localize in the same or very related documents. The white cluster (middle-left) even has a triplet: neurons, neural, and neuronal.

Describing the clusters

The real question is whether the top twenty words in each cluster actually say something about the papers in their cluster. If the clustering process actually separated the papers by topic, then we would assume that’s the case. You can see in the word cloud there’s certainly repetitions of words between clusters, but I do think the words can be interpreted into an overarching theme. There has to be outliers in each cluster (where do the particle physics papers go?), but I think the eight clusters break down into the following groups:
  • Cluster 1 (dark red, top left): Auditory neurophysiology, electric hearing, cochlear implants. This cluster seems to focus on neural responses (response, activity, evoked) in auditory centers (auditory nerve, nucleus, cortex, inferior colliculus) from electric stimulation (electrical, stimulation, cochlear implant). Compare this to Cluster 8, which seems to focus on the speech recognition performance of CI users, or Cluster 3, which seems to involve neural responses due to acoustic stimulation.
  • Cluster 2 (dark blue, top middle): Auditory neurophysiology, binaural hearing. This cluster certainly involves binaural neural processing (both binaural and interaural are present), especially in the inferior colliculus (note the abbreviation IC ). It is also the only cluster with inhibitory, which certainly has an important role in binaural processing.
  • Cluster 3 (orange, top right): Auditory processing, neuroimaging. This cluster seems to focus on higher-order auditory processing (speech, pitch, perception, complex), so it likely contains the neuroimaging papers. This is supported by the presence of the anatomical terms: primary, cortex, cortical, but there’s nothing to suggest neurophysiology at the cellular level.
  • Cluster 4 (white, center left): General systems neuroscience. This cluster appears to involve involve general principles of neuroscience (spike, information, model, synaptic), including the aforementioned trifecta of neurons, neuronal, and neural. With regards to anatomy, this cluster is probably focused more on the neocortex (cortex, cortical), since no brainstem terms are mentioned. This is also the only cluster with a non-auditory word (visual), which is understandable given that a larger percentage of papers on the visual system involve cortical processing than the auditory system.
  • Cluster 5 (yellow, center left): Cochlear implants, psychophysics. This cluster definitely focuses on cochlear implants (implant, implantation, CI, pulse), but also has many classic psychophysical terms (subjects, listeners, masking, noise). Compare this to Cluster 8, which seems to focus more on CI performance and less on the basic psychophysics.
  • Cluster 6 (blue, bottom left): Inner ear biology, auditory periphery. This cluster seems to focus on the periphery (cochlea, hair cells, cochlear nucleus) with perhaps more of a biological focus than some of the neural coding clusters (synapses, membrane, synaptic, cell). However, there is certainly a neural component (nerve, fibers).
  • Cluster 7 (light brown, bottom middle): Pyschophysics. This once is definitely a psychophysics cluster. Pretty much every word is indicative of this (masking, cues, thresholds, detection, target, signal, noise). There is also a binaural component (binaural, interaural, localization, time, level).
  • Cluster 8 (light blue, bottom right): Cochlear implant performance. This cluster certainly involves cochlear implants (electrode, stimulation, multichannel), but compared with the other two putative CI clusters (1 and 5), this cluster seems to focus on how well CIs actually work (speech, recognition, perception, performance, scores). It is certainly the must humanized cluster (children, patients, subjects, users). Given both children and age, this cluster likely also contains papers on how CIs affect development.

Final thoughts

While my research involves cochlear implants, I am still somewhat surprised that 3 of the 8 clusters seem to include them, since they’re really only one aspect of my advisor’s interests. I was, however, pleased to see that each CI cluster seems to have a separate focus (neurophysiology, psychophsyics, or general performance). This could indicate that the clustering worked well, although the notion of a singular topic per cluster is certainly something I arbitrarily imposed. Overall, I think the eight topics do cover my advisor’s research interests nicely, and I bet he’s even been an author on a paper in each cluster. In the future, it would be interesting to approach this problem with hierarchical clustering, since it might reveal subtopics within the larger clusters. For example, Cluster 7 might separate into topics on localization and signal detection. It would also be interesting to see where the outliers were assigned, since these 8 topics certainly don’t cover every paper in the library. Regardless, I think this was a fun and informative exercise. I did make several new PubMed email alerts as a result, but I’ll have to wait a few weeks to see how well they work. Who knows, maybe I’ll even read a paper or two, instead of just thinking of ways to find them.

Update: I set up around 5 PubMed alerts as a result of this work, each containing 3-5 search terms. They’ve been pretty effective at alerting me to papers, and I definitely prefer them to reading a bunch of table of contents from multiple journals. Now if only I had the motivation to read all those papers…


Word Clouds!

Author’s note: This is an old journal entry from July 29, 2014 that I never did anything with, so now it’s going to become my first blog post.

I recently discovered Wordle, a site which allows you to create beautiful word clouds, a type of graphic consisting of words where the font size is proportional to how often the word appears in a given body of text. I thought it would be fun trying to use Wordle to analyze some abstract books from scientific conferences, but that may have been a bit overambitious. Who knew you could crash Chrome by trying to copy and paste 15,000+ pages of text?

Undeterred (i.e. too much time on my hands), I decided to write some Matlab code to make my own. This allowed me to analyze very large bodies of text and also have more control in the final graphic creation. I had to learn some regular expressions as well as use Matlab’s Computer Vision Toolbox in a manner in which it certainly wasn’t designed, but hey, it works. Below are two word clouds, showing the most common 200 words in the ARO 2014 and SfN 2013 abstract books, respectively. ARO stands for the Association of Research in Otolaryngology, and it’s definitely the largest conference devoted to hearing research. It pales in size, however, compared to the annual SfN meeting (Society for Neuroscience), which draws in about 30,000 people.

Word cloud from ARO 2014

Word cloud from ARO 2014

Word cloud from SfN 2013

Word cloud from SfN 2013

To make the word clouds, I simply counted word frequencies like Wordle, but in addition to ignoring common English words (“the”, “and”, etc.), I also removed nonspecific scientific words like “abstract” and “methods.” The result certainly isn’t as pretty as those from Wordle, but look, a brain! and sort of a cochlea! The clouds from Wordle have much nicer inter-word spacing due to the way they handle collision detection, but I had to be more flexible since I wanted the words to fit into an arbitrary shape.

Observations so far? Despite obtaining the #3 spot on the ARO cloud, the word “auditory” barely made it into the SfN cloud, achieving spot #196 and losing terribly to “visual” at #47. Oh well, at least the sensory neuroscientists can be bitter together, since they were beaten by other systems, namely #19 “memory” and #21 “motor.” In case you were wondering, the word “neurons” was used 17,452 times in the SfN abstract book, an order of magnitude increase from the #1 word in the ARO cloud: “cells,” which was used 1,846 times. This isn’t all that surprising given the huge size of SfN relative to ARO.

Lastly, for a summary of the data, here are some basic counts on the text analysis.