The mathematics of random walks: skill ratings in Overwatch

UPDATE: Added a new condition below

Overwatch is a very fun team-based shooter that has an interesting skill rating (SR) system. Every player has an SR, and a matchmaking system tries to create fair games by assembling teams with similar SR. The winners get an SR increase and losers get a decrease, but it can be different amounts for each player. A hidden algorithm picks the actual value based on both the expected outcome of a match and individual performance.

For this post, I wanted to look at the contributions of skill vs. chance in the way SR can change over time. After all, players should theoretically only be able to win if they’re more skilled than their opponents, but it’s possible for the matchmaking system to place someone with very good teammates, leading to a win even though that person’s skill is low. The opposite is possible too, where a person should win a game but doesn’t due to their teammates. These two scenarios shouldn’t have an effect in the long run, but it still leads to terms like “Elo hell” (named for the Elo rating system), which refers to the supposed impossibility of climbing out of the lower ranks to the “true” SR. But does Elo hell actually exist?

Random walks

The way in which SR climbs and falls can be described by the concept of a random walk. For a simple example, say you’re standing on a sidewalk and flip a coin. For heads, you take a step forwards, and for tails, you take a step back. Where do you end up after 100 coin flips? 1000? This concept of a random walk can be used to describe things like stock prices and the physics of diffusion.

To put an example in the context of Overwatch, say your SR is right in the middle at 2500 out of a maximum 5000. If you have a 50% chance to win every game and play 300 games over the course of the 90 day season, what is your final SR? How likely is it that you climb 500 SR, which is the amount needed for the next rank? (Overwatch has ranks at different tiers called gold, platinum, diamond, etc.). I tried to answer these questions with Monte Carlo simulations, meaning I simulated a large number of season SR paths and calculated probabilities from the distribution of results.

To demonstrate, here are some simulated SR paths where the probability of winning was a constant 50%, starting from 2500 SR and assigning +25 SR for wins and -25 SR for losses. Each path is has different color in this plot, and the maximum season SR over 300 games is marked with a dot.

You can see each of these SR paths peaked at different times during the season. Since the next highest rank (diamond) occurs at 3000 SR, then only the blue path made it to diamond in this simulation.

If I repeat this process many many times, I can take the maximum SR of each path and get a distribution of season peak SRs. Below is a histogram of this distribution for 100,000 random walks when the win probability was always 50%.

You can see from this distribution that most paths had a peak below 3000 SR, i.e. they did not climb to diamond. But a significant fraction still did (0.25 or 25% in this case). This fraction is what I use below to estimate the probability that a player can climb a complete rank through pure chance. That’s the power of Monte Carlo simulations.

Making the simulations more accurate

One thing to notice in the simulations above is the tendency for the paths to diverge, i.e. if a path shoots up after a string of wins, there’s nothing to force the SR back to 2500. In reality, if a player gets many consecutive wins, they’ll likely find themselves playing with much more skilled players, which should eventually decrease their rank back to the “true” value. The difficult thing to know is whether a climb is due to a true increase in skill or merely chance. In a simulation, though, these are easy factors to separate.

To see the contribution of chance to climbing, I compared three win probability conditions:

  1. Flat – The win probability was always 50%, i.e. a purely random walk.
  2. Shallow – The win probability decreased linearly by 5% for each 500 change in SR
  3. Steep – The win probability decreased linearly by 10% for each 500 change in SR.

So for the shallow and steep conditions, there will be a normalizing force that will tend to drive the SR paths back to 2500 once they start to diverge.

Here’s what these win probabilities look like as a function of SR. The lines are centered at 2500, since that was the assumed “true” rank for these simulations.

Other Technical Details

  • Points per game – In the simulations above I used a constant 25 SR per game, but for the full set below, I made this value another random number. Specifically, it was a normal random number with μ = 23 SR and σ = 4 SR, i.e. 95% of the values ranged from 15 to 31 SR. These parameter estimates are from a personal data set of mine that I recorded from season 2 of about 120 games. I think these estimates are still accurate for more recent seasons.
  • Games per season – Here I used another three conditions: 100, 300, and 500 games per season. Since each season is ~90 days, I think these values span the range from casual to avid Overwatch players. If anything, 500 games might be an underestimate, since I’ve seen streamers with >1000 games in a season, so >11 games per day. That’s a lot of Overwatch.
  • Draws – I decided to leave out draws for the time being, because not only are they rare, but Blizzard is continually making changes in order to make draws as rare as possible. In my season 2 data set, <8% of my games were draws, and this was before Blizzard made their changes.
  • Streaks – Blizzard has a streak system in place to allow players very far away from their true SR to climb quickly. However, the details of this system are unknown, and it was also recently changed to become less significant. Therefore, I decided not to consider streaks.

Results

To demonstrate the different win probability conditions, here is a sample of 50 paths from the Flat and Steep conditions over 300 games. It’s impossible to distinguish individual paths with so many on the same plot, but you can get a sense of the distribution. The steep paths rarely go above 3000 SR, whereas a few of the flat paths do manage to climb ranks.

To estimate the probability that a path climbs a full rank (500 SR), I generated 100,000 paths for each condition and measured the fraction of maximum season SRs above 3000 SR. The starting point of 2500 SR was arbitrary. These results can be applied just as easily for a climb from silver to gold (1500 to 2000 SR) or diamond to master (3000 to 3500 SR). The only place I wouldn’t trust these results are for the very edges, e.g. a climb from 4000 to 4500 SR, since there are so few players at these high ranks that matchmaking becomes a bit of a mess.

Anyway, here are the results. The numbers reported are fractions, e.g. a value of 0.21 means there is a 21% chance of climbing a full 500 SR through chance alone.

There are two main trends. (1) Playing more games always increases the chances of climbing. It seems very unlikely to climb by playing only 100 games. (2) As the slope of the win probability curve increases (i.e. a stronger normalizing force), it becomes more and more difficult to climb through chance alone.

I was surprised by how large some of these probabilities were. Considering that tens of millions of people have played Overwatch, it certainly seems possible that some people have managed to climb ranks without any significant increase in actual skill.

But what if skill actually increases?

I next wanted to consider the cases when the “true” SR does increase, which should theoretically happen over time with more games played. The easiest way to mimic this in my simulations was to shift the win probability curves to a higher SR, so the point at which the win rate equals 50% will increase over time.

I repeated the previous set of simulations for two more conditions.

  1. 500 SR increase – Over the course of 500 games, I linearly increased the true SR by 500 points. This means that for the 100 game season, the true SR only increased by 100 points.
  2. 1000 SR increase – I linearly increased the true SR by 1000 points over 500 games, so by the end of a 500 game season, the true SR will have climbed two complete ranks.

Just to see what these SR paths look like, here is a sample using the steep win probability curve, this time with 5 paths for each of the rank increase conditions. There’s still a significant amount of individual variation, but overall there’s a gradual increase in SR over time, with a larger increase in the 1000 SR condition.

I next recalculated the probabilities of climbing a single rank with 100,000 simulated SR paths. Here are the probabilities for the 500 true SR increase.

The flat probabilities remained the same, as they should since shifting a flat line doesn’t change anything. The probabilities for 100 games also did not change much, since 100 games was not enough time for the true SR increase to kick in. However, there are significant increases in the 500 game probabilities, e.g. an increase from 5% to 62% for the steep condition. Interestingly the steep probability actually surpassed the shallow case (62% vs. 47%). This is counter-intuitive since the steep case allows less variation, but it actually benefits with a true SR increase, since the force driving the SR to the true value is stronger.

Lastly, here are the probabilities for a 1000 SR increase. Note that the probabilities are still for climbing a single rank, even though the true SR increased by two ranks.

Again, the chances for climbing in only 100 games are practically zero, even with this relatively quick increase in skill. I think this supports the claim that it’s quite difficult to climb as a casual player. On the other hand, chances of climbing with more games are much higher, and practically certain for 500 games. I think the most extreme examples of this are the Twitch streamers that do a “Bronze to Grandmaster” stream. These players climb very quickly, but their true SR is much, much higher than their opponents, which makes winning almost certain.

Conclusions

So does “Elo hell” actually exist? It’s hard to conclude anything certain with this data since this is all essentially a glorified thought experiment. But there are a few points in which I’m fairly confident.

  • Due to the nature of random walks, it’s possible (but unlikely) to climb a full rank through pure chance.
  • Climbing through chance alone is highly sensitive to how the win probability depends on SR. Even a 10% decrease at a higher rank can make it virtually impossible to climb with pure chance.
  • If your actual skill increases over time, climbing is much easier, but still not guaranteed. It’s only certain after many games or large differences in SR, e.g. it’s easy to climb to platinum if you actually deserve diamond.
  • Playing more games always increases you chances of climbing. Said another way: it is extremely difficult to climb when only playing casually, even if you get better over time.

Now onto more speculative conclusions: does Elo hell exist? I think for some people, it has. Of the millions of people playing Overwatch, some unlucky players might have legitimately increased in skill but not climbed due to chance. However, I think the chances of this are extremely rare, to the point that nearly every person complaining of Elo hell is actually at their true SR and just venting. For those very few unlucky souls where it’s actually true, hopefully they just keep playing, because they will inevitably climb after more games.

None of these conclusions are probably surprising to those that play a lot of competitive Overwatch, but I still think it’s useful to consider how big a role chance can play in a skill-based game. I was certainly surprised at how high some of these probabilities were. For another perspective on this skill vs. chance trade-off, check out this super interesting video from Vox on professional sports. Or just play more Overwatch. After all, you can’t git gud if you don’t play.

UPDATE:

After getting some feedback, I decided to run one additional set of simulations. Specifically, what happens when you start the season with a true SR at a higher rank, i.e. your true rank is 3000 but you’re trying to climb from 2500. I kept the true SR at 3000 for the entire season and recalculated the probabilities of reaching 3000 with the actual SR.

I also added another win probability condition, called Very Steep, which has twice the slope of the Steep condition, i.e. the win probability decreased by 20% linearly for a 500 change in SR. I think the true shape of the win-probability curve is more complicated than these simple lines, but they do illustrate a spectrum of normalizing forces. In reality, the curve is likely shallow for nearby SR, steep for larger differences, and then very steep near the edges.

Here are the probabilities for climbing when the true SR starts 500 points above the actual SR and remains constant throughout the season.

Interestingly, even with the very steep condition, the chances of climbing after 100 games is still <50%. I believe this very steep condition overestimates the normalizing force (especially near one’s true SR), so keep that in mind.

The story changes quickly with more games, and the chances of climbing are much higher after 300 games and especially after 500 games. This supports the conclusion I made earlier, that even if you’re unable to climb even while getting better at the game, it just takes more time.

The Ship Wars Part II: Another look at the most popular pairings

In my post last week, I took a look at how the most popular pairings in Harry Potter fanfiction changed over time. Specifically, I looked at the rate at which fics were published. This time, I wanted to look more closely at a wider range of pairings, as well as other measures of popularity (i.e. favorites, follows, and reviews).

The data

The dataset is the same from last week, e.g. the descriptive info from all 724,315 Harry Potter fanfics I could scrape from Fanfiction.net. From these, I looked only at fics with either (a) two characters, or (b) fics with more than two characters where pairings were explicitly marked with brackets. This shortens the number of fics to 505,406 consisting of 5,414 unique pairings.

Below are the 20 most popular of these pairings (in terms of number of fics). At the top are Draco/Hermione and Draco/Harry, which have more than 50,000 fics each. To give you a sense of that number, this means that there were, on average, more than 8 fics published per day. For 17 years.

To take a closer look at other popular pairings, I decided to consider the top 100 ranked pairings. These pairing each have more than 500 fics apiece and make up a large majority of all the fics with pairings. Here are some stats for the top 100 pairings.

Therefore, the bottom 98% of pairings only account for ~15% of fics. This may be some serious inequality, but sadly I don’t think anybody’s going to be protesting for more Harry/Stan Shunpike fics. Even though Starry is an amazing portmanteau for a pairing. #OccupyDiagonAlley #MakeStarryHappen.

Pairings with the most favorites

The next thing I looked at were average number of favorites, follows, and reviews that each pairing receives. These average values were highly correlated with each other, so for the sake of simplicity I’m only showing the number of favorites. Sorting the pairings by follows or reviews does switch up some of the rankings, but there is a large degree of overlap.

Here are the top 20 pairings with the most favorites per fic, i.e. the highest average values.

I think that there are a few interesting things to point out

  • Daphne/Harry and Fleur/Harry have ~800 and ~600 fics, respectively, but still have averages of >500 favorites. This is impressive, considering that over 98% of all fics have fewer than 500 favorites.
  • Two of the top 10 most popular pairings made it to this list: Harry/Severus (#9) and Harry/Hermione (#4). This is also impressive, since their average values are from ~16,000 and ~25,000 fics, respectively.
  • Every pairing on this list has either Harry or Hermione in it. Also, many of these are with significantly older men, which is, well, interesting. Of course, fics with pairings may not necessarily be romantic in nature, but I’m guessing a fair amount still are.

Case study: Daphne/Harry vs. Bellatrix/Rodolphus

It may seem like high average values are inflated by a small number of fics with thousands of favorites, and it is true that these distributions are highly skewed, e.g. most fics have only a few favorites. To look at this most closely, below I’m showing the distribution of favorites for two pairings with very similar fic counts: Harry/Daphne (#72) and Bellatrix/Rodolphus (#71). Besides their similar fic counts, I picked Bellatrix/Rodolphus because its average is quite low: ~10 favorites per fic.

This plot is a cumulative distribution plot, i.e. for a given x-value (number of favorites), the corresponding y-value is the fraction of fics with that number of favorites or less. So for a value of 10 favorites, Daphne/Harry has a value of 0.04, meaning 4% of Daphne/Harry fics have 10 or fewer favorites. The value for Bellatrix/Rodolphus is 0.8, so a full 80% of fics have 10 or fewer favorites. Also, note the x-axis has a log scale, so the values range from 1 to 10,000.

You can see the distribution of favorites for Daphne/Harry is shifted to the right (i.e. it has more favorites). Around 99% of Bellatrix/Rodolphus fics have <100 favorites, whereas only 40% do for Daphne/Harry. In fact, ~20% of Daphne/Harry fics (~170 fics) have more favorites than highest-rated fic for Bellatrix/Rodolphus (634 favorites). So there are many highly-rated fics for Daphne/Harry that help to increase its average.

Why are certain pairings so highly rated?

There are many valid answers for why one pairing might receive more favorites/follows/reviews than another. There’s always the obvious one: readers probably enjoy some pairings more than others, at least enough to favorite those fics more often (I certainly do). But there are two other aspects I want to explore: length and timing.

Possibility 1: Length

This one is fairly self-explanatory. Across all fics, there is a significant correlation between length (in words) and the number of favorites. This also holds true when considering the average values for these top 100 pairings. So below I’m showing the top 20 pairings by average fic length. There is significant overlap between this list and the one sorted by most favorites.

As a sidenote, I think it’s interesting to point out that half of these pairings are OC pairings (original characters). I wonder if OC fics tend to be longer because they have to include more character development instead of starting with established characters.

Possibility 2: Timing

In my last post, I showed that the popularity of pairings can change over time, sometimes to the point where previously popular pairings seem to disappear. Conversely, there is something interesting when you look at pairings that have been popular only recently. Below I have 20 top pairings ranked by what I call date “center-of mass”, i.e. it’s a weighted sum of the publication times for each pairing. The higher the “center-of-mass”, the more recently popular that pairing has been.

One thing to notice here are all the next-generation Harry Potter characters, which makes sense as these characters were only introduced in the epilogue of the final book. However, there are several pairings on both this list and the highest-rated list, namely the top two of Daphne/Harry and Fleur/Harry.

Case study: Daphne/Harry vs. Cho/Harry

To see these timing differences more clearly, I have two pairings compared below with similar numbers of total fics but very different trends over time: Daphne/Harry (#72) and Cho/Harry (#73). This graph shows the number of fics published each year, so you can see how the popularity has changed over time.

The Cho/Harry pairing peaked in popularity in 2003, likely due to the release of Harry Potter and the Order of the Phoenix, which was the first (and last) book which had this pair together romantically. On the other hand, Daphne/Harry has only become popular very recently, leading to its high “center-of-mass” value.

It’s possible that there is some degree of favorite “inflation,” i.e. readers are more likely to favorite stories in recent years than in the early 2000s. On the other hand, older fics have had more time to accumulate favorites if we assume the overall rate were constant.

Overall, there isn’t much of a correlation between publication date and number of favorites (especially when considering the full population of fics), but I still found it interesting that the pairing with the most favorites (Daphne/Harry) also tended to have the newest fics.

Final Thoughts

To briefly summarize, there are a lot of popular pairings in Harry Potter fanfiction, but a popular pairing doesn’t necessarily translate to higher ratings (in terms of favorites/follows/reviews). Two factors that might partially explain higher ratings are length (more likely) and timing (less likely), but they certainly don’t explain everything. The true predictor for an individual fic’s ratings is always going to be how much readers like it, but I hope this discussion and my previous post show that there are always interesting trends to see if you look closely enough. Until next time!

The Ship Wars: How the popularity of pairings in Harry Potter Fanfiction changed over time

In a previous post, I looked at character choices in Harry Potter fanfiction. One thing I focused on were the popularity of various pairing, e.g. I noticed that non-canon pairings like Draco/Harry tended to be more popular than canon pairings like Harry/Ginny. I think this is typical for fanfiction, which loves to explore “what-if” scenarios. I decided to take another look, this time asking the specific question of whether the popularity of the various pairings seemed to change over time.

The data

In my previous post, I looked at a random sample of ~200k fics from the total of ~700k. In order to avoid the sampling issue, I decided to bite the bullet and scrape all the fic info from Fanfiction.net. So the figures below are based on analyzing a total of 724,315 Harry Potter fics.

To make the figures below I looked at all the fics which listed two characters. This doesn’t necessarily mean a romantic pairing, so keep that in mind. If more than two characters were listed, I included pairings that were specifically mentioned with brackets, a feature which was added in 2013. This isn’t used all that often, so the figures below mostly reflect the relative popularity of two-character fics.

Overall popularity

Below are the number of fics for the eight most popular pairings over time. There is one data point for each year, e.g. the 2001 dot represents the number of fics that were submitted between January 1, 2001 and December 31, 2001.

There are a few obvious things to point out, like the extremely high popularity of fics with Draco, or how only three of these eights pairings are canon (James/Lily, Hermione/Ron, and Ginny/Harry). But it’s also interesting to notice that many of the curves have a similar shape, which reflects the overall popularity of Harry Potter fanfiction on Fanfiction.net.

I think this time period can be separated (more or less) into the five following stages:

  1. Initial growth: From 2001-2004, all the pairings increased in popularity as the Harry Potter community on Fanfiction.net grew in size.
  2. “Golden age”: From 2004-2007, the overall rates were pretty high. J.K. Rowling was still writing the books, so fanfiction authors got to explore all the wonderful “what-if” scenarios as each book revealed more and more of the story.
  3. Post-book lull:  From 2007-2010, most pairings dipped as fewer people submitted fics. My guess is overall interest in Harry Potter dropped as no new books were released.
  4. Movie spike: There was a big spike in submissions to Fanfiction.net in 2010 and 2011, which I suspect is a result of increased interest as the two Deathly Hallows movies came out.
  5. Post-movie lull: From 2011 to the present, overall popularity has waned but could be stabilizing. It will be interesting to see if the Fantastic Beasts movies get made and whether these will cause a resurgence in popularity.

Popularity rankings

While the overall numbers are neat to look at, I think it’s also interesting to see how the relative rankings of pairings has fluctuated over time. So below you can see how the pairings shifted in popularity. If a particularly pairing has a dot at, say, rank 4 in 2008, then that pairing was the 4th most popular for that year. Note that some pairings that aren’t shown here (e.g. Draco/Ginny) occasionally broke into the top 8 rankings, but I’m just showing how the most popular overall pairings changed relative to each other.

This doesn’t reveal much new information for the top three pairings, which have had relatively stable rankings. But I think there are some interesting things to see in the less popular pairings.

For example, the Harry/Hermione pairing had a very high ranking in the early 2000s, but dropped to rank 6 by 2007. I think this was caused by the release of Half Blood Prince and Deathly Hallows books, which cemented the canon pairings of Hermione/Ron and Ginny/Harry. You can see these canon pairings were rank 4 or 5 from 2007 to 2010, but both have dropped in ranking from 2011 to 2016. Conversely, the non-canon pairings of Harry/Hermione and Hermione/Severus seemed to increase in rank.

It’s important to note from the raw data above that all of these pairings dropped in overall popularity from 2011 to 2016, just that the non-canon pairings like Harry/Hermione and Hermione/Severus dropped less compared to the canon pairings. Part of this might be the effect of the movies. I know Harry/Hermione fans like use the movies to support their choice (e.g. that dance scene in the tent). But it might be that these non-canon pairings are more popular over time simply because they’re not canon. Fanfiction allows authors and readers to explore the unknown with their favorite characters, so it’s only natural that non-canon pairings are very popular.

Final Thoughts

I think it’s interesting to see that these pairings do shift in popularity over time, presumably as the different books and movies were released. Fans can be very vocal about their favorite pairing, so I think this is a good way to look at the fandom as a whole.

There are other interesting trends I didn’t explore, so hopefully I’ll get to revisit this data set in the future. Draco/Ginny, for example, was actually very popular in the early 2000s (#4 in 2003), but has since dropped out of the top 10 (#16 in 2015). Or there’s Rose/Scorpius, which obviously didn’t exist until 2007 when the characters were introduced, but has stayed firmly in the top 10 since 2010. Stay tuned.

Voter Representation in the United States

In the last month, there has been a lot of discussion in the news about how Hillary Clinton lost the presidential election to Donald Trump, despite winning the popular vote. This quirk of the Electoral College has now occurred five times in U.S. history and twice in the last twenty years. Instead of rehashing other explanations for how this happened, I wanted to take a different look at the Electoral College, and by extension, the U.S. Congress. I’ll try to keep my personal politics out of this post and focus on the data, but discerning readers should keep in mind that I chose my examples carefully to illustrate a point.

Historical look: representation over time

The U.S. congress is a compromise to the different ways a legislative body can represent a collection of states. The Senate has two members per state regardless of size, while the House of Representatives assigns its members based on each state’s population. The intention is to ensure states with a small population are still fairly represented compared to bigger states. The Electoral College assigns its members in the same way, with Washington D.C. being the sole exception. It has three votes in the Electoral College, but only non-voting representation in the U.S. Congress. My graphs below focus on just congressional representation, but the same numbers apply for the Electoral College.

To begin, below I have a plot of the U.S. population over time, specifically in states with congressional representation. The counts are from the U.S. census (according to Wikipedia), which is collected every ten years.

U.S. Population in states with congressional representation, so excluding territories before they became states and Washington D.C.

U.S. Population in states with congressional representation, so excluding territories before they became states and Washington D.C. Source

The main thing to point out here is simply that the population keeps growing. While the pre-Civil War population grew at an approximately exponential rate, it has now slowed to pretty much a linear rate since World War II.

Next I have the sizes for the two houses of Congress over this same period. By law, the House of Representatives has to re-assign its members after each census, but as you can see, the size of the House has not changed in the last 100 years. Instead, the same number of members are re-apportioned according to relative population changes in each state

Size of the U.S. Congress over time, specifically voting members.

Size of the U.S. Congress over time, specifically voting members. Source

This shows how the U.S. Congress grew quickly in early years as new states were admitted. After the 1920 census, though, Congress couldn’t agree whether to to add new members, supposedly due to size constraints (and political reasons). This led to the Reappointment Act of 1929, which established a way for reappointment to occur automatically. Importantly, it did not increase the size of the House, a tradition which continues to this day.

As a result, a growing number of people are represented by a single member of congress, or put another way, representation per capita is shrinking. Below is a graph that illustrates this, obtained by dividing the values from the two graphs above. Note that this uses the total population, not the number of elligible voters.

Stuff

Representation per capita in the U.S. Congress. The values are calculated by dividing the total number of congressional members by the total population in states with voting members in Congress.

Despite the increases in the size of Congress, the representation per capita has continuously fell in the U.S., with the largest decreases occurring before the Civil War. These are fairly consistent relative changes, so when plotted on a log scale (not shown), the rate of decrease is fairly constant.

I think the most important takeaway here is simply how low these numbers have fallen in recent years. Since the year 2000, there are fewer than 2 members of Congress per million people (and by extension, members of the Electoral College). This means each member of the Senate represents, on average, ~3 million people, while each member of the House represents ~700,000 people. These values have more than tripled since the last time the House increased in size 100 years ago. And of course, as the population continues to grow, these numbers will not get any better.

Representation by state

While the decrease in representation has a negative effect on every state in the union, it certainly affects some states more than other. Specifically, states with larger population will have fewer congress members (and electoral college members) per capita. As I said earlier, this is intentional by the U.S. constitution, but I still think it’s interesting to look at how it breaks down by state. Below is a chloropleth map below using data from the 2010 census.

Stuff

Congress members per capita (House + Senate) according to the 2010 census. This is equivalent to the number of Electoral College members per capita. Note that the coloring is based on a log scale.

Smaller states like Wyoming, Vermont, North Dakota, and Alaska have the highest representation, while many of the larger states have lower representation. Thirty states have fewer than 2 members of Congress per million people. For another look, I have the same values in a bar graph below.

Stuff

Congress members per capita in each state, ordered by decreasing population size. Click on the graph to see the full size figure with legible text.

Since the states are arranged here by population, the mostly decreasing trend implies smaller states have better representation than larger states. The only thing that keeps this trend from strictly decreasing is the discrete nature of assigning House members. Rhode Island is the smallest state to receive two House members, making its representation appear higher than Montana and some others with only one member.

So what? Warning: opinions ahead

I hope these figures provide some insight into whether the current system of the U.S. Congress and Electoral College provides a “fair” representation of the American people. The logic that led to the the compromise between states and individuals certainly still applies, i.e. smaller states still have reasons to fear that they would be “ignored” without this system. However, I want to argue that most small states are already being “ignored,” along with many of the larger states.

Specifically, I want to repeat what CGP Grey pointed out in his excellent video on the electoral college: the system does not guarantee that small states receive attention during a national campaign, since it’s still possible to win a Presidential election with just the 11 largest states. But even those states don’t receive most of the attention. Instead, a huge majority of campaign resources are delegated towards the “Battleground” states, i.e. states with close races that have a chance in tipping the Electoral College vote one way or another.

To illustrate this, I made maps of the states that had scheduled events with Hillary Clinton and Donald Trump during the official campaign, meaning after their respective party conventions but before the election. The counts for Donald Trump came from this list of rallies on Wikipedia, while the counts for Hillary Clinton came from this archive of speeches. I don’t pretend either source is 100% correct or complete, but they both demonstrate my point.

Stuff

Number of campaign events attended by Donald Trump (top) and Hillary Clinton (bottom) during the presidential campaign but after their official nominations.

Both candidates spent most of their time in an extremely small number of states: Florida, Ohio, Pennsylvannia, and North Carolina. Also, it’s not as if the white states in these maps all received 1 or 2 visits. According to the sources, Donald Trump only visited 19 states, while Hillary Clinton only visited 16 states.

One could argue that this was time well spent, since states like Florida and Pennsylvania did end up having close races and results that were very different from poll predictions. However, I argue that these maps are not the intention of the Electoral College. It isn’t big states or small states that are benefiting from this system, it’s only the states with contentious and often bitter races. This system encourages polarization. And while I don’t know what would happen if the presidential election became decided by a popular vote, I think this is an opportunity to have an honest discussion about what “fair” representation really means.

Subreddit breakdown: Gentlemanboners

UPDATE: Added a new graph below

After my previous look into /r/AskScience, I wanted to do a similar analysis for another interesting reddit community: /r/Gentlemanboners. It’s a place where people submit (mostly) SFW pictures of female celebrities, usually candids from red carpet events or glamour shots from magazines. It’s a fairly popular subreddit that usually gets a post to /r/all each day.

I thought it would be interesting to analyze the submissions to /r/Gentlemanboners for a couple reasons. First, it’s very easy to tell who’s in each picture because there’s a strict rule of requiring that the full name is included in the title. Second, certain girls seem to be much more popular than others, e.g. Emma Watson. To see if this was true, I downloaded the JSON files for all submissions between April 5, 2011 and June 22, 2016 using this script for a total of 89,842 files. Below are some some simple analyses, but I might get motivated to look more closely at this dataset in the future.

Popularity rankings

The first thing I looked at was ranking the most popular girls. You can see below the number of submissions for the 15 girls with the most submissions. I ignored incorrect/alternate spellings of names, and a single submission could be counted twice for two different girls if they were both in the picture. Also, this post was my first attempt to try D3.js, so you can see the exact numbers by placing the mouse cursor over each bar.

Emma Watson is by far the most popular girl on /r/Gentlemanboners, with 40% more submissions than Taylor Swift, Anna Kendrick, or Scarlett Johanssen, who are essentially tied for 2nd place. It’s interesting to see that this 40% difference can’t be seen between any of the lower ranks, meaning that Emma Watson’s popularity surpasses the power law trend that describes the rest of the data fairly well.

Popularity vs. Rating

The number of submissions is probably not the best measure for which girl is the most well-liked, so I also wanted to look at the scores for each submission. Below I have a scatter plot of the number of submissions vs. the average score for those submissions. The plot has the 500 girls with the most submissions. You can mouse over each dot to see the name associated with it.

There’s an overall positive correlation between the average score and the number of submissions, with Emma Watson’s dot located to the far right. However, you can also see that many girls have higher average scores despite fewer submissions.

I think the most interesting thing here are the names with very high average scores but few posts. At the top is Milana Vayntrub, with only 62 submissions but an average score of 985, much higher than the average score for anyone else in this group.

So who is Reddit’s favorite?

I can’t really say who deserves to be called Reddit’s favorite, since it depends on whether you consider the score or number of submissions as more important. Besides Emma Watson and Milana Vayntrub, there’s also girls to consider like Alison Brie. She was ranked 13th in the bar graph above, but her average score is higher than the top 89 most popular girls. The next most popular girl with a higher average score is Katherine McNamara (who has over 500 fewer submissions). So who is reddit’s favorite? I’ll leave that for you to decide.

Update: Total Karma

By request, I made an additional plot of the total karma, i.e. the sum of the scores for each post. Below are the top 15 girls ranked by total karma. Emma Watson is still at the top, probably thanks to her large lead in overall submissions, but the difference between 1st and 2nd is definitely smaller. Taylor Swift was probably the hardest hit, dropping down to 11th. The stars of Black Swan (Natalie Portman and Mila Kunis) were knocked out of the top 15, replaced by Natalie Dormer and Blake Lively.

Subreddit breakdown: AskScience

I’m a big fan of the reddit community AskScience, where anyone can ask a question and have “scientists” try to answer it. Some of the people answering may not be actual scientists, but there’s still a lot of participation from people with real expertise. You can also find some fascinating questions. Here’s a short list of questions I thought were interesting or amusing (in no particular order):

  1. If two ships travel at faster than half the speed of light away from each other, could light from one ever reach the other?
  2. What exactly is an itch?
  3. Why can’t I list every book I know, but I can tell you if I own it?
  4. Why do airplane windows need to have that hole?
  5. Do people sneeze while they sleep?
  6. If you farted hard enough in space, could you move yourself around?

Clearly, these people are asking the important questions. One thing I’ve noticed on AskScience is how physics-related questions seem to be the most common. I know I’m biased (I always hope to see more neuroscience questions), but I wanted to know for sure. I decided to analyze the subreddit submissions by scraping submission info with PRAW. Specifically, I used this Python script to download information about every submission since the subreddit started (6 years ago). The end result was 155,805 JSON files, which covered a range of 2,239 days. Below are some simple descriptive statistics about this dataset. Eventually, I’d like to do some natural language processing on the questions and answers themselves, but for now, this will be a broad look at the meta-data for submissions to AskScience.

First, a bit of history

Below, I’ve plotted the average rate of submissions over time in units of submissions per hour, starting around April 2010. You can see the that the rate grew slowly for the first eight months but then spiked around January 2011. I think this was a result of new traffic from AskScience’s nomination for the Best Little Community of 2010. Over the next nine months, the community grew by leaps and bounds until it was eventually added to the list of default subreddits in October 2011. That caused another huge spike in traffic, and you can see major fluctuations in submission rate over the following 12 months or so. Since then, the rate of submissions has continued to fluctuate up and down but held more or less steady.

rate_over_time

I think it’s interesting to look through the early periods to see how the subreddit gained popularity. For example, this post from March 2011 has a moderator saying that the number of subscribers reached 19,000, up from only 4,500 just 6 months earlier. When AskScience was made a default subreddit, this number skyrocketed. As of this writing, there were 8.5 million subscribers. That’s quite a change.

Breaking down submissions by flair type

Sometime in 2012, AskScience started requiring each question to be tagged with one of 12 flairs corresponding to different scientific fields: physics, biology, chemistry, etc. Below, I’ve broken down the 91,014 tagged posts by flair type. Sure enough, physics is by far the most popular. Biology is a close second, but each of the other fields are less popular by a factor of 2 or more. The number of questions tagged with the bottom three flairs (psychology, social sciences, and computing) are each  around 13 times smaller than the number tagged with physics.

percentages

This confirmed my observation that physics questions were the most common, but I was also curious to see if these percentages changed over time. Below I have a stacked area plot that shows how the relative percentages changed over time. Initially, most questions were not tagged, so you can see the gray area covers almost 100% of the plot. Then in 2012, the colored areas get larger, meaning more and more submissions were getting tagged. You can tell how the percentages changed by comparing the relative size of the colored areas

percentages_over_time

It seems in this graph that the relative percentages are somewhat stable over time, but something interesting happened to the Biology category. It’s hard to tell in the above graph, but the rate of biology and physics questions were actually quite similar in the beginning. Sometime in mid-2014, the rate of biology tags dropped, which you can see as the green biology area gets smaller.

It’s much easier to see this trend when you look at the raw submission rates, which I’ve plotted below. Initially, the curves are highly correlated (they both go up and down as the overall submission rate fluctuates, but then the biology rate drops and never recovers.

physics_vs_biology_rate

I’m not sure what caused this drop. It could be the result of more aggressive moderation of biology-tagged posts, or possibly a genuine decrease in the number of biology questions. Regardless, the end result is an even greater percentage of physics questions over the last two years. The bar graph above shows the percentages of physics and biology questions are around 27% and 20%, respectively. More recently, these values are closer to 30% and 15%.

UPDATE: After talking with a few mods from AskScience, I’ve learned that the drop in Biology posts was caused by the introduction of the “Human Body” tag, which was introduced as a subset of the Medicine category. This resulted in a lot of questions getting the Medicine flair that previously had the Biology flair. Nevertheless, there is still an small decrease in Biology-related posts over time, which you can see in the stacked area plot by comparing the Biology+Medicine areas to the Physics area.

Comparing scores

An obvious question to follow-up the previous analyses is how well the posts from each field do once they’re submitted, i.e. how many upvotes they get. Reddit “fuzzes” this information to prevent spam bots from taking advantage, but the numbers I extracted should still be an okay approximation.

I decided to compare three values:

  1. Score – This is the number of upvotes a post receives minus the number of downvotes.
  2. Upvote ratio – The ratio of upvotes to the total number of votes. This number is between 0 and 1.
  3. Number of comments – This is the total number of comments in a post, so it includes all the replies to the top-level answers.

Below I have the mean values for each flair type marked with circles. The lines are 99% bootstrapped confidence intervals, so a lot of overlap of these lines implies the mean values are not significantly different from each other. You shouldn’t place too much stock in comparing individual pairs because of the problem of multiple comparisons, but if there’s a large gap between the confidence intervals, its probably safe to say that the difference is statistically significant.

scores_by_flair

The confidence intervals for the scores have a large degree of overlap, indicating there probably aren’t very many significant differences between the groups. There might be some, e.g. medicine posts seem to have a higher average score than chemistry posts, but not by much. It’s important to note that the score and number of comments have distributions that are heavily skewed towards zero (95% of posts have scores under 100). The mean isn’t a great summary statistic for these types of distributions, but the confidence intervals I have here are bootstrapped, i.e. they don’t make any assumptions about the underlying distribution.

The confidence intervals for the upvote ratios and number of comments have less overlap, so you can have more confidence in making comparisons. Take, for example, the opposite trends for mathematics and neuroscience, two categories that receive similar numbers of questions. Math questions seem to get the most comments but have the lowest upvote ratios, whereas neuroscience questions seem to have high upvote ratios but few comments. This makes me wonder if math questions are more controversial, generating a lot of discussion but not a lot of positive interest. Conversely, I think people like seeing neuroscience questions enough to upvote them, but there aren’t as many neuroscientists on reddit answering questions.

Final remarks

This may have seemed like a lot of work just to confirm my suspicion about the popularity of physics questions, but I enjoyed taking a deeper look at such an interesting part of reddit. There’s plenty more to do with this dataset, such as looking at the actual content of the questions and answers. I’m also very interested to see how much higher the scores are for answers that were given by people with “scientist” flairs. Stay tuned.

 

Listening to data

I enjoy browsing /r/dataisbeautiful to see all the creative data visualizations, which makes me wish for more innovative visualization techniques in scientific literature. Unfortunately, anything besides static images is still largely relegated to the supplementary info. Libraries like D3.js are amazing, but the scientific community has yet to embrace these types of tools. This will hopefully change as journals such as eLife become more popular. Until then, I think we’re stuck with static images being the principal method for displaying data. There’s nothing specifically wrong with that, but different approaches can often lead to new insights.

To show you what I mean, I’ve put together a few videos demonstrating a different way of “looking” at time series data: sound synthesis. There are three example below, each using a different data source. I’ll be the first to admit that these are essentially gimmicks, but I hope I can convince you that it’s a useful tool during the data exploration phase, which can lead to new insights and analyses.

Example 1: Neuroscience

I’ll start with something from my own research. I’ve taken a 15 s recording from a neuron in the auditory midbrain and simply used the raw waveform from the signal to create a sound. This signal is the voltage from an extracellular electrode that is slowly advanced through the brain until it a contact is close enough to a particular neuron that we can see individual action potentials. During an experiment, we have the voltage playing continuously through a speaker, since it’s helpful to listen for the distinctive sounds of an action potential when searching for a neuron. Below is an example of we hear when we get close to a neuron.

The action potentials are the large “spikes” in the waveform that are shown in more detail on the smaller set of axes. You can hear them as distinctive crackling sounds. The spikes in this example appear to occur randomly, which we call “spontaneous” spiking since no stimulus is being applied. However, there is a limit imposed by the kinetics of action potential generation on how fast any neuron can fire, often referred to as the refractory period. For this reason, the spiking of a neuron is not truly a Poisson process as it is often assumed to be. You can sometimes even hear a distinct periodicity in the spiking, which is the basis for neuronal oscillations that play critical roles in brain function. These periodicities can be quantified with tools like spectral analysis or directional statistics.

Example 2: Physics/astronomy

This next example is obviously not my field of expertise, but I think it’s nevertheless fascinating. The audio is taken from the Space Audio page from a group at the University of Iowa. It’s a recording of “whistlers”, a phenomenon that can occur when lightning causes radio waves to travel along the Earth’s magnetic field lines, which causes dispersion due to the different speeds at which the different frequencies propagate. Here’s a figure from the same group that explains it better. The result is a distinctive whistling sound where the pitch drops as a result of the dispersion.

The visualization is a simple spectrogram, which shows frequency on the y-axis, time on the x-axis, and intensity as the color. You can see the curved lines in the spectrogram that correspond to the whistling sounds, which is a sign that the frequency is changing over time.

I love this example for two reasons (besides the fact that it just sounds cool). First, the signal did not need to be shifted at all in terms of frequency before converting to sound. You can find examples of “space sounds” where something like gamma radiation is used to synthesize a sound. However, gamma rays have frequencies that are much, much higher than our audible range, so it needs to be shifted in order to hear it. This is largely an arbitrary process, since the perception of the sound can vary depending on how much frequency shifting/compressing is done.

Second, dispersion is a concept important to the propagation of waves in general, so you can find examples of it in the auditory system as well. The cochlea, for example, acts like a dispersive delay line, meaning that it not only separates out the frequency content of sounds, but it does so at slightly different speeds, resulting in what is referred to as the “traveling wave.” These can look a lot like whistlers.

Example 3: The button

Frequent visitors to reddit will probably recall /r/thebutton, reddit’s most recent April Fool’s Day joke. Explaining it fully would take a while, so check out the wiki on the subreddit linked above or this blog post for more information. Suffice it to say, it’s a very interesting dataset. The official server-side data was released after the button ended, and I used it to synthesize sound.

It’s based on the concept of a vocoder, which works by splitting a sound into different frequency bands, and then using the energy in those bands to modulate a sound like white noise or a sawtooth wave. I used the different flair values of the button as my frequency bands, i.e. the 60 s presses were mapped to a high-frequency band and the 0 s were mapped to a low frequency band. I counted the number of presses within each band within 3 hour windows and used these rates to modulate white noise. I also compressed the timescale so listening to the two-month dataset takes two minutes. I recommend playing the video below with headphones on, although you may want to turn the volume down at the beginning since it starts out loud.

The top plot is the sound waveform that you’re hearing, and the middle plot is like a spectrogram, showing the log-scaled rate of button presses over time. The bottom plots the instantaneous press rate for each flair over the 3 hour window used to compose each frame, so the minimum non-zero value is \log_{10}\left(\frac{1 \text{ press}}{180 \text{ minutes}}\right) \approx -2.3. Also, you can probably tell I cut out the first few days (60 hours to be exact), since the rate of button presses is so high at the beginning that it would’ve swamped out the rest of the audio. That and given you all hearing loss.

You can hear/see some pretty interesting things, like how the overall amplitude ebbs and flows over 24 hour periods, or how the flair values near the borders (e.g. 11 s, 21 s, etc.) stick out, almost like harmonics in a natural sound. You can even hear the hitchiker exodus, where a bunch of redditors pressed the button at 42 s. It ironically starts 42 s into the video (I swear that was unplanned).

Final Thoughts

I hope you enjoyed the examples and have a new-found appreciation for different ways of “displaying” your data. My examples here were all time-series data, but there’s no reason these techniques can’t be extended to other modalities, like spatial data from geographers or spectroscopy data from chemists. The sky’s the limit! (Get it? Because sound can’t propagate in outer space?)

I’ll see myself out.

Generating gibberish from Harry Potter fanfiction summaries

 This is a continuation of my previous post analyzing character choices in Harry Potter fanfiction.

Generating random/gibberish text is not a new idea. Perhaps you’ve been to /r/SubredditSimulator, a complete subreddit in which content is created by bots generating random text. Or perhaps you’ve heard about gibberish scientific articles actually being accepted into journals and conference proceedings. To my knowledge, though, this is the first time someone applied these tools to fanfiction, or more accurately, fanfiction summaries.

The technique is based on the concept of Markov chains, a way of describing “memoryless” random processes, i.e. a process in which the next state in the process depends only on the current state and not the previous states. It’s an enormously useful concept. It’s even the basis for how Google ranks websites in their search engine results.

But enough introduction, let’s get to the good stuff. Introducing the Harry Potter Fanfiction Summary Generator! Just click the button below to generate a new random fanfiction summary.

Harry Potter Fanfiction Summary Generator



The Voyage of Life
Ideas, plot bunnys that won’t let go, possible future stories. feel free to submit questions! Hilarity will ensue. Do not read it unless you read hers first. Posted with permission.
Rated: K+
– English – Romance/Drama – Chapters: 11 – Words: 43228 – Reviews: 483 – Favs: 644 – Follows: 248
– Harry P., Hermione G.

Note: This generator is not creating summaries on the fly, but rather loading summaries from a previously generated list. There are about 10,000 summaries in the list, so it should take a while before you start seeing repeats.

So how does it work?

Disclaimer: the rest of this post will get somewhat technical, but I’ll try to avoid jargon.

You probably saw in the examples above that sometimes the generator produces perfectly legitimate results, even to the point of containing whole sentences from preexisting summaries (more on that later). Other times, it fails horribly and hilariously. To understand what’s going on, you need to understand the concept of Markov chains.

To construct a Markov chain, you start by analyzing the probabilities of transitions between “states”. In this case, the states are words. For example, if you start with the word ‘Harry’, you can probably guess that it is often followed by the word ‘Potter’ and less often by, say, ‘taradiddles’ (yes that’s a real word; it actually appears once in Harry Potter and the Order of the Phoenix). By analyzing all of the word transitions in a body of text, you can calculate lots of probabilities and create a diagram like the example below.

Diagram for simple Markov chain. The size of the arrows is proportional to the probability of the second word following the first, e.g. 'Potter' might follow the word 'Harry' 80% of the time, while 'and' might follow 'Harry' the other 20%. From these words, there are other likely choices,such as 'Harry Potter is' or 'Harry and Draco'. It is much less likely to see something like 'Harry and and'. (Note: these are dummy probabilities for the purpose of illustration).

Diagram for a hypothetical Markov chain. The size of the arrows is proportional to the transition probabilities between words, so a value of 0.8 would imply that the word ‘Potter’ follows the word ‘Harry’ 80% of the time. To construct a random sentence, we pick a starting point and then move between “states” (i.e. words) according to their probabilities.

As you might expect, Markov chains can be much more complicated than this example. To generate the summaries above, I constructed a Markov chain using 25,000 fanfiction summaries. This was a sample of “popular” fanfics, specifically the top 25,000 fics written in English, sorted by number of reviews. This is certainly a biased sample, but hopefully biased in an interesting way. For example, I might speculate that summaries in this sample are more successful (on average) in attracting readers’ attentions. I obviously don’t know if that’s true, but I think the sample is large enough to get a good sense of common trends in summaries.

That’s nice, but how does it work?

To actually explain the process, I need to introduce the concept of the Markov chain order, also referred to as the memory. When using a Markov chain to generate random text, this number refers to how many previous words are considered when selecting the next. For example, say we start with the phrase ‘sent back in’. With an an order of 1, only the previous word is considered, so the next word is chosen based on which words are most likely to follow ‘in’. For an order of of 3, you consider all three words, so the next most likely word is definitely ‘time’, which constructs the phrase ‘sent back in time’. As you might expect, this phrase is very common in fanfiction summaries, since a lot of stories involve time travel.

One way to analyze the effect of Markov chain order is to generate lots of random summaries and see how often these summaries match one of the input summaries used to construct the chain. By “match”, I mean an exact match, including the capitalization and punctuation. Below, I show the results of this analysis for small subsets of the full dataset. It would be nice to repeat this analysis for the entire thing, but that’s more work than I’m willing to do for a blog post.

Effect of Markov chain order on probability of producing exact matches when randomly generating summaries. A probability of zero implies that every generated summary is completely unique, while a value of one implies every generated summary is just a reproduction of an existing summary.

To calculate these probabilities, I generated lots of random summaries from each Markov chain and calculated the fraction of exact matches from the sample. I repeated this process several times with a different sample of summaries and averaged the results. This is an example of a Monte Carlo method.

There are two trends to describe in the graph:

Effect of order — With an order of 1, nearly every generated summary is unique. With an order of 5, basically all of them are just reproductions of the input data. Something special happens around order 2-3 when we start to get a lot of matches. This value has a lot to do with how long summaries tend to be. If you wanted to reproduce larger sections of text (e.g. an entire fanfiction), you would need a higher chain order.

Effect of sample size — You can see the general effect is to shift the curves to the right as the sample size increases. This implies that you get fewer matches with a larger sample.

From these results, I decided to choose an order of 3 to generate summaries from my full dataset, since I think it’s high enough to create interesting patterns, but low enough to create mostly unique results. I generated 10,000 summaries and 271 were matches. I decided to remove them from the generator above, since these were usually of result of all the crazy ways people use punctuation to make things stand out, e.g. ::Rewritten & Improved:: or ***COMPLETE***. However, you’ll still see times when it’s reproducing part of a summary, then it suddenly switches to a new one. This can create readable, yet hilarious results.

Lastly, I should mention the titles were also constructed with Markov chains, only using an order of 1 since titles are so much shorter. I also removed randomly generated titles with only one word, since these are always exact matches. Despite these precautions, ~18% of the titles are still matches.

You still haven’t told me how it works

Right. This post is already getting pretty long, so I decided to put some of the extra technical information on a separate page here. You can see the actual algorithm I used to generate random summaries, as well as the techniques I used to provide the accompanying information, e.g. genre, rating, reviews, etc.

Phrase analysis

To finish off this post, I decided to look at the most popular phrases used in the summary Markov chains. Recall that for an order n chain, we consider the n previous words to pick the next, so we can look at how often phrases of length n and + 1 occur. Since I had difficulty deciding between an order of 2 or 3, I created Markov chains for both and can analyze popular phrases from 2-4 words long. Below I have the top 15 phrases from each group.

Most popular word pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

Most popular phrases used in Harry Potter fanfiction summaries. The three lists corresponds to phrases of different lengths: pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

There are some interesting things to notice:

  1. For a length of two, 9 of the 15 phrases are prepositional phrases, i.e. not really specific to fanfiction summaries. Also, the only name mentioned is Harry.
  2. For a length of three, you start to see some interesting combinations, like character pairings and other phrases unique to Harry Potter fanfiction. I think the most interesting phrases are ‘What happens when’ and ‘back in time’, since they illustrate the hypothetical nature of Fanfiction stories.
  3.  For a length of four, you see more of the hypothetical phrases, including three variations of ‘what happens when’. I also think it’s very interesting that you see different parts of phrases that are >4 words. For example, there is ‘at the end of’ and ‘the end of the’, so I would probably predict the 5 word phrase ‘at the end of the’ would also be very popular.

Final thoughts

I hope you’re convinced that Markov chains are a neat way of analyzing text, even if it’s only to giggle at the the gibberish they can produce. Make sure to check out more involved uses of this technique like in /r/SubredditSimulator. Also, if you wanted to see some additional info like the actual algorithm I used, please visit this page. Thanks for reading!

Character choices in Harry Potter fanfiction

Harry Potter fanfiction is something I find pretty interesting. On FanFiction.net, there are over 700,000 stories, all set within the Harry Potter universe. Not one of the authors expects any sort of financial gain, although it is possible for popular authors to get published.  E.L. James, the author of 50 Shades of Grey, got her start writing fanfiction for the Twilight series. Say what you will about either of those series, but there’s no denying their popularity.

The stories on the FanFiction.net, also called “fics” or “fanfics”, vary wildly in length from poems to short stories to full-length novels. There’s even a few fics that contain more words than all seven Harry Potter books put together. It’s also surprising that the community is still active to this day, considering it’s been eight years since the last book came out and four years since the last movie.

I decided to take a look at the meta-information available on FanFiction.net for these stories. When browsing for fics, you can get lists of descriptions containing the title, author, genre, length, number of reviews, etc. There’s also a short summary written by the author. I wrote a script that scrapes this information from the search results, using the Python lxml module for HTML parsing. I randomly sampled 200,000 fics, so roughly 28% of the total. Here are a few stats from the dataset:

 

I think these are some fairly impressive numbers. Of course, these statistics are a poor summary of the actual data, but for my first post with this dataset, I wanted to look at character choices.

Most popular single characters

Each fic can list up to four characters from a list of 375. Harry Potter has a lot of supporting characters. The first thing I did was count how many times each individual character appeared in a fic and sorted the result. Below are the most popular character choices.

Top 25 character choices. Note that the percentages sum to >100% since each fic can have multiple choices.

Top 25 character choices. Note that the percentages sum to >100% since each fic can have multiple characters

 

It should come as no surprise that Harry is at the top of the list. I do feel a bit bad for Ron, though. It appears that Draco has taken what could arguable be called his spot, and Ron barely made it higher than “OC”, meaning an original character. I think there are multiple reasons for this. First, a large percentage of Harry Potter fanfics are romances: 55% of fics in my sample contain the “Romance” genre label, with the next highest being “Humor” at 21%. I think Draco has that bad boy appeal that makes him popular in romances. I’ve also found that a lot of fics are “do-overs”, i.e. fan re-imaginings of the original plot. In those stories, Ron can be unpopular. The community even has a term for treating a character badly called “bashing”.

The other rankings aren’t too surprising. I kinda wish Fred and George were next to each other, but I understand since there’s also a popular type of story that picks up where the books ended. As an example, if anyone reading this is a Harry Potter fan, I recommend the short story Cauterize by Lady Altair.

Pairings and multiple character groupings

The next thing I wanted to look at was how characters are grouped together. As I mentioned, romance is a very popular genre label, so the most popular grouping is obviously two characters. The community refers to this as a “pairing”, and if a fan likes a particular pairing, they “ship” that pairing. It’s short for “relationship” (used as both a noun and a verb for some reason). There can be fervent debates about “ships” on sites like /r/harrypotter, so I think it’s an interesting thing to look at.

As an aside for those familiar with FanFiction.net, I’ve ignored the use of square brackets [], which are supposed to explicitly denote a romantic pairing. Only 3.2% of fics in my sample use brackets, which goes up slightly to 4.5% if you just consider fics with a Romance genre label. Thus, I found it easier to just ignore them.

To do the analysis, I counted how often each combination of characters occurred and grouped the possible combinations by numbers of characters, i.e. a “double” indicates a fic with two and only two characters. These pairings may be romantic or platonic, but differentiating these cases is impossible from my dataset, since even the summary text may not indicate which pairings are romantic and which aren’t. Regardless, this is such a large dataset that I think the overall trends are still clear.

Stuff

Pie chart of character percentages from a sample of 200,000 fics. The color indicates the number of characters, which are broken up into the most popular character choices for each.

 

A few observations:

  1. Pairings are definitely the most popular type of fic, with Draco pairings claiming the top two spots. I was actually surprised that canon pairings are as popular as they are (Ron/Hermione, Harry/Ginny, etc.).
  2. The “other doubles” category is the single biggest chunk of the graph, but these are all pairings that consist of <1% of the total. In my dataset, there are 2,779 unique pairings, and I’m only showing 12. Of course, this is only a subset of the total possible 70,125 possible pairings given 375 available characters.
  3. Of the solo acts, Snape is the second most popular (after Harry, of course). This seems appropriate to me, since Snape strikes me as a lone wolf character.
  4. The “golden trio”  of Harry, Hermione, and Ron is unsurprisingly the most popular choice for three characters. The value 0.1% may seem small, but remember we’re talking about 700,000 fics, so 0.1% is still in the hundreds.

Summary analysis for different pairings

The last thing I wanted do was see what particular pairings have in common, if anything. The method I chose was comparing the most popular words used in the summaries. I decided to look at just two pairings, specifically the top two that didn’t have a common character (Draco/Hermione and James/Lily). To make the comparison, I took the 100 most popular words in the summaries of each pairing (using the same algorithm as Wordle to count words) and clustered the words by whether they were common to both pairings or not. This doesn’t mean a word specific to a pairing never appears in the opposite pairing, it’s simply not in the top 100. The resulting “Venn diagram” is shown below. Note that I removed explicit mentions to the characters involved since they dominated the counts. For example, I removed words such as Draco, Draco’s, Malfoy, etc. Also, I limited the analysis to fics written in English for obvious reasons.

Word cloud showing the most popular words contained the summaries of fics with either a Draco/Hermione pairing or James/Lily pairing.

“Venn diagram” word cloud showing the most popular words in the summaries of fics with either a Draco/Hermione or James/Lily pairing. The font size for each words is proportional to the total number of times that word appears in summaries with these pairings. Gray words in the center are the most common words that appear in both pairings, blue words on the left are the most common words that appear in Draco/Hermione fics, and red words on the right are the most common words in James/Lily fics.

 

The Venn diagram look may not have panned out as I had originally hoped, but the information is still interesting. Even if it does look like a Pepsi logo.

A few observations:

  1. The top words in both pairings tend to be shared, i.e. there are more gray words than blue or red. This isn’t unique to Draco/Hermione vs. James/Lily. Words like loveHogwarts, and year are common to many types of pairings. There are also common English words like up and out in the center. I automatically remove the most common English (Wordle does the same), but these two aren’t on the list of common words I found online.
  2. You can see that the Draco/Hermione pairing is more popular than James/Lily since its unique words are larger overall. To scale the font size of the words in the center, I averaged the counts from both pairings.
  3. The most common category of unique words are shorthand names for the pairing, e.g. DMHG or LJ. The word Dramione is a portmanteau of Draco and Hermione. I’m not sure if there’s one for James/Lily yet, but my vote is for Limes.
  4. I don’t think the list of unique words is enough to make any claims of thematic differences between the pairings. For example, I could speculate that many Draco/Hermione fics are Romeo-and-Juliet-style stories of star-crossed lovers, whereas James/Lily stories could have the “will-they-or-won’t-they” trope. There might be hints of this (secret and past for Draco/Hermione; finally and hate for James/Lily), but this isn’t enough to make any strong conclusions. I might look at popular phrases/groups of words to really get at this question.

Final thoughts

Character choices in Harry Potter fanfiction can be considered both highly variable (6,950 unique character groupings from a series with essentially three main characters) and highly regular (a randomly selected fic has a 25% chance of having Harry, Hermione, or Draco as a character). I hope to do more analyses like these, and I thought character choices was a good place to start because it’s an easy dimension for clustering fics together. Next, I hope to do more with the summaries. Perhaps use Markov chains to generate pseudorandom summaries like the posts in /r/SubredditSimulator. Please leave suggestions below and thanks for reading!

Analysis of abstracts from a paper library

Author’s note: This is an old journal entry from August 15, 2014.

One thing I really respect about my PhD advisor is his effort to stay up-to-date with the recent literature. At conferences, I always see him talking to several different grad students during the poster sessions, whereas it seems like most senior scientists make it to one or two posters before talking to other PIs. He also keeps an extensive library of journal articles that gets updated regularly. I don’t know if he actually reads every new paper, but I’m nevertheless jealous of his ability to find them. So far I’ve been dissatisfied with services designed to notify me of new articles, whether it’s an RSS feed of a major journal or an email search alert from PubMed/Google Scholar. These services are either too strict and miss relevant articles, or too lax and return way too many results. A new service called PubChase holds promise, but I don’t know how well it works. Regardless, I wanted to see if I could figure out a better way to find new, relevant articles. The first step: analyzing my advisor’s library of papers.

Getting the raw data

My advisor’s paper library has over 13,000 files in it, and I certainly did not want open every file in order to get the raw text from the titles and abstracts. Endnote provided a way to automate this, although it wasn’t successful in extracting the abstracts from every article. I did, however, manage to create a huge text file with the titles and abstracts of 6,687 journal articles. This process was likely biased for newer papers, since I don’t think Pubmed can pull abstracts from scans, but this frankly doesn’t bother me as long as the sampling process was unbiased with respect to the article’s topic, which is hopefully the case. To begin, I used my code based on the Wordle algorithm (see previous post) to identify the 500 most common words and their relative usages. This counting ignores common English words as well as a list of special words, which I omitted somewhat arbitrarily after deciding they would be bad at identifying a paper’s unique content. For example, words like “results”, “effects”, “suggest”, “show”, and “significantly” could show up in any abstract regardless of the topic. Also, I counted each word in the title 3 times in order to give the title more weight. This technique is used by PubMed in its search algorithm, described here. Shown below is a word cloud of the final set of 500 words used for clustering.

Word cloud of 500 most common words in extracted abstracts and titles. The font size is linearly proportional to the number of occurrences. Words appearing in an abstract were counted once, but three times in a title.

Word cloud of 500 most common words in extracted abstracts and titles. The font size is linearly proportional to the number of occurrences. Words appearing in an abstract were counted once, but three times in a title.

Clearly, some of the words are too small to read due to the enormous number of occurrences of  “auditory” (14,431 if you were curious). This illustrates why it is important to scale each word’s weight by its overall usage. Specifically, for document i=1,2,\ldots,N and term j=1,2,\ldots,M, the total weight W_{i,j} was calculated as the product of the global weight G_j of term j and the local weight L_{i,j} of term j in document i. These weights were calculated as follows, described in more detail on the PubMed website.

W_{i,j} = L_{i,j}G_j,
G_j = \sqrt{\ln \left(\frac{N}{n_j}\right)}
L_{i,j} = \frac{10}{1+e^{\alpha\ell_i}\lambda^{k_{i,j}-1}}

N is the total number of documents (6,687), n_j is the number of documents term j appears in, \ell_i is the total number of words in document i (or 250, whichever is larger), and k_{i,j} is the number of times term j is in document i. The constants \alpha = 0.0044 and \lambda = 0.7 were given by PubMed. The total number of terms M was set at 500, resulting in a set of 6,687 feature vectors of length 500 to be clustered.

Note that the above equations have two changes from the description on the PubMed website. First, the third equation above has a factor of 10 in the numerator, not 1. I added this because the local weights in my data set were originally much smaller than the global weights, skewing the clustering process. The second change was adding the square root in the second equation, which was made for the same reason. I don’t know why Pubmed’s global weights are smaller, but perhaps it is because their database of documents is much larger.

Choosing the number of clusters

With over 6,000 articles, you can imagine the breadth of information covered is quite broad. You can find papers in the library on everything from signal processing to basic anatomy to psychology. There’s even oddball papers on topics like particle physics. Any attempt to identify every topic in the library will fall victim to overfitting, but I was confident that I could separate a small number of topics that were well-represented and identify the words that best separate topics from each other.

I started with principal component analysis to reduce the dimensionality. The figure below shows the percent of explainable variance as a function of the number of components, and you can see how multidimensional this data set is. We need over 100 components to explain just 50% of the variance! This is a testament to how variable the term usages are between papers.

Cumulative variance explained by principal components

Cumulative variance explained by principal components

To help decide on the number of components to include and eventually perform the clustering, I used Matlab’s implementation of the EM algorithm for Gaussian mixture models (gmdistribution.fit). This allowed me to make some judgements on relative model quality using the Akaike information criterion (AIC). This value decreases as the likelihood of the model increases but contains an added penalty for increasing the number of free parameters. Therefore, a lower AIC value is desirable. You can see how the AIC value changes as a function of both the number of components and the number of clusters below

Akaike information criterion (AIC) for Gaussian mixture model as a function of the number of clusters. Each line represents a different number of principal components used.

Akaike information criterion (AIC) for Gaussian mixture model as a function of the number of clusters. Each line represents a different number of principal components used.

There are two obvious trends. First, increasing the number of components appears to increase the AIC value in an approximately linear fashion (lines appear offset from each other), which is due to increasing the number of free parameters. Second, increasing the number of clusters appear to decrease the AIC, which is due to increasing the likelihood of the model. Another interesting trend appears when you normalize these line by their max AIC value, as shown below

Values of Akaike information criterion (AIC) as a function of the number of clusters. The values are normalized by the maximum value for each line, i.e. when every paper was assigned to a single cluster.

Values of Akaike information criterion (AIC) as a function of the number of clusters. The values are normalized by the maximum value for each line, i.e. when every paper was assigned to a single cluster.

Here you can see that the AIC value decreases at the same relative rate when the number of components is ≥15, implying that adding additional components to these models will not significantly improve the likelihood. Therefore, I chose 15 components to be used in the final clustering. These components represent approximately 19.5% percent of the variance.

Finally, to choose the number of clusters, I used the fairly arbitrary elbow method to select a value of 8. This method is very subjective, but as the above figure demonstrates, adding additional clusters decreases the AIC value, but not by much.

Determining cluster keywords

The final result is a set of 8 clusters, each with 500-1200 papers. To identify what each cluster’s “topic” might be, I summed the feature vectors for every paper within a cluster and sorted the result in order to obtain the words with the highest weights for each cluster. I then created a word cloud with the top 20 words in each cluster, shown below, where the color indicates cluster identity.

Word cloud showing the top 20 words within each of the 8 clusters of papers, calculated by summing the feature vectors across members of a cluster. The color indicates cluster identity, and the font size is proportional to the summed weight of the word.

Word cloud showing the top 20 words within each of the 8 clusters of papers, calculated by summing the feature vectors across members of a cluster. The color indicates cluster identity, and the font size is proportional to the summed weight of the word.

Before making any claims about how the words are related within clusters, there are several general observations I’ve made:

  1. The weights are fairly evenly distributed, i.e. all the words have a somehwhat similar font size. This is especially true when compared to the earlier word cloud constructed with the raw word counts.
  2. The top words from the earlier word cloud (auditory, cochlear, and neurons) are still prominent in several of the clusters here, indicating that despite scaling for the overall usage, these words still occur often enough to produce a large summed weight within that cluster.
  3. Different forms of words are commonly grouped together, e.g. cell and cells, implant and implantation, inhibition and inhibitory, etc. This could indicate that these words co-localize in the same or very related documents. The white cluster (middle-left) even has a triplet: neurons, neural, and neuronal.

Describing the clusters

The real question is whether the top twenty words in each cluster actually say something about the papers in their cluster. If the clustering process actually separated the papers by topic, then we would assume that’s the case. You can see in the word cloud there’s certainly repetitions of words between clusters, but I do think the words can be interpreted into an overarching theme. There has to be outliers in each cluster (where do the particle physics papers go?), but I think the eight clusters break down into the following groups:
  • Cluster 1 (dark red, top left): Auditory neurophysiology, electric hearing, cochlear implants. This cluster seems to focus on neural responses (response, activity, evoked) in auditory centers (auditory nerve, nucleus, cortex, inferior colliculus) from electric stimulation (electrical, stimulation, cochlear implant). Compare this to Cluster 8, which seems to focus on the speech recognition performance of CI users, or Cluster 3, which seems to involve neural responses due to acoustic stimulation.
  • Cluster 2 (dark blue, top middle): Auditory neurophysiology, binaural hearing. This cluster certainly involves binaural neural processing (both binaural and interaural are present), especially in the inferior colliculus (note the abbreviation IC ). It is also the only cluster with inhibitory, which certainly has an important role in binaural processing.
  • Cluster 3 (orange, top right): Auditory processing, neuroimaging. This cluster seems to focus on higher-order auditory processing (speech, pitch, perception, complex), so it likely contains the neuroimaging papers. This is supported by the presence of the anatomical terms: primary, cortex, cortical, but there’s nothing to suggest neurophysiology at the cellular level.
  • Cluster 4 (white, center left): General systems neuroscience. This cluster appears to involve involve general principles of neuroscience (spike, information, model, synaptic), including the aforementioned trifecta of neurons, neuronal, and neural. With regards to anatomy, this cluster is probably focused more on the neocortex (cortex, cortical), since no brainstem terms are mentioned. This is also the only cluster with a non-auditory word (visual), which is understandable given that a larger percentage of papers on the visual system involve cortical processing than the auditory system.
  • Cluster 5 (yellow, center left): Cochlear implants, psychophysics. This cluster definitely focuses on cochlear implants (implant, implantation, CI, pulse), but also has many classic psychophysical terms (subjects, listeners, masking, noise). Compare this to Cluster 8, which seems to focus more on CI performance and less on the basic psychophysics.
  • Cluster 6 (blue, bottom left): Inner ear biology, auditory periphery. This cluster seems to focus on the periphery (cochlea, hair cells, cochlear nucleus) with perhaps more of a biological focus than some of the neural coding clusters (synapses, membrane, synaptic, cell). However, there is certainly a neural component (nerve, fibers).
  • Cluster 7 (light brown, bottom middle): Pyschophysics. This once is definitely a psychophysics cluster. Pretty much every word is indicative of this (masking, cues, thresholds, detection, target, signal, noise). There is also a binaural component (binaural, interaural, localization, time, level).
  • Cluster 8 (light blue, bottom right): Cochlear implant performance. This cluster certainly involves cochlear implants (electrode, stimulation, multichannel), but compared with the other two putative CI clusters (1 and 5), this cluster seems to focus on how well CIs actually work (speech, recognition, perception, performance, scores). It is certainly the must humanized cluster (children, patients, subjects, users). Given both children and age, this cluster likely also contains papers on how CIs affect development.

Final thoughts

While my research involves cochlear implants, I am still somewhat surprised that 3 of the 8 clusters seem to include them, since they’re really only one aspect of my advisor’s interests. I was, however, pleased to see that each CI cluster seems to have a separate focus (neurophysiology, psychophsyics, or general performance). This could indicate that the clustering worked well, although the notion of a singular topic per cluster is certainly something I arbitrarily imposed. Overall, I think the eight topics do cover my advisor’s research interests nicely, and I bet he’s even been an author on a paper in each cluster. In the future, it would be interesting to approach this problem with hierarchical clustering, since it might reveal subtopics within the larger clusters. For example, Cluster 7 might separate into topics on localization and signal detection. It would also be interesting to see where the outliers were assigned, since these 8 topics certainly don’t cover every paper in the library. Regardless, I think this was a fun and informative exercise. I did make several new PubMed email alerts as a result, but I’ll have to wait a few weeks to see how well they work. Who knows, maybe I’ll even read a paper or two, instead of just thinking of ways to find them.

Update: I set up around 5 PubMed alerts as a result of this work, each containing 3-5 search terms. They’ve been pretty effective at alerting me to papers, and I definitely prefer them to reading a bunch of table of contents from multiple journals. Now if only I had the motivation to read all those papers…