Harry Potter fanfiction is something I find pretty interesting. On FanFiction.net, there are over 700,000 stories, all set within the Harry Potter universe. Not one of the authors expects any sort of financial gain, although it is possible for popular authors to get published. E.L. James, the author of 50 Shades of Grey, got her start writing fanfiction for the Twilight series. Say what you will about either of those series, but there’s no denying their popularity.
The stories on the FanFiction.net, also called “fics” or “fanfics”, vary wildly in length from poems to short stories to full-length novels. There’s even a few fics that contain more words than all seven Harry Potter books put together. It’s also surprising that the community is still active to this day, considering it’s been eight years since the last book came out and four years since the last movie.
I decided to take a look at the meta-information available on FanFiction.net for these stories. When browsing for fics, you can get lists of descriptions containing the title, author, genre, length, number of reviews, etc. There’s also a short summary written by the author. I wrote a script that scrapes this information from the search results, using the Python lxml module for HTML parsing. I randomly sampled 200,000 fics, so roughly 28% of the total. Here are a few stats from the dataset:
I think these are some fairly impressive numbers. Of course, these statistics are a poor summary of the actual data, but for my first post with this dataset, I wanted to look at character choices.
Most popular single characters
Each fic can list up to four characters from a list of 375. Harry Potter has a lot of supporting characters. The first thing I did was count how many times each individual character appeared in a fic and sorted the result. Below are the most popular character choices.
It should come as no surprise that Harry is at the top of the list. I do feel a bit bad for Ron, though. It appears that Draco has taken what could arguable be called his spot, and Ron barely made it higher than “OC”, meaning an original character. I think there are multiple reasons for this. First, a large percentage of Harry Potter fanfics are romances: 55% of fics in my sample contain the “Romance” genre label, with the next highest being “Humor” at 21%. I think Draco has that bad boy appeal that makes him popular in romances. I’ve also found that a lot of fics are “do-overs”, i.e. fan re-imaginings of the original plot. In those stories, Ron can be unpopular. The community even has a term for treating a character badly called “bashing”.
The other rankings aren’t too surprising. I kinda wish Fred and George were next to each other, but I understand since there’s also a popular type of story that picks up where the books ended. As an example, if anyone reading this is a Harry Potter fan, I recommend the short story Cauterize by Lady Altair.
Pairings and multiple character groupings
The next thing I wanted to look at was how characters are grouped together. As I mentioned, romance is a very popular genre label, so the most popular grouping is obviously two characters. The community refers to this as a “pairing”, and if a fan likes a particular pairing, they “ship” that pairing. It’s short for “relationship” (used as both a noun and a verb for some reason). There can be fervent debates about “ships” on sites like /r/harrypotter, so I think it’s an interesting thing to look at.
As an aside for those familiar with FanFiction.net, I’ve ignored the use of square brackets , which are supposed to explicitly denote a romantic pairing. Only 3.2% of fics in my sample use brackets, which goes up slightly to 4.5% if you just consider fics with a Romance genre label. Thus, I found it easier to just ignore them.
To do the analysis, I counted how often each combination of characters occurred and grouped the possible combinations by numbers of characters, i.e. a “double” indicates a fic with two and only two characters. These pairings may be romantic or platonic, but differentiating these cases is impossible from my dataset, since even the summary text may not indicate which pairings are romantic and which aren’t. Regardless, this is such a large dataset that I think the overall trends are still clear.
A few observations:
- Pairings are definitely the most popular type of fic, with Draco pairings claiming the top two spots. I was actually surprised that canon pairings are as popular as they are (Ron/Hermione, Harry/Ginny, etc.).
- The “other doubles” category is the single biggest chunk of the graph, but these are all pairings that consist of <1% of the total. In my dataset, there are 2,779 unique pairings, and I’m only showing 12. Of course, this is only a subset of the total possible 70,125 possible pairings given 375 available characters.
- Of the solo acts, Snape is the second most popular (after Harry, of course). This seems appropriate to me, since Snape strikes me as a lone wolf character.
- The “golden trio” of Harry, Hermione, and Ron is unsurprisingly the most popular choice for three characters. The value 0.1% may seem small, but remember we’re talking about 700,000 fics, so 0.1% is still in the hundreds.
Summary analysis for different pairings
The last thing I wanted do was see what particular pairings have in common, if anything. The method I chose was comparing the most popular words used in the summaries. I decided to look at just two pairings, specifically the top two that didn’t have a common character (Draco/Hermione and James/Lily). To make the comparison, I took the 100 most popular words in the summaries of each pairing (using the same algorithm as Wordle to count words) and clustered the words by whether they were common to both pairings or not. This doesn’t mean a word specific to a pairing never appears in the opposite pairing, it’s simply not in the top 100. The resulting “Venn diagram” is shown below. Note that I removed explicit mentions to the characters involved since they dominated the counts. For example, I removed words such as Draco, Draco’s, Malfoy, etc. Also, I limited the analysis to fics written in English for obvious reasons.
The Venn diagram look may not have panned out as I had originally hoped, but the information is still interesting. Even if it does look like a Pepsi logo.
A few observations:
- The top words in both pairings tend to be shared, i.e. there are more gray words than blue or red. This isn’t unique to Draco/Hermione vs. James/Lily. Words like love, Hogwarts, and year are common to many types of pairings. There are also common English words like up and out in the center. I automatically remove the most common English (Wordle does the same), but these two aren’t on the list of common words I found online.
- You can see that the Draco/Hermione pairing is more popular than James/Lily since its unique words are larger overall. To scale the font size of the words in the center, I averaged the counts from both pairings.
- The most common category of unique words are shorthand names for the pairing, e.g. DMHG or LJ. The word Dramione is a portmanteau of Draco and Hermione. I’m not sure if there’s one for James/Lily yet, but my vote is for Limes.
- I don’t think the list of unique words is enough to make any claims of thematic differences between the pairings. For example, I could speculate that many Draco/Hermione fics are Romeo-and-Juliet-style stories of star-crossed lovers, whereas James/Lily stories could have the “will-they-or-won’t-they” trope. There might be hints of this (secret and past for Draco/Hermione; finally and hate for James/Lily), but this isn’t enough to make any strong conclusions. I might look at popular phrases/groups of words to really get at this question.
Character choices in Harry Potter fanfiction can be considered both highly variable (6,950 unique character groupings from a series with essentially three main characters) and highly regular (a randomly selected fic has a 25% chance of having Harry, Hermione, or Draco as a character). I hope to do more analyses like these, and I thought character choices was a good place to start because it’s an easy dimension for clustering fics together. Next, I hope to do more with the summaries. Perhaps use Markov chains to generate pseudorandom summaries like the posts in /r/SubredditSimulator. Please leave suggestions below and thanks for reading!