Generating gibberish from Harry Potter fanfiction summaries

 This is a continuation of my previous post analyzing character choices in Harry Potter fanfiction.

Generating random/gibberish text is not a new idea. Perhaps you’ve been to /r/SubredditSimulator, a complete subreddit in which content is created by bots generating random text. Or perhaps you’ve heard about gibberish scientific articles actually being accepted into journals and conference proceedings. To my knowledge, though, this is the first time someone applied these tools to fanfiction, or more accurately, fanfiction summaries.

The technique is based on the concept of Markov chains, a way of describing “memoryless” random processes, i.e. a process in which the next state in the process depends only on the current state and not the previous states. It’s an enormously useful concept. It’s even the basis for how Google ranks websites in their search engine results.

But enough introduction, let’s get to the good stuff. Introducing the Harry Potter Fanfiction Summary Generator! Just click the button below to generate a new random fanfiction summary.

Harry Potter Fanfiction Summary Generator



Harry 5
The fifty moments that shaped the relationship of Jacob and Renesmee, absolutely NO pedophilia.
Rated: M
– English – Romance – Chapters: 66 – Words: 258193 – Reviews: 290 – Favs: 91 – Follows: 34
– Hermione G., Draco M.

Note: This generator is not creating summaries on the fly, but rather loading summaries from a previously generated list. There are about 10,000 summaries in the list, so it should take a while before you start seeing repeats.

So how does it work?

Disclaimer: the rest of this post will get somewhat technical, but I’ll try to avoid jargon.

You probably saw in the examples above that sometimes the generator produces perfectly legitimate results, even to the point of containing whole sentences from preexisting summaries (more on that later). Other times, it fails horribly and hilariously. To understand what’s going on, you need to understand the concept of Markov chains.

To construct a Markov chain, you start by analyzing the probabilities of transitions between “states”. In this case, the states are words. For example, if you start with the word ‘Harry’, you can probably guess that it is often followed by the word ‘Potter’ and less often by, say, ‘taradiddles’ (yes that’s a real word; it actually appears once in Harry Potter and the Order of the Phoenix). By analyzing all of the word transitions in a body of text, you can calculate lots of probabilities and create a diagram like the example below.

Diagram for simple Markov chain. The size of the arrows is proportional to the probability of the second word following the first, e.g. 'Potter' might follow the word 'Harry' 80% of the time, while 'and' might follow 'Harry' the other 20%. From these words, there are other likely choices,such as 'Harry Potter is' or 'Harry and Draco'. It is much less likely to see something like 'Harry and and'. (Note: these are dummy probabilities for the purpose of illustration).

Diagram for a hypothetical Markov chain. The size of the arrows is proportional to the transition probabilities between words, so a value of 0.8 would imply that the word ‘Potter’ follows the word ‘Harry’ 80% of the time. To construct a random sentence, we pick a starting point and then move between “states” (i.e. words) according to their probabilities.

As you might expect, Markov chains can be much more complicated than this example. To generate the summaries above, I constructed a Markov chain using 25,000 fanfiction summaries. This was a sample of “popular” fanfics, specifically the top 25,000 fics written in English, sorted by number of reviews. This is certainly a biased sample, but hopefully biased in an interesting way. For example, I might speculate that summaries in this sample are more successful (on average) in attracting readers’ attentions. I obviously don’t know if that’s true, but I think the sample is large enough to get a good sense of common trends in summaries.

That’s nice, but how does it work?

To actually explain the process, I need to introduce the concept of the Markov chain order, also referred to as the memory. When using a Markov chain to generate random text, this number refers to how many previous words are considered when selecting the next. For example, say we start with the phrase ‘sent back in’. With an an order of 1, only the previous word is considered, so the next word is chosen based on which words are most likely to follow ‘in’. For an order of of 3, you consider all three words, so the next most likely word is definitely ‘time’, which constructs the phrase ‘sent back in time’. As you might expect, this phrase is very common in fanfiction summaries, since a lot of stories involve time travel.

One way to analyze the effect of Markov chain order is to generate lots of random summaries and see how often these summaries match one of the input summaries used to construct the chain. By “match”, I mean an exact match, including the capitalization and punctuation. Below, I show the results of this analysis for small subsets of the full dataset. It would be nice to repeat this analysis for the entire thing, but that’s more work than I’m willing to do for a blog post.

Effect of Markov chain order on probability of producing exact matches when randomly generating summaries. A probability of zero implies that every generated summary is completely unique, while a value of one implies every generated summary is just a reproduction of an existing summary.

To calculate these probabilities, I generated lots of random summaries from each Markov chain and calculated the fraction of exact matches from the sample. I repeated this process several times with a different sample of summaries and averaged the results. This is an example of a Monte Carlo method.

There are two trends to describe in the graph:

Effect of order — With an order of 1, nearly every generated summary is unique. With an order of 5, basically all of them are just reproductions of the input data. Something special happens around order 2-3 when we start to get a lot of matches. This value has a lot to do with how long summaries tend to be. If you wanted to reproduce larger sections of text (e.g. an entire fanfiction), you would need a higher chain order.

Effect of sample size — You can see the general effect is to shift the curves to the right as the sample size increases. This implies that you get fewer matches with a larger sample.

From these results, I decided to choose an order of 3 to generate summaries from my full dataset, since I think it’s high enough to create interesting patterns, but low enough to create mostly unique results. I generated 10,000 summaries and 271 were matches. I decided to remove them from the generator above, since these were usually of result of all the crazy ways people use punctuation to make things stand out, e.g. ::Rewritten & Improved:: or ***COMPLETE***. However, you’ll still see times when it’s reproducing part of a summary, then it suddenly switches to a new one. This can create readable, yet hilarious results.

Lastly, I should mention the titles were also constructed with Markov chains, only using an order of 1 since titles are so much shorter. I also removed randomly generated titles with only one word, since these are always exact matches. Despite these precautions, ~18% of the titles are still matches.

You still haven’t told me how it works

Right. This post is already getting pretty long, so I decided to put some of the extra technical information on a separate page here. You can see the actual algorithm I used to generate random summaries, as well as the techniques I used to provide the accompanying information, e.g. genre, rating, reviews, etc.

Phrase analysis

To finish off this post, I decided to look at the most popular phrases used in the summary Markov chains. Recall that for an order n chain, we consider the n previous words to pick the next, so we can look at how often phrases of length n and + 1 occur. Since I had difficulty deciding between an order of 2 or 3, I created Markov chains for both and can analyze popular phrases from 2-4 words long. Below I have the top 15 phrases from each group.

Most popular word pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

Most popular phrases used in Harry Potter fanfiction summaries. The three lists corresponds to phrases of different lengths: pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

There are some interesting things to notice:

  1. For a length of two, 9 of the 15 phrases are prepositional phrases, i.e. not really specific to fanfiction summaries. Also, the only name mentioned is Harry.
  2. For a length of three, you start to see some interesting combinations, like character pairings and other phrases unique to Harry Potter fanfiction. I think the most interesting phrases are ‘What happens when’ and ‘back in time’, since they illustrate the hypothetical nature of Fanfiction stories.
  3.  For a length of four, you see more of the hypothetical phrases, including three variations of ‘what happens when’. I also think it’s very interesting that you see different parts of phrases that are >4 words. For example, there is ‘at the end of’ and ‘the end of the’, so I would probably predict the 5 word phrase ‘at the end of the’ would also be very popular.

Final thoughts

I hope you’re convinced that Markov chains are a neat way of analyzing text, even if it’s only to giggle at the the gibberish they can produce. Make sure to check out more involved uses of this technique like in /r/SubredditSimulator. Also, if you wanted to see some additional info like the actual algorithm I used, please visit this page. Thanks for reading!

Character choices in Harry Potter fanfiction

Harry Potter fanfiction is something I find pretty interesting. On FanFiction.net, there are over 700,000 stories, all set within the Harry Potter universe. Not one of the authors expects any sort of financial gain, although it is possible for popular authors to get published.  E.L. James, the author of 50 Shades of Grey, got her start writing fanfiction for the Twilight series. Say what you will about either of those series, but there’s no denying their popularity.

The stories on the FanFiction.net, also called “fics” or “fanfics”, vary wildly in length from poems to short stories to full-length novels. There’s even a few fics that contain more words than all seven Harry Potter books put together. It’s also surprising that the community is still active to this day, considering it’s been eight years since the last book came out and four years since the last movie.

I decided to take a look at the meta-information available on FanFiction.net for these stories. When browsing for fics, you can get lists of descriptions containing the title, author, genre, length, number of reviews, etc. There’s also a short summary written by the author. I wrote a script that scrapes this information from the search results, using the Python lxml module for HTML parsing. I randomly sampled 200,000 fics, so roughly 28% of the total. Here are a few stats from the dataset:

 

I think these are some fairly impressive numbers. Of course, these statistics are a poor summary of the actual data, but for my first post with this dataset, I wanted to look at character choices.

Most popular single characters

Each fic can list up to four characters from a list of 375. Harry Potter has a lot of supporting characters. The first thing I did was count how many times each individual character appeared in a fic and sorted the result. Below are the most popular character choices.

Top 25 character choices. Note that the percentages sum to >100% since each fic can have multiple choices.

Top 25 character choices. Note that the percentages sum to >100% since each fic can have multiple characters

 

It should come as no surprise that Harry is at the top of the list. I do feel a bit bad for Ron, though. It appears that Draco has taken what could arguable be called his spot, and Ron barely made it higher than “OC”, meaning an original character. I think there are multiple reasons for this. First, a large percentage of Harry Potter fanfics are romances: 55% of fics in my sample contain the “Romance” genre label, with the next highest being “Humor” at 21%. I think Draco has that bad boy appeal that makes him popular in romances. I’ve also found that a lot of fics are “do-overs”, i.e. fan re-imaginings of the original plot. In those stories, Ron can be unpopular. The community even has a term for treating a character badly called “bashing”.

The other rankings aren’t too surprising. I kinda wish Fred and George were next to each other, but I understand since there’s also a popular type of story that picks up where the books ended. As an example, if anyone reading this is a Harry Potter fan, I recommend the short story Cauterize by Lady Altair.

Pairings and multiple character groupings

The next thing I wanted to look at was how characters are grouped together. As I mentioned, romance is a very popular genre label, so the most popular grouping is obviously two characters. The community refers to this as a “pairing”, and if a fan likes a particular pairing, they “ship” that pairing. It’s short for “relationship” (used as both a noun and a verb for some reason). There can be fervent debates about “ships” on sites like /r/harrypotter, so I think it’s an interesting thing to look at.

As an aside for those familiar with FanFiction.net, I’ve ignored the use of square brackets [], which are supposed to explicitly denote a romantic pairing. Only 3.2% of fics in my sample use brackets, which goes up slightly to 4.5% if you just consider fics with a Romance genre label. Thus, I found it easier to just ignore them.

To do the analysis, I counted how often each combination of characters occurred and grouped the possible combinations by numbers of characters, i.e. a “double” indicates a fic with two and only two characters. These pairings may be romantic or platonic, but differentiating these cases is impossible from my dataset, since even the summary text may not indicate which pairings are romantic and which aren’t. Regardless, this is such a large dataset that I think the overall trends are still clear.

Stuff

Pie chart of character percentages from a sample of 200,000 fics. The color indicates the number of characters, which are broken up into the most popular character choices for each.

 

A few observations:

  1. Pairings are definitely the most popular type of fic, with Draco pairings claiming the top two spots. I was actually surprised that canon pairings are as popular as they are (Ron/Hermione, Harry/Ginny, etc.).
  2. The “other doubles” category is the single biggest chunk of the graph, but these are all pairings that consist of <1% of the total. In my dataset, there are 2,779 unique pairings, and I’m only showing 12. Of course, this is only a subset of the total possible 70,125 possible pairings given 375 available characters.
  3. Of the solo acts, Snape is the second most popular (after Harry, of course). This seems appropriate to me, since Snape strikes me as a lone wolf character.
  4. The “golden trio”  of Harry, Hermione, and Ron is unsurprisingly the most popular choice for three characters. The value 0.1% may seem small, but remember we’re talking about 700,000 fics, so 0.1% is still in the hundreds.

Summary analysis for different pairings

The last thing I wanted do was see what particular pairings have in common, if anything. The method I chose was comparing the most popular words used in the summaries. I decided to look at just two pairings, specifically the top two that didn’t have a common character (Draco/Hermione and James/Lily). To make the comparison, I took the 100 most popular words in the summaries of each pairing (using the same algorithm as Wordle to count words) and clustered the words by whether they were common to both pairings or not. This doesn’t mean a word specific to a pairing never appears in the opposite pairing, it’s simply not in the top 100. The resulting “Venn diagram” is shown below. Note that I removed explicit mentions to the characters involved since they dominated the counts. For example, I removed words such as Draco, Draco’s, Malfoy, etc. Also, I limited the analysis to fics written in English for obvious reasons.

Word cloud showing the most popular words contained the summaries of fics with either a Draco/Hermione pairing or James/Lily pairing.

“Venn diagram” word cloud showing the most popular words in the summaries of fics with either a Draco/Hermione or James/Lily pairing. The font size for each words is proportional to the total number of times that word appears in summaries with these pairings. Gray words in the center are the most common words that appear in both pairings, blue words on the left are the most common words that appear in Draco/Hermione fics, and red words on the right are the most common words in James/Lily fics.

 

The Venn diagram look may not have panned out as I had originally hoped, but the information is still interesting. Even if it does look like a Pepsi logo.

A few observations:

  1. The top words in both pairings tend to be shared, i.e. there are more gray words than blue or red. This isn’t unique to Draco/Hermione vs. James/Lily. Words like loveHogwarts, and year are common to many types of pairings. There are also common English words like up and out in the center. I automatically remove the most common English (Wordle does the same), but these two aren’t on the list of common words I found online.
  2. You can see that the Draco/Hermione pairing is more popular than James/Lily since its unique words are larger overall. To scale the font size of the words in the center, I averaged the counts from both pairings.
  3. The most common category of unique words are shorthand names for the pairing, e.g. DMHG or LJ. The word Dramione is a portmanteau of Draco and Hermione. I’m not sure if there’s one for James/Lily yet, but my vote is for Limes.
  4. I don’t think the list of unique words is enough to make any claims of thematic differences between the pairings. For example, I could speculate that many Draco/Hermione fics are Romeo-and-Juliet-style stories of star-crossed lovers, whereas James/Lily stories could have the “will-they-or-won’t-they” trope. There might be hints of this (secret and past for Draco/Hermione; finally and hate for James/Lily), but this isn’t enough to make any strong conclusions. I might look at popular phrases/groups of words to really get at this question.

Final thoughts

Character choices in Harry Potter fanfiction can be considered both highly variable (6,950 unique character groupings from a series with essentially three main characters) and highly regular (a randomly selected fic has a 25% chance of having Harry, Hermione, or Draco as a character). I hope to do more analyses like these, and I thought character choices was a good place to start because it’s an easy dimension for clustering fics together. Next, I hope to do more with the summaries. Perhaps use Markov chains to generate pseudorandom summaries like the posts in /r/SubredditSimulator. Please leave suggestions below and thanks for reading!