Generating gibberish from Harry Potter fanfiction summaries

 This is a continuation of my previous post analyzing character choices in Harry Potter fanfiction.

Generating random/gibberish text is not a new idea. Perhaps you’ve been to /r/SubredditSimulator, a complete subreddit in which content is created by bots generating random text. Or perhaps you’ve heard about gibberish scientific articles actually being accepted into journals and conference proceedings. To my knowledge, though, this is the first time someone applied these tools to fanfiction, or more accurately, fanfiction summaries.

The technique is based on the concept of Markov chains, a way of describing “memoryless” random processes, i.e. a process in which the next state in the process depends only on the current state and not the previous states. It’s an enormously useful concept. It’s even the basis for how Google ranks websites in their search engine results.

But enough introduction, let’s get to the good stuff. Introducing the Harry Potter Fanfiction Summary Generator! Just click the button below to generate a new random fanfiction summary.

Harry Potter Fanfiction Summary Generator

Falling Star Bright
Everyone in Hogwarts seems to expect a little four eyed push over…heh, they are NOT correct.
Rated: M
– English – Romance/Angst – Chapters: 35 – Words: 70882 – Reviews: 478 – Favs: 816 – Follows: 1049
– Draco M., Ginny W.

Note: This generator is not creating summaries on the fly, but rather loading summaries from a previously generated list. There are about 10,000 summaries in the list, so it should take a while before you start seeing repeats.

So how does it work?

Disclaimer: the rest of this post will get somewhat technical, but I’ll try to avoid jargon.

You probably saw in the examples above that sometimes the generator produces perfectly legitimate results, even to the point of containing whole sentences from preexisting summaries (more on that later). Other times, it fails horribly and hilariously. To understand what’s going on, you need to understand the concept of Markov chains.

To construct a Markov chain, you start by analyzing the probabilities of transitions between “states”. In this case, the states are words. For example, if you start with the word ‘Harry’, you can probably guess that it is often followed by the word ‘Potter’ and less often by, say, ‘taradiddles’ (yes that’s a real word; it actually appears once in Harry Potter and the Order of the Phoenix). By analyzing all of the word transitions in a body of text, you can calculate lots of probabilities and create a diagram like the example below.

Diagram for simple Markov chain. The size of the arrows is proportional to the probability of the second word following the first, e.g. 'Potter' might follow the word 'Harry' 80% of the time, while 'and' might follow 'Harry' the other 20%. From these words, there are other likely choices,such as 'Harry Potter is' or 'Harry and Draco'. It is much less likely to see something like 'Harry and and'. (Note: these are dummy probabilities for the purpose of illustration).

Diagram for a hypothetical Markov chain. The size of the arrows is proportional to the transition probabilities between words, so a value of 0.8 would imply that the word ‘Potter’ follows the word ‘Harry’ 80% of the time. To construct a random sentence, we pick a starting point and then move between “states” (i.e. words) according to their probabilities.

As you might expect, Markov chains can be much more complicated than this example. To generate the summaries above, I constructed a Markov chain using 25,000 fanfiction summaries. This was a sample of “popular” fanfics, specifically the top 25,000 fics written in English, sorted by number of reviews. This is certainly a biased sample, but hopefully biased in an interesting way. For example, I might speculate that summaries in this sample are more successful (on average) in attracting readers’ attentions. I obviously don’t know if that’s true, but I think the sample is large enough to get a good sense of common trends in summaries.

That’s nice, but how does it work?

To actually explain the process, I need to introduce the concept of the Markov chain order, also referred to as the memory. When using a Markov chain to generate random text, this number refers to how many previous words are considered when selecting the next. For example, say we start with the phrase ‘sent back in’. With an an order of 1, only the previous word is considered, so the next word is chosen based on which words are most likely to follow ‘in’. For an order of of 3, you consider all three words, so the next most likely word is definitely ‘time’, which constructs the phrase ‘sent back in time’. As you might expect, this phrase is very common in fanfiction summaries, since a lot of stories involve time travel.

One way to analyze the effect of Markov chain order is to generate lots of random summaries and see how often these summaries match one of the input summaries used to construct the chain. By “match”, I mean an exact match, including the capitalization and punctuation. Below, I show the results of this analysis for small subsets of the full dataset. It would be nice to repeat this analysis for the entire thing, but that’s more work than I’m willing to do for a blog post.

Effect of Markov chain order on probability of producing exact matches when randomly generating summaries. A probability of zero implies that every generated summary is completely unique, while a value of one implies every generated summary is just a reproduction of an existing summary.

To calculate these probabilities, I generated lots of random summaries from each Markov chain and calculated the fraction of exact matches from the sample. I repeated this process several times with a different sample of summaries and averaged the results. This is an example of a Monte Carlo method.

There are two trends to describe in the graph:

Effect of order — With an order of 1, nearly every generated summary is unique. With an order of 5, basically all of them are just reproductions of the input data. Something special happens around order 2-3 when we start to get a lot of matches. This value has a lot to do with how long summaries tend to be. If you wanted to reproduce larger sections of text (e.g. an entire fanfiction), you would need a higher chain order.

Effect of sample size — You can see the general effect is to shift the curves to the right as the sample size increases. This implies that you get fewer matches with a larger sample.

From these results, I decided to choose an order of 3 to generate summaries from my full dataset, since I think it’s high enough to create interesting patterns, but low enough to create mostly unique results. I generated 10,000 summaries and 271 were matches. I decided to remove them from the generator above, since these were usually of result of all the crazy ways people use punctuation to make things stand out, e.g. ::Rewritten & Improved:: or ***COMPLETE***. However, you’ll still see times when it’s reproducing part of a summary, then it suddenly switches to a new one. This can create readable, yet hilarious results.

Lastly, I should mention the titles were also constructed with Markov chains, only using an order of 1 since titles are so much shorter. I also removed randomly generated titles with only one word, since these are always exact matches. Despite these precautions, ~18% of the titles are still matches.

You still haven’t told me how it works

Right. This post is already getting pretty long, so I decided to put some of the extra technical information on a separate page here. You can see the actual algorithm I used to generate random summaries, as well as the techniques I used to provide the accompanying information, e.g. genre, rating, reviews, etc.

Phrase analysis

To finish off this post, I decided to look at the most popular phrases used in the summary Markov chains. Recall that for an order n chain, we consider the n previous words to pick the next, so we can look at how often phrases of length n and + 1 occur. Since I had difficulty deciding between an order of 2 or 3, I created Markov chains for both and can analyze popular phrases from 2-4 words long. Below I have the top 15 phrases from each group.

Most popular word pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

Most popular phrases used in Harry Potter fanfiction summaries. The three lists corresponds to phrases of different lengths: pairs (left, red), triplets (center, blue), and quadruplets (right, green). The font size is proportional to the number of time that phrase occurs in summaries, relative to the top word in each list

There are some interesting things to notice:

  1. For a length of two, 9 of the 15 phrases are prepositional phrases, i.e. not really specific to fanfiction summaries. Also, the only name mentioned is Harry.
  2. For a length of three, you start to see some interesting combinations, like character pairings and other phrases unique to Harry Potter fanfiction. I think the most interesting phrases are ‘What happens when’ and ‘back in time’, since they illustrate the hypothetical nature of Fanfiction stories.
  3.  For a length of four, you see more of the hypothetical phrases, including three variations of ‘what happens when’. I also think it’s very interesting that you see different parts of phrases that are >4 words. For example, there is ‘at the end of’ and ‘the end of the’, so I would probably predict the 5 word phrase ‘at the end of the’ would also be very popular.

Final thoughts

I hope you’re convinced that Markov chains are a neat way of analyzing text, even if it’s only to giggle at the the gibberish they can produce. Make sure to check out more involved uses of this technique like in /r/SubredditSimulator. Also, if you wanted to see some additional info like the actual algorithm I used, please visit this page. Thanks for reading!