This page discusses some of the technical details relevant to this post. Please read the information there before continuing on.
Markov chain algorithm
The algorithm is pretty simple, since I didn’t actually calculate transition probabilities, merely created huge lists of words. It’s essentially a large hash table, consisting of keys and entries. A key will be a series of words, each of length n, where n is the chain order. Entries will be lists of words that immediately follow a particular key. To see different description of the algorithm including code snippets check out these two blog posts: 1 2
Markov Chain Generation, assuming order n
- Initialize an empty list of keys and entries.
- Extract a single text sample from the list of text samples (e.g. a summary from the list of summaries).
- Split sample into individual words, keeping capitalization and punctuation intact. Make sure the last “word” is some sort of special terminating character (like the newline character or an empty space).
- Select a starting position in the current text sample, i.e. before the first word.
- Select a key based on next n words from the current position. Select an entry as the first word after the key.
- If the key already exist in the list of keys, add the current entry to the list of entries that correspond to that particular key. Otherwise, append the key to list of keys and add its first entry to the list of entries.
- If the entry is the terminating character, go to 8. If not, advance the current position forward one word and go to 5.
- If there are more text samples to process, go to 2. If not, you’re done.
The result is a large list of keys, each with a corresponding list of entries. It’s important to make sure the terminating characters included in the entries, since they will be the trigger to stop generating words. It also helps to keep track of which keys begin a text sample so you can pick a logical starting point.
- Start with an empty text sample. We will add randomly generated words to this.
- Choose a random starting key and add it to the text sample. Now the text sample should have n words (This is why it helps to keep track of which keys are at the beginning of text samples, otherwise your random text will start in the middle of a sentence.)
- Of the list of entries for the current key, choose a random entry. (Since we added entries regardless of whether they already existed, there will be multiple copies of some entries. By sampling randomly from the entries, you are choosing words based on their probability of occurring.)
- Append the entry to the text sample.
- If the entry is a word, go to 6. If instead the entry is a terminating character, then you’re done.
- Construct a new key based on the last n words of the text sample, i.e. the final n – 1 words of the previous key plus the entry. Then go to 3 with this new key.
Picking a genre and character list
In the generated summaries, I included information such as genre, characters, reviews, favorites, etc. in the summary generator. This was done for fun to mimic the look of browsing results on FanFiction.net. Since the primary focus of this post was the summary text generation, I didn’t focus too much on this aspect.
For characters and genres, I simply chose a random entry from my database of 25,000 summaries, so sometimes there will be no character or genre listed, since authors sometimes choose to leave these out. This selection process implies more popular choices will pop up more often, e.g. you’ll see more ‘Draco M., Hermione G.’ than ‘Albus D., Giant Squid’. This isn’t a great method, since it often doesn’t agree with the text of the generated summaries. On the other hand, generated summaries themselves can start out discussing Harry and then finish with a ‘DM/HG’ tag, so I decided mismatches weren’t all that bad. It also adds to the ‘gibberish’ aspect.
Picking the number of chapters, words, reviews, etc.
For the numeric values, I got a little more fancy. I combined the five values (chapters, words, reviews, favs, and follows) together and fit these data points with a multivariate lognormal distribution and then randomly sampled from this distribution to create values. I chose lognormal since all five of these values are heavily skewed towards small values, but even so, the fits tend to under-represents values near zero. I also had to round the generated numbers since the lognormal distribution is continuous and I needed discrete values.
Therefore, the lognormal distribution is not a great choice, but it’s a quick and dirty method that at least captures some degree of how these five values covary. For example, words and chapters are positively correlated, i.e. more chapters generally implies more words. This trend (and others) comes through in the randomly generated values.
Here’s an example, showing just chapters vs. words
The ellipsoid from the Gaussian fit illustrates that a random number generated from this fit will very rarely chose a value of 1 for the number of chapters, despite this being a very popular choice (e.g. for one-shots). Nevertheless, it captures some of the behavior and was good enough for my purposes.