Word Clouds!

Author’s note: This is an old journal entry from July 29, 2014 that I never did anything with, so now it’s going to become my first blog post.

I recently discovered Wordle, a site which allows you to create beautiful word clouds, a type of graphic consisting of words where the font size is proportional to how often the word appears in a given body of text. I thought it would be fun trying to use Wordle to analyze some abstract books from scientific conferences, but that may have been a bit overambitious. Who knew you could crash Chrome by trying to copy and paste 15,000+ pages of text?

Undeterred (i.e. too much time on my hands), I decided to write some Matlab code to make my own. This allowed me to analyze very large bodies of text and also have more control in the final graphic creation. I had to learn some regular expressions as well as use Matlab’s Computer Vision Toolbox in a manner in which it certainly wasn’t designed, but hey, it works. Below are two word clouds, showing the most common 200 words in the ARO 2014 and SfN 2013 abstract books, respectively. ARO stands for the Association of Research in Otolaryngology, and it’s definitely the largest conference devoted to hearing research. It pales in size, however, compared to the annual SfN meeting (Society for Neuroscience), which draws in about 30,000 people.

Word cloud from ARO 2014

Word cloud from ARO 2014

Word cloud from SfN 2013

Word cloud from SfN 2013

To make the word clouds, I simply counted word frequencies like Wordle, but in addition to ignoring common English words (“the”, “and”, etc.), I also removed nonspecific scientific words like “abstract” and “methods.” The result certainly isn’t as pretty as those from Wordle, but look, a brain! and sort of a cochlea! The clouds from Wordle have much nicer inter-word spacing due to the way they handle collision detection, but I had to be more flexible since I wanted the words to fit into an arbitrary shape.

Observations so far? Despite obtaining the #3 spot on the ARO cloud, the word “auditory” barely made it into the SfN cloud, achieving spot #196 and losing terribly to “visual” at #47. Oh well, at least the sensory neuroscientists can be bitter together, since they were beaten by other systems, namely #19 “memory” and #21 “motor.” In case you were wondering, the word “neurons” was used 17,452 times in the SfN abstract book, an order of magnitude increase from the #1 word in the ARO cloud: “cells,” which was used 1,846 times. This isn’t all that surprising given the huge size of SfN relative to ARO.

Lastly, for a summary of the data, here are some basic counts on the text analysis.

One thought on “Word Clouds!

  1. Analysis of abstracts from a paper library | Uncertain Decisions

Leave a Reply

Your email address will not be published. Required fields are marked *