# Analysis of abstracts from a paper library

Author’s note: This is an old journal entry from August 15, 2014.

One thing I really respect about my PhD advisor is his effort to stay up-to-date with the recent literature. At conferences, I always see him talking to several different grad students during the poster sessions, whereas it seems like most senior scientists make it to one or two posters before talking to other PIs. He also keeps an extensive library of journal articles that gets updated regularly. I don’t know if he actually reads every new paper, but I’m nevertheless jealous of his ability to find them. So far I’ve been dissatisfied with services designed to notify me of new articles, whether it’s an RSS feed of a major journal or an email search alert from PubMed/Google Scholar. These services are either too strict and miss relevant articles, or too lax and return way too many results. A new service called PubChase holds promise, but I don’t know how well it works. Regardless, I wanted to see if I could figure out a better way to find new, relevant articles. The first step: analyzing my advisor’s library of papers.

### Getting the raw data

My advisor’s paper library has over 13,000 files in it, and I certainly did not want open every file in order to get the raw text from the titles and abstracts. Endnote provided a way to automate this, although it wasn’t successful in extracting the abstracts from every article. I did, however, manage to create a huge text file with the titles and abstracts of 6,687 journal articles. This process was likely biased for newer papers, since I don’t think Pubmed can pull abstracts from scans, but this frankly doesn’t bother me as long as the sampling process was unbiased with respect to the article’s topic, which is hopefully the case. To begin, I used my code based on the Wordle algorithm (see previous post) to identify the 500 most common words and their relative usages. This counting ignores common English words as well as a list of special words, which I omitted somewhat arbitrarily after deciding they would be bad at identifying a paper’s unique content. For example, words like “results”, “effects”, “suggest”, “show”, and “significantly” could show up in any abstract regardless of the topic. Also, I counted each word in the title 3 times in order to give the title more weight. This technique is used by PubMed in its search algorithm, described here. Shown below is a word cloud of the final set of 500 words used for clustering.

Word cloud of 500 most common words in extracted abstracts and titles. The font size is linearly proportional to the number of occurrences. Words appearing in an abstract were counted once, but three times in a title.

Clearly, some of the words are too small to read due to the enormous number of occurrences of  “auditory” (14,431 if you were curious). This illustrates why it is important to scale each word’s weight by its overall usage. Specifically, for document $i=1,2,\ldots,N$ and term $j=1,2,\ldots,M$, the total weight $W_{i,j}$ was calculated as the product of the global weight $G_j$ of term $j$ and the local weight $L_{i,j}$ of term $j$ in document $i$. These weights were calculated as follows, described in more detail on the PubMed website.

$W_{i,j} = L_{i,j}G_j,$
$G_j = \sqrt{\ln \left(\frac{N}{n_j}\right)}$
$L_{i,j} = \frac{10}{1+e^{\alpha\ell_i}\lambda^{k_{i,j}-1}}$

$N$ is the total number of documents (6,687), $n_j$ is the number of documents term $j$ appears in, $\ell_i$ is the total number of words in document $i$ (or 250, whichever is larger), and $k_{i,j}$ is the number of times term $j$ is in document $i$. The constants $\alpha = 0.0044$ and $\lambda = 0.7$ were given by PubMed. The total number of terms $M$ was set at 500, resulting in a set of 6,687 feature vectors of length 500 to be clustered.

Note that the above equations have two changes from the description on the PubMed website. First, the third equation above has a factor of 10 in the numerator, not 1. I added this because the local weights in my data set were originally much smaller than the global weights, skewing the clustering process. The second change was adding the square root in the second equation, which was made for the same reason. I don’t know why Pubmed’s global weights are smaller, but perhaps it is because their database of documents is much larger.

### Choosing the number of clusters

With over 6,000 articles, you can imagine the breadth of information covered is quite broad. You can find papers in the library on everything from signal processing to basic anatomy to psychology. There’s even oddball papers on topics like particle physics. Any attempt to identify every topic in the library will fall victim to overfitting, but I was confident that I could separate a small number of topics that were well-represented and identify the words that best separate topics from each other.

I started with principal component analysis to reduce the dimensionality. The figure below shows the percent of explainable variance as a function of the number of components, and you can see how multidimensional this data set is. We need over 100 components to explain just 50% of the variance! This is a testament to how variable the term usages are between papers.

Cumulative variance explained by principal components

To help decide on the number of components to include and eventually perform the clustering, I used Matlab’s implementation of the EM algorithm for Gaussian mixture models (gmdistribution.fit). This allowed me to make some judgements on relative model quality using the Akaike information criterion (AIC). This value decreases as the likelihood of the model increases but contains an added penalty for increasing the number of free parameters. Therefore, a lower AIC value is desirable. You can see how the AIC value changes as a function of both the number of components and the number of clusters below

Akaike information criterion (AIC) for Gaussian mixture model as a function of the number of clusters. Each line represents a different number of principal components used.

There are two obvious trends. First, increasing the number of components appears to increase the AIC value in an approximately linear fashion (lines appear offset from each other), which is due to increasing the number of free parameters. Second, increasing the number of clusters appear to decrease the AIC, which is due to increasing the likelihood of the model. Another interesting trend appears when you normalize these line by their max AIC value, as shown below

Values of Akaike information criterion (AIC) as a function of the number of clusters. The values are normalized by the maximum value for each line, i.e. when every paper was assigned to a single cluster.

Here you can see that the AIC value decreases at the same relative rate when the number of components is ≥15, implying that adding additional components to these models will not significantly improve the likelihood. Therefore, I chose 15 components to be used in the final clustering. These components represent approximately 19.5% percent of the variance.

Finally, to choose the number of clusters, I used the fairly arbitrary elbow method to select a value of 8. This method is very subjective, but as the above figure demonstrates, adding additional clusters decreases the AIC value, but not by much.

### Determining cluster keywords

The final result is a set of 8 clusters, each with 500-1200 papers. To identify what each cluster’s “topic” might be, I summed the feature vectors for every paper within a cluster and sorted the result in order to obtain the words with the highest weights for each cluster. I then created a word cloud with the top 20 words in each cluster, shown below, where the color indicates cluster identity.

Word cloud showing the top 20 words within each of the 8 clusters of papers, calculated by summing the feature vectors across members of a cluster. The color indicates cluster identity, and the font size is proportional to the summed weight of the word.

Before making any claims about how the words are related within clusters, there are several general observations I’ve made:

1. The weights are fairly evenly distributed, i.e. all the words have a somehwhat similar font size. This is especially true when compared to the earlier word cloud constructed with the raw word counts.
2. The top words from the earlier word cloud (auditory, cochlear, and neurons) are still prominent in several of the clusters here, indicating that despite scaling for the overall usage, these words still occur often enough to produce a large summed weight within that cluster.
3. Different forms of words are commonly grouped together, e.g. cell and cells, implant and implantation, inhibition and inhibitory, etc. This could indicate that these words co-localize in the same or very related documents. The white cluster (middle-left) even has a triplet: neurons, neural, and neuronal.

### Describing the clusters

The real question is whether the top twenty words in each cluster actually say something about the papers in their cluster. If the clustering process actually separated the papers by topic, then we would assume that’s the case. You can see in the word cloud there’s certainly repetitions of words between clusters, but I do think the words can be interpreted into an overarching theme. There has to be outliers in each cluster (where do the particle physics papers go?), but I think the eight clusters break down into the following groups:
• Cluster 1 (dark red, top left): Auditory neurophysiology, electric hearing, cochlear implants. This cluster seems to focus on neural responses (response, activity, evoked) in auditory centers (auditory nerve, nucleus, cortex, inferior colliculus) from electric stimulation (electrical, stimulation, cochlear implant). Compare this to Cluster 8, which seems to focus on the speech recognition performance of CI users, or Cluster 3, which seems to involve neural responses due to acoustic stimulation.
• Cluster 2 (dark blue, top middle): Auditory neurophysiology, binaural hearing. This cluster certainly involves binaural neural processing (both binaural and interaural are present), especially in the inferior colliculus (note the abbreviation IC ). It is also the only cluster with inhibitory, which certainly has an important role in binaural processing.
• Cluster 3 (orange, top right): Auditory processing, neuroimaging. This cluster seems to focus on higher-order auditory processing (speech, pitch, perception, complex), so it likely contains the neuroimaging papers. This is supported by the presence of the anatomical terms: primary, cortex, cortical, but there’s nothing to suggest neurophysiology at the cellular level.
• Cluster 4 (white, center left): General systems neuroscience. This cluster appears to involve involve general principles of neuroscience (spike, information, model, synaptic), including the aforementioned trifecta of neurons, neuronal, and neural. With regards to anatomy, this cluster is probably focused more on the neocortex (cortex, cortical), since no brainstem terms are mentioned. This is also the only cluster with a non-auditory word (visual), which is understandable given that a larger percentage of papers on the visual system involve cortical processing than the auditory system.
• Cluster 5 (yellow, center left): Cochlear implants, psychophysics. This cluster definitely focuses on cochlear implants (implant, implantation, CI, pulse), but also has many classic psychophysical terms (subjects, listeners, masking, noise). Compare this to Cluster 8, which seems to focus more on CI performance and less on the basic psychophysics.
• Cluster 6 (blue, bottom left): Inner ear biology, auditory periphery. This cluster seems to focus on the periphery (cochlea, hair cells, cochlear nucleus) with perhaps more of a biological focus than some of the neural coding clusters (synapses, membrane, synaptic, cell). However, there is certainly a neural component (nerve, fibers).
• Cluster 7 (light brown, bottom middle): Pyschophysics. This once is definitely a psychophysics cluster. Pretty much every word is indicative of this (masking, cues, thresholds, detection, target, signal, noise). There is also a binaural component (binaural, interaural, localization, time, level).
• Cluster 8 (light blue, bottom right): Cochlear implant performance. This cluster certainly involves cochlear implants (electrode, stimulation, multichannel), but compared with the other two putative CI clusters (1 and 5), this cluster seems to focus on how well CIs actually work (speech, recognition, perception, performance, scores). It is certainly the must humanized cluster (children, patients, subjects, users). Given both children and age, this cluster likely also contains papers on how CIs affect development.

### Final thoughts

While my research involves cochlear implants, I am still somewhat surprised that 3 of the 8 clusters seem to include them, since they’re really only one aspect of my advisor’s interests. I was, however, pleased to see that each CI cluster seems to have a separate focus (neurophysiology, psychophsyics, or general performance). This could indicate that the clustering worked well, although the notion of a singular topic per cluster is certainly something I arbitrarily imposed. Overall, I think the eight topics do cover my advisor’s research interests nicely, and I bet he’s even been an author on a paper in each cluster. In the future, it would be interesting to approach this problem with hierarchical clustering, since it might reveal subtopics within the larger clusters. For example, Cluster 7 might separate into topics on localization and signal detection. It would also be interesting to see where the outliers were assigned, since these 8 topics certainly don’t cover every paper in the library. Regardless, I think this was a fun and informative exercise. I did make several new PubMed email alerts as a result, but I’ll have to wait a few weeks to see how well they work. Who knows, maybe I’ll even read a paper or two, instead of just thinking of ways to find them.

Update: I set up around 5 PubMed alerts as a result of this work, each containing 3-5 search terms. They’ve been pretty effective at alerting me to papers, and I definitely prefer them to reading a bunch of table of contents from multiple journals. Now if only I had the motivation to read all those papers…

# Word Clouds!

Author’s note: This is an old journal entry from July 29, 2014 that I never did anything with, so now it’s going to become my first blog post.

I recently discovered Wordle, a site which allows you to create beautiful word clouds, a type of graphic consisting of words where the font size is proportional to how often the word appears in a given body of text. I thought it would be fun trying to use Wordle to analyze some abstract books from scientific conferences, but that may have been a bit overambitious. Who knew you could crash Chrome by trying to copy and paste 15,000+ pages of text?

Undeterred (i.e. too much time on my hands), I decided to write some Matlab code to make my own. This allowed me to analyze very large bodies of text and also have more control in the final graphic creation. I had to learn some regular expressions as well as use Matlab’s Computer Vision Toolbox in a manner in which it certainly wasn’t designed, but hey, it works. Below are two word clouds, showing the most common 200 words in the ARO 2014 and SfN 2013 abstract books, respectively. ARO stands for the Association of Research in Otolaryngology, and it’s definitely the largest conference devoted to hearing research. It pales in size, however, compared to the annual SfN meeting (Society for Neuroscience), which draws in about 30,000 people.

Word cloud from ARO 2014

Word cloud from SfN 2013

To make the word clouds, I simply counted word frequencies like Wordle, but in addition to ignoring common English words (“the”, “and”, etc.), I also removed nonspecific scientific words like “abstract” and “methods.” The result certainly isn’t as pretty as those from Wordle, but look, a brain! and sort of a cochlea! The clouds from Wordle have much nicer inter-word spacing due to the way they handle collision detection, but I had to be more flexible since I wanted the words to fit into an arbitrary shape.

Observations so far? Despite obtaining the #3 spot on the ARO cloud, the word “auditory” barely made it into the SfN cloud, achieving spot #196 and losing terribly to “visual” at #47. Oh well, at least the sensory neuroscientists can be bitter together, since they were beaten by other systems, namely #19 “memory” and #21 “motor.” In case you were wondering, the word “neurons” was used 17,452 times in the SfN abstract book, an order of magnitude increase from the #1 word in the ARO cloud: “cells,” which was used 1,846 times. This isn’t all that surprising given the huge size of SfN relative to ARO.

Lastly, for a summary of the data, here are some basic counts on the text analysis.