I’m a big fan of the reddit community AskScience, where anyone can ask a question and have “scientists” try to answer it. Some of the people answering may not be actual scientists, but there’s still a lot of participation from people with real expertise. You can also find some fascinating questions. Here’s a short list of questions I thought were interesting or amusing (in no particular order):
- If two ships travel at faster than half the speed of light away from each other, could light from one ever reach the other?
- What exactly is an itch?
- Why can’t I list every book I know, but I can tell you if I own it?
- Why do airplane windows need to have that hole?
- Do people sneeze while they sleep?
- If you farted hard enough in space, could you move yourself around?
Clearly, these people are asking the important questions. One thing I’ve noticed on AskScience is how physics-related questions seem to be the most common. I know I’m biased (I always hope to see more neuroscience questions), but I wanted to know for sure. I decided to analyze the subreddit submissions by scraping submission info with PRAW. Specifically, I used this Python script to download information about every submission since the subreddit started (6 years ago). The end result was 155,805 JSON files, which covered a range of 2,239 days. Below are some simple descriptive statistics about this dataset. Eventually, I’d like to do some natural language processing on the questions and answers themselves, but for now, this will be a broad look at the meta-data for submissions to AskScience.
First, a bit of history
Below, I’ve plotted the average rate of submissions over time in units of submissions per hour, starting around April 2010. You can see the that the rate grew slowly for the first eight months but then spiked around January 2011. I think this was a result of new traffic from AskScience’s nomination for the Best Little Community of 2010. Over the next nine months, the community grew by leaps and bounds until it was eventually added to the list of default subreddits in October 2011. That caused another huge spike in traffic, and you can see major fluctuations in submission rate over the following 12 months or so. Since then, the rate of submissions has continued to fluctuate up and down but held more or less steady.
I think it’s interesting to look through the early periods to see how the subreddit gained popularity. For example, this post from March 2011 has a moderator saying that the number of subscribers reached 19,000, up from only 4,500 just 6 months earlier. When AskScience was made a default subreddit, this number skyrocketed. As of this writing, there were 8.5 million subscribers. That’s quite a change.
Breaking down submissions by flair type
Sometime in 2012, AskScience started requiring each question to be tagged with one of 12 flairs corresponding to different scientific fields: physics, biology, chemistry, etc. Below, I’ve broken down the 91,014 tagged posts by flair type. Sure enough, physics is by far the most popular. Biology is a close second, but each of the other fields are less popular by a factor of 2 or more. The number of questions tagged with the bottom three flairs (psychology, social sciences, and computing) are each around 13 times smaller than the number tagged with physics.
This confirmed my observation that physics questions were the most common, but I was also curious to see if these percentages changed over time. Below I have a stacked area plot that shows how the relative percentages changed over time. Initially, most questions were not tagged, so you can see the gray area covers almost 100% of the plot. Then in 2012, the colored areas get larger, meaning more and more submissions were getting tagged. You can tell how the percentages changed by comparing the relative size of the colored areas
It seems in this graph that the relative percentages are somewhat stable over time, but something interesting happened to the Biology category. It’s hard to tell in the above graph, but the rate of biology and physics questions were actually quite similar in the beginning. Sometime in mid-2014, the rate of biology tags dropped, which you can see as the green biology area gets smaller.
It’s much easier to see this trend when you look at the raw submission rates, which I’ve plotted below. Initially, the curves are highly correlated (they both go up and down as the overall submission rate fluctuates, but then the biology rate drops and never recovers.
I’m not sure what caused this drop. It could be the result of more aggressive moderation of biology-tagged posts, or possibly a genuine decrease in the number of biology questions. Regardless, the end result is an even greater percentage of physics questions over the last two years. The bar graph above shows the percentages of physics and biology questions are around 27% and 20%, respectively. More recently, these values are closer to 30% and 15%.
UPDATE: After talking with a few mods from AskScience, I’ve learned that the drop in Biology posts was caused by the introduction of the “Human Body” tag, which was introduced as a subset of the Medicine category. This resulted in a lot of questions getting the Medicine flair that previously had the Biology flair. Nevertheless, there is still an small decrease in Biology-related posts over time, which you can see in the stacked area plot by comparing the Biology+Medicine areas to the Physics area.
An obvious question to follow-up the previous analyses is how well the posts from each field do once they’re submitted, i.e. how many upvotes they get. Reddit “fuzzes” this information to prevent spam bots from taking advantage, but the numbers I extracted should still be an okay approximation.
I decided to compare three values:
- Score – This is the number of upvotes a post receives minus the number of downvotes.
- Upvote ratio – The ratio of upvotes to the total number of votes. This number is between 0 and 1.
- Number of comments – This is the total number of comments in a post, so it includes all the replies to the top-level answers.
Below I have the mean values for each flair type marked with circles. The lines are 99% bootstrapped confidence intervals, so a lot of overlap of these lines implies the mean values are not significantly different from each other. You shouldn’t place too much stock in comparing individual pairs because of the problem of multiple comparisons, but if there’s a large gap between the confidence intervals, its probably safe to say that the difference is statistically significant.
The confidence intervals for the scores have a large degree of overlap, indicating there probably aren’t very many significant differences between the groups. There might be some, e.g. medicine posts seem to have a higher average score than chemistry posts, but not by much. It’s important to note that the score and number of comments have distributions that are heavily skewed towards zero (95% of posts have scores under 100). The mean isn’t a great summary statistic for these types of distributions, but the confidence intervals I have here are bootstrapped, i.e. they don’t make any assumptions about the underlying distribution.
The confidence intervals for the upvote ratios and number of comments have less overlap, so you can have more confidence in making comparisons. Take, for example, the opposite trends for mathematics and neuroscience, two categories that receive similar numbers of questions. Math questions seem to get the most comments but have the lowest upvote ratios, whereas neuroscience questions seem to have high upvote ratios but few comments. This makes me wonder if math questions are more controversial, generating a lot of discussion but not a lot of positive interest. Conversely, I think people like seeing neuroscience questions enough to upvote them, but there aren’t as many neuroscientists on reddit answering questions.
This may have seemed like a lot of work just to confirm my suspicion about the popularity of physics questions, but I enjoyed taking a deeper look at such an interesting part of reddit. There’s plenty more to do with this dataset, such as looking at the actual content of the questions and answers. I’m also very interested to see how much higher the scores are for answers that were given by people with “scientist” flairs. Stay tuned.