Reddit text mining and visualization with R Shiny

Text mining, unlike other fields of data mining, it requires expertise on many disciplines. To make meaningful conclusions, we'll need psychology, sociology, marketing, just to name a few. It can't be and shouldn't be finished by data scientists only. However, text mining is not accessible for most people. Take web data text mining for exam, there're three difficulties:

1. Collecting data is hard

Though data are open and free online, there're often too big and messy to be downloaded by hand. It requires some programming and data management skills.

2. Data is dirty and messy

Text data is made by people. People can make terrible data with strange spelling, grammar, format, encoding... Data cleaning on text data is a big challenge.

3. Fitting model & visualization is complicated

Text data is unstructured. That makes modeling and visualizing complicated. Knowledge on both math and programming is required.

About this project

In this project, I made a graphical interface for text mining on posts from Reddit.com. It's an App with considerable text data that everyone can download and run. It provides options for searching and extracting data, outputs statistics and plots. Note that since post contents and comments vary from boards (some boards even don't have contents, only titles. And many posts have no comments), to be consistent, we focus on analyzing title text data.

In this article and the following posts, I will show:

1. Introduction to this App

2. Introduction to text mining and plots

3. Example 1: What's the most popular music in 2016?

4. Example 2: Hot topics on world news, 2017

5. Availability

Introduction to this App

When open up, you will see an interface like this:

There're two parts: Control panel, Output panel.

The first thing to do is to select a Reddit board (or subreddit) that you want to analyze. The list shows all available data sets.

Once selected, you'll see five panel's show up in the output panel.
General Info:
Shows basic description about the data set. Author, Keyword, Exclude and Data range are related to filter which will be described later in this article.
Stat. of posts and comment number:
Shows minimum, maximum, 1st and 3rd Quartile, mean and median of points and comment number.
Boxplot of points and comment number:
Visualizes the stat. of points and comment number. For interpretation of Boxplots, see Boxplot. Note that outliers are excluded from plotting for scaling concern.
Histogram of post number over time:
Shows number of post in the time range.
Sample of data:
Lists the first 10 post from the data set. Not sorted by time.

Also, once a data set is selected, a filter option will show up.

Click on it, filter options will show.
Date range:
Selects only within a range of dates.
Author ID:
Searches a post author's ID.
Include keywords:
Selects posts containing some keywords. For multiple keywords, use semi-colon (";") to separate. The Union option gives all posts containing AT LEAST one of the keywords. The Intersect option selects only posts containing ALL keywords.
Exclude keywords:
Expludes posts containing keywords from selection. For multiple inputs, use semi-colon (";") to separate.
Points & comment number:
Selects only posts in the ranges of points and comment number.

Note that the outputs change dynamically while filtering.

This is an example of filter. Here, we select data range from 2017-01-01 to 2017-05-01. Search keywords: trump and clinton using Union option. So posts containing either "trump" or "clinton" will be selected. Then exclude "email" and "wall", so that posts containing one of the two words are excluded. We set points range to greater than 100, and comment number range to greater than 10. We have 189 posts selected and they were mostly posted in Jan 2017 according to the histogram.

Now you can close the filter. The plot type panel provides plotting options.

Plot type:

Selects which plot to draw.

Remove frequent terms:

This option removes terms that show in too many posts from further mining or plotting. Though English stop words will be eliminated automatically, there're still some other words that are meaningless yet common. For example, in the LifeProTips board, every post starts with "LPT:". The term "LPT" then dominates the analysis and becomes a noise. This option will remove terms show in a portion of posts (say, 0.01 of all posts). Don't worry about removing important terms. All terms removed will be listed on the top of output panel so you don't mess up.

Remove keywords:

You can of course remove specific terms. For multiple terms, use semi-colon to separate.

Introduction to text mining and plots

We will introduce the output plot types and how to interpret them. But before that, there are some basic knowledge need to be keep in mind.

1. Term frequency (tf) & Inverse document frequency (idf)

Term frequency is the count of each term in a document. Just like a homework for elementary school kids: count how many "cats" in the story book. It is a basic and straight forward measure of the role/ importance of a term in a document.

However, often time a term shows many times doesn't mean it's important. Imagine words like "a", "at", "to". They are essential in English grammar, but not too meaningful for analysis. Thus a weighting technique is introduced as Inverse-document-frequency (idf). It computes the log of ratio of total number of documents and number of documents that contain a term. So that, the more universal a term is (like "to" and "at"), the samller idf it has.

Now we can present the importance of a term by multiplying its tf and idf (tf-idf). For more details, see tf-idf.

2. n-gram

Sometimes counting terms is not enough as terms are paired or combined as phrases to express meanings. For example, by counting terms, we can never see words like "ice cream", though it's simple enough to be recognized by reading. The n-gram counts terms as combinations of n words. Thus in a cooking book, you are likely to see "chocolate-cake" and its frequency rather than just "chocolate" and "cake". In this App, we focus on bi-gram. For more details, see n-gram.

3. Word pairwise correlation

Though bi-gram treats terms as pairs, it only recognizes pairs shown next to each other. For terms often show in the same post while not necessarily continuous, we need other measures. Word pairwise correlation computes a correlation coefficient ranges in 1 and -1. The greater it is, the more likely the two words are to be shown in the same document. For more details, see pairwise correlation.

4. Sentiment analysis

Human being has emotions, thus our languages implies sentiment. Analyzing the sentiment in words can sometimes show meaningful results. A simple approach is based on a dictionary. We already know the word "happy" reflects joy and positive emotion. We can put a label on it. Putting sentiment labels on each terms gives a statistic of counts of each sentiments. For example, in a novel, there're 70% words reflecting sadness. We might guess it's a tragedy rather than a happy story.

5. Latent Dirichlet allocation (LDA)

Latent Dirichlet allocation (LDA) is a popular model in topic modeling. The goal is to recognize topics in documents through unsupervised learning (training with no labels, no human guidance). For example, we have 1000 news titles, and we want to know what topics they are about. LDA computes the probability of each words belong to each topics. So each word can be assigned to a topic. Though it doesn't name those topics, by looking at the word pools, we can guess what topics they are. A limitation is, the clustering of words requires assumption of topic numbers. When there're actually three topics, while we assume there're five, we'll see topics split. Or when we assume too less topics, we might see topics merged. For more details, see LDA.

Now we can move on to plot types.

1. Term frequency barchart

This barchart shows the top 10 to 100 words with highest tf-idf.

2. Author-post barchart

This barchart shows the author ID's with most posts.

3. Bigrams barchart

This barchart shows the bigrams with highest frequencies.

4. Word cloud (tf) and word cloud (tf-idf)

This word cloud visualize the words with highest tf or tf-idf.

5. Sentiment word cloud

This word cloud groups words by their sentiment labels.

6. Bigram cloud

This cloud shows the frequency of bigrams. The darker the arrows are the higher frequencies the pairs are.

7. Topic word cloud

This word cloud shows words in each topics (numbered 1 to 6).

8. Word pairwise correlation

This table lists all words pairwise correlation coefficients. In this example, we search for the word "korea" and find "north", "south", "missile" are most likely to show with it.

Availability

In this App, all modeling and plotting are done in R language, and the graphical interface is implemented with Shiny. There are two ways to run this App:

1. Run on Shiny server

This is the easiest way. All computations will be done by Shiny servers. You only need a browser and internet. Visit this website: https://daviden1013.shinyapps.io/reddit_shiny/

However, running with Shiny server is slower and unstable. When data size is too large (my experience is total posts number > 5000), plotting becomes very difficult and times out often.

2. Run locally on your computer

If you want to use this App for bigger projects, I strongly recommend downloading it. To run it, you'll need R and RStudio. See download R and download RStudio.

Once have them installed, open a new R file with RStudio, copy and paste the following code:

if(!require(shiny))
  install.packages("shiny")
library(shiny)
runGitHub("Reddit_shiny", "daviden1013")

Then select all code and press Run on the top right.

You should see the console running and loading packages. This step takes a while. Once it's finished, a window will pop up with the App.

If you're familiar with GitHub, you can also download this project as a zip from my GitHub.

References

1. Text mining with R-- a Tidy approach

2. Basic text mining with R

Search This Blog

Enshuo's Data Mining Notes

Sentiment Analysis with Deep Learning and Traditional Approaches: An Ensemble Modeling Example