Text mining, unlike other fields of data mining, it requires expertise on many disciplines. To make meaningful conclusions, we'll need psychology, sociology, marketing, just to name a few. It can't be and shouldn't be finished by data scientists only. However, text mining is not accessible for most people. Take web data text mining for exam, there're three difficulties:
1. Collecting data is hard
Though data are open and free online, there're often too big and messy to be downloaded by hand. It requires some programming and data management skills.
2. Data is dirty and messy
Text data is made by people. People can make terrible data with strange spelling, grammar, format, encoding... Data cleaning on text data is a big challenge.
3. Fitting model & visualization is complicated
Text data is unstructured. That makes modeling and visualizing complicated. Knowledge on both math and programming is required.
About this project
In this project, I made a graphical interface for text mining on posts from Reddit.com. It's an App with considerable text data that everyone can download and run. It provides options for searching and extracting data, outputs statistics and plots. Note that since post contents and comments vary from boards (some boards even don't have contents, only titles. And many posts have no comments), to be consistent, we focus on analyzing title text data.
In this article and the following posts, I will show:
When open up, you will see an interface like this:
There're two parts: Control panel, Output panel.
The first thing to do is to select a Reddit board (or subreddit) that you want to analyze. The list shows all available data sets.
Once selected, you'll see five panel's show up in the output panel.
General Info:
Shows basic description about the data set.
Author, Keyword, Exclude and
Data range are related to filter which will be described later in this article.
Stat. of posts and comment number:
Shows minimum, maximum, 1st and 3rd Quartile, mean and median of points and comment number.
Boxplot of points and comment number:
Visualizes the stat. of points and comment number. For interpretation of Boxplots, see
Boxplot. Note that outliers are excluded from plotting for scaling concern.
Histogram of post number over time:
Shows number of post in the time range.
Sample of data:
Lists the first 10 post from the data set. Not sorted by time.
Also, once a data set is selected, a
filter option will show up.
Click on it, filter options will show.
Date range:
Selects only within a range of dates.
Author ID:
Searches a post author's ID.
Include keywords:
Selects posts containing some keywords. For multiple keywords, use semi-colon (";") to separate. The
Union option gives all posts containing AT LEAST one of the keywords. The
Intersect option selects only posts containing ALL keywords.
Exclude keywords:
Expludes posts containing keywords from selection. For multiple inputs, use semi-colon (";") to separate.
Points & comment number:
Selects only posts in the ranges of points and comment number.
Note that the outputs change dynamically while filtering.
This is an example of filter. Here, we select data range from 2017-01-01 to 2017-05-01. Search keywords: trump and clinton using Union option. So posts containing either "trump" or "clinton" will be selected. Then exclude "email" and "wall", so that posts containing one of the two words are excluded. We set points range to greater than 100, and comment number range to greater than 10. We have 189 posts selected and they were mostly posted in Jan 2017 according to the histogram.
Now you can close the filter. The plot type panel provides plotting options.
Plot type:
Selects which plot to draw.
Remove frequent terms:
This option removes terms that show in too many posts from further mining or plotting. Though English
stop words will be eliminated automatically, there're still some other words that are meaningless yet common. For example, in the LifeProTips board, every post starts with "LPT:". The term "LPT" then dominates the analysis and becomes a noise. This option will remove terms show in a portion of posts (say, 0.01 of all posts). Don't worry about removing important terms. All terms removed will be listed on the top of output panel so you don't mess up.
Remove keywords:
You can of course remove specific terms. For multiple terms, use semi-colon to separate.
We will introduce the output plot types and how to interpret them. But before that, there are some basic knowledge need to be keep in mind.
1. Term frequency (tf) & Inverse document frequency (idf)
Term frequency is the count of each term in a document. Just like a homework for elementary school kids: count how many "cats" in the story book. It is a basic and straight forward measure of the role/ importance of a term in a document.
However, often time a term shows many times doesn't mean it's important. Imagine words like "a", "at", "to". They are essential in English grammar, but not too meaningful for analysis. Thus a weighting technique is introduced as Inverse-document-frequency (idf). It computes the log of ratio of total number of documents and number of documents that contain a term. So that, the more universal a term is (like "to" and "at"), the samller idf it has.
Now we can present the importance of a term by multiplying its tf and idf (tf-idf). For more details, see
tf-idf.
2. n-gram
Sometimes counting terms is not enough as terms are paired or combined as phrases to express meanings. For example, by counting terms, we can never see words like "ice cream", though it's simple enough to be recognized by reading. The n-gram counts terms as combinations of n words. Thus in a cooking book, you are likely to see "chocolate-cake" and its frequency rather than just "chocolate" and "cake". In this App, we focus on bi-gram. For more details, see
n-gram.
3. Word pairwise correlation
Though bi-gram treats terms as pairs, it only recognizes pairs shown next to each other. For terms often show in the same post while not necessarily continuous, we need other measures. Word pairwise correlation computes a correlation coefficient ranges in 1 and -1. The greater it is, the more likely the two words are to be shown in the same document. For more details, see
pairwise correlation.
4. Sentiment analysis
Human being has emotions, thus our languages implies sentiment. Analyzing the sentiment in words can sometimes show meaningful results. A simple approach is based on a dictionary. We already know the word "happy" reflects joy and positive emotion. We can put a label on it. Putting sentiment labels on each terms gives a statistic of counts of each sentiments. For example, in a novel, there're 70% words reflecting sadness. We might guess it's a tragedy rather than a happy story.
5. Latent Dirichlet allocation (LDA)
Latent Dirichlet allocation (LDA) is a popular model in topic modeling. The goal is to recognize topics in documents through unsupervised learning (training with no labels, no human guidance). For example, we have 1000 news titles, and we want to know what topics they are about. LDA computes the probability of each words belong to each topics. So each word can be assigned to a topic. Though it doesn't name those topics, by looking at the word pools, we can guess what topics they are. A limitation is, the clustering of words requires assumption of topic numbers. When there're actually three topics, while we assume there're five, we'll see topics split. Or when we assume too less topics, we might see topics merged. For more details, see
LDA.
Now we can move on to plot types.
1. Term frequency barchart
This barchart shows the top 10 to 100 words with highest tf-idf.
2. Author-post barchart
This barchart shows the author ID's with most posts.
3. Bigrams barchart
This barchart shows the bigrams with highest frequencies.
4. Word cloud (tf) and word cloud (tf-idf)
This word cloud visualize the words with highest tf or tf-idf.
5. Sentiment word cloud
This word cloud groups words by their sentiment labels.
6. Bigram cloud
This cloud shows the frequency of bigrams. The darker the arrows are the higher frequencies the pairs are.
7. Topic word cloud
This word cloud shows words in each topics (numbered 1 to 6).
8. Word pairwise correlation
This table lists all words pairwise correlation coefficients. In this example, we search for the word "korea" and find "north", "south", "missile" are most likely to show with it.
In this App, all modeling and plotting are done in R language, and the graphical interface is implemented with Shiny. There are two ways to run this App:
1. Run on Shiny server
However, running with Shiny server is slower and unstable. When data size is too large (my experience is total posts number > 5000), plotting becomes very difficult and times out often.
2. Run locally on your computer
If you want to use this App for bigger projects, I strongly recommend downloading it. To run it, you'll need R and RStudio. See
download R and
download RStudio.
Once have them installed, open a new R file with RStudio, copy and paste the following code:
if(!require(shiny))
install.packages("shiny")
library(shiny)
runGitHub("Reddit_shiny", "daviden1013")
Then select all code and press Run on the top right.
You should see the console running and loading packages. This step takes a while. Once it's finished, a window will pop up with the App.
If you're familiar with GitHub, you can also download this project as a zip from my
GitHub.
References
Best Information i have ever seen, thanks for writing this blog Data Science Online course Hyderabad
ReplyDeleteFon perde modelleri
ReplyDeleteNUMARA ONAY
MOBİL ODEME BOZDURMA
nft nasıl alınır
ankara evden eve nakliyat
trafik sigortası
dedektör
HTTPS://KURMA.WEBSİTE/
AŞK ROMANLARI
SMM PANEL
ReplyDeleteSmm panel
İs ilanlari
İnstagram takipçi satın al
HIRDAVATÇI
https://www.beyazesyateknikservisi.com.tr
SERVİS
Jeton hilesi
beykoz lg klima servisi
ReplyDeletetuzla vestel klima servisi
tuzla bosch klima servisi
tuzla arçelik klima servisi
çekmeköy samsung klima servisi
ataşehir samsung klima servisi
çekmeköy mitsubishi klima servisi
ataşehir mitsubishi klima servisi
maltepe vestel klima servisi