Sentiment Analysis with Deep Learning and Traditional Approaches: An Ensemble Modeling Example

Image
image downloaded from: https://beyondphilosophy.com/a-sentiment-analysis-of-sentiment-analysis-some-gobbledygook-some-bright-spots-and-a-nice-looking-dashboard/ In this article, we w ill use a simple text classification dataset to demonstrate how sentiment analysis can be done with both traditional text mining approaches and deep learning approaches. We will also compare the performance of the two modeling strategies and develop an ensemble model that maximizes prediction accuracy. The data is cited from de Freitas, Nando, and Misha Denil. "From Group to Individual Labels using Deep Features." (2015).  We will cover: Develop a LSTM deep learning model Sentiment analysis with polarity scores  Comparison and ensemble modeling Before we start, let's take a look at the data. The data contains 3,000 reviews labeled with positive and negative sentiments extracted from Amazon, IMDb, and Yelp.  The head of the data looks like this: So there is no way for m...

Example: What's the most popular music in 2016?

This article is a continuous of the previous post, Reddit text mining and visualization with R Shiny.
In this article, we will explore the most popular music styles in 2016 based on Reddit, Music board.

Take a look at the data

We first load the data. Select Music board, set data range to 2016-01-01 to 2016-12-31 and we get the following outputs.
In 2016, there are 23,774 posts in total. Stat. and boxplot shows most posts has 1 point or 0 comments (median). While the most popular quarter (3rd Qu.) has 3 points or 2 comments. The histogram shows a consistency of post number over the year.

We draw Author-post barchart to see how those posts are made.
The vertical axis is author ID's (too many, not listed), and horizontal axis is post numbers. As circled in red, a few people made a lot of posts. As circled in orange, most only made one post through the year.

Identify frequent music types

Now let's move on to identifying popular music types. The Term frequency barchart shows words with high if-idf.
Here we list the top 20. We see Rock is on the top. Looking at the plot, we can write down: Rock, Pop, Hip-Hop, Metal, and Rap.

So far, we're dealing with the whole year's data. Now let's extract posts with high points (points >= 100) and plot again.
In this subset, there're 657 posts. Rock is still the top, but Metal becomes the second and Punk is the third. Rap is out of the top 5.

We focus on posts with most discussions (comment number >= 10).
In this subset, we have 1546 posts. The top 5 are the same as in high-point group, but Punk is lower than Pop.

Brief summary

1. In all three groups: total, high-point, and high-discussion, most posts are about Rock. 
2. Punk is not the top 5 overall, but in high-point, and high-discussion groups, it's the top 5.
3. Taking union of the three groups' top 20 keywords, we can build a pocket list for further analysis:
Rock, Rap, Pop, Hip-Hop, Metal, Punk, Folk, Soul, Electronic.

Look into each music styles

Now we search posts related to each music styles in our pocket list.
Looking at the boxplot and Statistics, we write down the following table.


points
comments
posts
1st Qu.
median
3rd Qu.
1st Qu.
median
3rd Qu.
All
0
1
3
0
0
2
23774
Rock
1
2
5
0
0
1
5948
Pop
1
1
3
0
0
1
2182
Hip-Hop
1
1
2
0
0
1
1542
Metal
1
2
6
0
0
2
1066
Punk
1
2
6
0
0
2
853
Rap
1
1
2
0
0
1
1138
Folk
1
2
3
0
0
1
965
Soul
1
1.5
3
0
0
1
484
Electronic
1
1
2
0
0
1
957
For comment numbers, all music styles have median 0. Note that the range between 1st and 3rd Qu. can show variation. The smaller the more consistent the median is. Thus we should also consider it while reading medians. Metal and Punk have the 3rd Qu. of points equals to 2 and commnts equals to 6 which are the highest among all styles. While the 1st Qu. of points comments are the same as others. This means the distributions are extreme. Folk has a high median in points (2) and the Qu. range is samll (1-3). This indicates a centrality in distribution.

Brief summary

1. Rock has most posts and a high median (2) in points. 
2. Metal and Punk have bipolar distributions on points and comment numbers.
3. Folk has a centered distribution on points.

More on music styles

In previous analyses, we focus on individual styles. But looking closer the data and you'll see, the music styles are not exclusive! There're "Electronic Rock", or "Rock/ Country". Treating them separately drops information of their co-existence. In the following analysis, we'll focus on pairs of music styles.

Here, we plot bigrams occur over 30 times in the year.
Looking at the bigram cloud, we can see centers and radios. Surrounding Rock, there're "alternative", "country", "punk", "electronic"... There's also Rap and Hip-Hop forming a circle. Metal is also a center with "death", heavy", "power" as members. There are bigrams outside, like "lo fi" and "jazz fusion". 

Brief summary

1. Music styles are families with a center style and member styles.
2. Rock is the biggest family center.

Conclusion

1. Rock is the most liked (high points), most discussed music style.
2. Rock, Hip-Hop Metal, and Pop are the most popular families.

Comments

Popular posts from this blog

Reddit text mining and visualization with R Shiny

Sentiment Analysis with Deep Learning and Traditional Approaches: An Ensemble Modeling Example

Text Generator with LSTM Recurrent Neural Network with Python Keras.