Posts

Sentiment Analysis with Deep Learning and Traditional Approaches: An Ensemble Modeling Example

Image
image downloaded from: https://beyondphilosophy.com/a-sentiment-analysis-of-sentiment-analysis-some-gobbledygook-some-bright-spots-and-a-nice-looking-dashboard/ In this article, we w ill use a simple text classification dataset to demonstrate how sentiment analysis can be done with both traditional text mining approaches and deep learning approaches. We will also compare the performance of the two modeling strategies and develop an ensemble model that maximizes prediction accuracy. The data is cited from de Freitas, Nando, and Misha Denil. "From Group to Individual Labels using Deep Features." (2015).  We will cover: Develop a LSTM deep learning model Sentiment analysis with polarity scores  Comparison and ensemble modeling Before we start, let's take a look at the data. The data contains 3,000 reviews labeled with positive and negative sentiments extracted from Amazon, IMDb, and Yelp.  The head of the data looks like this: So there is no way for m...

Text Generator with LSTM Recurrent Neural Network with Python Keras.

Image
Image downloaded from https://blog.4tests.com/21-study-habits-reddit-community/ In this article, we will make a text generator with LSTM Recurrent Neural Network with Python Keras. We train the network with post titles from Raddit.com, LifeProTips board. At the end, it generates brilliant life tips in fluent English. My Python code can be found on my Github . What we are going to include: Loading and processing text data Training a naive LSTM neural network Training a modified LSTM neural network Summary: Comparing the two models Loading and Processing Text Data The Text data were crawled from Raddit.com, LifeProTips board. I made a web crawler with Python scrapy .  Check my previous post for details. We first load the data from database with some SQL. Here I use Microsoft Access. Since the data is large, to reduce training time and memory needed, we only use posts in Jan, 2017. Here is the code. Load data import pyodbc conn_str = ( ...

Example: Hot topics on world news, 2017

Image
This article is a continuous of the previous post, Reddit text mining and visualization with R Shiny . In this article, we will introduce techniques on exploring topics on Reddit, worldnews board. Take a look at the data First, we load the data and select date range to 2017-01-01 to 2017-05-01, which is the nearest date I collected. We have 8785 posts, with medians of 9 points and 3 comments. The post-over-time is stable. About 500 posts per week. We see how the posts are made by plotting the author-post barchart. There's one guy made over 300 posts while others contributed less than 100. Most people made no more than 10. Find keywords Now we plot the keywords in a barchart base on their tf-idf. There're important keywords like "trump", "china", "korea". But we also see adjectives like "north", "south" that should be connected with a noun.  We plot the bigrams with high frequencies. Combini...

Example: What's the most popular music in 2016?

Image
This article is a continuous of the previous post, Reddit text mining and visualization with R Shiny . In this article, we will explore the most popular music styles in 2016 based on Reddit, Music board. Take a look at the data We first load the data. Select Music board, set data range to 2016-01-01 to 2016-12-31 and we get the following outputs. In 2016, there are 23,774 posts in total. Stat. and boxplot shows most posts has 1 point or 0 comments (median). While the most popular quarter (3rd Qu.) has 3 points or 2 comments. The histogram shows a consistency of post number over the year. We draw Author-post barchart to see how those posts are made. The vertical axis is author ID's (too many, not listed), and horizontal axis is post numbers. As circled in red, a few people made a lot of posts. As circled in orange, most only made one post through the year. Identify frequent music types Now let's move on to identifying popular music types. The Term frequency barc...

Crawl Reddit.com with Python scrapy

Image
Nowadays, there's a lot of information on the internet. Yet when they separate among web pages, there's no way analysis can be done. In this article, we will visit Reddit.com, download posts, parse information and store in a database with Python scrapy. The steps are: 1. Visiting URL This includes finding the specific URL that your desired information is in, and interact with the website's server by GET or POST. In practice, it can be sending usernames and password, setting cookies and so on. At the end of this step, you'll get a html file with all the contents. 2. Parsing information In the previous step, we have a html file, we can simply store it as a text file. But this is not too helpful for further analysis as there're irrelevant contents. We want to extract information. To do this, we need some background knowledge on HTML and CSS. Identify in which HTML element the information is contained and parse it out! 3. Store information in da...