Sentiment Analysis with Deep Learning and Traditional Approaches: An Ensemble Modeling Example

image downloaded from: https://beyondphilosophy.com/a-sentiment-analysis-of-sentiment-analysis-some-gobbledygook-some-bright-spots-and-a-nice-looking-dashboard/

In this article, we will use a simple text classification dataset to demonstrate how sentiment analysis can be done with both traditional text mining approaches and deep learning approaches. We will also compare the performance of the two modeling strategies and develop an ensemble model that maximizes prediction accuracy. The data is cited from de Freitas, Nando, and Misha Denil. "From Group to Individual Labels using Deep Features." (2015).

We will cover:

Develop a LSTM deep learning model
Sentiment analysis with polarity scores
Comparison and ensemble modeling

Before we start, let's take a look at the data. The data contains 3,000 reviews labeled with positive and negative sentiments extracted from Amazon, IMDb, and Yelp.
The head of the data looks like this:

So there is no way for me to plug it in here in the US unless I go by a converter. 0
Good case, Excellent value. 1
Great for the jawbone. 1

Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!! 0

Label 1 means positive and 0 means negative.

We load the dataset, and store texts and labels in lists: sentence, label.

import numpy as np
import nltk
import nltk.sentiment
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint

np.random.seed(1)
# load data
amazon = open("amazon_cells_labelled.txt", 'r', encoding = 'utf8').readlines()
imdb = open("imdb_labelled.txt", 'r', encoding = 'utf8').readlines()
yelp = open("yelp_labelled.txt", 'r', encoding = 'utf8').readlines()
text = amazon + imdb + yelp
# make sentence, label lists
sentence = []
label = []
for t in text:
    t = t.replace('\n', '')
    s, l = t.split('\t')
    sentence.append(s)
    label.append(int(l))
# n_posts
n_posts = len(text)

To evaluate model performance on new observations, we split the data into training set (80%) and test set (20%). The following code only samples id's. The split will be done later according to models.

# sample 20% for testing
test_id = np.random.choice(x.shape[0], int(x.shape[0] * 0.2), replace = False)
train_id = np.array([i for i in list(range(x.shape[0])) if i not in test_id])

Develop an LSTM deep learning model

There are many ways to convert text data to mathematics form. Here we use word embedding. Review sentences are converted to integer lists where each integer represent a vocabulary. The integer sequences will further be embedded into shorter vectors by embedding layer in the neural network.

Find unique words

So first, we need to find all unique words. The following code use term frequency counter in NLTK module. fdict is a dictionary of {word : freq}.

# connect sentence to string
sentence_str = ' '.join(sentence)
# term frequency dictionary
fdict = nltk.FreqDist(sentence_str.split())

With frequencies, we can decide what words to include. For example, only take words with higher frequencies. In this example, there are only 8,015 unique words, so I decide to include them all.
We make word-integer mapping with this code:

words = []
for term, f in fdict.items():
    words.append(term)
words = sorted(words)
# make word-int mapping
word2int = {}
for i in range(len(words)):
    word2int[words[i]] = i + 1

Convert sentences into integer list

With the word2int dictionary, we can easily convert words in each reviews into integer lists.

# convert sentence to int sequence
sentence_index = []
for t in sentence:
    sentence_index.append([word2int[w] if w in word2int.keys() else 0 for w in t.split()])

Unify the sequence lengths

Now, words are converted to integers, and review sentences are integer lists. We still have a problem-- the sentences have different lengths. For RNN training, it is not necessary to unify sequence lengths (if you train with 1 sample each time). However, if we want to use batch training, it is required. We will find the longest sentence's length and fill other sentences with empty words (integer 0), so that all sentences have same length.

# fill all sentences with '0' to make sentences equal lengths
maxLength = 0
for t in sentence_index:
    if maxLength < len(t):
        maxLength = len(t)
sentence_index_filled = []
for t in sentence_index:
    sentence_index_filled.append([0] * (maxLength - len(t)) + t)

Prepare data

We re-shape sentences into [n_posts, seq_length] and split into training set and test set. train_id and test_id are sampled before.

# make data
x = np.reshape(sentence_index_filled, (n_posts, maxLength))
y = np.reshape(label, (n_posts, 1))
train_x = x[train_id,:]
train_y = y[train_id,:]
test_x = x[test_id,:]
test_y = y[test_id,:]

Define neural network

As mentioned before, we define an embedding layer to take inputs, convert each integer (word) into a vector. Note that this layer is also made of neurons and can adjust weights through training. The first parameter of embedding layer, input dimension is the number of words. The second parameter, output dimension is the dimension of output vector.
We have 8,015 unique words. I decide to output vectors of 200. We define one layer of LSTM cells with size 100 and 10% dropout rate.

# define model structure
model = Sequential()
model.add(Embedding(len(words), 200, input_length=maxLength))
model.add(LSTM(100))
model.add(Dropout(0.1))
model.add(Dense(1, activation = "sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# define the checkpoint
filepath="sentiment-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(train_x, train_y, epochs=20, batch_size=50, callbacks = callbacks_list)

Result

We plot prediction accuracy vs. epochs. At third epoch, the model already has good accuracy on training set. It is almost 1. But on test set, it reaches highest point at 6th epoch, with 77.5% prediction accuracy.

From loss vs. epochs plot, we see training loss drops along epochs. Test loss has a concave curve between 1st and 4th epoch. This shows obvious over-fitting after epoch 4th.

Sentiment analysis with polarity scores

In the following, we will use a dictionary-based traditional text mining approach-- Valence Aware Dictionary and sEntiment Reasoner (Vader). It is a word-bag model. Sentiment of a text is determined by the count of positive and negative words. A summary score, called polarity score is reported. The higher the polarity score is, the more positive a text is.
We load modules and declare a Sentiment Intensity Analyzer object.

import nltk
import nltk.sentiment
obj = nltk.sentiment.vader.SentimentIntensityAnalyzer()

Find best cut-off

Though the analyzer object reports polarity scores, it doesn't tell how to classify sentences. So we will try some cut-offs and find the one gives highest prediction accuracy.

# find good cut-off
for i in np.arange(0, 1, 0.1):
    pre = []
    for s in np.array(sentence)[test_id]:
        temp = obj.polarity_scores(s)['compound']
        temp = 0 if temp < i else 1
        pre.append(temp)
    
    print("cut:", i, "acc:", np.sum(np.array(pre) == test_y.reshape(-1))/test_y.shape[0])

Result

When cut-off is 0.1, accuracy is 80.1% which is the highest. So we use 0.1 as our cut-off.

Comparison and ensemble modeling

The accuracy of deep learning model and traditional model are 77.5% and 80.1%. In this example, seems traditional model has better performance. This might be caused by some features of this data.

1. Sample size is small

Deep learning model unlike traditional, dictionary based model, it has no other information except the data. When sample size is limited, it can hardly learn sentiment of words or sequences. It is likely that some words in test set has never been seen by the model.

While traditional model predicts based on dictionary (database), and is uninfected by sample size. An example: Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras, where the author uses 25,000 sentences to train a deep learning model for sentiment prediction (similar task as ours) gives a significantly higher accuracy, 86%.

2. Sentences are short

RNN is powerful on sequence data for it has memory and can take sequence order into consideration. This can't be done by most traditional, word-bag models. However, When sentences are short and grammar is simple, order doesn't hold too much information. So traditional models can do pretty well.

How about mixing them: Ensemble model

We already know that traditional model can do well in this example. And deep learning model can make use of order information. How about we mix the two models, take their advantages and maximize prediction accuracy?

Ensemble modeling

Ensemble modeling is a popular way of doing so. We build different models separately, tune them to the best states, and develop some voting rule to make our final prediction.

Here, I made these rule:

When two models agree, we take it.
When they conflict, if deep learning model has 90% certainty, we take it. Otherwise, we take traditional model's prediction.

The following code implements the rules. rnn_pre is deep learning model's output as probabilities. vader_pre is traditional model's output as categories.

rnn_pre = rnn_pre.reshape(-1)
predict = []
for i in range(test_y.shape[0]):
    if rnn_pre[i] > 0.5 and vader_pre[i] == 1:
        predict.append(1)
    elif rnn_pre[i] < 0.5 and vader_pre[i] == 0:
        predict.append(0)
    else:
        if rnn_pre[i] > 0.9:
            predict.append(1)
        elif rnn_pre[i] < 0.1:
            predict.append(0)
        else:
            predict.append(vader_pre[i])

print(np.sum(np.array(predict) == test_y.reshape(-1))/test_y.shape[0])

We have 82.5% accuracy. Better than any of them alone!

Search This Blog

Enshuo's Data Mining Notes