Text Generator with LSTM Recurrent Neural Network with Python Keras.

Image downloaded from https://blog.4tests.com/21-study-habits-reddit-community/

In this article, we will make a text generator with LSTM Recurrent Neural Network with Python Keras. We train the network with post titles from Raddit.com, LifeProTips board. At the end, it generates brilliant life tips in fluent English.
My Python code can be found on my Github.

What we are going to include:

Loading and processing text data
Training a naive LSTM neural network
Training a modified LSTM neural network
Summary: Comparing the two models

Loading and Processing Text Data

The Text data were crawled from Raddit.com, LifeProTips board. I made a web crawler with Python scrapy. Check my previous post for details.

We first load the data from database with some SQL. Here I use Microsoft Access. Since the data is large, to reduce training time and memory needed, we only use posts in Jan, 2017. Here is the code.

Load data

import pyodbc

conn_str = (
            r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};'
            r'DBQ=C:\Users\username\reddit.accdb;' 
        )
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()

SQL = "SELECT title FROM LPT WHERE datetime LIKE '%201701%'"
cursor.execute(SQL)

title = []
for row in cursor.fetchall():
    title.append(row[0])

cursor.close()
cnxn.close()

Clean data

After running the code above, we have a list title containing 1904 titles, with about 225,000 characters. We take a look at how many unique characters in there:

 
title_str = ''.join(title)
unique_char = sorted(list(set(title_str)))
len(unique)

There are 73 unique characters. But many of them are rare, like '#', '$', '%'. As text output, they don't contain too much information. Also, upper and lower cases don't provide much information.
So the following code converts all titles to lower case, makes a white list with characters we are interested in generating and remove others.

 
# join title strings
title_str = '\n'.join(title)
# make white list
white_list = "\n abcdefghijklmnopqrstuvwxyz,.:;?'’*-" + '"'
# convert title strings to lower case
title_str = title_str.lower()
# remove chars not in white list
title_str_reduced = ''.join([ch for ch in title_str if ch in white_list])
# get unique chars
chars = sorted(list(set(title_str_reduced)))

Note that the last line

chars = sorted(list(set(title_str_reduced)))

converts a string to a set of unique characters, and convert the set to a list. This is a risky move since sets don't preserve ordering while lists do. Converting a set to a list creates random ordering. That is why we sort after the conversion. This makes sure we have consistent outputs.

The next step is creating mapping among characters, integers and one-hot vectors. This is the most important step in data processing. No matter what type of deep learning algorithm, as a mathematical model, it only takes numbers, vectors or matrices. We need to represent text in one of those. In Python, there is a very convenient hash table implementation, dictionary.

The first dictionary we make is a character to integer mapping. This will be used in input training set.

# make char to int index mapping
char2int = {}
for i in range(len(chars)):
    char2int[chars[i]] = i

Then integer to character mapping. This will be used in converting output back to characters (so human can read!).

# make int index to char mapping
int2char = {}
for i in range(len(chars)):
    int2char[i] = chars[i]

The last dictionary maps characters to one-hot vectors. This is used in training data's outputs. It's a very important concept. As we treat each characters as categories and use cross-entropy as loss function, the output has to be categorical.

# make char to one hot vector mapping
char2vec = {}
for ch, i in char2int.items():
    char2vec[ch] = np.eye(len(chars))[i]

Train a Naive LSTM Neural Network

At this point, we have the text data cleaned, have the dictionaries ready, it's good to start training. We first train a very basic LSTM model with 2 layers, 500 cells each with dropout rate 0.2, and a dense layer to take care of output shapes. This network takes integer as input. For each batch of inputs, shape = [batch_size, sequence_lenght, 1]. "1" stands for the integer index of that character. And the output is a single character represented as a one hot vector. Output shape is [batch_size, num_unique_char].

Make training set

The following code makes training set. We give a reasonable sequence length of 100 characters. This will be the input length. When training, use this sequence to predict the next character after the sequence, then move forward one character. For example, we take the first 100 characters, ask the network to predict the next, the 101th character. Then move on, take the 2ed to 102ed character sequence, and ask the network to predict 103rd character.

# make data
seq_length = 100
x = []
y = []
for i in range(len(title_str_reduced) - seq_length):
    x.append([char2int[index] for index in title_str_reduced[i:i+seq_length]])
    y.append(char2vec[title_str_reduced[i+seq_length]])

x = np.reshape(x, (len(x), seq_length, 1))
x = x/float(len(chars))
y = np.reshape(y, (len(y), len(chars)))

Define network

Then we define the network.

# define model structure
model = Sequential()
model.add(LSTM(500, input_shape = (None, 1), return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(500))
model.add(Dropout(0.2))
model.add(Dense(len(chars), activation = "softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Note that the input shape is [None, 1]. None gives flexibility to the input sequence length. So the network can accept any sequence length. This is important when generating texts. Because we will ask the network to read previously generated text and generate new characters. The input sequence length will be dynamic.
The return _sequences option specifies whether to return every outputs for each characters in the input sequence. As the first layer, we want it to be fully connected to the second layer, thus set it to true. But the second (or last) layer should only return the last output character. Otherwise it won't match dense layer's input shape since dense layer is NOT a LSTM and can't handle it. Another option is using TimeDistributed() wrapper on dense layer.

Train neural network

Now we define checkpoints and start training.

# define the checkpoint
filepath="LPT-naive-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=60, batch_size=50, callbacks=callbacks_list)

When save_best_only is True, it compares the loss with previous epochs. If the loss doesn't improve, model won't be saved. We train with batch size 50. Train for 60 epochs. But will stop once the loss stop dropping.

Generate Life Tips

I trained on my Geforce GTX 960 GPU. It took about 6 hours, and stopped at epoch 18, with loss 1.1391. Now we can load the best model from checkpoints and use it for title text generating.

# load the best model
best_model = load_model("LPT-naive-17-1.1391.hdf5")
# predict LifeProTips
out = "lpt:"
for i in range(1000):

    pattern = [char2int[ch] for ch in out]
    pattern = np.reshape(pattern, (1, len(pattern), 1))
    pattern = pattern/float(len(chars))
    pre = best_model.predict(pattern).reshape(-1)
    index, value = max(enumerate(pre), key=operator.itemgetter(1))
    next = int2char[index]
    out += next

output = out.split('\n')

for t in output:
    print(t)

We give an initial sequence: "lpt:" only. All the rests, including punctuation, new line, other "lpt:" headers are generated by the neural network itself.
(I doubled the new line spacing so it's easier to read)

The Outputs

"lpt: if you are srarting a shopt for the ooen that you don't have to contact the top of your phone on a hood bootain before you account for a complate bround the car with the best or the ooes and then ualk you to use the bottom of your car with the same to search the person in the commant soutine.

they will toletimes want to hear you forgot, in the comtect number of the winle contact numbers on the shower will help you for a foun better photo. they will toletimes want to hear you fat them.

lpt: if you are srarting a shopt to you don't want to het your car with a conplered sine, take a picture of the shower will help you for a foun better photo. they will toletimes want to hear you fat them.

lpt: if you are srarting a shopt to you don't want to het your car with a conplered sine, take a picture of the shower will help you for a foun better photo. they will toletimes want to hear you fat them.

lpt: if you are srarting a shopt to you don't want to het your car with a conplered sine, take"

The output is surprising to me. It not only makes correct headers, punctuation, abbreviations, it even learns to end a sentence with period followed by new line. The grammar is correct in most cases, while some spellings are wrong.

In the next section, we will stand on this foundation and modify the training processes.

Train a Modified LSTM Neural Network

In this section, data cleaning and pre-processing is the same as before. We only change the followings:

Increase the cell numbers in each layers from 500 to 700.
Use one-hot vector as input instead of single integers.

The bigger a neural network is, the more memory it holds and can process more information. That is why bigger models, though slower to train, can preform better than smaller ones. The first change of increasing cell numbers helps in this way.

Taking single integers as character representation has a cost. Since integers are ordinal, while in fact, characters are categorical. It requires the model to figure out this difference. That is, the neural network needs to adjust itself internally so that ordinal inputs has the same effect as categorical inputs. However, if we involve the background knowledge on the input data, make it vectors (categorical representation), the neural network can save effort feature engineering on it and focus more on analyzing the language's structure. Like spelling and grammar. That is how the second change helps.

Here, instead of input integer index, we input characters as vectors. The input shape is [batch_size, sequence_length, num_unique_char]. The output shape remains [batch_size, num_unique_char].

Make training set

Training set is made in the same way as before. Note that now we use one-hot vectors for characters.

# make data
seq_length = 100
x = []
y = []
for i in range(len(title_str_reduced) - seq_length):
    x.append([char2vec[index] for index in title_str_reduced[i:i+seq_length]])
    y.append(char2vec[title_str_reduced[i+seq_length]])

x = np.reshape(x, (len(x), seq_length, len(chars)))
y = np.reshape(y, (len(y), len(chars)))

Define network

We set the cell number in each layers to 700. Note that input shape has changed since we use one-hot vectors as input.

# define model structure
model = Sequential()
model.add(LSTM(700, input_shape = (None, len(chars)), return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(700))
model.add(Dropout(0.2))
model.add(Dense(len(chars), activation = "softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Train neural network

Training is the same as before. We use batch size 50, run at most 60 epochs.

Generate Life Tips

Similar to before, just notice the input now is converted by char2vec not char2int.

# load the best model
best_model = load_model("LPT-oneHot-23-0.6763.hdf5")
# predict LifeProTips
out = "lpt:"
for i in range(1000):

    pattern = [char2vec[ch] for ch in out]
    pattern = np.reshape(pattern, (1, len(pattern), len(chars)))
    pre = best_model.predict(pattern).reshape(-1)
    index, value = max(enumerate(pre), key=operator.itemgetter(1))
    next = int2char[index]
    out += next

output = out.split('\n')

for t in output:
    print(t)

I trained on my GPU for about 20 hours. But the loss dropped significantly lower than the naive neural network. I didn't finish the training since the performance is good enough. I stopped at epoch 24, loss is 0.6763.

The Outputs

"lpt: if you have a habit of support for the bottom of your current information that you want to do it.

lpt: if you have to copy a late of all the same time to get information in a third port, ask them to accomplish something that you can accumulate on your phone call with "this" to stay in the same and then read the money to replace the back of your car for one second."

lpt: if you have a soap on water to a habit of drinking and looking at a computer adjust the stain out of your clothes.

lpt: if you have a soap of water in a new setting inside out. after a shower with the stuffy side up and the other way the plugs in your car again.

lpt: if you have a soap on water to a habit of drinking and looking at a computer adjust the stain out of your clothes.

lpt: if you have a soap of water in a new setting inside out. after a shower with the stuffy side up and the other way the plugs in your car again.

lpt: if you have a soap on water to a habit of drinking and looking at a computer adjust the st"

The last sentence is not finished since I only allow it to generate 1000 characters.
In the generated titles, I don't find any wrong spellings. The grammar is surprising, it began to use phrases, it even use quotation marks to emphasize a word "this". Ignoring the meanings, I guess it's hard to tell the life tips are actually generated by a robot.

Summary: Comparing the two models

Loss vs. Epoch

The two neural networks have significant differences in performance and convergence. Let's see the plot of loss vs. epochs.

Loss is a measure of how well a model converges. As training goes, loss decreases.
We see the modified model's loss drops faster in the early epochs. (It starts in a lower value because of the first epoch's training. The loss before any training for both models are about 5.0. Epoch 0 is recorded after the first epoch has been trained.)

Take Home Advice

Use bigger (deeper, fatter) neural networks.
Monitor loss changing. Stop when loss stop dropping for a few epochs.
Spend time and effort on pre-processing data. Input type can mean a lot to neural network training.

References

I found these articles helpful:

Search This Blog

Enshuo's Data Mining Notes

Sentiment Analysis with Deep Learning and Traditional Approaches: An Ensemble Modeling Example