In one of our earlier posts we have created a Text Summarization algorithm. Now the main problem with such Natural language Generation algorithms is Language Modelling.
The algorithm needs enough training to get confident about the use of words. Not every part of text could be used while summarization or other NLP problems. So for these tasks the model needs to be trained on Language so that it could generate text efficiently.
And by text generation we mean by the process of generating a natural language having sense of meaning. The model needs a see sentence based on which it would be able to Generate a complete summary.
For building this text generation model we will be using Tensorflow, Keras Library, Deep Learning process, NLP and LSTM. So first we should know about each of them in details.
What is Deep Learning?
Deep Learning is a Subset of Machine Learning which groups the process of training models mostly through unsupervised learning. Models are provided with data including Text, Voice and Images through which they are trained enough to take further decisions.
In our today’s article we will be training our model on Text bits. This way it will learn Text Generation using LSTM by knowing about the occurrence of each word and frequency. After this our model will be able to generate text on its own just by providing a seed sentence.
Natural language Processing
It is the process of processing and analyzing natural languages by computer models. Machines need to learn Natural Language Processing for various tasks such as Text Summarization, Sentiment Analysis, Speech to Text Generation, etc.
What is Tensorflow?
Tensorflow is one of the most famous open source Deep Learning library. It was made public by Google in the year 2015 and developers around the world are using it to build AI and Machine Learning projects.
The library is mainly used for numerical computational which makes the model computing amazing smooth. It helps developers throughout the process of training models, getting data, forecasting results and adapting the model changes.
We will be using Tensorflow to train our model for Text Generation process. It could also be used to train models for Digit Recognition, Image Recognition and other data oriented NLP models.
What Does Keras Library Do?
Keras is an open source network library majorly written in Python. Its generally referred as API which could run on the top of Tensorflow, PlaidML, Theano, etc. It was developed to make experimentation on Deep learning a bit easier and efficient.
Deep Learning models get an ease to be developed with the help of Keras Library. And after the support it got from Tensorflow it really made much difference in the abstraction of Deep Learning data models.
What About LSTM?
LSTM short for Long Short-Term Memory is an Artificial Intelligence architecture. It is often used to build stable deep learning models. It can remember sequences in data elements which could be used to train models.
The model which we are going to build will use LSTM architecture to remember occurrence of words. It also tries to keep the meaning of final generated text related to the seed sentence that we provide.
Now you might have got a basic understanding of each of the element. We will be moving towards the building of a Deep learning model. This model would be used for Text Generation using LSTM with Deep learning.
Note: Deep Learning algorithms requires GPU for fast processing therefore we are using Google Colab. If your system has GPU available then you can use that for sure.
Note 2: The Indentation is not correct in Code blocks due to WordPress Plugin. Check the Respective Output Screen for correct code indentation.
Building Text Generation Model with LSTM
Installation of Tensorflow and Keras
pip install --upgrade tensorflow sudo pip3 install keras (For ubuntu) pip install keras (For windows)
1. Importing Important Libraries
import string from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from nltk.tokenize import word_tokenize
Tokenizer is used to split the sentence into a list of words and to_categorical is used to convert class vector into a matrix.
2. Upload Files
This step is only for those who are working on Google Colab. So, If you are working on your local machine then you can directly skip onto the third step.
from google.colab import files uploaded = files.upload()
3. Opening a File
We can use the below code to open a file in the environment.
file = open('File name', 'r',encoding="ISO-8859-1") text=file.read() file.close()
4. Cleaning the Text
For this step we would need to download ‘punkt’ from nltk library. Use the code below to perform the action:
import nltk nltk.download('punkt')
After downloading punkt now we will be using the code below to clean our text file.
def clean_doc(doc): clean_words=  words = word_tokenize(text) for word in words: word = word.strip(string.punctuation) if len(word)>=1 and word.isdigit()==False: word = word.lower() clean_words.append(word) return clean_words token=clean_doc(text)
In the above step all the non required words are removed and text is converted into a usable form.
5. Creating Sequence
In this step , sequences of text is generated each having 21 length. This length depends on the data you are working on . Since we have taken data of medium size so we are using length as 21.
sequence_len=21 seq= for i in range(0,len(token)-sequence_len): seq.append(token[i:i+sequence_len])
6. Unique Identifier
tokenizer = Tokenizer() tokenizer.fit_on_texts(seq) sequence = tokenizer.texts_to_sequences(seq)
Here sequence is a list of sublist having words represented with their unique id .You can print sequence and check the result for better understanding.
7. Index Number
You can check the unique id or index no of each word using word_index . It will return a dictionary having word as key and corrosponding id as value.
8. Vocab Size
vocab_size = len(tokenizer.word_index)
Here vocab size is the total no of unique words present in the source text.
9. Numpy Array
import numpy as np arr=np.array(sequence)
In this step the sequence is converted into two dimensional array called numpy array becuase our model demands for it.
10. Sequence and Targted Division
In this step we will be dividing our sequence list into input and corrosponding output values so that we can train it in our model.
Now Y needs to be converted into categorical form . For this purpose to_categorical() function can be used.
11.Text Generation Model Using LSTM
Here comes the most awaited step of our article. In this step we would be building actual model to generate text by the help of above steps.
from keras.models import Sequential from keras.layers import Dense,LSTM,Embedding seq_length = X.shape model = Sequential() model.add(Embedding(vocab_size, 150, input_length=seq_length)) model.add(LSTM(1024, return_sequences=True)) model.add(LSTM(1024)) model.add(Dense(1024, activation='relu')) model.add(Dense(vocab_size, activation='softmax')) print(model.summary())
In this process we are using the following elements:
Sequential Model: This type of model allows you to build model by adding layer one after the other as you can see in the code itself.
Embedding Layer: It learns embedding for all the words in the model and can be used only as a first layer.
LSTM Layer: This is basic layer on which we are building our model. In this layer we are using 1024 nodels
Dense Layer: It is a fully connected Neural Network layer in which each input node is connected to each output node. This is the final output layer which is using softmax activation function.
Finally we can print the summary of our model generated so far. It will give you idea about the functionality and parameters used by each layer.
Screen – 2
12. Compiling the Model
We can use the compile( ) function to compile the model we created in last step.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, Y, batch_size=80, epochs=150)
Parameters Used in the Compilation Process:
Loss: This parameter needs to be minimised so as to get the best output possible.
Optimizer: This parameter helps the model perform efficiently by using less resources and also to provide faster execution.
Metrics: It is the value which we want to print while compiling. In our case we have used value.
Fit( ) : It is used to fit the model with input and output values.
Batch Size: It is the number of nodes which has to be processed before the model is updated.
epochs: Number of iterations to be performed.
13. Final Output of Our Text Generation Model
from random import randint
from keras.preprocessing.sequence import pad_sequences
seed_text=”well prince so genoa and lucca are now just
family estates of the buonapartes but i warn you if you”
def generate_seq(model,tokenizer,seq_length,seed_text,n_words): result =  in_text =seed_text for i in range(n_words): encoded = tokenizer.texts_to_sequences([in_text]) encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') yhat = model.predict_classes(encoded) out_word = '' for word,index in tokenizer.word_index.items(): if index==yhat: out_word = word break in_text+=' ' + out_word result.append(word) return ' '.join(result) generated = generate_seq(model, tokenizer, seq_length, seed_text, 30) print(generated)
In the above function generate_seq() we are converting the output achieved to a readable format by converting text_to_sequence and then applying pad_sequence to it . It ensures that all the sequences or input are of same length by adding extra padding to shorter ones
We have generated a text output of length 30. You can even create a whole story of about 100 to 200 words.
Here comes an end to our article on Building a Text Generation Model Using LSTM with Deep Learning. You can use the codes above to create your own model and test it out for more text segments.
Check out the GitHub Link for Code and Project files: Varsha Saini’s GitHub