In one of our earlier posts, we have created a Text Summarization algorithm. Now the main problem with such Natural language Generation algorithms is Language Modelling.
The algorithm needs enough training to get confident about the use of words. Not every part of text could be used while summarization or other NLP problems. So for these tasks the model needs to be trained in Language so that it could generate text efficiently.
And by text generation, we mean the process of generating a natural language having a sense of meaning. The model needs a see sentence based on which it would be able to Generate a complete summary.
For building this next-generation model we will be using Tensorflow, Keras Library, Deep Learning process, NLP and LSTM. So first we should know about each of them in detail.
What is Deep Learning?
Deep Learning is a subset of Machine Learning which groups the process of training models mostly through unsupervised learning. Models are provided with data including Text, Voice, and Images through which they are trained enough to take further decisions.
In our today’s article, we will be training our model on Text bits. This way it will learn Text Generation using LSTM by knowing about the occurrence of each word and frequency. After this, our model will be able to generate text on its own just by providing a seed sentence.
Natural language Processing
It is the process of processing and analyzing natural languages by computer models. Machines need to learn Natural Language Processing for various tasks such as Text Summarization, Sentiment Analysis, Speech to Text Generation, etc.
What is Tensorflow?
Tensorflow is one of the most famous open-source Deep Learning library. It was made public by Google in the year 2015 and developers around the world are using it to build AI and Machine Learning projects.
The library is mainly used for numerical computational which makes the model computing amazing smooth. It helps developers throughout the process of training models, getting data, forecasting results, and adapting the model changes.
We will be using Tensorflow to train our model for the Text Generation process. It could also be used to train models for Digit Recognition, Image Recognition and other data-oriented NLP models.
What Does Keras Library Do?
Keras is an open-source network library majorly written in Python. Its generally referred to as API which could run on the top of Tensorflow, PlaidML, Theano, etc. It was developed to make experimentation on Deep learning a bit easier and efficient.
Deep Learning models get ease to be developed with the help of Keras Library. And after the support, it got from Tensorflow it really made much difference in the abstraction of Deep Learning data models.
What About LSTM?
LSTM short for Long Short-Term Memory is an Artificial Intelligence architecture. It is often used to build stable deep learning models. It can remember sequences in data elements that could be used to train models.
The model which we are going to build will use LSTM architecture to remember the occurrence of words. It also tries to keep the meaning of the final generated text related to the seed sentence that we provide.
Now you might have got a basic understanding of each of the elements. We will be moving towards the building of a deep learning model. This model would be used for Text Generation using LSTM with Deep learning.
Note: Deep Learning algorithms require GPU for fast processing therefore we are using Google Colab. If your system has GPU available then you can use that for sure.
Note 2: The Indentation is not correct in Code blocks due to WordPress plugins. Check the Respective Output Screen for correct code indentation.
Building Text Generation Model with LSTM
Installation of Tensorflow and Keras
pip install --upgrade tensorflow sudo pip3 install keras (For ubuntu) pip install keras (For windows)
1. Importing Important Libraries
import string from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from nltk.tokenize import word_tokenize
Tokenizer is used to split the sentence into a list of words and to_categorical is used to convert class vector into a matrix.
2. Upload Files
This step is only for those who are working on Google Colab. So, If you are working on your local machine then you can directly skip onto the third step.
from google.colab import files uploaded = files.upload()
3. Opening a File
We can use the below code to open a file in the environment.
file = open('File name', 'r',encoding="ISO-8859-1") text=file.read() file.close()
4. Cleaning the Text
For this step we would need to download ‘punkt’ from nltk library. Use the code below to perform the action:
import nltk nltk.download('punkt')
After downloading punkt now we will be using the code below to clean our text file.
def clean_doc(doc): clean_words= [] words = word_tokenize(text) for word in words: word = word.strip(string.punctuation) if len(word)>=1 and word.isdigit()==False: word = word.lower() clean_words.append(word) return clean_words token=clean_doc(text)
Output Screen:Â
In the above step, all the nonrequired words are removed and text is converted into a usable form.
5. Creating Sequence
In this step, sequences of text are generated each having 21 lengths. This length depends on the data you are working on. Since we have taken data of medium size so we are using length as 21.
sequence_len=21 seq=[] for i in range(0,len(token)-sequence_len): seq.append(token[i:i+sequence_len])
6. Unique Identifier
tokenizer = Tokenizer() tokenizer.fit_on_texts(seq) sequence = tokenizer.texts_to_sequences(seq)
Here the sequence is a list of sublists having words represented with their unique id. You can print the sequence and check the result for better understanding.
Output Screen:
7. Index Number
You can check the unique id or index no of each word using word_index. It will return a dictionary having word as the key and corresponding id as the value.
tokenizer.word_index
8. Vocab Size
vocab_size = len(tokenizer.word_index)
Here vocab size is the total no of unique words present in the source text.
9. Numpy Array
import numpy as np arr=np.array(sequence)
In this step, the sequence is converted into a two-dimensional array called NumPy array because our model demands it.
10. Sequence and Targeted Division
In this step, we will be dividing our sequence list into the input and corresponding output values so that we can train it in our model.
Now Y needs to be converted into categorical form. For this purpose to_categorical() function can be used.
Output Screen:
11.Text Generation Model Using LSTM
Here comes the most awaited step of our article. In this step we would be building an actual model to generate text with the help of the above steps.
from keras.models import Sequential from keras.layers import Dense,LSTM,Embedding seq_length = X.shape[1] model = Sequential() model.add(Embedding(vocab_size, 150, input_length=seq_length)) model.add(LSTM(1024, return_sequences=True)) model.add(LSTM(1024)) model.add(Dense(1024, activation='relu')) model.add(Dense(vocab_size, activation='softmax')) print(model.summary())
In this process we are using the following elements:
Sequential Model: This type of model allows you to build the model by adding layers one after the other as you can see in the code itself.
Embedding Layer: It learns embedding for all the words in the model and can be used only as a first layer.
LSTM Layer: This is a basic layer on which we are building our model. In this layer, we are using 1024 models
Dense Layer: It is a fully connected Neural Network layer in which each input node is connected to each output node. This is the final output layer which is using the softmax activation function.
Finally, we can print the summary of our model generated so far. It will give you an idea about the functionality and parameters used by each layer.
Output Screen:
Screen – 2
12. Compiling the Model
We can use the compile( ) function to compile the model we created in the last step.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, Y, batch_size=80, epochs=150)
Output Screen:
Parameters Used in the Compilation Process:
Loss: This parameter needs to be minimized so as to get the best output possible.
Optimizer: This parameter helps the model perform efficiently by using fewer resources and also to provide faster execution.
Metrics: It is the value that we want to print while compiling. In our case, we have used value.
Fit( ) : It is used to fit the model with input and output values.
Batch Size: It is the number of nodes which has to be processed before the model is updated.
epochs: Number of iterations to be performed.
13. Final Output of Our Text Generation Model
from random import randint
from keras.preprocessing.sequence import pad_sequences
seed_text=”well prince so genoa and lucca are now just
family estates of the buonapartes but i warn you if you”
def generate_seq(model,tokenizer,seq_length,seed_text,n_words): result = [] in_text =seed_text for i in range(n_words): encoded = tokenizer.texts_to_sequences([in_text])[0] encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') yhat = model.predict_classes(encoded) out_word = '' for word,index in tokenizer.word_index.items(): if index==yhat: out_word = word break in_text+=' ' + out_word result.append(word) return ' '.join(result) generated = generate_seq(model, tokenizer, seq_length, seed_text, 30) print(generated)
Output Screen:
In the above function generate_seq() we are converting the output achieved to a readable format by converting text_to_sequence and then applying pad_sequence to it. It ensures that all the sequences or input are of the same length by adding extra padding to shorter ones
We have generated a text output of length 30. You can even create a whole story of about 100 to 200 words.
Here comes an end to our article on Building a Text Generation Model Using LSTM with Deep Learning. You can use the codes above to create your own model and test it out for more text segments.
Check out the GitHub Link for Code and Project files: Varsha Saini’s GitHub
.