In the 21st century, data is growing at an exponential rate, and these data are of different forms including videos, music, texts, pictures, etc. The internet has been the major source of this data, and the introduction of social media sites like Instagram, Facebook, and Twitter has played a massive role in the increment of text data. An increase in the usage of these social media sites has led to a massive increase in text data to be analyzed by NLP (Natural language Processing) to perform information retrieval and sentiment analysis. Most of these data are humongous and noisy, hence raw data is inapplicable for analysis. Therefore, text processing is necessary for modelling and analysis of the data.
In this tutorial, we are going to discuss how to deal with text data using machine learning. The process to deal with text data is called text processing and we are going to use NLP libraries for this.
Technical terms of NLP
Text processing comprises two different phases namely Tokenization and Normalization.
- Tokenization: It is a method of separating a text into various parts called tokens. Tokens are considered as the building blocks of natural language. Tokens can be classified as words, subwords, and characters.
For example, let’s take a sentence “Next month, we’re going to U.S.!”.
On applying tokenization, the resulting Tokens will be like:
“ | Next | Month | , | we | ‘re | going | to | U.S. | ! | “ |
Through this example, we can see how tokenization is performed and how it separates the “,” from the rest of the word as it treats a comma as a different word. The word “we’re” separated into two different tokens, “we” and “ ’re ” as the algorithm knows that the base words for these two tokens are different. Hence it separates them as “we” and “are”. Next, we see that “U.S.” is not all separated in spite of all the full stops, as the algorithm realizes that “U.S.” is a noun and it should be kept intact.
- Normalization: It is the process of converting a token to its base form for further classification. It is helpful in removing the variations, punctuations, stop words, and noise from the text, it also reduces the number of unique tokens present in the text. We will use two methods for normalization: Stemming and Lemmatization.
- Stemming: This process removes the additional forms i.e., the starting or ending letters from the word, and tries to get a base word, but fails most of the time, however, produces a similar word. However, it works faster than the other method. There are two important types of stemming which is most popularly used:
- Porter stemmer, which was introduced by Martin F. Porter in 1980 is not much efficient but very fast, as of why it is used popularly.
- Snowball stemmer, which is an advanced version of porter stemmer was also developed by martin porter. This stemming method is more efficient than the porter stemmer but is slower in comparison to that of the latter. However, for several purposes where accuracy is the key, this method serves well.
- Lemmatization: This process is systematic in removing the additional forms and results in the correct base form or lemma. It makes use of parts of speech, vocabulary, word structure, and grammar relations. Since it uses these forms to get the results, it becomes slower than stemming.
In the table below we have tried to explain the differences between the results obtained through stemming and lemmatization which will also show the differences between the two:
Word |
Stemming |
Lemmatization |
Change |
chang |
change |
Changing |
chang |
change |
Changes |
chang |
change |
Changed |
chang |
change |
Changer |
chang |
change |
is |
is |
be |
remembered |
rememb |
remember |
Text processing using python.
To perform text processing in python, we are going to use two NLP libraries, namely NLTK (Natural Language Toolkit) and spacy. We are going to use these libraries as they are the most widely used libraries and hence more popular than other libraries. However, we do have other libraries for text processing like coreNLP, Genism, PyNLPl, Pattern, Polyglot, textblob, etc.
In order to perform text processing we need to save a text as variable for both NLTK and Spacy.
In this tutorial, we have simply taken up a text block to explain text processing. However, we can use another type of text file like PDFs or Word documents, which we first need to load. There are some kinds of pdf that are not meant for editing, in that case, those documents can’t be used for text processing.
text = “The internet has been the major source of these data, and the introduction of social media sites like Instagram, Facebook and twitter has played a massive role in the increment of text data. Increase in the usage of these social media sites has led to a massive increase in text data to be analyzed by NLP (Natural language Processing) to perform information retrieval and sentiment analysis. Most of these data are humongous and noisy, hence raw data is inapplicable for analysis. Therefore, text processing is necessary for modelling and analysis of the data.”
Import all the needed libraries.
import re
import nltk
import pandas as pd
from nltk.tokenize import WordPunctTokenizer #imports the tokenizer
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download(’averaged_perceptron_tagger’)
nltk.download('wordnet')
#downloads wordnet for lemmatizer
from nltk.stem import WordNetLemmatizer
The first line i.e., import re is used to import regular expressions popularly known as regex. It is used to extract, search or manipulate string patterns from a larger piece of text.
The second line is to import pandas which is a data analytics toolkit and it will the used at the last part after the clean tokens are obtained.
The third line is to import the NLTK library
The fourth line imports WordPunctTokenizer which will convert the given text in the form of tokens.
The fifth line is to download the stop words which are already present in the NLTK library.
In the sixth line of the code we are importing the stop words which were downloaded earlier.
The seventh line of the code will download the pos tags for lemmatization
The eighth line downloads the wordnet from the NLTK library.
The last of the block will import the WordNetLemmatizer which will be further use in the process of stemming from the NLTK library i.e., nltk.stem
Second step is to apply Tokenization
word_punct_token = WordPunctTokenizer().tokenize(text)
Here we use tokenization to separate the words in the text, into tokens. The word_punct_token is a variable used to collect the tokens of text after the WordPunctTokenizer has performed tokenization. There are several other tokenizers present in the NLTK libraries which we can use instead of this one.
Third step is to apply Normalization
After performing tokenization we are now going to perform normalization on the separated token to ensure the originality of all the words and to remove punctuations and stop words so that the text can easily be processed.
clean_token=[]
for token in word_punct_token:
# converts the text to lower case
token = token.lower()
# remove any character that are not alphabetical
new_token = re.sub(r'[^a-zA-Z]+', '', token)
# removes all empty value and single character value
if new_token != "" and len(new_token) >= 2:
vowels=len([vowel for vowel in new_token if vowel in "aeiou"])
if vowels != 0: # removes word that contains consonants
clean_token.append(new_token)
The above part of the code is used to remove additional noise in the text data.
The third line in the code block will convert the text to lower case which will help us in just dealing with the lower case alphabets.
The fourth line of the code block will remove punctuations, numbers, special characters, etc. and only leave the characters that are alphabets.
The fifth line of the code then removes single characters and double characters like ”a” , “is”, etc. from the tokens.
The last two lines will also remove those words which contain only consonants and no vowels.
Fourth step is to remove stopwords
# Get the list of all stop words in NLTK
stop_words = stopwords.words('english')
# To add new stopwords to the list
stop_words.extend(["would","so","token","data","text",'word'])
print(stop_words)
# Remove all the stopwords from the list of tokens
tokens = [x for x in clean_token if x not in stop_words]
Stop words are the words that are most commonly used in the language. These words do not add any new meaning to a given text and hence they are needed to be removed before the processing in order to lower the amount of noise in the data. Different libraries use a different set of stop words, however NLTK has allowed us to add or remove a particular stop word from its library. For example, if we are dealing with a business text we can add words like product, employee, work, etc. in the list of stop words.
Coming to the code block the first line will fetch all the stop words present in NLTK library in English language.
In the second line we are adding a few stopwords to the list of originally present stop words.
The third line then prints the updated stop words.
The last line of the block of code will remove all the stop words in the list of tokens.
Fifth step is to do POS (Parts-of-Speech) Tagging
data_tagset = nltk.pos_tag(tokens)
df_tagset = pd.DataFrame(data_tagset, columns=['Word', 'Tag'])
In the POS tagging we tag a list of words, list of tuples in the form of (word, tag) to classify them as nouns, verbs, pronouns, etc.
nltk.pos_tag module returns the tokens in the form of tuples, which is then converted in the form of a DataFrame.
Sixth step is to Lemmatization
# Create the lemmatizer objects
lemmatizer = WordNetLemmatizer()
lemmatize_text = []
for word in tokens:
output = [word, lemmatizer.lemmatize(word, pos='n'), lemmatizer.lemmatize(word, pos='a'),lemmatizer.lemmatize(word, pos='v')]
lemmatize_text.append(output)
# Now create DataFrame using original words and the lemma words
df = pd.DataFrame(lemmatize_text, columns =['Word', 'Lemmatized Noun', 'Lemmatized Adjective', 'Lemmatized Verb'])
df['Tag'] = df_tagset['Tag']
Coming directly into the for loop “pos=’n’” will return the words which are nouns, “pos = ‘a’” will return the adjectives and “pos = ‘v’” will return the verbs. After we get the nouns, adjectives and verbs we will then append the results into one .
We will now create a dataframe to store the lemmatized words in n orderly manner. After this we will then tag the dataframe to create a POS tag.
We can use stemming as well in here but stemming does not give promising results as it only removes a certain letters from the word and results in words that does not give any meaning to the sentence.
However, on the other hand using lemmatization gives us a benefit by trimming the words of their prefixes and suffixes as well using other POS methods to get the base words. This method ensures us with perfect base words.
After performing lemmatization we will now carry on the process of obtained lemmas to match them with perfect words. The below block of code will do the same hence processing the words in a more grouped manner.
# replacement with a single character for simplification
df = df.replace(['NN','NNS','NNP','NNPS'],'n')
df = df.replace(['JJ','JJR','JJS'],'a')
df = df.replace(['VBG','VBP','VB','VBD','VBN','VBZ'],'v')
Now, we will define a function in which we will take the lemmatized noun when the tagset is noun and take the lemmatized adjective when tagset is adjective and similarly take the lemmatized verb when the tagset is verb.
df_lemmatized = df.copy()
df_lemmatized['Tempt Lemmatized Word']=df_lemmatized['Lemmatized Noun'] + ' | ' + df_lemmatized['Lemmatized Adjective']+ ' | ' + df_lemmatized['Lemmatized Verb']
df_lemmatized.head(5)
lemma_word = df_lemmatized['Tempt Lemmatized Word']
tag = df_lemmatized['Tag']
i = 0
new_word = []
while i<len(tag):
words = lemma_word[i].split('|')
if tag[i] == 'n':
word = words[0]
elif tag[i] == 'a':
word = words[1]
elif tag[i] == 'v':
word = words[2]
new_word.append(word)
i += 1
df_lemmatized['Lemmatized Word']=new_word
By the end of the code we have the words separated as nouns, adjectives and verbs which is then appended and stored in the Lemmatized word.
Get the final processed Tokens and making wordcloud
Lemma_word = [str(x) for x in df_lemmatized[‘Lemmatized Word’]]
# Now calculate the frequency distribution of the tokens
After the final words are obtained, they are ready to be plug-in into any text analytics models. The most popular text analytics model used is Word cloud. A word cloud is a powerful text-processing model which uses visual representations to show the most frequent words in the text. It displays the texts in color and size varies depending on the frequency of the word that appears on the text. For example, if the word is smaller in size then it is less important and if it is bigger in size then it is comparatively more important. The below image shows an example how a word cloud looks like:
We can also perform sentiment analysis and text classification as part of further steps. As a different approach, we can convert the text to numerical form and then use scikit-learn to perform predictive data analysis on the obtained set of lemmas. This will then be trained through the algorithm to obtain the desired results in text processing of the given text.