In this article, we will discuss and implement nearly all the major techniques that you can use to understand your text data and give you a complete(ish) tour into Python tools that get the job done.
Before we start: Dataset and Dependencies
If you want to follow the analysis step-by-step you may want to install the following libraries:
pandas matplotlib numpy
nltk seaborn sklearn gensim pyldavis
wordcloud textblob spacy textstat
Now, we can take a look at the data.
news.head(3)
For simplicity, I will be exploring the first 10000 rows from this dataset. Since the headlines are sorted by publish_date it is actually 2 months from February/19/2003 until April/07/2003.
Ok, I think we are ready to start our data exploration!
Analyzing text statistics
They include:
- word frequency analysis,
- sentence length analysis,
- average word length analysis,
- etc.
Those really help explore the fundamental characteristics of the text data.
To do so, we will be mostly using histograms (continuous data) and bar charts (categorical data).
First, I’ll take a look at the number of characters present in each sentence. This can give us a rough idea about the news headline length.
news['headline_text'].str.len().hist()
Code Snippet that Generates this Chart
The histogram shows that news headlines range from 10 to 70 characters and generally, it is between 25 to 55 characters.
map(lambda x: len(x)).
hist()
Code Snippet that Generates this Chart
It is clear that the number of words in news headlines ranges from 2 to 12 and mostly falls between 5 to 7 words.
apply(lambda x : [len(i) for i in x]).
map(lambda x: np.mean(x)).hist()
Let’s find out.
One reason why this may not be true is stopwords. Stopwords are the words that are most commonly used in any language such as “the”,” a”,” an” etc. As these words are probably small in length these words may have caused the above graph to be left-skewed.
Analyzing the amount and the types of stopwords can give us some good insights into the data.
To get the corpus containing stopwords you can use the nltk library. Nltk contains stopwords from many languages. Since we are only dealing with English news I will filter the English stopwords from the corpus.
nltk.download(‘stopwords’)
stop=set(stopwords.words(‘english’))
Now, we’ll create the corpus.
new= news[‘headline_text’].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
dic=defaultdict(int)
for word in corpus:
if word in stop:
dic[word]+=1
and plot top stopwords.
Code Snippet that Generates this Chart
We can evidently see that stopwords such as “to”,” in” and “for” dominate in news headlines.
We will use the counter function from the collections library to count and store the occurrences of each word in a list of tuples. This is a very useful function when we deal with word-level analysis in natural language processing.
most=counter.most_common()
for word,count in most[:40]:
if (word not in stop):
x.append(word)
y.append(count)
Code Snippet that Generates this Chart
Wow! The “us”, “Iraq” and “war” dominate the headlines over the last 15 years.
Ngram exploration
If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on.
Looking at most frequent n-grams can give you a better understanding of the context in which the word was used.
To implement n-grams we will use ngrams function from nltk.util. For example:
list(ngrams([‘I’ ,’went’,’to’,’the’,’river’,’bank’],2))
To build a representation of our vocabulary we will use Countvectorizer. Countvectorizer is a simple method used to tokenize, vectorize and represent the corpus in an appropriate form. It is available in sklearn.feature_engineering.text
So with all this, we will analyze the top bigrams in our news headlines.
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x)
How about trigrams?
x,y=map(list,zip(*top_tri_grams))
sns.barplot(x=y,y=x)
Topic Modeling exploration with pyLDAvis
Latent Dirichlet Allocation (LDA) is an easy to use and efficient model for topic modeling. Each document is represented by the distribution of topics and each topic is represented by the distribution of words.
Once we categorize our documents in topics we can dig into further data exploration for each topic or topic group.
But before getting into topic modeling we have to pre-process our data a little. We will:
- tokenize: the process by which sentences are converted to a list of tokens or words.
- remove stopwords
- lemmatize: reduces the inflectional forms of each word into a common base or root.
- convert to the bag of words: Bag of words is a dictionary where the keys are words(or ngrams/tokens) and values are the number of times each word occurs in the corpus.
With NLTK you can tokenize and lemmatize easily:
nltk.download(‘punkt’)
nltk.download(‘wordnet’)
corpus=[]
stem=PorterStemmer()
lem=WordNetLemmatizer()
for news in df[‘headline_text’]:
words=[w for w in word_tokenize(news) if (w not in stop)]
return corpus
Now, let’s create the bag of words model using gensim
bow_corpus = [dic.doc2bow(doc) for doc in corpus]
and we can finally create the LDA model:
num_topics = 4,
id2word = dic,
passes = 10,
workers = 2)
lda_model.show_topics()
You can print all the topics and try to make sense of them but there are tools that can help you run this data exploration more efficiently. One such tool is pyLDAvis which visualizes the results of LDA interactively.
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic)
vis
Code Snippet that Generates this Chart
On the left side, the area of each circle represents the importance of the topic relative to the corpus. As there are four topics, we have four circles.
- The distance between the center of the circles indicates the similarity between the topics. Here you can see that the topic 3 and topic 4 overlap, this indicates that the topics are more similar.
- On the right side, the histogram of each topic shows the top 30 relevant words. For example, in topic 1 the most relevant words are police, new, may, war, etc
So in our case, we can see a lot of words and topics associated with war in the news headlines.
Wordcloud
Creating wordcloud in python with is easy but we need the data in a form of a corpus. Luckily, I prepared it in the previous section.
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
background_color=’white’,
stopwords=stopwords,
max_words=100,
max_font_size=30,
scale=3,
random_state=1)
plt.axis(‘off’)
plt.show()
There are many parameters that can be adjusted. Some of the most prominent ones are:
- stopwords: The set of words that are blocked from appearing in the image.
- max_words: Indicates the maximum number of words to be displayed.
- max_font_size: maximum font size.
There are many more options to create beautiful word clouds. For more details, you can refer here.
Sentiment analysis
There are many projects that will help you do sentiment analysis in python. I personally like TextBlob and Vader Sentiment.
Textblob
The sentiment function of TextBlob returns two properties:
- polarity: is a floating-point number that lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.
- subjectivity: refers to how someone’s judgment is shaped by personal opinions and feelings. Subjectivity is represented as a floating-point value which lies in the range of [0,1].
I will run this function on our news headlines.
TextBlob(‘100 people killed in Iraq’).sentiment
Now that we know how to calculate those sentiment scores we can visualize them using a histogram and explore data even further.
return TextBlob(text).sentiment.polarity
apply(lambda x : polarity(x))
news[‘polarity_score’].hist()
Let’s dig a bit deeper by classifying the news as negative, positive and neutral based on the scores.
if x<0:
return ‘neg’
elif x==0:
return ‘neu’
else:
return ‘pos’
map(lambda x: sentiment(x))
news.polarity.value_counts())
Code Snippet that Generates this Chart
Yep, 70 % of news is neutral with only 18% of positive and 11% of negative.
news[news['polarity']=='pos']['headline_text'].head()
news[news['polarity']=='neg']['headline_text'].head()
Vader Sentiment Analysis
VADER or Valence Aware Dictionary and Sentiment Reasoner is a rule/lexicon-based, open-source sentiment analyzer pre-built library, protected under the MIT license.
VADER sentiment analysis class returns a dictionary that contains the probabilities of the text for being positive, negative and neutral. Then we can filter and choose the sentiment with most probability.
We will do the same analysis using VADER and check if there is much difference.
sid = SentimentIntensityAnalyzer()
# Polarity score returns dictionary
ss = sid.polarity_scores(sent)
#return ss
return np.argmax(list(ss.values())[:-1])
map(lambda x: get_vader_score(x))
polarity=news[‘polarity’].replace({0:’neg’,1:’neu’,2:’pos’})
polarity.value_counts())
Code Snippet that Generates this Chart
Yep, there is a slight difference in distribution. Even more headlines are classified as neutral 85 % and the number of negative news headlines has increased (to 13 %).
Named Entity Recognition
Let us consider an example of a news article.
entities such as RBI as an organization, Mumbai and India as Places, etc.
There are three standard libraries to do Named Entity Recognition:
In this tutorial, I will use spaCy which is an open-source library for advanced natural language processing tasks. It is written in Cython and is known for its industrial applications. Besides NER, spaCy provides many other functionalities like pos tagging, word to vector transformation, etc.
SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it supports the following entity types:
To use it we have to download it first:
python -m spacy download en_core_web_sm
of the strategic Chabahar port through various measures,
including larger subsidies to merchant shipping firms using the facility,
people familiar with the development said on Thursday.’)
We can also visualize the output using displacy module in spaCy.
Now that we know how to perform NER we can explore the data even further by doing a variety of visualizations on the named entities extracted from our dataset.
First, we will run the named entity recognition on our news headlines and store the entity types.
doc=nlp(text)
return [X.label_ for X in doc.ents]
apply(lambda x : ner(x))
ent=[x for sub in ent for x in sub]
count=counter.most_common()
Now, we can visualize the entity frequencies:
sns.barplot(x=y,y=x)
We can also visualize the most common tokens per entity. Let’s check which places appear the most in news headlines.
doc=nlp(text)
return [X.text for X in doc.ents if X.label_ == ent]
gpe=[i for x in gpe for i in x]
counter=Counter(gpe)
sns.barplot(y,x)
per=[i for x in per for i in x]
counter=Counter(per)
sns.barplot(y,x)
Exploration through Parts of Speach Tagging in python
- Noun (NN)- Joseph, London, table, cat, teacher, pen, city
- Verb (VB)- read, speak, run, eat, play, live, walk, have, like, are, is
- Adjective(JJ)- beautiful, happy, sad, young, fun, three
- Adverb(RB)- slowly, quietly, very, always, never, too, well, tomorrow
- Preposition (IN)- at, on, in, from, with, near, between, about, under
- Conjunction (CC)- and, or, but, because, so, yet, unless, since, if
- Pronoun(PRP)- I, you, we, they, he, she, it, me, us, them, him, her, this
- Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!
This is not a straightforward task, as the same word may be used in different sentences in different contexts. However, once you do it, there are a lot of helpful visualizations that you can create that can give you additional insights into your dataset.
I will use the nltk to do the parts of speech tagging but there are other libraries that do a good job (spacy, textblob).
Let’s look at an example.
sentence=”The greatest comeback stories in 2019″
tokens=word_tokenize(sentence)
nltk.pos_tag(tokens)
You can also visualize the sentence parts of speech and its dependency graph with spacy.displacy module.
displacy.render(doc, style=’dep’, jupyter=True, options={‘distance’: 90})
We can observe various dependency tags here. For example, DET tag denotes the relationship between the determiner “the” and the noun “stories”.
You can check the list of dependency tags and their meanings here.
pos=nltk.pos_tag(word_tokenize(text))
pos=list(map(list,zip(*pos)))[1]
return pos
tags=[x for l in tags for x in l]
counter=Counter(tags)
sns.barplot(x=y,y=x)
You can dig deeper into this by investigating which singular noun occur most commonly in news headlines. Let us find out.
adj=[]
pos=nltk.pos_tag(word_tokenize(text))
for word,tag in pos:
if tag==’NN’:
adj.append(word)
return adj
words=[x for l in words for x in l]
counter=Counter(words)
sns.barplot(x=y,y=x)
Exploring through text complexity
You can actually put a number called readability index on a document or text. Readability index is a numeric value that indicates how difficult (or easy) it is to read and understand a text.
There are many readability score formulas available for the English language. Some of the most prominent ones are:
Readability Test | Interpretation | Formula |
---|---|---|
Automated Readability Index (ARI) | The output is an approximate representation of the U.S grade level needed to comprehend a text. |
ARI = 4.71 * (characters/words) + 0.5 * (words/sentence) -21.43 |
Flesch Reading Ease (FRE) | Higher scores indicate material that is easier to read, lower numbers mark harder-to-read passages: – 0-30 College – 50-60 High school – 60+ Fourth grade |
FRE = 206.835 − 1.015 * (total words/total sentences) − 84.6 * (total syllables/ total words) |
FleschKincaid Grade Level (FKGL) | The result is a number that corresponds with a U.S grade level. | FKGL = 0.39 * (total words/ totalsentences) + 11.8 (total syllables/total words) -15.59 |
Gunning Fog Index (GFI) | The result is a number that corresponds with a U.S grade level. | GFI = 0.4 * (( words/ sentence) + 100 * (complex words/ words)) |
apply(lambda x : flesch_reading_ease(x)).hist()
news.iloc[x][‘headline_text’].head()
Final Thoughts
Hopefully, you will find some of them useful in your current and future projects.
To make data exploration even easier, I have created a “Exploratory Data Analysis for Natural Language Processing Template” that you can use for your work.
Get Exploratory Data Analysis for Natural Language Processing Template
Happy exploring!