Before we start: Dataset and Dependencies
pip install
pandas matplotlib numpy
nltk seaborn sklearn gensim pyldavis
wordcloud textblob spacy textstat
news= pd.read_csv(‘data/abcnews-date-text.csv’,nrows=10000)
news.head(3)
Ok, I think we are ready to start our data exploration!
Analyzing text statistics
Analyzing text statistics
- word frequency analysis,
- sentence length analysis,
- average word length analysis,
- etc.
To do so, we will be mostly using histograms (continuous data) and bar charts (categorical data).
First, I’ll take a look at the number of characters present in each sentence. This can give us a rough idea about the news headline length.
news['headline_text'].str.len().hist()
text.str.split().
map(lambda x: len(x)).
hist()
news[‘headline_text’].str.split().
apply(lambda x : [len(i) for i in x]).
map(lambda x: np.mean(x)).hist()
Analyzing the amount and the types of stopwords can give us some good insights into the data.
To get the corpus containing stopwords you can use the nltk library. Nltk contains stopwords from many languages. Since we are only dealing with English news I will filter the English stopwords from the corpus.
import nltk
nltk.download(‘stopwords’)
stop=set(stopwords.words(‘english’))
corpus=[]
new= news[‘headline_text’].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
from collections import defaultdict
dic=defaultdict(int)
for word in corpus:
if word in stop:
dic[word]+=1
counter=Counter(corpus)
most=counter.most_common()
x, y= [], []
for word,count in most[:40]:
if (word not in stop):
x.append(word)
y.append(count)
sns.barplot(x=y,y=x)
Ngram exploration
To implement n-grams we will use ngrams function from nltk.util. For example:
from nltk.util import ngrams
list(ngrams([‘I’ ,’went’,’to’,’the’,’river’,’bank’],2))
list(ngrams([‘I’ ,’went’,’to’,’the’,’river’,’bank’],2))
So with all this, we will analyze the top bigrams in our news headlines.
def get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
top_n_bigrams=get_top_ngram(news[‘headline_text’],2)[:10]
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x)
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x)
top_tri_grams=get_top_ngram(news[‘headline_text’],n=3)
x,y=map(list,zip(*top_tri_grams))
sns.barplot(x=y,y=x)
x,y=map(list,zip(*top_tri_grams))
sns.barplot(x=y,y=x)
Topic Modeling exploration with pyLDAvis
Once we categorize our documents in topics we can dig into further data exploration for each topic or topic group.
But before getting into topic modeling we have to pre-process our data a little. We will:
- tokenize: the process by which sentences are converted to a list of tokens or words.
- remove stopwords
- lemmatize: reduces the inflectional forms of each word into a common base or root.
- convert to the bag of words: Bag of words is a dictionary where the keys are words(or ngrams/tokens) and values are the number of times each word occurs in the corpus.
import nltk
nltk.download(‘punkt’)
nltk.download(‘wordnet’)
def preprocess_news(df):
corpus=[]
stem=PorterStemmer()
lem=WordNetLemmatizer()
for news in df[‘headline_text’]:
words=[w for w in word_tokenize(news) if (w not in stop)]
words=[lem.lemmatize(w) for w in words if len(w)>2]
corpus.append(words)
return corpus
corpus=preprocess_news(news)
dic=gensim.corpora.Dictionary(corpus)
bow_corpus = [dic.doc2bow(doc) for doc in corpus]
lda_model = gensim.models.LdaMulticore(bow_corpus,
num_topics = 4,
id2word = dic,
passes = 10,
workers = 2)
lda_model.show_topics()
num_topics = 4,
id2word = dic,
passes = 10,
workers = 2)
lda_model.show_topics()
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic)
vis
- The distance between the center of the circles indicates the similarity between the topics. Here you can see that the topic 3 and topic 4 overlap, this indicates that the topics are more similar.
- On the right side, the histogram of each topic shows the top 30 relevant words. For example, in topic 1 the most relevant words are police, new, may, war, etc
Wordcloud
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
def show_wordcloud(data):
wordcloud = WordCloud(
background_color=’white’,
stopwords=stopwords,
max_words=100,
max_font_size=30,
scale=3,
random_state=1)
wordcloud=wordcloud.generate(str(data))
fig = plt.figure(1, figsize=(12, 12))
plt.axis(‘off’)
plt.imshow(wordcloud)
plt.show()
show_wordcloud(corpus)
- stopwords: The set of words that are blocked from appearing in the image.
- max_words: Indicates the maximum number of words to be displayed.
- max_font_size: maximum font size.
Sentiment analysis
Textblob
- polarity: is a floating-point number that lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.
- subjectivity: refers to how someone’s judgment is shaped by personal opinions and feelings. Subjectivity is represented as a floating-point value which lies in the range of [0,1].
from textblob import TextBlob
TextBlob(‘100 people killed in Iraq’).sentiment
TextBlob(‘100 people killed in Iraq’).sentiment
def polarity(text):
return TextBlob(text).sentiment.polarity
return TextBlob(text).sentiment.polarity
news[‘polarity_score’]=news[‘headline_text’].
apply(lambda x : polarity(x))
news[‘polarity_score’].hist()
apply(lambda x : polarity(x))
news[‘polarity_score’].hist()
def sentiment(x):
if x<0:
return ‘neg’
elif x==0:
return ‘neu’
else:
return ‘pos’
if x<0:
return ‘neg’
elif x==0:
return ‘neu’
else:
return ‘pos’
news[‘polarity’]=news[‘polarity_score’].
map(lambda x: sentiment(x))
map(lambda x: sentiment(x))
plt.bar(news.polarity.value_counts().index,
news.polarity.value_counts())
news.polarity.value_counts())
news[news['polarity']=='pos']['headline_text'].head()
news[news['polarity']=='neg']['headline_text'].head()
Vader Sentiment Analysis
VADER sentiment analysis class returns a dictionary that contains the probabilities of the text for being positive, negative and neutral. Then we can filter and choose the sentiment with most probability.
We will do the same analysis using VADER and check if there is much difference.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download(‘vader_lexicon’)
sid = SentimentIntensityAnalyzer()
def get_vader_score(sent):
# Polarity score returns dictionary
ss = sid.polarity_scores(sent)
#return ss
return np.argmax(list(ss.values())[:-1])
news[‘polarity’]=news[‘headline_text’].
map(lambda x: get_vader_score(x))
polarity=news[‘polarity’].replace({0:’neg’,1:’neu’,2:’pos’})
map(lambda x: get_vader_score(x))
polarity=news[‘polarity’].replace({0:’neg’,1:’neu’,2:’pos’})
plt.bar(polarity.value_counts().index,
polarity.value_counts())
polarity.value_counts())
Named Entity Recognition
In this tutorial, I will use spaCy which is an open-source library for advanced natural language processing tasks. It is written in Cython and is known for its industrial applications. Besides NER, spaCy provides many other functionalities like pos tagging, word to vector transformation, etc.
SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it supports the following entity types:
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load(“en_core_web_sm”)
doc=nlp(‘India and Iran have agreed to boost the economic viability
of the strategic Chabahar port through various measures,
including larger subsidies to merchant shipping firms using the facility,
people familiar with the development said on Thursday.’)
of the strategic Chabahar port through various measures,
including larger subsidies to merchant shipping firms using the facility,
people familiar with the development said on Thursday.’)
[(x.text,x.label_) for x in doc.ents]
from spacy import displacy
displacy.render(doc, style=’ent’)
First, we will run the named entity recognition on our news headlines and store the entity types.
def ner(text):
doc=nlp(text)
return [X.label_ for X in doc.ents]
ent=news[‘headline_text’].
apply(lambda x : ner(x))
ent=[x for sub in ent for x in sub]
counter=Counter(ent)
count=counter.most_common()
x,y=map(list,zip(*count))
sns.barplot(x=y,y=x)
sns.barplot(x=y,y=x)
def ner(text,ent=”GPE”):
doc=nlp(text)
return [X.text for X in doc.ents if X.label_ == ent]
doc=nlp(text)
return [X.text for X in doc.ents if X.label_ == ent]
gpe=news[‘headline_text’].apply(lambda x: ner(x))
gpe=[i for x in gpe for i in x]
counter=Counter(gpe)
gpe=[i for x in gpe for i in x]
counter=Counter(gpe)
x,y=map(list,zip(*counter.most_common(10)))
sns.barplot(y,x)
sns.barplot(y,x)
per=news[‘headline_text’].apply(lambda x: ner(x,”PERSON”))
per=[i for x in per for i in x]
counter=Counter(per)
per=[i for x in per for i in x]
counter=Counter(per)
x,y=map(list,zip(*counter.most_common(10)))
sns.barplot(y,x)
sns.barplot(y,x)
Exploration through Parts of Speach Tagging in python
- Noun (NN)- Joseph, London, table, cat, teacher, pen, city
- Verb (VB)- read, speak, run, eat, play, live, walk, have, like, are, is
- Adjective(JJ)- beautiful, happy, sad, young, fun, three
- Adverb(RB)- slowly, quietly, very, always, never, too, well, tomorrow
- Preposition (IN)- at, on, in, from, with, near, between, about, under
- Conjunction (CC)- and, or, but, because, so, yet, unless, since, if
- Pronoun(PRP)- I, you, we, they, he, she, it, me, us, them, him, her, this
- Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!
I will use the nltk to do the parts of speech tagging but there are other libraries that do a good job (spacy, textblob).
Let’s look at an example.
import nltk
sentence=”The greatest comeback stories in 2019″
tokens=word_tokenize(sentence)
nltk.pos_tag(tokens)
sentence=”The greatest comeback stories in 2019″
tokens=word_tokenize(sentence)
nltk.pos_tag(tokens)
doc = nlp(‘The greatest comeback stories in 2019’)
displacy.render(doc, style=’dep’, jupyter=True, options={‘distance’: 90})
displacy.render(doc, style=’dep’, jupyter=True, options={‘distance’: 90})
We can observe various dependency tags here. For example, DET tag denotes the relationship between the determiner “the” and the noun “stories”.
You can check the list of dependency tags and their meanings here.
def pos(text):
pos=nltk.pos_tag(word_tokenize(text))
pos=list(map(list,zip(*pos)))[1]
return pos
pos=nltk.pos_tag(word_tokenize(text))
pos=list(map(list,zip(*pos)))[1]
return pos
tags=news[‘headline_text’].apply(lambda x : pos(x))
tags=[x for l in tags for x in l]
counter=Counter(tags)
tags=[x for l in tags for x in l]
counter=Counter(tags)
x,y=list(map(list,zip(*counter.most_common(7))))
sns.barplot(x=y,y=x)
sns.barplot(x=y,y=x)
def get_adjs(text):
adj=[]
pos=nltk.pos_tag(word_tokenize(text))
for word,tag in pos:
if tag==’NN’:
adj.append(word)
return adj
adj=[]
pos=nltk.pos_tag(word_tokenize(text))
for word,tag in pos:
if tag==’NN’:
adj.append(word)
return adj
words=news[‘headline_text’].apply(lambda x : get_adjs(x))
words=[x for l in words for x in l]
counter=Counter(words)
words=[x for l in words for x in l]
counter=Counter(words)
x,y=list(map(list,zip(*counter.most_common(7))))
sns.barplot(x=y,y=x)
sns.barplot(x=y,y=x)
Exploring through text complexity
There are many readability score formulas available for the English language. Some of the most prominent ones are:
Readability Test | Interpretation | Formula |
---|---|---|
Automated Readability Index (ARI) | The output is an approximate representation of the U.S grade level needed to comprehend a text. | ARI = 4.71 * (characters/words) + 0.5 * (words/sentence) -21.43 |
Flesch Reading Ease (FRE) | Higher scores indicate material that is easier to read, lower numbers mark harder-to-read passages: – 0-30 College – 50-60 High school – 60+ Fourth grade | FRE = 206.835 − 1.015 * (total words/total sentences) − 84.6 * (total syllables/ total words) |
FleschKincaid Grade Level (FKGL) | The result is a number that corresponds with a U.S grade level. | FKGL = 0.39 * (total words/ totalsentences) + 11.8 (total syllables/total words) -15.59 |
Gunning Fog Index (GFI) | The result is a number that corresponds with a U.S grade level. | GFI = 0.4 * (( words/ sentence) + 100 * (complex words/ words)) |
from textstat import flesch_reading_ease
news[‘headline_text’].
apply(lambda x : flesch_reading_ease(x)).hist()
apply(lambda x : flesch_reading_ease(x)).hist()
x=[i for i in range(len(reading)) if reading[i]<5]
news.iloc[x][‘headline_text’].head()
news.iloc[x][‘headline_text’].head()
Final Thoughts
To make data exploration even easier, I have created a “Exploratory Data Analysis for Natural Language Processing Template” that you can use for your work.
Get Exploratory Data Analysis for Natural Language Processing Template