As soon as you start working on a data science task you realize the dependence of your results on the data quality. The initial step — data preparation — of any data science project sets the basis for effective performance of any sophisticated algorithm.
In textual data science tasks, this means that any raw text needs to be carefully preprocessed before the algorithm can digest it. In the most general terms, we take some predetermined body of text and perform upon it some basic analysis and transformations, in order to be left with artefacts which will be much more useful for a more meaningful analytic task afterward.
The preprocessing usually consists of several steps that depend on a given task and the text, but can be roughly categorized into segmentation, cleaning, normalization, annotation and analysis.
- Segmentation, lexical analysis, or tokenization, is the process that splits longer strings of text into smaller pieces, or tokens. Chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc.
- Cleaning consists of getting rid of the less useful parts of text through stop-word removal, dealing with capitalization and characters and other details.
- Normalization consists of the translation (mapping) of terms in the scheme or linguistic reductions through stemming, lemmatization and other forms of standardization.
- Annotation consists of the application of a scheme to texts. Annotations may include labeling, adding markups, or part-of-speech tagging.
- Analysis means statistically probing, manipulating and generalizing from the dataset for feature analysis and trying to extract relationships between words.
Segmentation
Sometimes segmentation is used to refer to the breakdown of a text into pieces larger than words, such as paragraphs and sentences, while tokenization is reserved for the breakdown process which results exclusively in words.
This may sound like a straightforward process, but in reality it is anything but. Do you need a sentence or a phrase? And what is a phrase then? How are sentences identified within larger bodies of text? The school grammar suggests that sentences have “sentence-ending punctuation”. But for machines the point is the same be it at the end of an abbreviation or of a sentence.
“Shall we call Mr. Brown?” can easily fall into two sentences if abbreviations are not taken care of.
Sometimes segmentation is used to refer to the breakdown of a text into pieces larger than words, such as paragraphs and sentences, while tokenization is reserved for the breakdown process which results exclusively in words.
This may sound like a straightforward process, but in reality it is anything but. Do you need a sentence or a phrase? And what is a phrase then? How are sentences identified within larger bodies of text? The school grammar suggests that sentences have “sentence-ending punctuation”. But for machines the point is the same be it at the end of an abbreviation or of a sentence.
“Shall we call Mr. Brown?” can easily fall into two sentences if abbreviations are not taken care of.
Cleaning
The process of cleaning helps put all text on equal footing, involving relatively simple ideas of substitution or removal:
- setting all characters to lowercase
- noise removal, including removing numbers and punctuation (it is a part of tokenization, but still worth keeping in mind at this stage)
- stop words removal (language-specific)
Lowercasing
Text often has a variety of capitalization reflecting the beginning of sentences or proper nouns emphasis. The common approach is to reduce everything to lower case for simplicity. Lowercasing is applicable to most text mining and NLP tasks and significantly helps with consistency of the output. However, it is important to remember that some words, like “US” and “us”, can change meanings when reduced to the lower case.
Noise Removal
Noise removal refers to removing characters digits and pieces of text that can interfere with the text analysis. There are various ways to remove noise, including punctuation removal, special character removal, numbers removal, html formatting removal, domain specific keyword removal, source code removal, and more. Noise removal is highly domain dependent. For example, in tweets, noise could be all special characters except hashtags as they signify concepts that can characterize a tweet. We should also remember that strategies may vary depending on the specific task: for example, numbers can be either removed or converted to textual representations.
Stop-word removal
Stop words are a set of commonly used words in a language like “a”, “the”, “is”, “are” and etc in English. These words do not carry important meaning and are removed from texts in many data science tasks. The intuition behind this approach is that, by removing low information words from text, we can focus on the important words instead. Besides, it reduces the number of features in consideration which helps keep your models better sized. Stop word removal is commonly applied in search systems, text classification applications, topic modeling, topic extraction and others. Stop word lists can come from pre-established sets or you can create a custom one for your domain.
Normalization
Normalization puts all words on equal footing, and allows processing to proceed uniformly. It is closely related to cleaning, but brings the process a step forward putting all words on equal footing by stemming and lemmatizing them.
Stemming
Stemming is the process of eliminating affixes (suffixes, prefixes, infixes, circumfixes) from a word in order to obtain a word stem. The results can be used to identify relationships and commonalities across large datasets. There are several stemming models, including Porter and Snowball. The danger here lies in the possibility of overstemming where words like “universe” and “university” are reduced to the same root of “univers”.
Lemmatization
Lemmatization is related to stemming, but it is able to capture canonical forms based on a word’s lemma. By determining the part of speech and utilizing special tools, like WordNet’s lexical database of English, lemmatization can get better results:
The stemmed form of leafs is: leaf
The stemmed form of leaves is: leav
The lemmatized form of leafs is: leaf
The lemmatized form of leaves is: leaf
Stemming may be more useful in queries for databases whereas lemmazation may work much better when trying to determine text sentiment.
Annotation
Text annotation is a sophisticated and task-specific process of providing text with relevant markups. The most common and general practice is to add part-of-speech (POS) tags to the words.
Part-of-speech tagging
Understanding parts of speech can make a difference in determining the meaning of a sentence as it provides more granular information about the words. For example, in a document classification problem, the appearance of the word book as a noun could result in a different classification than book as a verb. Part-of-speech tagging tries to assign a part of speech (such as nouns, verbs, adjectives, and others) to each word of a given text based on its definition and the context. It often requires looking at the proceeding and following words and combined with either a rule-based or stochastic method.
Analysis
Finally, before actual model training, we can explore our data for extracting features that might be used in model building.
Count
This is perhaps one of the more basic tools for feature engineering. Adding such statistical information as word count, sentence count, punctuation counts and industry-specific word counts can greatly help in prediction or classification.
Chunking (shallow parsing)
Chunking is a process that identifies constituent parts of sentences, such as nouns, verbs, adjectives, etc. and links them to higher order units that have discrete grammatical meanings, for example, noun groups or phrases, verb groups, etc..
Collocation extraction
Collocations are more or less stable word combinations, such as “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on. As they usually convey a specific established meaning it is worthwhile to extract them before the analysis.
Word Embedding/Text Vectors
Word embedding is the modern way of representing words as vectors to redefine the high dimensional word features into low dimensional feature vectors. In other words, it represents words at an X and Y vector coordinate where related words, based on a corpus of relationships, are placed closer together.
Preparing a text for analysis is a complicated art which requires choosing optimal tools depending on the text properties and the task. There are multiple pre-built libraries and services for the most popular languages used in data science that help automate text pre-processing, however, certain steps will still require manually mapping terms, rules and words.