Statistical Language Models

From simple to ++, with use cases, examples & code snippets

In NLP, a language model is a probability distribution over strings on an alphabet. In formal language theory, a language is a set of strings on an alphabet. The NLP version is a soft variant of the one in formal language theory.

The NLP version is better suited to modeling natural languages such as English or French. No hard rules dictate exactly which strings are in the language and which not. Rather we have observations to work with. People write. People talk. Their utterances characterize the language.

Importantly, the NLP statistical version is good for learning languages over strings from examples. Consider learning a model to recognize product names. The training set may contain iPhone 4 and iPhone 5 but not iPhone 12. It should recognize iPhone 12 as a product name.

In this post, we cover statistical language models from simple to elaborate. The covered models include: Independent model, first-order Markov model, Kth-order Markov model, Hidden Markov Model, Conditional Markov model, and Conditional Random Fields. Each is illustrated with realistic examples and use cases.

Next Token Probabilities

We’ve defined a language as a probability distribution over strings. In many use cases, what we really want is the probability of the next symbol given the current string. It turns out these two are equivalent, as noted in [1].

Consider a string on an alphabet. (The alphabet may be composed of characters or words or other tokens.) Denote this string x1, x2, …, xn. We can write P(x1,x2,…,xn) as P(x1)*P(x2|x1)*…*P(xn|xn-1,xn-2,…,x1).

Use Cases

What can we do with a language model? Quite a lot.

Suggest auto-completes. Smartphones offer auto-complete suggestions as we type. Web search engines offer auto-complete suggestions as we start typing a query. Under-the-hood these are powered by language models.

Recognize handwriting. Imagine pointing your smartphone at your handwritten notes and asking them to be recognized. Perhaps to be digitized and searchable. Or at least to make them legible!

Handwriting recognition is challenging. Even people often get it wrong. Sometimes even their own writing!

Consider trying to recognize the words in a poorly-written text. A pure image processing approach can make a lot of errors. Adding a language model can cut these down a lot. The language model provides a useful context.

Example 1: Let’s glimpse this in a simple example. Imagine the OCR thought the next word was databasc. The e was misrecognized as a c. They do look similar.

Now let’s add a language model trained on English words. Specifically to assign high probabilities to strings that look like English words and low probabilities to those that don’t. bath would get a high probability, but not axbs.

This language model will know that database is much more likely than databasc. So adding it will help detect and correct the OCR’s error.

By adding a multi-word language model, we can improve the error-detection and correction accuracy further. This model’s alphabet is the lexicon of words. Its high-probability strings model high-probability word sequences. The alphabet becomes way larger. Every distinct word is in the alphabet. So care needs to be exercised to reap its benefits while avoiding the cost (the model getting overly complex). More on this later.

A multi-word language model can also help fill in missing words in the written text.

Detect and correct spelling errors. We’ll reinterpret the same example databasc. The last character, c, is now a spelling error.

Recognize speech. Similar reasoning holds here. Except that the modality is different. The underlying language being expressed is the same. This is not to say that accurate handwriting or speech recognition is easy. Just that adding a language model helps.

Recognize multi-token named entities. Multi-token named entities often have a language structure. As an example, in a person’s name, the first name word(s) typically appear before the last name words. Sandwiched between the two might be middle name word(s). As another example, in a US street address, the street number typically appears before the street name. As we will see later, such entities are often recognized via latent language models such as Hidden Markov Models.

Now that we have seen some use cases, let’s dive into

Models

We’ll denote the string asx1, x2, …, xn.

Independent. This is the simplest approach. It assumes that all characters in the string are generated independently.

P(x1, x2, …, xn) = Q(x1)*Q(x2)*…*Q(xn).

Here Q is a probability distribution over the alphabet.

This approach is often effective on long strings on a large alphabet. Such as documents with many words. The text is seen as a sequence of words. The alphabet is the lexicon of words.

In this setting, more elaborate approaches get complex very quickly so they need to work convincingly and significantly better to justify their added complexity.

This approach would be effective in detecting the document’s language. For each language, we would train a separate Q, a distribution over the words seen in the language’s training set. For a new document W expressed as the sequence of words W1 W2 … Wn, we would calculate PL(W) = QL(W1]*QL[W2]*…QL[Wn] for every language L we have modeled. We would deem L* = argmax_L PL(W) as the language of W.

Example I1: Train on the sequence of words the dog is wagging the cat’s tail. We get Q(the)=2/7, Q(is) = 1/7, etc.

Python Code I1: Not tested, or even run. Plus, needs a runner.

from collections import Counter
import mathclass IndependentModel:
    
    def __init__(self):
        self.ctr = Counter()    def train(self,words):
        for word in words:
            self.ctr[word] += 1    def Q(self,word):
        return float(self.ctr[word])/self.ctr.values()    def P(self, words):
        return math.prod(map(lambda word: self.Q(word), words))

Exercise I1: Evolve this code snippet into a language recognizer. Say English vs French.

First-order Markov Model. Here, a symbol is allowed to depend on the previous one.

P(x1, x2, …, xn) = Q(x1)*Q(x2|x1)*…*Q(xn|xn-1)

A first-order Markov model is described by a state-transition matrix P. pij is the probability of going from state i to state j. Denoting xt as i and xt+1 as j, Q(xt+1,xt) equals pij.

This Markov model may also be expressed as a directed graph. There is an arc i → j if the value i can be followed by a value j. This arc has a probability p(j|i) on it. The graph representation merely makes explicit which state transition probabilities are 0.

Example 1M1: A first-order Markov model is effective in modeling short phrases. Say we want to discover salient short phrases (of 2-to-4 words each) from a large corpus of English documents. A first-order Markov model strikes a good balance between being rich enough to be able to do a reasonable job on it, without becoming too complex.

Consider the text “data mining is a phrase. data mining is a method of extracting meaningful information from a large data set”.

From this text, we will build the Markov graph. This graph’s nodes are distinct words that appear in the text. There is an arc from node u to node v if word u is followed by word v at least once in the text. A few arcs are shown below.

data → mining   mining → is    method → of  extracting → meaningful

We see that P(data, mining) = Q(data)*P(mining | data) = (2/19)*1 . By contrast, P(a, method) = Q(a)*P(method | a) = (3/19)*(⅓) = 1/19 < P(data, mining). Unfortunately, P(is, a) = P(data, mining). So (is, a) would be a false positive because it is not a salient phrase.

So the first-order Markov model on its own, while somewhat useful, also has glaring false positives. Adding grammatical information to this approach, specifically part-of-speech tags of words significantly improves its quality. That is, (mostly) reduces the false positives without (hardly) missing on the true ones. The intuition is that salient phrases are comprised of words favoring certain parts of speech, especially nouns. Consider (data, mining). Both words are nouns. Consider (is, a). ‘is’ is a verb, ‘a’ an article.

We won’t go into detail on how to combine the two approaches. The main point is that the Markov model is still useful.

Python Code 1M1: Not tested, or even run. Plus, needs a runner.

from collections import defaultdict, Counter
import mathclass FirstOrderMarkovModel:    def __init__(selfs):
       self.ctr2 = defaultdict(lambda:defaultdict(int))
       self.ctr1 = Counter()    def train(self,words):
       words_with_sentinel = [‘_B_’] + words
       for i in range(len(words_with_sentinel)-1):
          self.ctr2[words[i][words[i+1] += 1
          self.ctr1[words[i]] += 1    def Q(self,word,previous_word):
       return float(self.ctr2[previous_word[word])/
       self.ctr1[previous_word])    def P(words):
       p = self.Q(words[0],’_B’)
       for i in range(len(words)-1):
          p *= self.Q(words[i+1],words[i])
       return p

Kth-order Markov Model. In this model, a character is allowed to depend on the previous K characters.

P(x1, x2, …, xn) = Q(x1)*Q(x2|x1)*…*Q(xn|xn-1,xn-2,…,xn-k)

With K set to 2 or 3, this approach can discover better salient phrases than a first-order Markov model without incurring a huge increase in complexity. What do we mean by “better”? Consider a phrase of length 3. Call it x1, x2, x3. The first-order Markov model loses information as it assumes x3 is independent of x1 given x2. The second-order model doesn’t. P(x1)*P(x2|x1)*P(x3|x2,x1) equals P(x1,x2,x3).

As before, the need for combining the statistical approach with the grammatical approach remains.

Example KM1: We’d like to build a second-order Markov model on three-word product names. Why second-order? Consider these made-up examples: 4k tv samsung, 3k tv sony, … Imagine that sony does not offer a 4k option. If we used a first-order Markov model we would lose the interaction between the first word (4k or 3k or …) and the brand-name. The second-order model would capture this interaction. In other words, the product name would be influenced not only by the second word (tv in our example) but also by the first one (4k or 3k or …).

Hidden Markov Model

Here, strings are probabilistically generated from a model with latent variables, i.e., hidden states.

Notationally, let X = x1, …, xn denote a string of length n and S = s1, …, sn denote a string of states associated with X. Symbol xi is generated from state si.

X is called the observation sequence; S the (hidden) state sequence from which X was generated.

Structurally the HMM is a first-order Markov model over the set of states. Some states are connected to other states by arcs. In addition, each state has a probability distribution over the alphabet of observables.

This model generates a pair (X, S) as follows. First, it generates s1. Then it emits x1 from s1 with a certain probability. Next, it moves to state s2 with a certain probability. This probability depends only on where it came from, i.e. s1. What was emitted, i.e. x1, is immaterial. Next, it generates x2 from s2 with a certain probability. And so it goes.

In summary, the HMM generator alternates between generating states and generating observables from those states. The state transition probabilities depend only on the previous state. The emission transition probabilities depend only on the current state, i.e. the state from which the observable is emitted.

Example HMM1: Say we want to generate sentences in English that resemble real ones. Consider specifically sentences having one of two structures

Article Noun Verb Adverb
Pronoun Verb

An example sentence of the first structure is The boy is fast. An example sentence of the second structure is She sings.

A realistic sentence generator would accommodate a lot more structures. We chose two because the maximum insight comes in going from one to two.

What does an HMM that models these two structures look like? First, it helps to add a begin state and an end state. The HMM always starts from the begin state and stops at the end state. The begin and end states are called silent. They don’t emit any token.

Okay, now to the structure, i.e. the arcs connecting the states.

begin → article → noun → verb → adverb → end
begin → pronoun → verb → end

Note that some states are shown twice. This is just due to the presentation constraint. Take an example. Since the HMM has arcs begin → article and begin → pronoun what this really means is that from the begin state the HMM can move to the article state with a certain probability and to the pronoun state with another probability. The two probabilities sum to 1. From the state begin the HMM has to move somewhere.

Similarly, from the state verb we can either move to the state adverbor to the state end. Note that the HMM does not track how we arrived at the state verb, which means the state sequence begin ⇒ article ⇒ noun ⇒ verb ⇒ end is also possible, even though we didn’t ask for it. This type of generalization is good if we want the HMM to be able to generate sentences such as The boy ate and bad if not.

Training: We can train the emission parameters of this model easily. We can take advantage of the fact that our states are named entities, for whom training sets are readily available. It’s easy to collect words that form articles, those that form nouns, etc.

Next, we look into training the state transition probabilities. For example, we need to estimate the probability of going from the begin state to the article state. This is less than 1 as from the state begin we can also go to the state pronoun instead.

For training the transitions, we assume we have access to sentences whose words have POS-tags attached to them. A rich training set of this type is easy to assemble. It is easy to get lots of sentences. It is also easy to get the POS-tag sequences of these sentences by running a suitable POS-tagger on each.

The transition probabilities of the HMM are easy to train from such a training set. In fact, we only need the POS-tag sequence for each sentence. Let’s illustrate this. The probability on the arc begin → article is estimated as the number of POS-tag sequences whose first token is article divided by the number of POS-tag sequences whose first token is either article or pronoun.

Two generated sentences

Let’s see a few. The first one is below.

          The     boy     is     fast
           ^       ^      ^       ^
           |       |      |       |
begin → article → noun → verb → adverb → end

We start from the state begin, walk to article, emit The from it, walk to noun, emit boy from it, and so on. t

The second one is below.

         She     sings
          ^        ^
          |        |
begin → pronoun → verb → end

We start from begin, walk to pronoun, emit She from it, move to verb, emit Sing from it, and finally walk to end and stay put there.

Versus Independent: Imagine, by contrast, the sentences generated by the independent model. The words would be spewed out based on their probabilities with no regard for the desired structure.

Position-specific Independent Model aka chain HMM

In some use cases, the generated symbols are significantly influenced by their positions in the string. As one example, consider product names such as iPhone 11, canon camera, and sony tv. There is clearly some sequential structure. In these examples, the brand name (canon, sony) precedes the product type (camera, tv).

The position-specific generalization of the independent model is suited for such modeling.

P(x1, x2, …, xn) = Q1(x1)*Q2(x2)*…*Qn(xn).

Here Qi is a position-specific probability distribution over the alphabet. So with n positions we have n distributions Q1, …, Qn.

A position-specific independent model may be viewed as an HMM whose states 1, 2, …, n representing token positions 1 through n respectively. Such an HMM’s graph is a single directed path 1 → 2 → 3 → … → n. Consequently, the transition probabilities on all the arcs are 1. The HMM’s parameters are the position-specific emission probabilities. We will call such a model a chain HMM.

A chain HMM is hardly an HMM as there is nothing Markovian about it. That said, calling it out as an HMM is useful. It surfaces paths to enhancing the model as needed.

An example of such an enhanced version is commonly used in biomolecular sequence analysis. It goes by the name profile HMM [3]. A profile HMM is a chain HMM with a few judiciously chosen states and transitions added to detour off the chain at certain positions. Detour paths merge back to the chain at certain other positions. Detour paths model position-specific mutations. In biomolecular sequences, such mutations often happen. A model that captures them recognizes membership of biomolecular sequences in the modeled family more accurately.

A different enhancement path from a position-specific chain HMM is into a so-called Conditional Markov Model (CMM). We will cover CMMs later in this post. We will also illustrate, in the example below, what benefits we get by moving from a chain HMM to a CMM.

Example CHMM-Park-Names: Say we want to model national or state park names in the US. Such as Yellowstone National Park, Yosemite National Park, Castle Rock State Park, … A chain HMM is a good way to go.

This HMM will be trained off a list of national plus state park names. We will set n to the maximum number of words in an entry in this list. The position-specific emission probabilities of the chain HMM are easy to estimate. The tokenized version of a park name reveals the position of each word in it. For example, in yosemite national park, yosemite is the first word, national the second, and park the third. So from a token in a park name we know which state’s emission probability to train.

The model above is attractive for its simplicity and flexibility. Just train it on a list of park names. As the list gets richer, the model will improve on its own.

That said, it seems a bit unnatural. A natural model for the type of park names we have in mind is

word+ ( state | national) park

This just means that a park name has one or two words followed by national or state followed by park.

The model below is closer to the natural model we seek.

Example HMM-Park-Names: Here we will use three states: prefix_word, regional_word, and park. The state regional_word will emit state and national (for now), with equal probability (for now). The state prefix_word will emit a word in a park name that appears before regional_word. The state park will (for now) emit the word park with probability 1.

A training set of state sequences is easy to construct? We just need a few:

prefix_word regional_word park
prefix_word prefix_word regional_word park
…
prefix_word prefix_word prefix_word regional_word park

Why not just use a regular expression then? The HMM is more accurate. Consider the phrase “the national park”in some text. This phrase is not an actual national park. The HMM can in principle model this by assigning a very low probability of emitting the from the state prefix_word.

In fact, this takes us to the topic of how to train the emission probabilities of prefix_word? (The emission probabilities of the other states have already been settled, for now.) Here is a simple approach. Take a list of national and state park names. Strip off the last two words in each park name. The words that remain are prefix words.

The state park could be extended to emit beach, forest, monument, etc. (In California, state beaches are often classified under state parks.)

To get a better feel for the trade-offs between a chain HMM and an entity-based HMM, let’s see another example involving the two.

Example CHMM-Product: Consider a chain HMM to model product names. (By ‘product name’ we mean both specific products and product categories.) We could use a trained version to recognize product names in unstructured text, some that it hasn’t even been trained on.

This HMM will be trained off a list of product names. The training is similar to that of the park chain HMM. Remember, we just need to learn (i) the number of states and (ii) the emission probabilities from the various states.

This HMM has a simple structure and is easy to train. That said, the entity-based version we describe below will likely be more accurate.

Example HMM-Product

We might choose as our entities brand_token, product_type_token, product_version_token, and product_base_name_token. Example values are brand_token = canon, product_type_token = camera, product_version_token = 12, product_base_name_token = iphone. Our HMM’s states would be these entities.

Why does each entity’s name end with the word token? Because the entity applies to a single token. So the state sequence associated with “smart phone”would be [product_type_token, product_type_token].

To train this HMM we would need training sets for the various entities. These training sets could be derived from lists of brands, product types, etc. We say “derived” because we can’t use the lists as-is if they contain multi-word entries. Consider smart phone listed as a product type. From this, we would derive two examples: smart → product_type_token and phone → product_type_token.

We also need state sequences that capture the sequences of entities in tokenized product names. We could construct such a training set manually. This isn’t hard because product names tend to follow some common conventions. For instance, in many two-word product names, the first name is a brand, and the second a product type. We already saw an example: canon camera. Consequently [brand_token, product_type_token] should be in the training set of state sequences.

Automatically Deriving State Sequences From Product Names

Constructing state sequences manually will get us quite far. A few state sequences cover a lot of product names. That said, the manual approach risks not scaling to a robust industrial-strength model, one covering millions of products. The product names may span a long tail of state sequences. Consider the state sequence [product_type_token, brand_token]. Some product names follow this convention.

As it turns out, we can automate the construction of state sequences from the tokenized product names. This way as we add more and more products to our training set, state sequences automatically get extracted from them.

Basic version: The basic idea is to take the tokenized product name and, for each word in it, find its most-likely entity. We can do this because we have training sets of (word, entity) pairs for the various entities.

Let’s see a simple example.

canon camera ⇒ [canon,camera] ⇒ [brand_token,product_type_token]

Refined version: We can refine this basic idea as follows. As before, we tokenize the product name. We then run the sequence of tokens through the HMM to find a state sequence that fits it best.

This refinement uses information from both the entity training sets and from the HMM’s state transitions. Consequently, it can be more accurate.

I like to call this approach “bootstrap training”. We initialize the HMM manually. This includes seeding it with whatever state sequences we choose to. We can then run tokenized product names through it in the hope of discovering more state sequences.

To enable the discovery of new state sequences, when initializing the HMM, we should allow transitions from all states to all states. (Other than from the state begin to the state end and any going out from the state end.) We can initialize our training set for these transitions automatically, so that the initial probabilities on these transitions are very low, albeit not zero. We call this pseudo-training.

Curation: It’s possible that some of the new state sequences we discover via “bootstrap training” have errors in them. So these should be reviewed by humans. Nonetheless, we have probably benefited despite the human in the loop. Manually discovering new state sequences is a lot more time consuming than curating automatically discovered ones.

Conditional Markov Model

Just like an HMM, a conditional Markov model (CMM) operates on (token sequence, state sequence) pairs. Unlike an HMM a CMM is optimized for finding the best state sequence for a given token sequence. What exactly does this mean? This will get gradually clear in the examples below.

That said, in this post, we will not discuss how to find the best state sequence for a given token sequence. We will discuss how to score a particular state sequence for a given token sequence. This scoring will surface the “conditional” part in CMM.

Example CMM1 (First Part): Imagine training a language model on (tokenized full person name, token entities) pairs, with the aim of using it to parse person names. Parsing means to break up a person’s name into its constituents. Such as first name and last name.

Consider the name John K Smith. The model should be able to infer that K is a middle name word. Even if the training set does not contain a K emitted from middle_name_word.

The CMM directly models this as

P(state[2] = middle_name_word | state[1] = first_name_word, token[2] = K)P(state[2] = middle_name_word | state[1] = first_name_word)*P(token[2] = K | state[2] = middle_name_word)

How is this beneficial, compared to how the HMM models it.

Towards answering this question, first let’s abstract out the specifics of this example. We have P(S[i+1] | S[i], X[i+1]) for the CMM versus P(S[i+1] | S[i])*P(X[i+1] | S[i+1]) for the HMM. Here X[i] is the ith token and S[i] the state from which it was emitted.

The CMM approach may be viewed as a multinomial classifier whose input is the pair (S[i], X[i+1]), and whose output is a probability distribution over the various values S[i+1] can take. This formulation lends itself to extracting any features that we see fit from the input. Possibly overlapping. Possibly taking into account interactions among S[i] and X[i+1].

We can now use any multinomial classifier algorithm on this problem. Such as (multinomial) logistic regression, decision trees, or random forest.

By contrast, the HMM cannot capture interactions between S[i] and X[i+1]. Nor can it accommodate arbitrary features. Nor can it leverage sophisticated machine learning classifier algorithms. The CMM on the other hand cannot leverage the entity training sets. We no longer learn emission probabilities.

Example CMM 1 (completed): Let’s complete this example. Starting with the features. From input (S[i], X[i+1]) we will extract the features state = S[i], token = X[i+1], token length = number of characters in X[i+1]. That is, we have three predictors: state, token, and token length. Each is categorical. The response is the value of S[i+1]. Also categorical.

Why these features? The feature state lets us model the influence of the current state on the next state. The feature token let’s us use the actual value of the token as a predictor of its state. This is useful. Certain tokens favor certain entities. For example, John is much more likely to be a first name word than the last name word. As an added bonus we get the interaction between state and token for free so long as our learning algorithm can exploit it. (If there is an interaction that is.) The feature token length is useful because we know that a token’s length favors certain entities over others. First or middle names are often one character long. Last names hardly ever.

Let’s see a few training examples. We’ll start with (token sequence, state sequence) pairs.

John   Smith                        John    K    Smith
first  last                         first middle last

From these let’s derive a few training instances for the CMM formulation. (We don’t show all.) The first three columns list the predictors. The last one lists the response.

(previous) state    token   token's length             state
       first          K             1                  middle
       begin         John           4                  first
       first         Smith          5                  last
         …             …            …                  …

The first training instance above reads as “(previous state equaling first, token equaling K, token’s length equalling 1)” leads to next state being middle”.

Example CMM 2 (Sketched): Consider parsing US street addresses. For simplicity assume they have the form

street_num_word (street_name_word)+ street_suffix [unit_prefix unit_num_word]

Below are some pairs of tokenized US street addresses together with their entities. The entity names are abbreviated so the examples can fit in the space allocated.

123             Marine           Dr
street_num_word street_name_word street_suffix203             Lake             Forest           Ave
street_num_word street_name_word street_name_word street_suffix123             Main             St            Apt         22
street_num_word street_name_word street_suffix unit_prefix unit_num

For the design of our features, let’s start by focusing on those extracted from a token. Which features of a token discriminate among the various entities? The actual value of the token can predict whether it is a street_name_word or a street_suffix or a unit_prefix. The presence of digits predicts that it is a street_num_word or a unit_num_word.

In view of this, from input (S[i], X[i+1]) we will extract the features state = S[i], token = X[i+1], proportion_of_digits_in_token = fraction of characters in X[i+1] that are digits.

Conditional Random Fields

Just like a CMM, a CRF operates on (S, X) pairs where X denotes a token sequence and S its state sequence. Both model P(S|X), to optimize for the use case of finding a high-probability state sequence S for a given X.

The CMM factors this as

P(S|X) = product_i P(S[i] | X[i-1], S[i-1]) (CMM 1)

That is, the probability of state S[i] is conditioned on the previous state S[i-1] and the current token X[i].

The CRF is less restrictive. That is, it accommodates more general P(S|X).

The CRF operates on an undirected graph whose nodes model states and whose edges model direct influences among pairs of states. These influences are symmetric.

In this post, we will limit ourselves to a linear-chain CRF. It has the structure

S[1] — S[2] — S[3] — … — S[n]

This just models that a state is influenced by the previous state and the next state. That is, S[i] is directly influenced by S[i-1] and by S[i+1].

In the CRF world, the linear CRF is the closest analog to a CMM. So let’s look into it in more detail.

The linear CRF models P(S|X) as

P(S|X) is proportional to product_i product_f P(f)*e^f(S[i-1], S[i], X) (CRF 1)

The intuition is a bit easier to convey if we define a score C(S, X) from this by taking logs.

C(S, X) = sum_i sum_f w(f)*f(S[i-1],S[i],X) (CRF 2)

We call it C because we think of it as a compatibility function. As a function of S, C(S, X) is monotone in P(S|X). So for the purposes of finding high-scoring state sequences, C(S, X) works equally well.

What is f? It’s a feature function that scores the compatibility of the triple (S[i-1], S[i], and X) along some dimension. More positive values are more compatible. w(f) is f’s weight. It controls the compatibility’s strength and direction.

We can have as many feature functions as we choose.

Okay, so in a linear CRF, we have a set of possibly overlapping feature functions. Each is applied to every edge. That said, they can use any information from any subset of the tokens in X. By contrast, CMMs only score compatibility of (S[i-1], S[i], X[i]) triples.

At this point, it’s best to see a concrete example with specific feature functions.

Example CRF 1: Consider a generalized version of an example we saw earlier. Parse global street addresses. The generalization is that we are going global.

Street addresses across the globe come in many different formats. In some, street numbers come before the street name, in some after. Some start with units, others end with them. (An example of a unit is “Apt 30”.) These are just a few of the variations.

Global street addresses also have varying quality, in terms of how well they adhere to the format for their locale. We want to be able to parse poor-quality addresses as well as possible.

Let’s see two examples

123 St Francis St
Rue du Vivier 15

The first one is a US one, the second one Belgian. In addition to the positions of the street numbers and names, the positions of the street keywords also vary. (In these examples, ‘St’ and ‘Rue’ are street keywords.)

We’ll use a linear-chain CRF.

Are there any global features we should extract from the tokenized street address? Two come to mind: country and language.

Both the language and the country influence address formats. These factors overlap, but they also have complementary influences. Such is the case of Belgian vs French addresses, both written in French. In some aspects they are similar. In other aspects, they are not. That is the country matters.

In view of this, we should include both.

We’ll build two specific feature functions around these.

F1: country-edge compatibilities: (S[i-1],S[i], country(X))
F2: language-edge compatibilities: (S[i-1],S[i], country(X))

We’ll add a third feature function to prefer tokens that are compatible with their tagged states.

F3: state-token compatibilities: (S[i], X[i])

Training: F1 and F2 may be trained from (tokenized street address, state sequence) pairs tagged with their country and language. Transitions from S[i-1] to S[i] which are compatible with the country and language score high, those that are not don’t. F3 may be trained from the (state, token) pairs in the training set.