Language Modeling (ASR Part 3)
06 Dec 2018Link to the start of ASR series: Automatic Speech Recognition (ASR Part 0)
Language Modeling is the technique of modeling a system that assigns probabilities to a sequence of words. Language Models (LMs) model valid sequences of words in a language that are sematically and syntactically correct and is used to differentiate between valid sentences in a language and random sequences of words. This is achieved by modeling the probability of a word occurring in a given sentence given it’s predecessors.
Given that we are building an ASR system, how is LM relevant in this case? An AM gives us the probable word given it’s acoustic representation (MFCC). Given a series of continuous acoustic vectors, AM spits out a distribution over all the words in the dictionary at each time step. Just taking the maximum word at each time step will not yield correct results for the basic fact that each word is dependent on the surrounding words. For example, the acoustic representation of the words \(\text{"ball"}\) and \(\text{"wall"}\) would be similar to each other and the AM will assign high scores to these words. But choosing the correct word among these is subject to the context of the sentence. If the surrounding words are related to sports (let’s say) then we should choose \(\text{"ball"}\) even if it has a lower AM score. This decision of choosing the correct word given it’s context is done by LM.
There are different LMs employed in active use today. Some of the bleeding edge LMs are deep neural networks (Deep Bi-LSTM networks specifically).
We will be exploring one of the widely tested LMs in this blog post.
N-Gram Language Model
For a bigram,
\[P(bites|dog) = \frac{c(\text{dog bites})}{c(dog)}\]For an N-gram,
\[P(w_{k}|w_{1},w_{2}..w_{k-1}) = \frac{c(w_{1}w_{2}...w_{k})}{c(w_{1}w_{2}...w_{k-1})}\]In this case, for unseen words, we get 0 probability. Hence we need to do some smoothing. Similar to Add One Smoothing (also called as Laplace Smoothing), we have Witten Bell Smoothing. Here, while calculating the probabilities of seen N-grams, we need to keep aside some probability for unseen stuff.
For eg, in case of unigrams, we usually do,
\[P(word) = \frac{c(word)}{c(\text{all_words})}\]For an unseen word, it will be 0. So instead of the above we do this,
\(P(word) = \frac{c(word)}{c(\text{all_words})+V}\), where \(V\) is the vocabulary size.
Similarly, for N-gram, we do,
\[P(w_{k}|w_{1},w_{2}..w_{k-1}) = \frac{c(w_{1}w_{2}...w_{k})}{c(w_{1}w_{2}...w_{k-1}) + V(w_{1}w_{2}...w_{k-1})}\]where, \(V(w_{1}w_{2}...w_{k-1})\) is the number of unique words that follows these words.
Back Off Strategy
Let us take a look at this scenario. We have a 4-gram \(\text{"tiger killed wild boar"}\) and we need to calculate,
\[P(boar \mid tiger,killed,wild)\]Assume that this 4-gram does not occur in the training corpus, i.e
\[c(\text{"tiger, killed, wild, boar"})=0\]Then, we split the above into,
\[P(boar \mid killed,wild) \alpha(\text{killed, wild, boar})\]If this trigram is also not present in the corpus, i.e
\[c(\text{killed, wild, boar})=0\]Then we again split into,
\[P(boar \mid wild) \alpha(\text{wild, boar})\]and so on until we get a count that is not 0.
Therefore, we get,
\[P(boar \mid tiger,killed,wild) = \alpha(\text{killed, wild, boar}) (\alpha(\text{wild, boar}) (\alpha(boar)P(boar)))\]This technique is called Backoff strategy.
Language Model Evaluation
Likelihood
For a bigram model, for sentences in a test set, probability is computed as,
\[P(w_{1}w_{2}...w_{n}) = P(w_{1}|<s>)P(w_{2}|w_{1})...P(</s>|w_{n})\]where \(<s>\) is the sentence-start token and \(</s>\) is the sentence-end token.
If we have 2 models A and B, whichever gives the higher probability values to these words in the test set is a better model.
Log Likelihood
In the above equation, since we are multiplying numbers less than one, the number becomes too small too quickly. Hence, we take log on both sides.
\[log P(w_{1}w_{2}...w_{n}) = log P(w_{1}|<s>) + log P(w_{2}|w_{1})...+ log P(</s>|w_{n})\]Entropy
If we multiply the above the equation by \(-\frac{1}{n}\), we get entropy.
\[entropy = -\frac{1}{n}log P(w_{1}w_{2}...w_{n})\]Perplexity
Perplexity is the average reciprocal of the word probability. So, if words on average are assigned probability 1/100 then the perplexity would be 100.
Perplexity is just the anti-logarithm (exponential) of the entropy.
\[P(w_{1}w_{2}...w_{n})^{-\frac{1}{n}}\]To explain perplexity further,
\[entropy = -\frac{1}{n}log P(w_{1}w_{2}...w_{n})\] \[entropy = log P(w_{1}w_{2}...w_{n})^{-\frac{1}{n}}\]Removing \(log\),
\[P(w_{1}w_{2}...w_{n})^{-\frac{1}{n}} = 2^{entropy}\]Hence,
\[perplexity = 2^{entropy}\]To summarize LM evaluations,
HIGH likelihood ↔ LOW entropy ↔ LOW perplexity ↔ GOOD model
LOW likelihood ↔ HIGH entropy ↔ HIGH perplexity ↔ BAD model