Conditional Random Fields
16 May 2018 Conditional Random Field (CRF) is the go-to algorithm for sequence labeling problems. Initial attempts at sequence labeling such as POS tagging and Named Entity Recognition were accomplished using the Hidden Markov Models. Although HMMs gave promising results for the same, they suffered from the same drawbacks as using a Naive Bayes models i.e conditional independence.Both Naive Bayes (and HMM in extension) try to fit a model that maximizes the joint probability distribution \( P(X , Y) \). This is flawed on two levels. One, since \( X \) is already given, trying to fit a joint distribution over both \( X \) and \( Y \) is computationally expensive, wasteful and in the end, we do not require the type of distribution the input variable follow. We just need to know what distribution the target variable follows given an input variable. Second, since we are assuming that the input variables are conditionally independent, if there exists some correlation between the variables, the model assigns high probability when these two variables occur together instead of generalizing over the input distribution. Hence, we can see that any perturbance in the input distribution has a significant effect over the output. Ideally, since we only care about the distribution of the target variable \( Y \), this should not be under consideration. We should evaluate \( P(Y | X) \) instead of \( P(X , Y) \).
In hindsight, this might seem trivial and as the obvious choice for any classification problem. In fact, majority of the classifiers in use today optimize over this metric. Nevertheless, it was an important breakthrough when it was considered decades ago.
A Conditional Random Field (CRF) does exactly this. It evaluates \( P(Y | X) \). The purpose of this post is to delve deeper into the mathematics of a CRF and why CRF is considered to be the starting point for most of the current classification algorithms.
Consider a function \( Q(.) \) as a real valued function. According to Markovian distribution theorem, $$P(X,Y) \propto exp(\sum _{c\in C}Q(c, X,Y))$$ Where \( C \) is a set of feature verticals (or cliques). This equation provides us with the un-normalized joint probability distribution over \( X \) and \( Y \). Since we are interested in calculating the conditional probability, $$P(Y|X)= \frac{exp(\sum _{c\in C}Q(c, X,Y))}{\sum _{X}exp(\sum _{c\in C}Q(c, X,Y))}$$ This is similar to Gibbs Distribution (or Softmax depending on the \( Q(.) \) function). The above equation can be re written as $$P(Y|X)= \frac{\prod _{C}\phi _{C}(X, Y)}{Z}$$ Where \( \phi _{C}(X, Y) = exp(Q(C,X,Y)) \) and \( Z \) is called as the Partition Function. $$Z = \sum _{X}exp(\sum _{c\in C}Q(c, X,Y))$$ For a CRF which is used for sequence labelling such as PoS or NER, the graph forms a linear chain and the probability equation can be written as $$P(Y_{1},Y_{2},...,Y_{M}|X) = \frac{1}{Z}\prod _{i}\phi (Y_{i},X)\phi^{1}(Y_{i-1},Y_{i},X)$$ Where, \( \phi () = exp(\sum _{k}\theta _{k}s_{k}(y_{i},x_{i},i)) \) and \( \phi^{1} () = exp(\sum _{j}\lambda_{j}t_{j}(y_{i-1},y_{i},x_{i},i)) \). \( s_{k} \) is the state feature function and \( t_{j} \) is the transition feature function (Similar to HMMs).
CRF is a generalized method of calculating \( P(Y | X) \) and is not restricted to just evaluating sequence label probabilities. The flexibility of CRFs stems from the ability to choose any function as \( Q(X , Y) \). In the following section, we can see how the logistic function can also be derived from CRF.
Consider the feature function, $$\phi (X_{i},Y) = exp(w_{i}1(X_{i},Y))$$ where \( 1 \) is an indicator function. It can also be denoted as $$\phi (X_{i},Y) = exp(w_{i}* AND(X_{i},Y))$$ Calculating the probabilities, $$\phi (X_{i},Y=1) = exp(w_{i}X_{i})$$ $$\phi (X_{i},Y=0) = exp(0)=1$$ Therefore, substituting the above results in the CRF equation, we get, $$P(Y|X) = \frac{\phi (X_{i},Y=0) * \phi (X_{i},Y=1)}{\phi (X_{i},Y=0) + \phi (X_{i},Y=1)} = \frac{exp(\sum _{i}w_{i}X_{i})}{1+exp(\sum _{i}w_{i}X_{i})}$$ The above equation is nothing but the logistic equation. Hence, we can see that CRF is a generalized form of any machine learning classification problem.