# Word Mover's Distance: Mathematical Background

We are fully aware of the marked influence of the introduction of Word2Vec method of word embedding on the Natural Language Processing domain. It was a huge leap forward from the hitherto constricting method of word embeddings namely, Term Frequency (TF) and Inverse Document Frequency (IDF). Neither of these methods were anywhere close to preserving the semantics of the words in their representations. With the introduction of Word2Vec and the possibility of semantic embedding in the vectors, various avenues were thrown open where this representation could be exploited for various NLP applications.

One of the logical extensions to Word2Vec was document similarity. Now that you could find out similarity between words, what was the best way of classifying or clustering documents semantically? There are two metrics by which we can quantify the similarity between two words, cosine similarity and Euclidean distance.

Word Mover's Distance (WMD) was introduced in 2015 which leveraged Euclidean distance in order to quantify similarity between documents. WMD accomplishes this by quantifying the amount of distance that the embedded vectors in a document needs to travel in order to match the embedded vectors in another document, thereby offering us a proxy which quantifies dissimilarity (or similarity) between two documents. You can find a link to the original paper here.

For example, let us consider three documents,

D1: "Obama speaks to the media in Illinois"
D2: "The President greets the press in Chicago"
D3: "Obama speaks in Illinois"

WMD algorithm can be described in the following steps.

• Each word in a sentence can be represented as a point in the $$(n-1)$$ dimensional simplex of word distribution. The magnitude of the vector is represented by it's nBOW (n Bag Of Words) representation. Mathematically, if the word $$i$$ appears $$c_{i}$$ times it can represented as, $$d_{i} = \frac{c_{i}}{\sum_{j=1}^{n}c_{j}}$$ where $$i$$ is a vector in $$n$$ dimensional space specified by vocabulary size. It is perfectly clear from the above description that each word pair, if they are dissimilar, lie in a completely different plane irrespective of their semantic distance.
• In the example considered above, after stop words removal, the nBOW representations of documents D1 and D2 have zero common non zero elements and hence lie in the max distance possible from each other even though semantically these sentences are close.
• The distance metric used between any word pair is given by the Euclidean distance between the Word2Vec vectors. We will denote this distance between the word $$i$$ and word $$j$$ as $$c(i , j) = || x_{i} - x{j} ||$$, where $$c(i , j)$$ is the traveling cost from word $$i$$ to word $$j$$.
• The objective of the WMD algorithm is to calculate the flow matrix $$T$$. Flow matrix $$T$$ can be defined as a sparse matrix of dimension $$n X n$$. $$T \in R^{n X n} \hspace{0.2cm}where \hspace{0.2cm} T_{i,j}\geq 0$$ In the flow matrix $$T$$, $$T_{i,j}$$ denotes how much of word $$i$$ in $$d$$ flows into word $$j$$ in $$d'$$. Naturally, the sum of the outgoing flow from word $$i$$ should be equal to $$d_{i}$$ and the sum of the incoming flow to word $$j$$ should be equal to $$d'_{j}$$. Mathematically, $$\sum_{j} T_{i,j} = d_{i}$$ $$\sum_{i} T_{i,j} = d'_{j}$$
• Finally, we can calculate the distance between the two documents $$d$$ and $$d'$$ by calculating the minimum cumulative cost required to move all words in $$d$$ to $$d'$$. $$\sum_{i,j} T_{i,j}c(i,j)$$
• The minimum cumulative cost can be formalized with constraints as follows, $$\underset{T\geq 0}{min}\sum_{i,j} T_{i,j}c(i,j)$$ subject to $$\sum_{j} T_{i,j} = d_{i}$$ and $$\sum_{i} T_{i,j} = d'_{j}$$
• The solution to the above problem is a variation of the Transportation Problem called Earth Mover's Distance which can be found here.
Although the algorithm seems simple enough when the number of words in the 2 documents are equal, complications arise when the number of words are different, as in the case between D2 and D3. In that case, the computational complexity increases to $$O(P^{4})$$. Hence, this algorithm is computationally very expensive.