NLP Summary - Jacob Eisenstein

This is a running summary of the NLP textbook by Jacob Eisenstein. This blog post adds a bit of my personal take on his ideas.

Three main themes in NLP:

1. Learning and Knowledge

2. Search and Learning

3. Relational, compositional, and distributional perspectives

Learning - Linear Text Classification

1. Naive Bayes with Bag-of-Words

2. Perceptron

3. Loss Function

4. Logistic Regression

Learning - Non-Linear Text Classification

Why neural networks for NLP?

Neural network hyper-parameter selection

Convolutional Neural Networks

Linguistic Applications of Classification

  1. Sentiment Classification
  2. Word sense Disambiguation - Wordnet has sense annotations

Things to note while designing a classifier

  1. Tokenization: Most of the NLP constructs look at words as a single entity (like word embeddings). Tokenization might seem straightforward, but each tokenization technique has significant effect on the performance of the model. We cannot just rely on tokenizing on white-spaces. There are instances of hyphenated phrases (prize-winning, out-of-the-box etc) or punctuations (Ph.D, U.S etc). This is also highly language specific. For example, Malayalam, like German, can coalesce multiple words into a single word and cannot be reliably split into independent words. In such cases, we might have to train classifiers to predict if each character is a word boundary or not.
  2. Text Normalization: One normalization technique is to lowercase every word (in English, at least). But, this might ignore subtleties important for classifications. apple and Apple have different meanings based on the case of the letter. This consideration is applications dependent and if it makes sense to lowercase everything for your application, then it should be done. There is also the cases of lemmas and stems in English. NLTK does a pretty good job of stemming and lemmatising most of the English words.