Common Sense Is Not So Common in NLP

One of the main drawbacks of any NLU neural model is it’s lack of generalization. This topic has been explored extensively in the previous post Empirical Evaluation of Current Natural Language Understanding (NLU). To what can we attribute this lack of common-sense to?

The main reason for this brittleness is the fundamental lack of understanding that a model can gain from processing just text, irrespective of the amount of text it sees. For example, it is common-sense that keeping a closet door open is ok while keeping a refrigerator door open is not good. As humans, we can reason that keeping the refrigerator door open leads to spoiling of food inside since it is perishable while clothes are non-perishable and hence it is ok to keep the closet door open. But, this information is not usually written down anywhere and hence, it is difficult for a model to learn this reasoning.

Also, text is just one modal of information that humans interact with. Humans interact with the world through sight, smell and touch and this information is inherent in understanding. The NLU models lack this crucial exposure to other modalities (which form the root of common-sense). For example, we do not write down obvious things like the color of an elephant. Written text usually talk about elephants in terms of size, like “big elephant with huge tuskers charged at the man”, while “dark grey elephant was spotted in Africa” is practically non-existent. Nevertheless, since we are exposed to pictures of elephants, we know that elephants are usually dark grey in color. If a pre-trained model is asked to predict the color of an elephant, it will fail or might even say it is white since “white elephant” is a valid phrase that is used as a metaphor.

Another reason for lacking common-sense is that common-sense is just not written down. Consider the following example:

S: The toy did not fit in the bag because it was too big.

Q: What does “it” refer to?

A: Toy

S: The toy did not fit in the bag because it was too small.

Q: What does “it” refer to?

A: Bag

In brief, why do models lack common-sense?

  1. There is inherent bias when humans write things down. We do not tend to write down obvious things.
  2. Common-sense is also not written down
  3. The models are not exposed to other modalities (like images, audio or video).

Incorporating common-sense into neural models:

  1. Build a knowledge base, similar to WordNet. We can store specific common-sense information in this database. For example, an Elephant object can have the attribute color: grey.
  2. Multi-modal learning: Sun et al. VideoBERT
  3. Human-in-the-loop training.

Resources for common-sense reasoning

  1. Yejin Choi: Key researcher in the field of common reasoning. Talk at NeurIPS 2019 LIRE workshop
  2. Maarten Sap et al., ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning
  3. Antoine Bosselut et al., COMET: Commonsense Transformers for Automatic Knowledge Graph Construction
  4. Keisuke Sakaguchi et al., WinoGrande: An Adversarial Winograd Schema Challenge at Scale


  1. Winograd
  2. Winogrande (AAAI 2020)
  3. Physical IQA (AAAI 2020)
  4. Social IQA (EMNLP 2019)
  5. Cosmos QA (EMNLP 2019)
  6. VCR: Visual Commonsense Reasoning (CVPR 2019)
  7. Abductive Commonsense Reasoning (ICLR 2020)
  8. TimeTravel: Counterfactual Reasoning (EMNLP 2019)
  9. HellaSwag: Commonsense NLI (ACL 2019)

Empirical Evaluation of Current Natural Language Understanding (NLU)

Evaluating language understanding is as difficult as elucidating the meaning of the word “understanding” itself. Before getting into evaluating computer models on language understanding, let’s explore how we evaluate human “understanding” of natural language. How do we evaluate whether a person understands a particular language? Do we emphasize on the person’s memory of the meanings of different words? Or do we emphasize on the person’s ability to construct sequences of words/tokens that make sense to another human? Or is the ability of a person to read a passage and be able to answer questions on that passage considered to be a good indicator of his/her “understanding” of a language?

It is apparent that trying to quantify a person’s language understanding is a non-trivial pursuit. But, we have tried to quantify it nonetheless with multiple language proficiency tests across the world. The tests generally evaluate on multiple proxies (like sentence completions, fill-in-the-missing-word, passage question answering, essay writing) and the proficiency of a person is evaluated on how well he/she performs on ALL of these tasks. Important point to note here is that excelling on one section (task) and failing in another, points to a failure in language understanding.

We can draw similarity between the language proficiency tests and the natural language understanding evaluation in NLP. There are multiple datasets and benchmarks that are a proxy to the sections in tests. For example, we have the SQuaD (for passage question answering), GLUE (for next sentence predictions) etc.

Similar to the proficiency tests, we can consider a model to “understand” a language when it performs reasonably well across all the benchmarks without special training. We have seen multiple models achieving SOTA on individual benchmarks (surpassing human-levels). Does this mean we have achieved language understanding in computers? There are multiple ways we can evaluate generalizing capacity of any model. We have multiple datasets for the same task but on different domains, like having question answering datasets over multiple domains and multiple languages and any model is expected to perform well on all of them to prove understanding. While recent language models, trained on huge datasets, generalize well over multiple tasks, they still require significant fine-tuning to perform well in these tasks. There are questions raised as to whether this fine-tuning just overfits to the quirks of the individual benchmarks. Another approach is to train a model on ALL of these tasks simultaneously, proposed by McCann et al. The Natural Language Decathlon: Multitask Learning as Question Answering. They propose converting all tasks (summarization, translation, sentiment analysis etc) into a question answering problem.

This blog post covers the current state of natural language understanding and explores where/if we are lacking and is a review of the Dani Yogatama et al. Learning and Evaluating General Linguistic Intelligence.

In order to claim language understanding, a model must be evaluated on it’s abilities to

  1. Deal with the full complexity of natural language across a variety of tasks.
  2. Effectively store and reuse representations, combinatorial modules (e.g, which compose words into representations of phrases, sentences,and documents), and previously acquired linguistic knowledge to avoid catastrophic forgetting.
  3. Adapt to new linguistic tasks in new environments with little experience (i.e., robustness to domain shifts)

Dani Yogatam et al., evaluate BERT (based on Transformer) and ELMo (based on recurrent networks) on their general lingustic understanding. The main categories of tasks evaluated against are:

  1. Reading Comprehension: This is a question answering dataset. We have 3 datasets, SQuaD (constructed from Wikipedia), TriviaQA (written by trivia enthusiasts) and QuAC (where a student asks questions about a Wikipedia article and a teacher answers with a short excerpt from the article). These are the same tasks but with different domains and distributions

  2. Natural Language Inference: This is a sentence pair classification problem. Given two sentences, we need to predict whether the two sentences ENTAIL, CONTRADICT, NEUTRAL each other. We have two variants of this task: MNLI (Multi Genre Natural Language Inference) and SNLI (Stanford Natural Language Inference)

Results from Dani Yogatama et al.:

  1. On both SQuAD and MNLI, both models are able to approach their asymptotic errors after seeing approximately 40,000 training examples, a surprisingly high number of examples for models that rely on pretrained module: We still need significant amount of training examples to perform well on a tasks even if the model is pretrained.

  2. Jointly training BERT on SQuAD and TriviaQA slightly improves final performance. The results show that pretraining on other datasets and tasks slightly improve performance in terms of final exact match and F1 score: If you want your model to perform well on multiple domains with the same task, it is better to jointly train the model on both datasets.

  3. Our next set of experiments is to analyze generalization properties of existing models. We investigate whether our models overfit to a specific dataset (i.e., solving the dataset) it is trained on or whether they are able to learn general representations and modules (i.e., solving the task). We see that high-performing SQuAD models do not perform well on other datasets without being provided with training examples from these datasets. These results are not surprising since the examples come from different distributions, but they do highlight the fact that there is a substantial gap between learning a task and learning a dataset: Just training these models on one dataset and expecting it to perform well on another dataset will not work. However, jointly training it on both datasets provide good results as mentioned in the previous point.

  4. Catastrophic Forgetting; An important characteristic of general linguistic intelligence models is the ability to store and reuse linguistic knowledge throughout their lifetime and avoid catastrophic forgetting. First,we consider a continual learning setup, where we train our best SQuAD-trained BERT and ELMo on a new task, TriviaQA or MNLI. the performance on both models onthe first supervised task they are trained on (SQuAD) rapidly degrades.: Taking a SOTA model from one task, training it on another task will degrade it’s performance in the first task. This indicates that the model parameter updates from the second task is not compatible with the pre-trained model parameters for the first task. The model tends to forget what is has been trained for in the first place. This indicates that the model hasn’t achieved generalized language understanding and has merely fit to the task (and dataset) it has trained on.

tl;dr We still have a long way to reach generalized language understanding in computers even though we are achieving SOTA in each task.

Reducing Model Size - Language Models

The size of the SOTA neural networks is growing bigger everyday. Most of the SOTA models have parameters in excess of 1 billion.


The above image is taken from Huggingface’s DistilBERT

This trend has multiple effects on NLP research:

1. NLP leaderboards are dominated by results from industry, who have access to the vast compute and data resources required to train these humongous models. Companies are incentivized to keep this research proprietary, at least part of it to maintain intellectual superiority among competitors. Open research from the academia is unable to keep up.

2. Larger model sizes preclude deploying on mobile devices.

3. Large compute resources are required to even run inferences on these models. This compute requires energy, which in turn leaves a bigger carbon footprint.

4. The notion (and proof) that bigger always outperforms a smaller model has inspired researchers to build bigger and bigger models instead of other paradigm shifting ventures. As Francois Chollet puts it, “Training ever bigger convnets and LSTMs on ever bigger datasets nets us closer to strong AI – in the same sense that building taller towers brings us closer to the moon”.

Given these issues, this blog post summarizes the current efforts and research in the direction of reducing model footprints.

1. Neural Networks are over-parameterized

Ramanujan, Vivek et al. “What’s Hidden in a Randomly Weighted Neural Network?” proposes that even before train a neural network, a randomly initialized neural network already consists of sub-networks that can perform almost as well as a fully trained model. They also propose an algorithm to identify these sub-networks.

2. Distillation

This is a teacher-student architecture, where a bigger model (teacher)’s outputs/hidden activations are used to train a smaller model (student). This has been applied generally to many neural models. For BERT specifically, Huggingface’s DistilBERT, is a good resource with implementation. The output of the bigger BERT model and the output of the smaller model are used to calculate the cross-entropy loss (with or wothout temperature).

SOTA in distillation is TinyBERT. They propose a novel distillation procedure optimised for transformer based networks (which BERT is). Unlike DistilBERT, where only the softmax output is used to train the student, TinyBERT uses the hidden activations including the attention weight matrices to train the student.

3. Pruning

Here, we focus on reducing the model size by either removing transformer heads, weights themselves or by removing layers.

Elena Voita et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned propose removing the attentions heads from the transformer. Similar efforts is proposed by Paul Michel et al. in Are Sixteen Heads Really Better than One?. They propose a Head Importance Score (a differentiation of loss w.r.t the head activation) as a measure to prune heads. Results suggest that we can remove almost 60-80% of the heads without much impact on the performance.

Ziheng Wang et al. Structured Pruning of Large Language Models propose a way to remove model weights.

Angela Fan et al. Reducing Transformer Depth on Demand with Structured Dropout propose a way to remove model layers by using a method similar to dropouts. Unlike dropouts, where certain model parameters are frozen during training, they propose freezing certain layers while training. This lends the model to having some layers to be able to be removed and not affect the performance significantly.

4. Quantization

Another approach to reduction in model sizes is to convert the floating point numbers representing the model parameters into integers. Intel’s Q8BERT is one of the implementations of this approach.

NLP Summary - Jacob Eisenstein

This is a running summary of the NLP textbook by Jacob Eisenstein. This blog post adds a bit of my personal take on his ideas.

Three main themes in NLP:

1. Learning and Knowledge

  • On one extreme, some advocate for free form ML modeling. Feed raw textual inputs to train end-to-end systems and train it to produce desired outputs. Given enough data, we can sufficiently train any network. There is no need for any feature engineering or domain knowledge.
  • On the other extreme, we have advocates of transforming text into hierarchical structures (sub word units called morphemes, followed by POS tags, followed by tree-like parsing).
  • End-to-end systems found success with the advent of deep learning innovations (including speech recognition). Reinforced by advances in CV.
  • We can leverage linguistic structure in model building. We know that sentences are compositional: meaning of larger units are gradually constructed from the meanings of smaller constituents. Dyer et al. 2016

2. Search and Learning

  • Most NLP problems can be written as:

    • is the input
    • is the output
    • is the scoring function
    • is the parameters of the model
    • is the predicted output selected after maximising the scoring function
  • Given this formulation, we can divide the problem into two:
    • Search: This module is for determining the the of the function and finding . Example: bottom-up dynamic programming and beam search.
    • Learning: This module is for training the model . This is done through optimization algorithms.
  • Why do we divide into these problems? We know that in its roots, we need to solve 2 problems, search and learning, which are individually solved already. Therefore, identify relevant algorithms for search, optimization and learning and use their combinations to solve your problem.

3. Relational, compositional, and distributional perspectives

  • Any language element (word, phrase or a sentence) has multiple semantic relationships to other elements. This is captured in semantic ontologies such as WordNet.

Learning - Linear Text Classification

1. Naive Bayes with Bag-of-Words

  • Maximum Likelihood Estimation
  • Joint probability learning (generative model)
  • Smoothing: Add 1 smoothing, Laplace smoothing
  • Batch Learning

2. Perceptron

  • Discriminative model
  • Online model (weights are updated after each example), unlike Naive Bayes.

3. Loss Function

  • A function which takes in the model weights and the input and the output is a positive number that indicates how bad/good the model is to this particular instance. The objective is to minimise the sum of losses across all training instances.
  • negative of Maximum Log Likelihood can be a loss function.

4. Logistic Regression

Learning - Non-Linear Text Classification

Why neural networks for NLP?

  • Rapid advances in Deep Learning.
  • Word Embeddings
  • GPU speeds » CPU speeds

Neural network hyper-parameter selection

  • Activation Functions: Sigmoid, tanh, relU and leaky relU
  • Architecture: You can either make the network wide or deep. This tradeoff is not well understood.
  • Since sqaushing activations such as sigmoid and tanh bounds output to , the gradients propagating to lower levels might be insignificant and lower layers stop learning (vanishing gradients). You can by-pass certain connections from a lower layer to a higher layer (residual networks and highway networks).
  • Regularization and dropout to prevent over-fitting.
  • Model initialization: tanh -> Uniform distribution (Glorot and Bengio, 2010), relU -> zero-mean Guassian distribution.
  • Gradient clipping
  • Batch normalization.
  • Other optimization algorithms to improve stochastic gradient descent: Adagrad

Convolutional Neural Networks

  • Bag-of-Words model disregards word order.
  • A 1-D convolutional filter can be used to extract word order semantics.

Linguistic Applications of Classification

  1. Sentiment Classification
  2. Word sense Disambiguation - Wordnet has sense annotations

Things to note while designing a classifier

  1. Tokenization: Most of the NLP constructs look at words as a single entity (like word embeddings). Tokenization might seem straightforward, but each tokenization technique has significant effect on the performance of the model. We cannot just rely on tokenizing on white-spaces. There are instances of hyphenated phrases (prize-winning, out-of-the-box etc) or punctuations (Ph.D, U.S etc). This is also highly language specific. For example, Malayalam, like German, can coalesce multiple words into a single word and cannot be reliably split into independent words. In such cases, we might have to train classifiers to predict if each character is a word boundary or not.
  2. Text Normalization: One normalization technique is to lowercase every word (in English, at least). But, this might ignore subtleties important for classifications. apple and Apple have different meanings based on the case of the letter. This consideration is applications dependent and if it makes sense to lowercase everything for your application, then it should be done. There is also the cases of lemmas and stems in English. NLTK does a pretty good job of stemming and lemmatising most of the English words.

Interpreting Machine Learning Models - Part 5

Link to start of the blog series: Interpretable ML

Type 2 : Model Agnostic Interpretation (Continued)

  Interpretablility Accuracy
Complex Models No Yes
Simple Models Yes No
  1. Shapely Values

    Lloyd Shapely, a Nobel Prize winner in 2012, came up with a cooperative game theory to distribute profits fairly. Basic premise is that if a group of people come together to play a cooperative game and win some money, how do we distribute this winnings fairly among the players? The individual players themselves interact differently to achieve the outcome and therefore how do we decide the contribution of each member to the collective objective fairly?

    A prediction can be explained by assuming that each feature value of the instance is a “player” in a game where the prediction is the payout. Shapley values – a method from coalitional game theory – tells us how to fairly distribute the “payout” among the features.

    Let us take the following example: We have a trained machine learning model to predict apartment prices. For a certain apartment it predicts €300,000 and you need to explain this prediction. The apartment has a size of 50 m^2, is located on the 2nd floor, has a park nearby and cats are banned. The average prediction for all apartments is €310,000. How much has each feature value contributed to the prediction compared to the average prediction?

    The feature values park-nearby, cat-banned, area-50 and floor-2nd worked together to achieve the prediction of €300,000. Our goal is to explain the difference between the actual prediction (€300,000) and the average prediction (€310,000): a difference of -€10,000.

    How do we calculate the Shapley value for one feature? The Shapley value is the average marginal contribution of a feature value across all possible coalitions. The following list shows all coalitions of feature values that are needed to determine the Shapley value for cat-banned. The first row shows the coalition without any feature values. The other rows show different coalitions with increasing coalition size, separated by “+”. All in all, the following coalitions are possible:

    • No feature values
    • park-nearby
    • size-50
    • floor-2nd
    • park-nearby + size-50
    • park-nearby + floor-2nd
    • size-50 + floor-2nd
    • park-nearby + size-50 + floor-2nd

    For each of these coalitions we compute the predicted apartment price with and without the feature value cat-banned and take the difference to get the marginal contribution. The Shapley value is the (weighted) average of marginal contributions. We replace the feature values of features that are not in a coalition with random feature values from the apartment dataset to get a prediction from the machine learning model.

    Accurate calculation of Shapely Values is a NP-Hard problem. Hence, we come up with a way of estimating it and one of the ways is SHAP.

  2. SHAP (SHapley Additive exPlanations)



    Here, is the feature attribution function that estimates feature importance.

    • Base Rate: Expectation of the model i.e mean of all predictions over the training data.
    • Change in rate given a feature :
    • Change in rate given features :

    The main assumption is that: , and are conditionally independent.

    • Using LIME principles to estimate Shapely Values:
      • LIME uses a local kernel with a loss minimization function (an exponentially smoothed kernel) with regularization. We have to pick a loss function, regularizer and a local kernel. If you remember, from the previous post on LIME, I stated that this is a point of weakness of LIME as they were chosen heuristically.
      • LIME trains a linear model with the heuristic parameters. SHAP proposes that instead of using these heuristics, we can use Shapely Values (estimated through linear models) with a theoretical basis which obey the Shapely Principles.


    An important thing to note here is that SHAP is not causal. It has scope for future work here. But, explainability is not causality i.e we are just explaining why a model gave a particular result without asserting that these features cause these outputs. A very important distinction.