Interpreting Machine Learning Models - Part 4

Link to start of the blog series: Interpretable ML

Type 2 : Model Agnostic Interpretation (Continued)

  1. Feature Importance

    We measure the relative importance of a feature by permuting it’s values and observing it’s effect on the prediction. If the feature is “important” to the prediction, then the prediction changes drastically when the feature value is changed. Conversely, if the feature is relatively “unimportant”, then permuting the value of the feature will have negligible effect on the predicted value.

    NOTE: We still assume that the features are not correlated.

  2. Global Surrogate

    Here, we solve a machine learning problem with more machine learning! If the black-box model is too complex to be interpreted, then we train a simple, interpretable model to mimick the bigger, complex model.

    This is an area of active research in machine learning, engendered not just by a need for interpretability, but also by a need to reduce model sizes. As the models get more and more complex, they grow in size too and contain millions of parameters. This makes it harder to deploy these models on memory-constrained devices such as phones and IoT devices. Therefore, we develop small ML models which can probe the complex model infinitely. Therefore, the smaller model trains to mimick the bigger model by observing how the prediction changes when the input is changed. These research endeavours have been surprisingly successful. The same approach is used in this case, where we train a smaller, interpretable model to mimick the bigger model and hence, we can interpret the outputs.

    This smaller model is called as a “surrogate” of the bigger model and more accurately, it is called as “global surrogate” as it mimicks the entire feature space of the bigger model. This is in contrast to “local surrogates” which is explored in the next section, where the surrogate is trained only on a local sub space of the bigger model and is used to interpret a single prediction.

  3. Local Surrogate (LIME)

    Local interpretable model-agnostic explanations (LIME) focuses on training local surrogate models to interpret individual predictions instead of the entire model. This follows a similar principle of Feature Importance, where we generate a new dataset by perturbing the given input. The exact steps are outlined below:

    • Select the instance for which you want to have an explanation of its black box prediction.
    • Perturb your dataset and get the black box predictions for these new points. Similar to permuting only a single feature in Feature Importance, here we perturb the given vector by changing all the features.
    • Weight the new samples according to their proximity to the instance of interest. This is to give higher importance to generated instances which are closer to the instance of interest. This can be done by any similarity or distance metric. LIME uses an exponential smoothing kernel. A smoothing kernel is a function that takes two data instances and returns a proximity measure. The kernel width determines how large the neighborhood is: A small kernel width means that an instance must be very close to influence the local model, a larger kernel width means that instances that are farther away also influence the model.
    • Train a weighted, interpretable model on the dataset with the variations.
    • Explain the prediction by interpreting the local model.

    How do you get the variations of the data? This depends on the type of data, which can be either text, image or tabular data. For text and images, the solution is to turn single words or super-pixels on or off. In the case of tabular data, LIME creates new samples by perturbing each feature individually, drawing from a normal distribution with mean and standard deviation taken from the feature.

    • LIME for Tabular Data:
      • Tabular data is when the training data is in the form a table where each row is a training instance and each column is a feature.
      • The problem here, is how do we generate data close to the instance that we are interested in? Even though LIME uses exponential smoothing function with a kernel width of 0.75 times the square root of the number of columns of the training data, there is no explanation why.
    • LIME for Text:
      • Variations of the data are generated differently: Starting from the original text, new texts are created by randomly removing words from the original text. The dataset is represented with binary features for each word. A feature is 1 if the corresponding word is included and 0 if it has been removed.
    • LIME for Images:
      • LIME for images works differently than LIME for tabular data and text. Intuitively, it would not make much sense to perturb individual pixels, since many more than one pixel contribute to one class. Randomly changing individual pixels would probably not change the predictions by much. Therefore, variations of the images are created by segmenting the image into “superpixels” and turning superpixels off or on. Superpixels are interconnected pixels with similar colors and can be turned off by replacing each pixel with a user-defined color such as gray. The user can also specify a probability for turning off a superpixel in each permutation.

Interpreting Machine Learning Models - Part 3

Link to start of the blog series: Interpretable ML

Type 2 : Model Agnostic Interpretation

In the previous blog post, we explored various inherently interpretable machine learning models. In this blog post, we will explore various methods of interpretation without any dependency on the type of ML model.

Given an opportunity, we can stick with only inherently ML models. Unfortunately, we have access to innumerable other ML models which are much better than the inherently interpretable ML models. We cannot abandon the former in favor of the latter. Also, having methods to induce interpretability of ML models without relying on the type of model allows us, as developers, to experiment with any number of variations of models without sacrificing interpretability.

  1. Partial Dependency Plot (PDP)

    In layman terms, this plot illustrates the correlation between a feature and target. It illustrates how the target variable changes with change in feature variable value.

    This requires us to know something called as marginalisation. Assume we have 4 variables and we have a function . This function can be represented as


    If was a discrete variable, then integration is replaced by the summation symbol. By integrating (or summing) over all values of , we have marginalised the function over and now we get a relation between , and (i.e ) only without any dependency on .

    This concept is utilised in PDP, where is the set of all features that we are interested in and is the set of all features that we are not interested in.


    By marginalising over the features in , we get the relation between and the ML model.

    To illustrate, let us assume that the features are , , , and the ML model is .

    The output of the ML model is given by,

    Now, we would like to plot a PDP between and the ML model (i.e) we would like to know the how affects the model output.

    Therefore, marginalising over all the other features,

    Now, we have the relation between and . This is nothing but the PDP plot.

    This works for all numerical features. When it comes to categorical features, it becomes simpler because we just need to expand on all the combinations of the categorical features. For example, is an ML model relies on “temperature” and “weather” to predict water sales, we can just set the “weather” variable to “summer”, “spring”, “autumn” and “winter” and record the output of the ML model. Here, we have effectively marginalised over the “weather” variable.

    In PDP, we are assuming that there is no correlation between the features. If there is, this will lead to incorrect results.

  2. Individual Conditional Expectation (ICE)

    PDP is a global method. It does not focus on single, individual instances. It takes all the instances and then plots the correlation. In ICE, we do the same thing for each individual instance. We take each instance and keep , , same and vary (by sampling from a grid or drawing from a distribution) and see how the output () changes. The average of ICE of all instances gives us PDP.

  3. Accumulated Local Effects (ALE)

    ALEs are a better alternative to PDPs. We already know that PDPs have a serious flaw which manifests when the features are correlated. ALEs do not suffer from any of them. How does ALE do that? We know that in PDP, we marginalise over ALL the values of the unwanted features. If the features are correlated, we will end up with feature vectors that are unlikely to ever occur in real life. For example, in house price prediction, if we have number of rooms and square footage area as features and we want to find out how number of rooms affect the house price, we keep the number of rooms constant and vary the square footage. It can go from 20 sqft to 200 sqft. Having 1 room and 200 sqft is highly unlikely to occur and so is 10 rooms and 20 sqft. In ALE, we take a small window to marginalise over instead of ALL the values that the variable can take. For eg, if one example has 3 rooms and 30 sqft, we keep 3 rooms as constant and vary square footage to 29 - 31 sqft (and not 20 - 200 sqft).

Interpreting Machine Learning Models - Part 2

Link to start of the blog series: Interpretable ML

Type 1 : Interpretable Machine Learning Models

In this post, we will be going over some of the machine learning models that can be interpreted intrinsically. This will not be an in-depth review of the models themselves, rather an exploration of how these models lend themselves to interpretability.

  1. Linear Regression

    A linear regression is one of the simpler (and widely used) ML models for regression. Let’s explore how we can interpret a linear regression model and justify whether it is indeed an intrinsically interpretable ML model.

    Linear regression is accomplished with a hyperplane that splits the vector space into two and can be expressed using the following equation.

    As can be seen from the above equation, each feature is assigned a learned parameter which estimates the relative importance given to that particular feature. Since it is also a linear equation, humans can easily comprehend the degree to which a feature affects the output compared to others.

    Depending on the type of feature , we can interpret the corresponding weight as follows:

    1. If is a numerical feature, then every unit change in results in change in the output , given all other features remain constant.

    2. If is a categorical feature, depending on the encoding method used, changing from the reference category to the other category results in change in the output , given all other features remain constant. Determining this reference category is a very tricky business and hence this type of interpretation is tricky.

    3. If is a binary feature, presence of results in change in the output , given all other features remain constant.

    If you have noticed, every interpretation has a condition associated i.e given all other features remain constant. Encountering this situation where only a certain feature changes while all other features remain constant is highly unlikely. This is one of the disadvantages of using these models for interpretability (along with inherent drawbacks of linear regression itself like features should be independent and follow normal distribution).

  2. Logistic Regression

    Logistic regression is the most commonly used model for classification. Let’s explore how logistic regression can be considered an intrinsically interpretable ML model.

    The logical jump from a linear regression to logistic regression is pretty straight-forward. Here, we pass the output of the linear regression through a non-linear function to get the probabilities.

    The linear regression equation is,

    The logistic regression equation is,

    Now that the simple linear equation has been passed through a non-linear function, it becomes a bit difficult for us to interpret the learned weights of logistic regression. So, let us play around with the equation till it is more palatable.

    Let us get the linear term on the right hand side,

    On the left hand side (LHS), we have the ratio of probability of the event happening to the probability of the event not happening (we can call this “the odds”). “log()” of this can be called as the “log odds”.

    Applying exp() on both sides, we get,

    Although this equation makes more sense than the previous ones, it is still not that interpretable. So, let us think about it in this way. What effect would changing by have on the prediction probability?

    Taking the ratio,

    Since, , we can simplify further to get,

    From the above equation, it becomes pretty clear that a unit change in a feature changes the odds ratio by a factor of .

  3. Decision Trees

    Now, decision trees are one of the most understandable machine learning models out there. This is partly because we, as humans, tend to follow this structure when making decisions.

    In simpler terms, a decision tree can be explained as follows: Starting from the root node, you go to the next nodes and the edges tell you which subsets you are looking at. Once you reach the leaf node, the node tells you the predicted outcome. All the edges are connected by “AND”. If feature is [smaller/bigger] than threshold AND … then the predicted outcome is the mean value of of the instances in that node.

    • Feature Importance: The feature that gives us the most reduction in entropy (or variance) is the most important feature. It is beautiful how this can be expressed both mathematically and intuitively.

    • Interpreting a single prediction: A single prediction can be interpreted by visualising exactly the decision path taken to arrive at the output. We can observe each node it went through, the thresholds of these nodes as well as the ultimate leaf node it was assigned to. Since a particular feature can be found any number of times in the tree, we can also estimate how important a feature was in predicting the outcome of this particular prediction.

Interpreting Machine Learning Models - Part 1

Link to start of the blog series: Interpretable ML

This post explores the different types of interpretability, relationships, consequences and evaluation of machine learning interpretability.

Types of interpretability

  1. Intrinsic interpretability: This type of interpretability involves machine learning models that can inherently be interpreted. For example, a short decision tree can express visually the thresholds of splits at every level. A simple linear regression can also show the importance given to each feature. In this scenario, we do not need to resort to any other methods to interpret the models other than to inspect the learned parameters themselves.

  2. Post hoc interpretability: This type of interpretability involves machine learning models that are difficult or impossible to interpret by human standards. For example, just looking at the neural weights of neural networks offer no explanation whatsoever regarding the interpretability. In this scenario, we try to explain the behavior of a model after it is trained by observing how it behaves in myriad situations. This type of interpretability can be applied to interpretable machine learning models too, like a complex decision tree or a linear regressor.

Relationship between algorithm transparency and interpretability

Machine learning algorithms with a high level of algorithm transparency usually tend to have a high interpretability. Algorithm transparency is a measure of how well the learning algorithm is studied and how well we can correlate the learning algorithm with the learned features. For example, in a k-means clustering algorithm, we use a distance metric to classify the points. We know exactly the vector space in which the distance is calculated, the distances between the cluster center and how we decide which cluster a point belongs to. Hence, we can say that k-means has a high level of algorithm transparency. Contrast this with a convolutional neural network and the difference becomes obvious. Although we do understand on a high level that the lower layers differentiates on lower pixel level like contrasts/edges while the higher layers learn more semantic features of the image, we do not yet understand how the gradient updates (irrespective of the algorithm used) in the higher layers, which trickles down to lower layers, correlate to identifying specific features of the image. This is an extremely exciting area of research that I am personally interested in.

Evaluating Machine Learning Interpretability

Before we go further into “how” to achieve interpretability, we need to first understand “what” we are trying to achieve. How do we evaluate different interpretability models? How do we know which method is superior than the other?

Doshi-Velez and Kim (2017) proposed a three level evaluation metric:

  1. Application grounded evaluation
  2. Human grounded evaluation
  3. Functionally grounded evaluation


  1. Application-grounded evaluation (Real humans, real tasks):

    This involves conducting human experiments within a real application. Domain experts are involved to verify the correctness and usefulness of the interpretation offered by the model. For example, a model which predicts whether a tumour is malignant or benign can produce a prediction along with an interpretation report which a doctor can verify.

    This can also involve not making a prediction and only offering supporting evidence to the domain expert in order to make his task easier and faster to accomplish. For example, in the previous example, a model can mark regions of X-ray images which it might flag as malignant/benign which the doctor can incorporate in his decision making.

  2. Human-grounded metrics (Real humans, simplified tasks):

    What happens when we do not have access to domain experts or if the model does not necessarily replicate a domain expert’s task? In this type of evaluation, we make use of lay humans who do not possess any prior knowledge of the task or the underlying model. This can be accomplished through the following 3 ways:

    • Binary forced choice: Humans are presented with pairs of explanations, and must choose theone that they find of higher quality (basic face-validity test made quantitative).

    • Forward simulation/prediction: Humans are presented with an explanation and an input, andmust correctly simulate the model’s output (regardless of the true output).

    • Counterfactual simulation: Humans are presented with an explanation, an input, and an output, and are asked what must be changed to change the method’s prediction to a desiredoutput (and related variants).

  3. Functionally-grounded evaluation (No humans, proxy tasks):

    In situations where we cannot leverage humans for testing (for cost, time or ethical reasons), we can use a proxy for evaluation. This seems a bit counter-intuitive since interpretability requires human comprehension. This type of evaluation, hence, is applicable to models whose counterparts are already subjected to some form of human evaluation. This type of evaluation requires further research.

Unintended consequence

One very interesting consequence that will arise if we manage to build/train a very good interpretation model for existing models is that we can ultimately use the explanations provided by the interpretation model to make the prediction itself. If the interpretation model is actually good, we can as well eliminate the complex underlying machine learning model itself. There would be no need to have the deep neural networks with millions of parameters. Of course, this can spiral into a recursive problem where the interpretation model itself becomes complex enough to require another interpretation model. That would be a very interesting situation to be in :P

Interpreting Machine Learning Models - Introduction

One of the main impediments to the wide adoption of machine learning, especially deep learning, in critical (and commercial) applications is the apparent lack of trust accorded to these machine learning applications. This distrust mainly stems from the inability to reason about the outputs spit out by these models. This phenomenon is not just relegated to those who are outside the machine learning domain either. Even seasoned machine learning practitioners are flummoxed by the apparent failings of machine learning models. In fact, I would go so far as to say that longer you tinker around machine learning, more skeptical you become of its’ abilities.

Extensive research and development has been done in machine learning domain in the past decade and almost all of it has concentrated on achieving that elusive 100% accuracy (or whatever other metrics are used) and most of it were still in the research labs, with nary a thought given to the issue of how exactly this would be applied in the real world. With billions of dollars being poured into this domain and after years in research labs, it was time to bring this technology out in the open for commercial use. This was when the true drawback of integrating machine learning into our everyday lives became apparent. It is not so easy to trust a machine learning model.

If we delve deeper into why it is hard for us to trust a machine learning model, it helps to look into how we handle decisions taken by others that affect us. We, as humans, tend to give more importance to how a decision was arrived at rather than the decision itself. Taking an advisor-investor relationship as an example, if the advisor recommends a particular stock to the investor, he needs to explain why he came to that decision. Even if the investor is not as savvy as the advisor and might not grasp all the decision making factors of the advisor, he will be wary of investing without any type of explanation. And if the advisor insists on not providing any explanation, it would erode the trust between them. Similarly, all the machine learning models are “decision-making” models and we need to be able to “see” how a model arrived at an output even if it doesn’t make complete sense to us.

Another reason why we have so little trust in machine learning models is because of the paucity of understanding in how exactly these models interpret the real world. Even if we have a state-of-the-art model that achieves 99.9% accuracy, we do not know why it fails in the other 0.1%. If we knew, we would fix it and achieve 100%. We cannot trust any machine learning output completely because we do not know when it is going to fail. This drawback became apparent with the introduction of adversarial networks and their runaway success in breaking almost all the state-of-the-art models at the time. Turns out, we do not exactly know how an object detector model exactly detects an object even though we know enough to push it to 99.9%.

Now that we are faced with the reality of deploying these models in the real world and asking people to use them and consequently trust them (sometimes with our lives, as is the case in self driving automobiles), we have to invest more in interpreting the outputs of these models. We are seeing more and more research being done on interpretability of machine learning models in recent times. This is not cure-all for widespread adoption of machine learning, but it is definitely a good starting point.

The purpose of this blog series is to explore some of the ways we can achieve machine learning interpretability and keep this in mind before developing any new machine learning applications.

Links to the blog series:

  1. Part 1: Evaluation and Consequences
  2. Part 2: Interpretable Machine Learning Models