Demystifying Post-hoc Explainability for ML models

1. Introduction

As machine learning becomes more widely used in high-stakes situations that affect people's livelihoods, there have been increasing calls to "open the black-box" and make machine learning algorithms more understandable. The needs of end-users must be carefully considered when providing useful explanations. This has led to the emergence of a subfield in AI called eXplainable AI (XAI). Rather than trying to create models that are inherently interpretable, there has been a recent explosion of work on "Explainable ML", where a second (post-hoc) model is created to explain the first black-box model -- one of the main criteria is intrinsic vs. post-hoc. This criterion is used for distinguishing whether interpretability is achieved through constraints imposed on the complexity of the ML model (intrinsic) or by applying methods that analyze the model after training (post-hoc) (Carvalho et al., 2019). If you can build an interpretable model which is also adequately accurate for your setting, do it! Otherwise, post-hoc explanations come to the rescue which is going to be the topic of discussion here!
In this review, we study several approaches towards explainable AI systems and provide a taxonomy of how one can think about diverse approaches towards post-hoc explainability from a data modalities perspective. We present the most widely adopted quantitative evaluation measures to validate explanations. Finally, we conclude with a discussion addressing open questions and recommend a path to the development and adoption of explainable methods for safety-critical or mission-critical applications. The objective is to provide to the reader a guide to map a black-box model to a set of compatible post-hoc explanation methods.1
Figure 1: A pipeline for Post-hoc Explanation. The goal is to make the user understand the predictions of ML models, which is achieved through explanations. For this, we make use of an explanation method, which is nothing more than an algorithm that generates explanations.

2. Motivation

For commercial benefits, ethics concerns, or regulatory considerations, XAI is essential if users are to understand, appropriately trust, and effectively manage AI results. When we talk about an explanation for a decision, we generally mean the need for reasons or justifications for that particular outcome, rather than a description of the inner workings or the logic of reasoning behind the decision-making process in general. Using XAI systems provides the required information to justify results, particularly when unexpected decisions are made. It also ensures that there is an auditable and provable way to defend algorithmic decisions as being fair and ethical, which leads to building trust. Explainability is not just important for justifying decisions. It can also help prevent things from going wrong. Indeed, understanding more about system behavior provides greater visibility over unknown vulnerabilities and flaws, and helps to rapidly identify and correct errors in low criticality situations (debugging). Thus enabling an enhanced control (Adadi et al., 2018).

3. Approaches for Post-hoc Explanation

Broadly speaking, there exist two methods of explaining post-hoc explanations. Local explanations focus on data and provide individual explanations, they provide trust to model outcomes. While global explanations focus on the model and provide an understanding of the decision process, it connotes some sense of understanding of the mechanism by which the model works. Post-hoc explanations are model-agnostic in nature, as these are not tied to a particular type of ML model and separates prediction from explanation.
Figure 2: List of popular approaches for Post-hoc Explanation.
Popular approaches with examples for post-hoc explainability:
  • Feature Importance Based Explanations
    • LIME (Ribeiro et al., 2016) - explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.
    • SHAP (Lundberg et al., 2017) - assigns each feature an importance value for a particular prediction and fairly attributes the prediction to all the features.
  • Rule-Based Explanations
    • Anchors (Ribeiro et al., 2018) - explains the behavior of complex models with high-precision rules, representing local, "sufficient" conditions for predictions.
    • LORE (Guidotti et al., 2018) - learns a local interpretable predictor on a synthetic neighborhood generated by a genetic algorithm. Then, it derives from the logic of the local interpretable predictor a meaningful explanation consisting of a decision rule and a set of counterfactual rules.
  • Saliency Maps
    • Layer-Wise Relevance Propagation (Bach et al., 2015) - assumes that the classifier can be decomposed into several layers of computation. Such layers can be parts of the feature extraction from the image or parts of a classification algorithm run on the computed features.
    • Integrated Gradients (Sundararajan et al., 2017) - they do not need any instrumentation of the network, and can be computed easily using a few calls to the gradient operation, allowing even novice practitioners to easily apply the technique.
  • Prototype-Based Explanations
    • Prototype Selection (Bien et al., 2011) - a good set of prototypes for a class should capture the full structure of the training examples of that class while taking into consideration the structure of other classes.
    • TracIn (Pruthi et al., 2020) - computes the influence of a training example on a prediction made by the model by tracing how the loss on the test point changes during the training process whenever the training example of interest was utilized.
  • Counterfactual Explanations
    • DiCE (Mothilal et al., 2020) - generating and evaluating a diverse set of counterfactual explanations based on determinantal point processes.
    • FACE (Poyiadzi et al., 2020) - generates counterfactuals that are coherent with the underlying data distribution and supported by the "feasible paths" of change, which are achievable and can be tailored to the problem at hand.
  • Representation-Based Explanations
    • Network Dissection (Bau et al., 2017) - quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts.
    • Compositional Explanation (Mu et al., 2020)- automatically explaining logical and perceptual abstractions encoded by individual neurons in deep networks and generate explanations by searching for logical forms defined by a set of composition operators over primitive concepts.
  • Model Distillation
    • LGAE (Tan et al., 2019) - leverage model distillation to learn global additive explanations that describe the relationship between input features and model predictions. These global explanations take the form of feature shapes, which are more expressive than feature attributions.
    • Decision Trees as global explanations (Bastani et al., 2017) - generate new training data by actively sampling new inputs and labeling them using the complex model, since they are nonparametric.
  • Summaries of Counterfactuals
    • AReS (Rawal et al., 2020) - construct global counterfactual explanations which provide an interpretable and accurate summary of recourses for the entire population.

4. Explanations in different data modalities

In many scientific problems, we have access to different modalities of data. Specifically, for this review, we consider the modalities -- tabular, text, image, and time-series data, to understand the need for explainability and recognize possibly unique challenges presented by them when constructing explanations in each of these cases. Figure [3] provides a visual walkthrough of post-hoc explanations for different modalities.
Figure 3: Examples of Post-hoc explanations for different modalities.

How do we provide explanations for Tabular / Structured Data?

The majority of data in day-to-day life is in fact structured, tabular data, such as recommender systems, fraud detection, relational databases, etc. Depending on the task/domain, it could be either low or high dimensional and the gradients may not always be meaningful. In this section, we present select approaches used by explainers for explaining the decisions on tabular data.

Feature Importance Based Explanations

(Agarwal et al., 2020) impose constraints on the structure of neural networks, resulting in a family of models known as Neural Additive Models (NAMs), which are inherently interpretable while exhibiting low prediction accuracy when applied to tabular data. Methodologically, NAMs belong to a larger model family called Generalized Additive Models (GAMs) (Hastie et al., 1990). GAMs have the form:
(1) g ( E [ y ] ) link function = β + f 1 ( x 1 ) + f 2 ( x 2 ) + + f K ( x K ) univariate shape function with E [ f i ] = 0 applied to the input with K features (1) g ( E [ y ] ) link function = β + f 1 x 1 + f 2 x 2 + + f K x K univariate shape function with  E [ f i ] = 0 applied to the input with  K  features {:(1)ubrace(g(E[y]))_("link function")=beta+ubrace(f_(1)(x_(1))+f_(2)(x_(2))+cdots+f_(K)(x_(K)))_({:["univariate shape function with "(E[f_(i)]=0)],["applied to the input with "K" features"]:}):}\begin{align} \underbrace{g(\mathbb{E}[y])}_{\color{magenta}\text{link function}}=\beta+\underbrace{f_{1}\left(x_{1}\right)+f_{2}\left(x_{2}\right)+\cdots+f_{K}\left(x_{K}\right)}_{\begin{subarray}{l}\color{magenta}\text{univariate shape function with $\mathbb{E}[f_i]= 0$}\\\color{magenta}\text{applied to the input with $K$ features}\end{subarray}} \end{align}
NAMs learn a linear combination of networks that each attend to a single input feature: each f i f i f_(i)f_i  in [1] is parametrized by a neural network. These networks are trained jointly using backpropagation and can learn arbitrarily complex shape functions. Interpreting NAMs is easy as the impact of a feature on the prediction does not rely on the other features and can be understood by visualizing its corresponding shape function (e.g., plotting f i ( x i ) f i ( x i ) f_(i)(x_(i))f_i(x_i)  vs.  x i x i x_(i)x_i).

Rule-Based Explanations

(Ribeiro et al., 2018) introduce model-agnostic explanations based on if-then rules, called anchors. An anchor explanation is a rule that sufficiently anchors the prediction locally, such that changes to the rest of the feature values of the instance do not matter. For instance, it is observed that anchors for a few predictions of 400 gradient boosted trees trained on balanced versions of the adult dataset give the following insight -- Marital status appears in many different anchors for predicting whether a person makes > $50K annually. It is to be noted that these are not exhaustive, since the models are complex, and these anchors explain their behavior on part of the input space but not all of it. (Letham et al., 2015) propose a model, Bayesian Rule Lists (BRL), that produces a posterior distribution over permutations of if ... then ... rules, starting from a large, pre-mined set of possible rules. The decision lists with high posterior probability tend to be both accurate and interpretable, where the interpretability comes from a hierarchical prior over permutations of rules.

Prototype-Based Explanations

Rather than explaining which features led to a certain class being predicted, (Tan et al., 2020) propose to explain a prediction by presenting similar points that represent that class. Since these prototypes will be identified using distance functions derived from the tree ensemble, they call them tree space prototypes. They introduce two new prototype selection methods that exploit approximation guarantees for submodular objective functions, and one that tries to directly optimize for accuracy. To formalize the impact of a training point on a prediction, (Koh et al., 2017) ask the counterfactual: what would happen if we did not have this training point, or if the values of this training point were changed slightly? Answering this question by perturbing the data and retraining the model can be prohibitively expensive. To overcome this problem, they use influence functions, a classic technique from robust statistics that tells us how the model parameters change as we upweight a training point by an infinitesimal amount. This allows us to differentiate through the training to estimate in closed-form the effect of a variety of training perturbations.

Counterfactual Explanations

(Albini et al., 2020) provide counterfactual explanations for generic Bayesian Classifiers (BCs), based on the causal reasoning underpinning them. The explanation method relies upon mapping the influences between a BC’s variables, e.g. between observations and classifications. From these influences, they extract two relations between variables deemed to be relevant to the Counterfactual explanation. These relations amount, respectively, to critical and potential influences, indicating (potentially) pivotal factors, whose absence would give rise to a different classification. (Ustun et al., 2019) frame these issues in terms of recourse, which is defined as the ability of a person to change the decision of a model by altering actionable input variables (e.g., income vs. age or marital status). They present integer programming tools to ensure recourse in linear classification problems without interfering in model development.

How do we provide explanations for textual data?

Traditionally, Natural Language Processing (NLP) systems have been mostly based on techniques that are inherently explainable. Examples of such approaches, often referred to as white-box techniques, include rules, decision trees, hidden Markov models, logistic regressions, and others. Recent years, though, have brought the advent and popularity of black-box techniques, such as deep learning models and the use of language embeddings as features. While these methods in many cases substantially advance model quality, they come at the expense of models becoming less interpretable. A major challenge working with text is the difficulty in writing similarity/perturbation functions. This section covers some of the well-known explanation methods for handling textual data.

Saliency Map Visualization

To interpret why the model made its prediction (not the ground-truth answer), (Wallace et al., 2019) use the model’s own output in the loss calculation. For each method, they reduce each token’s gradient (which is the same dimension as the token embedding) to a single value by taking the L2 norm. To better understand Neural Machine Translation (NMT) behavior, (Lee et al., 2017) propose a tool that provides several methods to understand beam search and attention mechanism in an interactive way, by visualizing search tree and attention, expanding search tree manually, and changing attention weight either manually or automatically.

Input Reduction

In existing interpretation methods for NLP, a word’s importance is determined by either input perturbation—measuring the decrease in model confidence when that word is removed or by the gradient with respect to that word. To understand the limitations of these methods, (Feng et al., 2018) use input reduction, which iteratively removes the least important word from the input. For each word in an input sentence, they measure its importance by the change in the confidence of the original prediction when that word is removed from the sentence.
(2) y = argmax y f ( y x ) (3) g ( x i x ) importance = f ( y x ) predicted probability of label y given input x f ( y x i ) (2) y = argmax y f y x (3) g x i x importance = f ( y x ) predicted probability of label  y  given input  x f y x i {:(2)y=argmax_(y^('))f(y^(')∣x),(3)ubrace(g(x_(i)∣x))_("importance")=ubrace(f(y∣x))_({:["predicted probability"],["of label "y" given input "x]:})-f(y∣x_(-i)):}\begin{align} y&=\operatorname{argmax}_{y^{\prime}} f\left(y^{\prime} \mid \mathbf{x}\right)\\[10pt] \underbrace{g\left(x_{i} \mid \mathbf{x}\right)}_{\color{magenta}\text{importance}}&=\underbrace{f(y \mid \mathbf{x})}_{\begin{subarray}{l}\color{magenta}\text{predicted probability}\\\color{magenta}\text{of label $y$ given input $\mathbf{x}$}\end{subarray}}-f\left(y \mid \mathbf{x}_{-i}\right) \end{align}
In Figure [3], input reduction shows that the words “named”, “at”, and “in downtown” are sufficient to predict the People, Organization, and Location tags, respectively.

Prototype-Based Explanations

(Han et al., 2020) explore the application of influence functions as a mechanism to reveal artifacts (or confounds) in training data that might be exploited by models and investigate the degree to which results from the influence function are consistent with insights gleaned from gradient-based saliency scores for representative NLP tasks.
How would the model’s predictions change if a training input were modified?
A classic result (Cook et al., 1982) tells us that the influence of upweighting z z zz  on the parameters  θ ^ θ ^ hat(theta)\hat{\theta} is given by: (4) I up,params ( z ) = d θ ^ ϵ , z d ϵ | ϵ = 0 = H θ ^ 1 Hessian θ gradient L ( z , θ ^ ) loss (4) I up,params  ( z ) = d θ ^ ϵ , z d ϵ ϵ = 0 = H θ ^ 1 Hessian θ  gradient L ( z , θ ^ ) loss {:(4)I_("up,params ")(z)(=^())(d hat(theta)_(epsilon,z))/(d epsilon)|_(epsilon=0)=-ubrace(H_( hat(theta))^(-1))_("Hessian")ubrace(grad_(theta))_(" gradient")ubrace(L(z,( hat(theta))))_("loss"):}\begin{align} \left.\mathcal{I}_{\text {up,params }}(z) \stackrel{}{=} \frac{d \hat{\theta}_{\epsilon, z}}{d \epsilon}\right|_{\epsilon=0}&=-\underbrace{H_{\hat{\theta}}^{-1}}_{\color{magenta}\text{Hessian}} \underbrace{\nabla_{\theta}}_{\color{magenta}\text{ gradient}}\underbrace{L(z, \hat{\theta})}_{\color{magenta}\text{loss}} \end{align}
Now, define z = ( x , y ) z = ( x , y ) z=(x,y)z=(x, y)  and  z δ = ( x + δ , y ) z δ = ( x + δ , y ) z_(delta)=(x+delta,y)z_δ = (x + δ, y).
(5) θ ^ ϵ , z δ , z empirical risk minimizer approximation from moving ϵ mass from z onto z δ = arg min θ Θ 1 n i = 1 n L ( z i , θ ) empirical risk + ϵ L ( z δ , θ ) ϵ L ( z , θ ) (5) θ ^ ϵ , z δ , z empirical risk minimizer approximation from moving ϵ  mass from  z  onto  z δ = arg min θ Θ 1 n i = 1 n L z i , θ empirical risk + ϵ L z δ , θ ϵ L ( z , θ ) {:(5)ubrace( hat(theta)_(epsilon,z_(delta),-z))_({:["empirical risk minimizer"],["approximation from moving"],[epsilon" mass from "z" onto "(z_(delta))]:})=arg min_(theta in Theta)ubrace((1)/(n)sum_(i=1)^(n)L(z_(i),theta))_("empirical risk")+epsilon L(z_(delta),theta)-epsilon L(z","theta):}\begin{align} \underbrace{\hat{\theta}_{\epsilon, z_{\delta},-z}}_{\begin{subarray}{l}\color{magenta}\text{empirical risk minimizer}\\\color{magenta}\text{approximation from moving}\\\color{magenta}\text{$\epsilon$ mass from $z$ onto $z_δ$}\end{subarray}} &= \arg \min _{\theta \in \Theta} \underbrace{\frac{1}{n} \sum_{i=1}^{n} L\left(z_{i}, \theta\right)}_{\color{magenta}\text{empirical risk}}+\epsilon L\left(z_{\delta}, \theta\right)-\epsilon L(z, \theta)\\[10pt] \end{align}
An analogous calculation to [4] yields:
(6) d θ ^ ϵ , z δ , z d ϵ | ϵ = 0 = I up,params ( z δ ) perturbation I up,params ( z ) training point (7) = H θ ^ 1 ( θ L ( z δ , θ ^ ) θ L ( z , θ ^ ) ) (6) d θ ^ ϵ , z δ , z d ϵ ϵ = 0 = I up,params  z δ perturbation I up,params  ( z ) training point (7) = H θ ^ 1 θ L z δ , θ ^ θ L ( z , θ ^ ) {:(6)(d hat(theta)_(epsilon,z_(delta),-z))/(d epsilon)|_(epsilon=0)=I_("up,params ")ubrace((z_(delta)))_("perturbation")-I_("up,params ")ubrace((z))_("training point"),(7)=-H_( hat(theta))^(-1)(grad_(theta)L(z_(delta),( hat(theta)))-grad_(theta)L(z,( hat(theta)))):}\begin{align} \left.\frac{d \hat{\theta}_{\epsilon, z_{\delta},-z}}{d \epsilon}\right|_{\epsilon=0} &=\mathcal{I}_{\text {up,params }}\underbrace{\left(z_{\delta}\right)}_{\color{magenta}\text{perturbation}}-\mathcal{I}_{\text {up,params }}\underbrace{(z)}_{\color{magenta}\text{training point}} \\[10pt] &=-H_{\hat{\theta}}^{-1}\left(\nabla_{\theta} L\left(z_{\delta}, \hat{\theta}\right)-\nabla_{\theta} L(z, \hat{\theta})\right) \end{align}
While influence functions might appear to only work for infinitesimal (therefore continuous) perturbations, it is important to note that this approximation holds for arbitrary δ δ delta\delta:  the ϵ ϵ epsilon-\epsilon- upweighting scheme allows us to smoothly interpolate between z z zz  and  z δ z δ z_( delta)z_\delta. This is particularly useful for working with discrete data (e.g., in NLP) or with discrete label changes.

How do we provide explanations for images?

When we describe how we classify images, we might focus on parts of the image and compare them with prototypical parts of images from a given class. This method of reasoning is commonly used in difficult identification tasks: e.g., radiologists compare suspected tumors in X-ray scans with prototypical tumor images for diagnosis of cancer. The question is whether we can ask a machine learning model to imitate this way of thinking, and to explain its reasoning process in a human-understandable way. This section lists important explanation techniques commonly used in computer vision.

Saliency Map Visualization

(Simonyan et al., 2014) propose a method for computing the spatial support of a given class in a given image (image-specific class saliency map) using a single back-propagation pass through a classification ConvNet. Such saliency maps can be used for weakly supervised object localization.
Consider the linear score model for the class c c cc :
(8) S c ( I ) class score function = w c T weight vector I image + b c model bias (8) S c ( I ) class score function = w c T weight vector I  image + b c model bias {:(8)ubrace(S_(c)(I))_({:["class score"],["function"]:})=ubrace(w_(c)^(T))_({:["weight"],["vector"]:})ubrace(I)_(" image")+ubrace(b_(c))_({:["model"],["bias"]:}):}\begin{align} \underbrace{S_{c}(I)}_{\begin{subarray}{l}\color{magenta}\text{class score}\\\color{magenta}\text{function}\end{subarray}} = \underbrace{{w_c}^{T}}_{\begin{subarray}{l}\color{magenta}\text{weight}\\\color{magenta}\text{vector}\end{subarray}}\underbrace{I}_{\color{magenta}\text{ image}} + \underbrace{b_c}_{\begin{subarray}{l}\color{magenta}\text{model}\\\color{magenta}\text{bias}\end{subarray}} \end{align}
Given an image I 0 I 0 I_(0)I_0 , we can approximate S c ( I ) S c ( I ) S_(c)(I)S_{c}(I) with a linear function in the neighbourhood of I 0 I 0 I_(0)I_0 by computing the first-order Taylor expansion:
(9) S c ( I ) w T I + b (9) S c ( I ) w T I + b {:(9)S_(c)(I)~~w^(T)I+b:}\begin{align} S_{c}(I) \approx w^{T}I + b \end{align}
where w w ww is the derivative of S c S c S_(c)S_c with respect to the image I I II at the point (image) I 0 I 0 I_(0)I_0 :
(10) w = S c I | I 0 (10) w = S c I I 0 {:(10)w=(delS_(c))/(del I)|_(I_(0)):}\begin{align} w=\left.\frac{\partial S_{c}}{\partial I}\right|_{I_{0}} \end{align}
Given an image I 0 I 0 I_(0)I_{0} (with  m m mm rows and  n n nn columns) and a class  c c cc , the class saliency map  M R m × n M R m × n M inR^(m xx n)M \in \mathcal{R}^{m \times n} is computed as follows. First, the derivative w w ww [10] is found by back-propagation. After that, the saliency map is obtained by rearranging the elements of the vector  w . w . w.w .

Shapley Value Importance

To quantify the contribution of each neuron to an arbitrary performance aspect of a network, (Ghorbani et al., 2020) develop the Neuron Shapley framework. They introduce a new multi-arm bandit based algorithm that efficiently estimates Neuron Shapley values in large networks and their systematic experiments discover several interesting findings, including the phenomenon that a small number of neurons are critical to different aspects of a network’s performance, e.g. accuracy, fairness, robustness. This facilitates both model interpretation and repair. In Figure [3], removing filters with the highest class-specific Shapley values (blue dash) reduce the class prediction accuracy more effectively than removing filters identified by other approaches.

Concept Attribution

The hallmark of concept attribution is that clinically relevant measures are directly used to produce explanations that match the semantics of the end-users. CNN decisions are explained by directly relating to well-known prognostic factors and clinical guidelines. This promotes a more intuitive interaction between the physicians and black-box systems aiding the diagnosis, with a consequent increase of confidence in automated support tools. Besides, concept attribution is a complementary technique to saliency heatmaps giving concept-based explanations rather than pixel-based ones (Graziani et al., 2020).

How do we provide explanations for time-series data?

Multivariate time series data are being generated at an ever-increasing pace due to the ubiquity of sensors and the advancement of IoT technologies. Classifying these multivariate time series is crucial for utilizing these data effectively, and is an important research topic in the machine learning community. In this section, we consolidate some of the important explanation techniques used in time-series data.

Relevance Heatmaps

Similar to the saliency masks on images, a heatmap can be created based on the relevance produced by XAI methods. It is possible to create a visualization with this heatmap enriching a line plot of the original time series. Figure [3] shows XAI methods shown with their relevance heatmaps are Saliency Maps, LRP, DeepLIFT, LIME, and SHAP. The blue rectangles display controversial parts of the time series for the XAI methods with red marking high importance for the classification which are, e.g., set to zero by verification methods (Schlegel et al., 2019).

Prototype-Based Explanations

(Gee et al., 2019) introduce a prototype diversity penalty that explicitly accounts for prototype clustering and encourages the model to learn more diverse prototypes. These diverse prototypes will help focus on areas of the latent space where class separation is most difficult and least defined to improve classification accuracies. They show the utility of this approach on three tasks in two-dimensional time-series classification: (1) bradycardia from ECG; (2) apnea from respiration; and (3) spoken digits from audio waveforms.

Patch-Based Classification

A data transformation approach for time-series that creates patches of different sizes and transforms them into a shape usable by every neural network. Therefore, we can interpret the fine-grained patches to understand the individual fine-grained blocks and their influence on the overall classification (Mercier et al., 2021).

5. Evaluation Strategies

Next, we dive into how to effectively evaluate the generated explanations. The first evaluation approach is application-grounded, involving real humans on real tasks. This evaluation measures how well human-generated explanations can aid other humans in particular tasks, with explanation quality assessed in the true context of the explanation’s end tasks. The second evaluation approach is human-grounded, using human evaluation metrics on simplified tasks. The final evaluation metric is functionally grounded evaluation, without human subjects. We refer the reader to Figure [4] for an overview of different evaluation strategies.
Figure 4: Evaluating measures for different explanations. These evaluations can be group under the broader -- Processing, Representation, and Explanation Producing types.
Most of the work surveyed conducts one of the following types of evaluation of their explanations (Gilpin et al., 2018).
  • Completeness compared to the original model: A proxy model can be evaluated directly according to how closely it approximates the original model being explained (Ribeiro et al., 2016).
  • Disabling irrelevant hidden features: (Lertvittayakumjorn et al., 2020) propose a technique to disable the learned features which are irrelevant or harmful to the classification task so as to improve the classifier. This technique and the word clouds form the human-debugging framework.
  • Human evaluation: Humans can evaluate explanations for reasonableness, that is how well an explanation matches human expectations. Human evaluation can also evaluate completeness or substitute-task completeness from the point of view of enabling a person to predict behavior of the original model; or according to helpfulness in revealing model biases to a person (Lai et al., 2019).
  • Completeness as measured on a substitute task: Some explanations do not directly explain a model’s decisions, but rather some other attribute that can be evaluated.
  • Ability to detect models with biases: An explanation that reveals sensitivity to a specific phenomenon (such as a presence of a specific pattern in the input) can be tested for its ability to reveal models with the presence or absence of a relevant bias (such as reliance or ignorance of the specific pattern) (Tong et al., 2020).

6. Current Limitations and Future Directions

This section gives an overview of the current limitations and future roadmap for XAI research.


We address some of the limitations arising from post-hoc explanations:
  • Robustness: Model-agnostic perturbation-based methods are (unsurprisingly) more prone to instability than their gradient-based counterparts (Alvarez et al., 2018). As a concrete example, consider an image classification model that places importance on both salient aspects of the input, i.e., those actually related to the ground-truth class and on background noise. Suppose, in addition, that those artifacts are not uniformly relevant for different inputs, while the salient aspects are. Should the explanation include the noisy pixels?
  • Adversarial attacks: By being able to differentiate between data points coming from the input distribution and instances generated via perturbation, an adversary can create an adversarial classifier (scaffolding) that behaves like the original classifier (perhaps be extremely discriminatory) on the input data points, but behaves arbitrarily differently (looks unbiased and fair) on the perturbed instances, thus effectively fooling LIME or SHAP into generating innocuous explanations (Slack et al., 2020).
  • Manipulable: Explanation maps can be changed to an arbitrary target map (Dombrowski et al., 2019). This is done by applying a visually hardly perceptible perturbation to the input. This perturbation does not change the output of the neural network, i.e. in addition to the classification result also the vector of all class probabilities is (approximately) the same. This finding is clearly problematic if a user, say a medical doctor, is expecting a robustly interpretable explanation map to rely on in the clinical decision-making process.
  • Unjustified Counterfactual Explanations: (Laugel et al., 2019) propose an intuitive desideratum for more relevant counterfactual explanations, based on ground-truth labelled data, that helps generating better explanations. They design a test to highlight the risk of having undesirable counterfactual examples disturb the generation of counterfactual explanations and apply this test to several datasets and classifiers and show that the risk of generating undesirable counterfactual examples is high.
  • Faithfulness: Explainable ML methods provide explanations that are not faithful to what the original model computes. Explanations must be wrong. They cannot have perfect fidelity with respect to the original model. If the explanation was completely faithful to what the original model computes, the explanation would equal the original model, and one would not need the original model in the first place, only the explanation. An inaccurate (low-fidelity) explanation model limits trust in the explanation, and by extension, trust in the black box that it is trying to explain (Rudin et al., 2019).

Future Directions

XAI is a burgeoning field that has yet to solve many open challenges. We discuss some of the potential research directions likely to happen:
  • It is tempting to use model explainability to gain insights into model fairness, however existing explainability tools do not reliably indicate whether a model is indeed fair (Begley et al., 2020).
  • Estimate the causal effect of (the presence or absence of) a human-interpretable concept on a deep neural net's predictions. Identifying vulnerabilities in existing post hoc explanations and proposing approaches to address these vulnerabilities is a critical research direction going forward (Goyal et al., 2019)!
  • Rigorous user studies and evaluations to ascertain the utility of different post-hoc explanation methods in various contexts are extremely critical for the progress of the field (Bansal et al., 2020)!
  • Exploring post-hoc explanations for complex ML tasks, besides the usual classification settings can be a good starting point for researchers.

7. Conclusion

In this review, we have presented a comprehensive overview of methods proposed in the literature for explaining decision systems covering categorization based on the data modalities and explanation strategies. We reinforce the fact that there must be careful analysis and exposition of the limitations and vulnerabilities of the proposed explanation method. Wherever applicable, use quantitative metrics (and not anecdotal evidence) to make claims and have a system in place to receive constant feedback from the end-users to improve post-hoc explainability.


1 A couple of recent (upto 2021) interesting talks on explainability:


[1] Rishabh Agarwal and Nicholas Frosst and Xuezhou Zhang and Rich Caruana and Geoffrey E. Hinton (2020). Neural Additive Models: Interpretable Machine Learning with Neural Nets. CoRR, abs/2004.13912.

[2] Trevor Hastie and Robert Tibshirani (1990). Generalized Additive Models. Chapman and Hall/CRC.

[3] Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin (2018). Anchors: High-Precision Model-Agnostic Explanations. In AAAI (pp. 1527–1535). AAAI Press.

[4] Sarah Tan and Matvey Soloviev and Giles Hooker and Martin T. Wells (2020). Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable. In FODS (pp. 23–34). ACM.

[5] Emanuele Albini and Antonio Rago and Pietro Baroni and Francesca Toni (2020). Relation-Based Counterfactual Explanations for Bayesian Network Classifiers. In IJCAI (pp. 451–457).

[6] Pang Wei Koh and Percy Liang (2017). Understanding Black-box Predictions via Influence Functions. In ICML (pp. 1885–1894). PMLR.

[7] Benjamin Letham and Cynthia Rudin and Tyler H. McCormick and David Madigan (2015). Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. CoRR, abs/1511.01644.

[8] Berk Ustun and Alexander Spangher and Yang Liu (2019). Actionable Recourse in Linear Classification. In FAT (pp. 10–19). ACM.

[9] Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian and Matt Gardner and Sameer Singh (2019). AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models. In EMNLP/IJCNLP (3) (pp. 7–12). Association for Computational Linguistics.

[10] Jaesong Lee and Joong-Hwi Shin and Jun-Seok Kim (2017). Interactive Visualization and Manipulation of Attention-based Neural Machine Translation. In EMNLP (System Demonstrations) (pp. 121–126). Association for Computational Linguistics.

[11] Feng, J. (2018). Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 3719–3728). Association for Computational Linguistics.

[12] Xiaochuang Han and Byron C. Wallace and Yulia Tsvetkov (2020). Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. In ACL (pp. 5553–5563). Association for Computational Linguistics.

[13] Cook, R. D. and Weisberg, S. Residuals (1982). influence in regression. New York: Chapman and Hall.

[14] Karen Simonyan and Andrea Vedaldi and Andrew Zisserman (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In ICLR (Workshop Poster).

[15] Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In KDD (pp. 1135–1144). ACM.

[16] Amirata Ghorbani and James Y. Zou (2020). Neuron Shapley: Discovering the Responsible Neurons. In NeurIPS.

[17] Dominique Mercier and Andreas Dengel and Sheraz Ahmed (2021). PatchX: Explaining Deep Models by Intelligible Pattern Patches for Time-series Classification. CoRR, abs/2102.05917.

[18] Alan H. Gee and Diego Garcia-Olano and Joydeep Ghosh and David Paydarfar (2019). Explaining Deep Classification of Time-Series Data with Learned Prototypes. CoRR, abs/1904.08935.

[19] Udo Schlegel and Hiba Arnout and Mennatallah El-Assady and Daniela Oelke and Daniel A. Keim (2019). Towards a Rigorous Evaluation of XAI Methods on Time Series. CoRR, abs/1909.07082.

[20] Mara Graziani and Vincent Andrearczyk and Stéphane Marchand-Maillet and Henning Müller (2020). Concept attribution: Explaining CNN decisions to physicians. Comput. Biol. Medicine, 123, 103865.

[21] Oscar Li and Hao Liu and Chaofan Chen and Cynthia Rudin (2017). Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions. CoRR, abs/1710.04806.

[22] David Alvarez-Melis and Tommi S. Jaakkola (2018). On the Robustness of Interpretability Methods. CoRR, abs/1806.08049.

[23] Dylan Slack and Sophie Hilgard and Emily Jia and Sameer Singh and Himabindu Lakkaraju (2020). Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In AIES (pp. 180–186). ACM.

[24] Ann-Kathrin Dombrowski and Maximilian Alber and Christopher J. Anders and Marcel Ackermann and Klaus-Robert Müller and Pan Kessel (2019). Explanations can be manipulated and geometry is to blame. In NeurIPS (pp. 13567–13578).

[25] Leilani H. Gilpin and David Bau and Ben Z. Yuan and Ayesha Bajwa and Michael Specter and Lalana Kagal (2018). Explaining Explanations: An Approach to Evaluating Interpretability of Machine Learning. CoRR, abs/1806.00069.

[26] Vivian Lai and Chenhao Tan (2019). On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In FAT (pp. 29–38). ACM.

[27] Schrasing Tong and Lalana Kagal (2020). Investigating Bias in Image Classification using Model Explanations. CoRR, abs/2012.05463.

[28] Piyawat Lertvittayakumjorn and Lucia Specia and Francesca Toni (2020). FIND: Human-in-the-Loop Debugging Deep Text Classifiers. In EMNLP (1) (pp. 332–348). Association for Computational Linguistics.

[29] Thibault Laugel and Marie-Jeanne Lesot and Christophe Marsala and Xavier Renard and Marcin Detyniecki (2019). The Dangers of Post-hoc Interpretability: Unjustified Counterfactual Explanations. In IJCAI (pp. 2801–2807).

[30] Cynthia Rudin (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature.

[31] Amina Adadi and Mohammed Berrada (2018). Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6, 52138–52160.

[32] D. V. Carvalho, E. M. Pereira, & Jaime S. Cardoso (2019). Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics, 8, 832.

[33] Scott M. Lundberg and Su-In Lee (2017). A Unified Approach to Interpreting Model Predictions. In NIPS (pp. 4765–4774).

[34] Riccardo Guidotti and Anna Monreale and Salvatore Ruggieri and Dino Pedreschi and Franco Turini and Fosca Giannotti (2018). Local Rule-Based Explanations of Black Box Decision Systems. CoRR, abs/1805.10820.

[35] Tom Begley and Tobias Schwedes and Christopher Frye and Ilya Feige (2020). Explainability for fair machine learning. CoRR, abs/2010.07389.

[36] Yash Goyal and Uri Shalit and Been Kim (2019). Explaining Classifiers with Causal Concept Effect (CaCE). CoRR, abs/1907.07165.

[37] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., & Samek, W. (2015). On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7), 1–46.

[38] Mukund Sundararajan and Ankur Taly and Qiqi Yan (2017). Axiomatic Attribution for Deep Networks. In ICML (pp. 3319–3328). PMLR.

[39] Ramaravind Kommiya Mothilal and Amit Sharma and Chenhao Tan (2020). Explaining machine learning classifiers through diverse counterfactual explanations. In FAT* (pp. 607–617). ACM.

[40] David Bau and Bolei Zhou and Aditya Khosla and Aude Oliva and Antonio Torralba (2017). Network Dissection: Quantifying Interpretability of Deep Visual Representations. In CVPR (pp. 3319–3327). IEEE Computer Society.

[41] Sarah Tan, Rich Caruana, Giles Hooker, Paul Koch, & Albert Gordo. (2019). Learning Global Additive Explanations for Neural Nets Using Model Distillation.

[42] Kaivalya Rawal and Himabindu Lakkaraju (2020). Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses. In NeurIPS.

[43] Rafael Poyiadzi and Kacper Sokol and Raul Santos-Rodriguez and Tijl De Bie and Peter A. Flach (2020). FACE: Feasible and Actionable Counterfactual Explanations. In AIES (pp. 344–350). ACM.

[44] Bien, J., & Tibshirani, R. (2011). Prototype selection for interpretable classification. The Annals of Applied Statistics, 5(4), 2403–2424.

[45] Garima Pruthi and Frederick Liu and Satyen Kale and Mukund Sundararajan (2020). Estimating Training Data Influence by Tracing Gradient Descent. In NeurIPS.

[46] Jesse Mu and Jacob Andreas (2020). Compositional Explanations of Neurons. In NeurIPS.

[47] Osbert Bastani and Carolyn Kim and Hamsa Bastani (2017). Interpreting Blackbox Models via Model Extraction. CoRR, abs/1705.08504.

[48] Gagan Bansal and Tongshuang Wu and Joyce Zhou and Raymond Fok and Besmira Nushi and Ece Kamar and Marco Tulio Ribeiro and Daniel S. Weld (2020). Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. CoRR, abs/2006.14779.

Recommended for you

Jintang Li
Adversarial Learning on Graph
Adversarial Learning on Graph
This review gives an introduction to Adversarial Machine Learning on graph-structured data, including several recent papers and research ideas in this field. This review is based on our paper “A Survey of Adversarial Learning on Graph”.
7 points
0 issues