AI // ENCYCLOPEDIA / MODEL RISK / 06 / EXPLAINABILITY INDEX NEXT: 07 MLOPS & GOVERNANCE →
MODEL VALIDATION & RISK · CHAPTER 06 / 07

Explainability — SHAP, LIME & Partial Dependence

A model that predicts well is not the same as a model you can account for. When a loan is denied, a tumour flagged, or a transaction blocked, "the gradient-boosted ensemble said so" will not satisfy a customer, an engineer, or a regulator. Shapley values attribute each prediction to its input features, and the attributions sum exactly to the score the model produced.

LEVELCORE READING TIME≈ 28 MIN BUILDS ONML 13 · STATS 04 INSTRUMENTSFORCE PLOT · PDP/ICE · LIME
6.1

Why explainability — trust, debugging, regulation

A high cross-validated score (Chapter 01) tells you a model is accurate on data that looks like your test set. It tells you nothing about why a particular prediction came out the way it did, whether the model leans on a feature it should never have seen, or whether it will hold up when the world shifts under it. Explainability — also called interpretability — is the discipline of answering "why this output?" in terms a human can check. It serves three distinct masters.

DriverThe question it asksWhat an explanation buys
TrustShould a clinician, underwriter, or operator act on this?A reason the human can sanity-check against domain knowledge before deferring to the model.
DebuggingWhy is this prediction wrong / surprising?Exposes leakage, spurious correlations, and shortcut features — the snow-in-the-background-means-husky failures.
RegulationCan you justify an adverse decision to the subject and an auditor?A per-decision record that satisfies a legal right to an explanation.

The regulatory pressure is no longer hypothetical. In the United States, the Equal Credit Opportunity Act and its Regulation B have for decades required lenders to give applicants the specific principal reasons for an adverse credit action; the CFPB confirmed in 2023 that this duty applies to opaque machine-learning models too — "the algorithm did it" is not a lawful reason. In the EU, the GDPR grants meaningful information about the logic of automated decisions, and the AI Act (in force from 2024, with high-risk obligations phasing in through 2026–2027) mandates transparency and human oversight for high-risk systems such as credit scoring and medical devices. Explanations are now a compliance artifact, not a research nicety.

A LOAD-BEARING CAVEAT

An explanation is a model of a model, and models can lie. Every method in this chapter is a post-hoc approximation of an opaque function — it tells you what the model appears to do near a point, not the ground truth of the world. Post-hoc explanations can be unstable (small input changes flip them), unfaithful (they describe a surrogate, not the model), and even adversarially manipulable. The honest position, argued forcefully by Rudin (2019), is that for genuinely high-stakes decisions an inherently interpretable model (a sparse linear model, a short rule list, a small tree) is often preferable to a black box with an explanation bolted on. Use post-hoc tools, but never confuse them with understanding.

6.2

Global vs local explanations

Explanations split along one axis above all others: scope. A global explanation describes the model's behaviour over the whole input distribution — "income is the most important feature on average." A local explanation describes one prediction — "this applicant was denied chiefly because of three recent late payments." The two answer different questions and must not be substituted for one another.

ScopeAnswersMethods in this chapterTypical consumer
GlobalWhat does the model do overall?permutation importance, PDPmodel owner, validator
LocalWhy this single prediction?ICE, LIME, SHAPend user, regulator, debugger

A feature can be globally unimportant yet decisive for one row, and globally important yet irrelevant for another. Averaging local explanations recovers a global one — this is exactly how SHAP unifies the two scopes (§6.5) — but you cannot run the inference backwards: a single global importance bar does not tell any individual applicant why they were refused. The right-to-explanation laws of §6.1 are fundamentally demands for local explanations.

A second, orthogonal axis is model access. Model-agnostic methods (LIME, permutation importance, PDP, KernelSHAP) treat the model as a black box and only call its predict function, so one implementation works for any model. Model-specific methods exploit internal structure for speed or fidelity — TreeSHAP reads the splits of a tree ensemble to compute exact Shapley values in polynomial time; integrated gradients use a neural network's backward pass. Agnostic methods are universal but slow; specific methods are fast but tied to an architecture.

A useful sanity rule: choose the explanation scope to match the decision being made. A board reviewing whether to deploy a fraud model wants a global picture; a customer disputing a blocked card wants a local one. Reporting the wrong scope is a more common error than computing either one incorrectly.

6.3

Permutation importance & PDP/ICE

The cheapest global tool needs nothing but the trained model and a held-out set. Permutation importance asks a blunt question: if I destroy a feature's information by shuffling its column, how much worse does the model get? A feature the model relies on will see its score collapse when scrambled; a feature it ignores will not move the needle.

EQ V6.1 — PERMUTATION IMPORTANCE $$ \mathrm{Imp}_j \;=\; s\big(\hat{f},\, X,\, y\big) \;-\; \frac{1}{K}\sum_{k=1}^{K} s\big(\hat{f},\, X^{(\pi_k, j)},\, y\big) $$
\(s\) is any score where higher is better (\(R^2\), accuracy, AUC); \(X^{(\pi_k, j)}\) is the data with column \(j\) randomly permuted under permutation \(\pi_k\), leaving every other feature and the labels untouched. Importance is the drop in score caused by breaking the link between feature \(j\) and the target, averaged over \(K\) shuffles to tame the randomness. Because it only calls \(\hat{f}\), it is fully model-agnostic and uses the same predict-and-score loop for any estimator.

Two warnings come with it. First, importance is measured on data the model was scored against, so prefer a held-out set: permutation importance on the training set rewards overfitting. Second — the one experts always raise — correlated features split and hide each other's importance. If two columns carry nearly the same information, shuffling one leaves the model propped up by the other, so both look unimportant even though the pair is decisive. With strong collinearity, permutation importance under-reports; cluster correlated features and permute the cluster, or reach for Shapley values, which share credit more fairly.

Permutation importance measures the drop in model score when a single feature's column is randomly shuffled (breaking its link to the target) while all other features and the labels are left intact. True or false? (Answer true or false.)
That is exactly EQ V6.1: \(\mathrm{Imp}_j = s(\hat f, X, y) - \tfrac1K\sum_k s(\hat f, X^{(\pi_k,j)}, y)\). The first term is the score on intact data; the second is the score after column \(j\) is permuted. A feature the model leans on causes a large score drop when scrambled; an ignored feature causes none. The statement is true.
PYTHON · RUNNABLE IN-BROWSER
# Permutation importance from scratch: rank features by the R^2 drop on shuffle.
import numpy as np
rng = np.random.default_rng(0)

# A model that truly uses x0 strongly, x1 mildly, and ignores x2, x3.
N, d = 400, 4
X = rng.normal(0, 1, (N, d))
w_true = np.array([3.0, 1.0, 0.0, 0.0])
y = X @ w_true + rng.normal(0, 0.5, N)

beta = np.linalg.lstsq(X, y, rcond=None)[0]      # the fitted "black box"
def r2(Xp):
    pred = Xp @ beta
    return 1 - ((y - pred) ** 2).sum() / ((y - y.mean()) ** 2).sum()

base = r2(X)                                      # score on intact data (EQ V6.1, term 1)
print(f"baseline R^2 = {base:.4f}\n")

names, imp = ["x0", "x1", "x2", "x3"], []
for j in range(d):
    drops = []
    for _ in range(10):                           # K = 10 shuffles, average them
        Xs = X.copy()
        Xs[:, j] = rng.permutation(Xs[:, j])      # break feature j <-> target only
        drops.append(base - r2(Xs))               # the score drop = importance
    imp.append(np.mean(drops))

for j in np.argsort(imp)[::-1]:                   # rank: most important first
    print(f"{names[j]}: importance {imp[j]:+.4f}")
print("\nx0 dominates, x1 is mild, x2/x3 ~ 0 -- the model's true reliance, recovered.")
plot_xy(range(d), sorted(imp, reverse=True))
edits are live — break it on purpose

Permutation importance ranks features but says nothing about shape: is the effect of income linear, threshold-like, or U-shaped? The partial dependence plot (PDP), introduced by Friedman with gradient boosting, answers that. Fix feature \(j\) to a value \(v\), set it to \(v\) for every row in the data while leaving the other features as they are, average the predictions, and sweep \(v\) across its range:

EQ V6.2 — PARTIAL DEPENDENCE $$ \mathrm{PD}_j(v) \;=\; \mathbb{E}_{X_{-j}}\!\big[\,\hat{f}(v,\, X_{-j})\,\big] \;\approx\; \frac{1}{N}\sum_{i=1}^{N} \hat{f}\big(v,\, x^{(i)}_{-j}\big) $$
\(X_{-j}\) is every feature except \(j\); the expectation marginalizes them out, leaving the average effect of feature \(j\) alone as a curve. The Monte-Carlo estimate just averages the model over the actual dataset with column \(j\) overwritten by \(v\). Its blind spot is the same as permutation importance: by overwriting \(j\) for all rows it can create off-manifold inputs (a pregnant 80-year-old) when \(j\) is correlated with the others, and by averaging it hides heterogeneity — opposite effects on two subgroups cancel to a flat line.

Individual conditional expectation (ICE) curves fix that second flaw by not averaging: plot one line per row, each showing how that single prediction would move as \(j\) sweeps. The PDP is exactly the average of all the ICE lines. When the ICE lines are parallel, the PDP tells the whole story; when they fan out or cross, the feature interacts with others and the average is a lie of omission. PDP for the headline, ICE to check it is honest.

INSTRUMENT V6.1 — PDP / ICE EXPLORERAVERAGE EFFECT vs PER-ROW LINES · EQ V6.2
PDP RANGE (MAX − MIN)
ICE SPREAD AT MID
PDP TRUSTWORTHY?
The bold mint curve is the PDP — the model's average response as the feature sweeps left → right. The faint grey lines are ICE curves, one per row, and the PDP is literally their average. Set INTERACTION STRENGTH to 0 and the ICE lines stay parallel: the average tells the whole story. Crank it up and the lines fan out and cross — now the flat-looking average is hiding subgroups that move in opposite directions, and the readout flips to "MISLEADING". This is precisely why you never trust a PDP without its ICE.
6.4

LIME — local surrogate models

Global tools blur the individual case. LIME — Local Interpretable Model-agnostic Explanations, Ribeiro et al. (2016) — takes the opposite stance: forget the global function, just explain one prediction by approximating the black box with a simple, interpretable model in a small neighbourhood around that point. The intuition is that any wiggly decision surface looks roughly linear if you zoom in far enough.

The recipe for explaining a single instance \(x\): (1) generate a cloud of perturbed samples around \(x\); (2) ask the black box \(\hat{f}\) for its prediction on each; (3) weight each sample by how close it is to \(x\) with a kernel \(\pi_x\); (4) fit a sparse linear model \(g\) to that weighted, labelled cloud. The coefficients of \(g\) are the explanation: a signed weight per feature, valid only near \(x\).

EQ V6.3 — THE LIME OBJECTIVE $$ \xi(x) \;=\; \underset{g \in G}{\arg\min}\; \underbrace{\mathcal{L}\big(\hat{f},\, g,\, \pi_x\big)}_{\text{local fidelity}} \;+\; \underbrace{\Omega(g)}_{\text{simplicity}}, \qquad \pi_x(z) = \exp\!\left(\frac{-D(x,z)^2}{\sigma^2}\right) $$
\(G\) is a family of interpretable models (sparse linear, short trees). \(\mathcal{L}\) penalizes \(g\) for disagreeing with \(\hat{f}\) on samples \(z\), each weighted by proximity \(\pi_x(z)\); \(\Omega\) penalizes complexity (e.g. number of nonzero weights). The result is the simplest surrogate that is faithful to the black box right around \(x\) — explicitly trading global accuracy for local interpretability. The neighbourhood width \(\sigma\) is a free knob, and that is exactly LIME's weak spot: the explanation can swing with the kernel width and with the random sample, so two runs can disagree.

LIME's appeal is that it is genuinely model-agnostic and produces a human-readable handful of "because feature X was high and feature Y was low" reasons. Its documented failure modes are equally real: the explanations can be unstable (re-running with a new random seed or a different bandwidth perturbs the weights), the linear surrogate can be a poor fit where the surface is sharply curved, and the choice of neighbourhood is more art than science. SHAP can be seen as the principled answer to "how should I have weighted those samples?" — which is the bridge to §6.5.

INSTRUMENT V6.2 — LIME LOCAL SURROGATEBLACK-BOX BOUNDARY → LOCAL LINEAR FIT · EQ V6.3
LOCAL SURROGATE
LOCAL FIT (WEIGHTED R²)
STABILITY OVER RESEEDS
The curved blue line is the black box's true decision boundary; the white dot is the instance we want to explain. Each RESEED draws a fresh cloud of perturbations (sized by σ), weighted by how close they sit to the dot, and fits a straight mint surrogate — LIME's local linear explanation. Shrink σ and the surrogate hugs the curve tightly (high local fidelity); widen it and the line tries to span a curved region and fits badly. Press RESEED a few times at a wide σ and watch the surrogate slope wander: that wobble is LIME's notorious instability, made visible.
6.5

SHAP — Shapley values for features

SHAP — SHapley Additive exPlanations, Lundberg & Lee (2017) — is the most-used method in the field because it rests on the one result everything else lacks: a uniqueness theorem. Borrow the Shapley value from cooperative game theory, where it is the provably unique fair way to split a coalition's payout among its players. Cast the prediction as the payout and the features as the players, and you get the only feature-attribution method satisfying a set of common-sense axioms simultaneously.

The Shapley value of feature \(j\) is its average marginal contribution across every possible order in which features could be added to the prediction. "Marginal contribution" means: how much does the model's output change when \(j\) joins a coalition \(S\) of features that are already "present" (set to their instance value) versus "absent" (marginalized to the background)?

EQ V6.4 — THE SHAPLEY VALUE $$ \phi_j \;=\; \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!\,\big(|F| - |S| - 1\big)!}{|F|!}\;\Big[\, v\big(S \cup \{j\}\big) - v(S) \,\Big] $$
\(F\) is the full feature set, \(S\) any coalition not containing \(j\), and \(v(S)\) the model's expected output when only the features in \(S\) are known. The bracket is \(j\)'s marginal contribution when it joins \(S\); the combinatorial weight is the fraction of orderings in which exactly that coalition precedes \(j\), so \(\phi_j\) is the average marginal contribution over all orderings. It is the unique attribution satisfying efficiency, symmetry, dummy (a feature that never changes \(v\) gets 0), and additivity.

The axiom that matters most for an audit is efficiency (also called local accuracy): the attributions and the base value must add up to exactly the prediction. Nothing is invented, nothing is lost — every unit of "why this number and not the average" is assigned to some feature.

EQ V6.5 — EFFICIENCY: THE EXPLANATION ADDS UP $$ \hat{f}(x) \;=\; \underbrace{\phi_0}_{\text{base value } \mathbb{E}[\hat f]} \;+\; \sum_{j=1}^{|F|} \phi_j \qquad\Longleftrightarrow\qquad \sum_{j=1}^{|F|} \phi_j \;=\; \hat{f}(x) - \mathbb{E}[\hat{f}(X)] $$
\(\phi_0\) is the base value — the average prediction over the background, what you would guess knowing nothing about this row. The SHAP values are the signed pushes from that baseline to the actual prediction, and they must sum to the gap \(\hat f(x) - \mathbb{E}[\hat f]\) exactly. This is what turns a SHAP explanation into a literal audit trail: a regulator can check that the reasons sum to the decision. The force plot in Instrument V6.3 is this equation drawn as arrows.

Computing EQ V6.4 exactly costs \(2^{|F|}\) coalition evaluations — fine for a handful of features, hopeless for hundreds. SHAP's practical contribution is fast estimators: KernelSHAP recovers the Shapley values as the solution of a specially weighted linear regression (the principled cousin of LIME), and TreeSHAP computes them exactly for tree ensembles in time polynomial in the tree size — which is why SHAP and gradient boosting (Chapter on boosting) are the default explainability pairing in production. A persistent subtlety experts flag: how you define "feature absent" — marginalizing with the marginal distribution (interventional) versus the conditional (observational) — changes the values when features are correlated, and the two are answering subtly different causal questions.

A model's base (mean) value is \( \mathbb{E}[\hat f] = 0.30 \) and its prediction for one row is \( \hat f(x) = 0.82 \). By the efficiency axiom (EQ V6.5), what must the sum of that row's SHAP values equal, \( \hat f(x) - \mathbb{E}[\hat f] \)?
Efficiency forces the attributions plus the base value to reconstruct the prediction, so the SHAP values sum to \( \hat f(x) - \mathbb{E}[\hat f] = 0.82 - 0.30 = \) 0.52. Whatever the individual feature pushes are, positive and negative, they must total exactly +0.52 — that is the property that makes the explanation an audit trail.
PYTHON · RUNNABLE IN-BROWSER
# Exact Shapley values for a tiny 3-feature model -- and the efficiency check.
import numpy as np
from itertools import permutations

# Model: linear part + one pairwise interaction between x0 and x1.
def f(x):
    return 3*x[0] + 2*x[1] - 1*x[2] + 4*x[0]*x[1]

x        = np.array([1.0, 1.0, 1.0])    # the instance we explain
baseline = np.array([0.0, 0.0, 0.0])    # "feature absent" = baseline value

def v(S):                                # coalition value: S use x, rest use baseline
    z = baseline.copy()
    for i in S: z[i] = x[i]
    return f(z)

base_value = v([])                       # phi_0 = f(baseline)
pred       = v([0, 1, 2])                # f(instance)

# Shapley = average marginal contribution over ALL feature orderings (EQ V6.4).
phi = np.zeros(3)
orders = list(permutations(range(3)))
for order in orders:
    seen = []
    for i in order:
        before = v(seen); seen = seen + [i]
        phi[i] += v(seen) - before       # marginal contribution of i in this order
phi /= len(orders)

print(f"base value phi_0      : {base_value:.1f}")
print(f"shapley values phi    : {phi}")     # -> [5. 4. -1.]
print(f"sum of shapley values : {phi.sum():.1f}")
print(f"prediction - base     : {pred - base_value:.1f}")
print(f"efficiency holds?     : {np.isclose(phi.sum(), pred - base_value)}")
print("the 4*x0*x1 interaction is split evenly: +2 to x0, +2 to x1 (symmetry).")
edits are live — break it on purpose
INSTRUMENT V6.3 — SHAP FORCE PLOTFEATURE PUSHES FROM BASE → PREDICTION · EQ V6.5
BASE VALUE E[f]
PREDICTED APPROVAL
Σ φ = PRED − BASE?
A toy loan-approval score (additive log-odds, so contributions are exact Shapley values). The plot is EQ V6.5 drawn as forces: every prediction starts at the base value — the average approval probability — and each feature pushes it right (mint, toward approval) or left (red, toward denial) by its SHAP value. The arrows always land exactly on the prediction, and the bottom readout confirms Σφ = pred − base to the decimal. Drag RECENT LATE PAYMENTS up and watch a single red arrow grow until it alone flips the decision — that red bar is the principal adverse-action reason §6.1's regulations demand.
NEXT

Explanations make a model legible; governance makes it accountable. Knowing why a prediction happened is one pillar of model risk — but a deployed model also needs versioning, reproducible pipelines, monitoring against the drift of Chapter 05, audit logs, and a human chain of responsibility. Chapter 07 assembles those pieces into MLOps and governance: how to ship, watch, and answer for a model in production once the math is done.

6.R

References

  1. Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017 — the SHAP framework and its uniqueness theorem (§6.5, EQ V6.4–V6.5).
  2. Ribeiro, M. T., Singh, S. & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. KDD 2016 — LIME, local surrogate explanations (§6.4, EQ V6.3).
  3. Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5) — partial dependence plots (§6.3, EQ V6.2).
  4. Lundberg, S. M., Erion, G. G. & Lee, S.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles. arXiv — TreeSHAP, exact polynomial-time Shapley values for trees (§6.5).
  5. Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. (2015). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. J. Computational and Graphical Statistics 24(1) — ICE curves (§6.3).
  6. Shapley, L. S. (1953). A Value for n-Person Games. Contributions to the Theory of Games II — the original Shapley value from cooperative game theory.
  7. Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High-Stakes Decisions and Use Interpretable Models Instead. Nature Machine Intelligence 1 — the case for inherently interpretable models (§6.1 caveat).
  8. Molnar, C. (2022). Interpretable Machine Learning (2nd ed.). Open textbook — the standard practical reference covering every method in this chapter.