Why explainability — trust, debugging, regulation
A high cross-validated score (Chapter 01) tells you a model is accurate on data that looks like your test set. It tells you nothing about why a particular prediction came out the way it did, whether the model leans on a feature it should never have seen, or whether it will hold up when the world shifts under it. Explainability — also called interpretability — is the discipline of answering "why this output?" in terms a human can check. It serves three distinct masters.
| Driver | The question it asks | What an explanation buys |
|---|---|---|
| Trust | Should a clinician, underwriter, or operator act on this? | A reason the human can sanity-check against domain knowledge before deferring to the model. |
| Debugging | Why is this prediction wrong / surprising? | Exposes leakage, spurious correlations, and shortcut features — the snow-in-the-background-means-husky failures. |
| Regulation | Can you justify an adverse decision to the subject and an auditor? | A per-decision record that satisfies a legal right to an explanation. |
The regulatory pressure is no longer hypothetical. In the United States, the Equal Credit Opportunity Act and its Regulation B have for decades required lenders to give applicants the specific principal reasons for an adverse credit action; the CFPB confirmed in 2023 that this duty applies to opaque machine-learning models too — "the algorithm did it" is not a lawful reason. In the EU, the GDPR grants meaningful information about the logic of automated decisions, and the AI Act (in force from 2024, with high-risk obligations phasing in through 2026–2027) mandates transparency and human oversight for high-risk systems such as credit scoring and medical devices. Explanations are now a compliance artifact, not a research nicety.
An explanation is a model of a model, and models can lie. Every method in this chapter is a post-hoc approximation of an opaque function — it tells you what the model appears to do near a point, not the ground truth of the world. Post-hoc explanations can be unstable (small input changes flip them), unfaithful (they describe a surrogate, not the model), and even adversarially manipulable. The honest position, argued forcefully by Rudin (2019), is that for genuinely high-stakes decisions an inherently interpretable model (a sparse linear model, a short rule list, a small tree) is often preferable to a black box with an explanation bolted on. Use post-hoc tools, but never confuse them with understanding.
Global vs local explanations
Explanations split along one axis above all others: scope. A global explanation describes the model's behaviour over the whole input distribution — "income is the most important feature on average." A local explanation describes one prediction — "this applicant was denied chiefly because of three recent late payments." The two answer different questions and must not be substituted for one another.
| Scope | Answers | Methods in this chapter | Typical consumer |
|---|---|---|---|
| Global | What does the model do overall? | permutation importance, PDP | model owner, validator |
| Local | Why this single prediction? | ICE, LIME, SHAP | end user, regulator, debugger |
A feature can be globally unimportant yet decisive for one row, and globally important yet irrelevant for another. Averaging local explanations recovers a global one — this is exactly how SHAP unifies the two scopes (§6.5) — but you cannot run the inference backwards: a single global importance bar does not tell any individual applicant why they were refused. The right-to-explanation laws of §6.1 are fundamentally demands for local explanations.
A second, orthogonal axis is model access. Model-agnostic methods (LIME, permutation importance, PDP, KernelSHAP) treat the model as a black box and only call its predict function, so one implementation works for any model. Model-specific methods exploit internal structure for speed or fidelity — TreeSHAP reads the splits of a tree ensemble to compute exact Shapley values in polynomial time; integrated gradients use a neural network's backward pass. Agnostic methods are universal but slow; specific methods are fast but tied to an architecture.
A useful sanity rule: choose the explanation scope to match the decision being made. A board reviewing whether to deploy a fraud model wants a global picture; a customer disputing a blocked card wants a local one. Reporting the wrong scope is a more common error than computing either one incorrectly.
Permutation importance & PDP/ICE
The cheapest global tool needs nothing but the trained model and a held-out set. Permutation importance asks a blunt question: if I destroy a feature's information by shuffling its column, how much worse does the model get? A feature the model relies on will see its score collapse when scrambled; a feature it ignores will not move the needle.
Two warnings come with it. First, importance is measured on data the model was scored against, so prefer a held-out set: permutation importance on the training set rewards overfitting. Second — the one experts always raise — correlated features split and hide each other's importance. If two columns carry nearly the same information, shuffling one leaves the model propped up by the other, so both look unimportant even though the pair is decisive. With strong collinearity, permutation importance under-reports; cluster correlated features and permute the cluster, or reach for Shapley values, which share credit more fairly.
# Permutation importance from scratch: rank features by the R^2 drop on shuffle.
import numpy as np
rng = np.random.default_rng(0)
# A model that truly uses x0 strongly, x1 mildly, and ignores x2, x3.
N, d = 400, 4
X = rng.normal(0, 1, (N, d))
w_true = np.array([3.0, 1.0, 0.0, 0.0])
y = X @ w_true + rng.normal(0, 0.5, N)
beta = np.linalg.lstsq(X, y, rcond=None)[0] # the fitted "black box"
def r2(Xp):
pred = Xp @ beta
return 1 - ((y - pred) ** 2).sum() / ((y - y.mean()) ** 2).sum()
base = r2(X) # score on intact data (EQ V6.1, term 1)
print(f"baseline R^2 = {base:.4f}\n")
names, imp = ["x0", "x1", "x2", "x3"], []
for j in range(d):
drops = []
for _ in range(10): # K = 10 shuffles, average them
Xs = X.copy()
Xs[:, j] = rng.permutation(Xs[:, j]) # break feature j <-> target only
drops.append(base - r2(Xs)) # the score drop = importance
imp.append(np.mean(drops))
for j in np.argsort(imp)[::-1]: # rank: most important first
print(f"{names[j]}: importance {imp[j]:+.4f}")
print("\nx0 dominates, x1 is mild, x2/x3 ~ 0 -- the model's true reliance, recovered.")
plot_xy(range(d), sorted(imp, reverse=True))
Permutation importance ranks features but says nothing about shape: is the effect of income linear, threshold-like, or U-shaped? The partial dependence plot (PDP), introduced by Friedman with gradient boosting, answers that. Fix feature \(j\) to a value \(v\), set it to \(v\) for every row in the data while leaving the other features as they are, average the predictions, and sweep \(v\) across its range:
Individual conditional expectation (ICE) curves fix that second flaw by not averaging: plot one line per row, each showing how that single prediction would move as \(j\) sweeps. The PDP is exactly the average of all the ICE lines. When the ICE lines are parallel, the PDP tells the whole story; when they fan out or cross, the feature interacts with others and the average is a lie of omission. PDP for the headline, ICE to check it is honest.
LIME — local surrogate models
Global tools blur the individual case. LIME — Local Interpretable Model-agnostic Explanations, Ribeiro et al. (2016) — takes the opposite stance: forget the global function, just explain one prediction by approximating the black box with a simple, interpretable model in a small neighbourhood around that point. The intuition is that any wiggly decision surface looks roughly linear if you zoom in far enough.
The recipe for explaining a single instance \(x\): (1) generate a cloud of perturbed samples around \(x\); (2) ask the black box \(\hat{f}\) for its prediction on each; (3) weight each sample by how close it is to \(x\) with a kernel \(\pi_x\); (4) fit a sparse linear model \(g\) to that weighted, labelled cloud. The coefficients of \(g\) are the explanation: a signed weight per feature, valid only near \(x\).
LIME's appeal is that it is genuinely model-agnostic and produces a human-readable handful of "because feature X was high and feature Y was low" reasons. Its documented failure modes are equally real: the explanations can be unstable (re-running with a new random seed or a different bandwidth perturbs the weights), the linear surrogate can be a poor fit where the surface is sharply curved, and the choice of neighbourhood is more art than science. SHAP can be seen as the principled answer to "how should I have weighted those samples?" — which is the bridge to §6.5.
SHAP — Shapley values for features
SHAP — SHapley Additive exPlanations, Lundberg & Lee (2017) — is the most-used method in the field because it rests on the one result everything else lacks: a uniqueness theorem. Borrow the Shapley value from cooperative game theory, where it is the provably unique fair way to split a coalition's payout among its players. Cast the prediction as the payout and the features as the players, and you get the only feature-attribution method satisfying a set of common-sense axioms simultaneously.
The Shapley value of feature \(j\) is its average marginal contribution across every possible order in which features could be added to the prediction. "Marginal contribution" means: how much does the model's output change when \(j\) joins a coalition \(S\) of features that are already "present" (set to their instance value) versus "absent" (marginalized to the background)?
The axiom that matters most for an audit is efficiency (also called local accuracy): the attributions and the base value must add up to exactly the prediction. Nothing is invented, nothing is lost — every unit of "why this number and not the average" is assigned to some feature.
Computing EQ V6.4 exactly costs \(2^{|F|}\) coalition evaluations — fine for a handful of features, hopeless for hundreds. SHAP's practical contribution is fast estimators: KernelSHAP recovers the Shapley values as the solution of a specially weighted linear regression (the principled cousin of LIME), and TreeSHAP computes them exactly for tree ensembles in time polynomial in the tree size — which is why SHAP and gradient boosting (Chapter on boosting) are the default explainability pairing in production. A persistent subtlety experts flag: how you define "feature absent" — marginalizing with the marginal distribution (interventional) versus the conditional (observational) — changes the values when features are correlated, and the two are answering subtly different causal questions.
# Exact Shapley values for a tiny 3-feature model -- and the efficiency check.
import numpy as np
from itertools import permutations
# Model: linear part + one pairwise interaction between x0 and x1.
def f(x):
return 3*x[0] + 2*x[1] - 1*x[2] + 4*x[0]*x[1]
x = np.array([1.0, 1.0, 1.0]) # the instance we explain
baseline = np.array([0.0, 0.0, 0.0]) # "feature absent" = baseline value
def v(S): # coalition value: S use x, rest use baseline
z = baseline.copy()
for i in S: z[i] = x[i]
return f(z)
base_value = v([]) # phi_0 = f(baseline)
pred = v([0, 1, 2]) # f(instance)
# Shapley = average marginal contribution over ALL feature orderings (EQ V6.4).
phi = np.zeros(3)
orders = list(permutations(range(3)))
for order in orders:
seen = []
for i in order:
before = v(seen); seen = seen + [i]
phi[i] += v(seen) - before # marginal contribution of i in this order
phi /= len(orders)
print(f"base value phi_0 : {base_value:.1f}")
print(f"shapley values phi : {phi}") # -> [5. 4. -1.]
print(f"sum of shapley values : {phi.sum():.1f}")
print(f"prediction - base : {pred - base_value:.1f}")
print(f"efficiency holds? : {np.isclose(phi.sum(), pred - base_value)}")
print("the 4*x0*x1 interaction is split evenly: +2 to x0, +2 to x1 (symmetry).")
Explanations make a model legible; governance makes it accountable. Knowing why a prediction happened is one pillar of model risk — but a deployed model also needs versioning, reproducible pipelines, monitoring against the drift of Chapter 05, audit logs, and a human chain of responsibility. Chapter 07 assembles those pieces into MLOps and governance: how to ship, watch, and answer for a model in production once the math is done.
References
- Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions.
- Ribeiro, M. T., Singh, S. & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier.
- Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
- Lundberg, S. M., Erion, G. G. & Lee, S.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles.
- Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. (2015). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.
- Shapley, L. S. (1953). A Value for n-Person Games.
- Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High-Stakes Decisions and Use Interpretable Models Instead.
- Molnar, C. (2022). Interpretable Machine Learning (2nd ed.).