Bibliography — AI Encyclopedia

All sources

Abadi, M. et al. (2016). TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016 — the computation-graph runtime and distributed execution model behind §2.1. FRAME
Abadi, M. et al. (2016). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv — the original TensorFlow white paper describing the dataflow-graph design. FRAME
Aggarwal, C. C., Hinneburg, A. & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT 2001, LNCS 1973, 420–434. shows lower-order (fractional, Manhattan) norms concentrate more slowly than Euclidean (§11.5). VOL I
Aghajanyan, A., Zettlemoyer, L. & Gupta, S. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021 — the empirical low-rank hypothesis that motivates LoRA. OPEN
Ainsworth, S. K., Hayase, J. & Srinivasa, S. (2023). Git Re-Basin: Merging Models modulo Permutation Symmetries. ICLR 2023 — permutation alignment that removes the loss barrier between independently trained models (§6.4). OPEN
Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6) — the Akaike Information Criterion behind automated order selection (EQ T2.9). TIME
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD 2019 — the TPE sampler plus pruning combination that is the practical default in 2026. MLOPS
Alayrac, J.-B. et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022 — gated cross-attention into a frozen LLM; the canonical cross-attention design (§2.4). MM
Ambroise, C. & McLachlan, G. J. (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. PNAS 99(10) — the definitive demonstration of feature-selection bias and why selection must sit inside cross-validation (§4.5). DATA
Anil, C., Durmus, E., Sharma, M. et al. (2024). Many-shot Jailbreaking. Anthropic — long-context in-context attacks that scale with the number of faux-compliant examples. OPEN
Ansel, J. et al. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. ASPLOS 2024 — torch.compile and TorchDynamo. FRAME
Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN. ICML 2017 — replacing JSD with the Wasserstein distance (EQ N6.4–N6.5) and the critic. DL · GAME
Arlot, S. & Celisse, A. (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys 4 — the comprehensive modern reference on CV variants and their bias/variance. MLOPS
Artzner, P., Delbaen, F., Eber, J.-M. & Heath, D. (1999). Coherent Measures of Risk. Mathematical Finance 9(3) — the four coherence axioms; why VaR fails subadditivity and ES does not. QUANT
Ashkboos, S. et al. (2024). QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. NeurIPS 2024 — Hadamard rotations that spread weight/activation outliers (§11.4). VOL II
Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 — UCB and the regret framework for principled exploration beyond ε-greedy (§1.5). RL
Axelrod, R. (1984). The Evolution of Cooperation. Basic Books — the round-robin tournaments and the four properties of tit-for-tat (EQ G2.4, §2.3); the founding popular text on repeated cooperation. GAME
Axelrod, R. & Hamilton, W. D. (1981). The Evolution of Cooperation. Science 211(4489) — the peer-reviewed account of the iterated-PD tournaments and the evolutionary stability of tit-for-tat. GAME
Bachelier, L. (1900). Théorie de la spéculation. Ann. Sci. ÉNS 17 — the founding thesis modelling prices as Brownian motion, five years before Einstein. QUANT
Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Meta — self-supervised audio encoder; the discriminative alternative to generative ASR. MM
Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 — additive attention (EQ N4.2/N4.3); the birth of the mechanism and the length-decay figure. DL
Bai, J., Lu, F., Zhang, K. et al. (2019). ONNX: Open Neural Network Exchange. onnx.ai — the framework-agnostic graph format and standard operator set at the heart of §3.2. FRAME
Baldi, P. & Hornik, K. (1989). Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima. Neural Networks 2(1) — proves the linear autoencoder optimum spans the top-k PCA subspace (§5.1). DL
Basel Committee on Banking Supervision (2019). Minimum Capital Requirements for Market Risk (FRTB, finalized). BIS d457 — the switch to 97.5% stressed Expected Shortfall with liquidity-horizon scaling (EQ Q6.8). QUANT
Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. JMLR 18(153) — forward vs reverse mode, the theory behind EQ F1.3. FRAME
Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans. R. Soc. — the original statement of EQ S1.7. STATS
Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics 6(5) — the formal origin of the MDP (EQ R1.1) and the recursive value relation behind EQ R1.4. RL
Bellman, R. (1957). Dynamic Programming. Princeton University Press — the founding text; the principle of optimality and the recursive value relation behind EQ R2.1–R2.3. RL
Bengio, Y., Louradour, J., Collobert, R. & Weston, J. (2009). Curriculum Learning. ICML 2009 — easy-to-hard example ordering (§4.4). OPEN
Bengio, Y., Simard, P. & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5(2) — the formal analysis of vanishing/exploding gradients behind EQ N3.2–N3.3. DL
Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate. J. R. Stat. Soc. B 57(1) — FDR control for the many-tests regime of EQ S4.13. STATS
Bergmeir, C. & Benítez, J. M. (2012). On the Use of Cross-Validation for Time Series Predictor Evaluation. Information Sciences 191 — forward-chaining validation for temporal data (§1.4). MLOPS
Bergstra, J. & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. JMLR 13 — the result that random search matches or beats grid search under low effective dimensionality (§2.3). MLOPS
Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control (4th ed.). Athena Scientific — the contraction-mapping convergence analysis (EQ R2.6) in full rigor. RL
Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When Is "Nearest Neighbor" Meaningful? ICDT 1999, LNCS 1540, 217–235. the foundational distance-concentration result behind EQ M11.6. VOL I
Bickel, P. J., Hammel, E. A. & O'Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science 187(4175) — the canonical Simpson's paradox case study. STATS
Bifet, A. & Gavaldà, R. (2007). Learning from Time-Changing Data with Adaptive Windowing (ADWIN). SIAM SDM 2007 — an adaptive-window detector with a formal false-positive bound. MLOPS
Black, F. & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. Journal of Political Economy 81(3) — the continuous-time closed form the binomial tree converges to as N → ∞ (Quant 03); the lattice is its discrete, fully constructive counterpart. QUANT
Black, F. & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. Journal of Political Economy 81(3). — the original no-arbitrage derivation and the closed-form formula. QUANT
Black, K., Brown, N., Driess, D., Esmail, A., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence — flow-matching continuous action chunks at high frequency (§6.2, EQ MM6.3). MM
Blitzstein, J. K. & Hwang, J. (2019). Introduction to Probability (2nd ed.). Chapman & Hall / CRC. Harvard Stat 110 — conditioning, Bayes, expectation, LLN; free course materials. STATS
Board of Governors of the Federal Reserve System & OCC (2011). SR 11-7: Guidance on Model Risk Management. Supervisory letter — the model-risk framework and three pillars of §7.5 (EQ V7.5). MLOPS
Bollerslev, T. (1986). Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics 31(3) — adds the lagged-variance term, giving GARCH(1,1) (EQ T4.4), the field's workhorse. TIME
Borsos, Z. et al. (2022). AudioLM: A Language Modeling Approach to Audio Generation. Google — next-audio-token prediction over codec tokens (EQ MM4.8); the audio-LM of §4.4. MM
Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. Proceedings of COLT '92, 144–152. Where the maximum-margin hyperplane and the kernel trick (EQ M10.1, M10.7) were first combined — the true origin of the kernel SVM. VOL I
Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society B 26(2) — the Box-Cox power-transform family, EQ D3.7. DATA · TIME
Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the canonical text; the ACF/PACF identification method (§1.3) and integration order \(d\) (§1.5) are its core. TIME
Boyle, P. (1977). Options: A Monte Carlo Approach. Journal of Financial Economics 4(3) — the paper that brought simulation to option pricing. QUANT
Bradbury, J., Frostig, R., Hawkins, P. et al. (2018). JAX: composable transformations of Python+NumPy programs. github.com/google/jax — jit/grad/vmap/pmap over XLA (EQ F3.5). FRAME
Breck, E., Cai, S., Nielsen, E., Salib, M. & Sculley, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data 2017 — the data/model/infra test layers of §7.3. MLOPS
Breiman, L. (1996). Bagging Predictors. Machine Learning 24(2) — bootstrap aggregating and its variance-reduction argument (EQ M14.2, M14.3). VOL I
Breiman, L. (2001). Random Forests. Machine Learning 45(1) — feature subsampling as the second decorrelation lever; OOB error (§14.2). VOL I
Bricken, T. et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic Transformer Circuits — the SAE recipe of EQ 13.4–13.5 on a one-layer model. VOL II
Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1) — the original proper scoring rule for probabilistic forecasts (§3.5). MLOPS
Brigo, D. & Mercurio, F. (2006). Interest Rate Models — Theory and Practice (2nd ed.). Springer Finance — the standard practitioner reference for Vasicek, CIR, Hull–White, G2++, and calibration. QUANT
Broder, A. Z. (1997/1998). On the resemblance and containment of documents & Min-wise independent permutations. SEQUENCES / STOC. MinHash estimation of Jaccard similarity at scale (§11.4). VOL I
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Google DeepMind — action tokenization over a VLM vocabulary and web/robot co-training (§6.2, EQ MM6.2). MM
Brooks, T. et al. (2024). Video Generation Models as World Simulators. OpenAI (Sora) — spacetime latent patches and a diffusion transformer over video, the basis of §3.5 and EQ MM3.6. MM
Brown, B. et al. (2024). Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Coverage / pass@N scaling with repeated sampling (EQ 12.1, §12.1–12.2). VOL II
Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science 365 — Pluribus, self-play in imperfect information. GAME
Bruce, J. et al. (2024). Genie: Generative Interactive Environments. ICML 2024 — latent-action world model learned from unlabelled gameplay video; playable worlds from one prompt (EQ MM5.4). MM
Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167. The most-cited tutorial — derives the dual (EQ M10.3) and the KKT support-vector conditions step by step. VOL I
Campello, R. J. G. B., Moulavi, D., Zimek, A. & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM TKDD 10(1) — HDBSCAN, the variable-density successor that removes DBSCAN's ε knob (§12.3 note). VOL I
Carion, N. et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020 — detection as set prediction; no anchors or non-max suppression. MM
Carpenter, B. et al. (2017). Stan: A Probabilistic Programming Language. J. Stat. Softw. — Hamiltonian Monte Carlo for the hierarchical models of §5.5. STATS
Casella, G. & Berger, R. L. (1987). Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem. JASA — a careful account of where the two frameworks agree and diverge. STATS
Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning and reporting on the same folds inflates scores. MLOPS
Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning on the test set inflates results, and nested cross-validation as the fix (§1.2). DATA
Chang, C.-C. & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM TIST 2(3), 1–27. The standard implementation behind scikit-learn's SVC, and the reference for practical \((C, \gamma)\) selection (§10.5). VOL I
Chang, H., Zhang, H., Jiang, L., Liu, C. & Freeman, W. T. (2022). MaskGIT: Masked Generative Image Transformer. CVPR 2022 — parallel masked-token decoding, the few-round alternative to autoregression in §3.4. MM
Chaudhry, A., Ranzato, M., Rohrbach, M. & Elhoseiny, M. (2019). Efficient Lifelong Learning with A-GEM. ICLR 2019 — gradient-episodic-memory replay for continual learning (§4.4). OPEN
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 — the interpolation method of EQ D5.3. DATA
Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016 — regularized second-order objective and split gain (EQ M15.6–M15.7). VOL I
Chicco, D. & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 and Accuracy. BMC Genomics 21 — the case for MCC on imbalanced binary problems (§5.5). DATA · MLOPS
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 — introduces the GRU (EQ N3.6) and the encoder–decoder framing carried into Chapter 04. DL
Chollet, F. (2015). Keras. Official documentation — the high-level layers/Sequential/Functional API of §2.2 and the current Keras 3 multi-backend design. FRAME
Chollet, F. (2021). Deep Learning with Python (2nd ed.). Manning — the canonical Keras text by its author; covers layers, fit, callbacks, and custom training loops. FRAME
Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS — the original preference-to-reward pipeline and the Bradley–Terry reward model (EQ R6.2–R6.3) that RLHF scaled to language. RL
Christoffersen, P. F. (1998). Evaluating Interval Forecasts. International Economic Review 39(4) — the conditional-coverage / independence test that complements Kupiec. QUANT
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning Workshop — the LSTM-vs-GRU comparison underpinning the "no universal winner" claim (§3.4). DL
Clauset, A., Shalizi, C. R. & Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review 51(4), 661–703. The methodological reference on fitting and — crucially — testing power-law tails against alternatives. STATS
Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems (GSM8K). OpenAI — outcome verifiers + best-of-N selection on math (§12.2–12.3). VOL II
Cont, R. (2001). Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues. Quantitative Finance 1(2), 223–236. A careful survey of the fat-tail evidence and why pinning down the tail exponent is genuinely hard. STATS
Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3), 273–297. The paper that introduced the soft margin and slack variables (EQ M10.5) and gave the method its modern form — the canonical primary source for this chapter. VOL I
Cover, T. & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Information Theory, 13(1), 21–27. why the distance choice is the model: the founding analysis of k-NN. VOL I
Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley — the standard graduate text; entropy, mutual information, source coding, and their inequalities. STATS
Cox, J. C., Ingersoll, J. E. & Ross, S. A. (1985). A Theory of the Term Structure of Interest Rates. Econometrica 53(2) — the square-root diffusion, the Feller condition, and non-negative rates (EQ Q4.7–Q4.9). QUANT
Cox, J. C., Ross, S. A. & Rubinstein, M. (1979). Option Pricing: A Simplified Approach. Journal of Financial Economics 7(3) — the original recombining binomial lattice, the CRR parameters of EQ Q2.4, and the convergence to Black–Scholes. QUANT
Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024 — Mamba-2 and the SSD duality of EQ 11.4. VOL II
Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. ICML 2006 — why PR-AUC, not ROC-AUC, is the metric to trust under imbalance (§5.5). DATA
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437 — a frontier-class MoE trained at a fraction of typical cost, with unusually open methodology. OPEN
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv — RLVR with GRPO at scale; reasoning behavior emerging from verifiable rewards (§6.5). RL · VOL II
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. JASIS 41(6) — LSA: truncated SVD of a term–document matrix, the §13.5 connection to embeddings. VOL I
Défossez, A. et al. (2024). Moshi: A Speech-Text Foundation Model for Real-Time Dialogue. Kyutai — full-duplex spoken dialogue; the streaming / low-latency direction of §4.5. MM
Défossez, A., Copet, J., Synnaeve, G. & Adi, Y. (2022). High Fidelity Neural Audio Compression. Meta — EnCodec; residual-vector-quantized neural codec (EQ MM4.4) that turns audio into discrete tokens. MM
Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B 39(1) — the Expectation–Maximization algorithm behind GMM fitting (§12.4, EQ M12.5). VOL I
Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022 — outlier-aware 8-bit quantization; why a few features must be preserved. OPEN
Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023 — 4-bit NF4 base weights plus LoRA adapters; fine-tuning a 65B model on a single 48 GB GPU. OPEN
Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021 — the result marking diffusion's displacement of GANs for large-scale image generation (§6.5). DL
Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the Augmented Dickey-Fuller unit-root test that decides how many differences \(d\) a series needs. TIME
Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the unit-root test that decides random walk vs stationary (§1.4). TIME
Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems, LNCS 1857 — the canonical survey of why and when ensembles help (§14.1). VOL I
Domingos, P. & Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29 — explains why a model with violated independence assumptions still classifies well (the paradox of §9.5). VOL I
Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021 — the Vision Transformer, the chief modern challenger to the convolutional prior. DL · MM
Easley, D. & Kleinberg, J. (2010). Networks, Crowds, and Markets. Cambridge University Press (Ch. 6) — an accessible, freely available treatment of best response, dominant strategies, and equilibrium used to frame this chapter. GAME
Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218. the optimal low-rank approximation theorem (EQ S6.7). STATS · VOL I
Efron, B. & Morris, C. (1975). Data Analysis Using Stein's Estimator and Its Generalizations. JASA / Ann. Statist. — shrinkage and partial pooling as empirical Bayes. STATS
Einstein, A. (1905). Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung…. Ann. Phys. 322 — the physical derivation that variance grows linearly in time. QUANT
Elhage, N. et al. (2022). Toy Models of Superposition. Anthropic / Transformer Circuits — the superposition hypothesis and feature geometry behind EQ 13.2–13.3. VOL II
Engle, R. F. (1982). Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation. Econometrica 50(4) — the original ARCH model (EQ T4.2); the work cited for Engle's 2003 Nobel Prize. TIME
Engle, R. F. (2002). Dynamic Conditional Correlation: A Simple Class of Multivariate GARCH Models. Journal of Business & Economic Statistics 20(3) — the DCC model bridging to the multivariate chapter. TIME
Engle, R. F. & Granger, C. W. J. (1987). Co-integration and Error Correction: Representation, Estimation, and Testing. Econometrica 55(2) — defines cointegration and the Granger representation theorem linking it to the VECM (EQ T5.7–T5.8). TIME
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. KDD-96 — the original DBSCAN: core/border/noise points and density-reachability (§12.3, EQ M12.3). VOL I
European Union (2024). Regulation (EU) 2024/1689 — the Artificial Intelligence Act. Official Journal — risk-tiered obligations (risk management, data governance, logging, human oversight) phasing in through 2026–2027. MLOPS
Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters 27(8) — the canonical tutorial on ROC curves, AUC, and the pair-counting identity (§4.1). MLOPS
Fischer, H. (2011). A History of the Central Limit Theorem: From Classical to Modern Probability Theory. Springer — the full lineage of EQ S2.6, from de Moivre and Laplace to Lindeberg and Lévy. STATS
Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023 — second-order error-aware rounding (§11.4). VOL II · OPEN
Freund, Y. & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences 55(1) — AdaBoost and its training-error bound (EQ M14.5). VOL I
Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5) — boosting as functional gradient descent (EQ M14.4). VOL I · MLOPS
Friedman, J. W. (1971). A Non-cooperative Equilibrium for Supergames. Review of Economic Studies 38(1) — Grim-Trigger equilibria and an early form of the Folk Theorem behind EQ G2.3 (§2.1). GAME
Friedman, J., Hastie, T. & Tibshirani, R. (2000). Additive Logistic Regression: A Statistical View of Boosting. Annals of Statistics 28(2) — AdaBoost as stagewise exponential-loss minimization (EQ M15.5). VOL I
Fujimoto, S., van Hoof, H. & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. arXiv:1802.09477 — TD3; twin critics, delayed updates, and target smoothing. RL
Gama, J., Medas, P., Castillo, G. & Rodrigues, P. (2004). Learning with Drift Detection (DDM). SBIA 2004, LNCS 3171 — the error-rate drift detector behind EQ V5.4. MLOPS
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia, A. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys 46(4) — the canonical taxonomy of drift types and adaptation strategies (§5.1). MLOPS
Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 · the naive / advanced / modular RAG taxonomy used here VOL IV
Gardner, E. S. & McKenzie, E. (1985). Forecasting trends in time series. Management Science 31(10) — the damped-trend method (EQ T3.5), a perennial competition benchmark. TIME
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press — the standard modern reference for conjugacy, hierarchy, and computation. STATS
Geman, S. & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE TPAMI 6(6) — introduced Gibbs sampling (EQ S7.9) to statistics and image analysis. STATS
Gerganov, G. et al. (2023). llama.cpp — LLM inference in C/C++. The reference local inference engine and the home of the GGUF format. OPEN
Giles, M. & Glasserman, P. (2006). Smoking Adjoints: Fast Monte Carlo Greeks. RISK Magazine — adjoint algorithmic differentiation for computing all Greeks in one reverse pass. QUANT
Glasserman, P. (2003). Monte Carlo Methods in Financial Engineering. Springer — the definitive reference for variance reduction, path simulation, and Greeks. QUANT
Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS — the variance-preserving (Xavier/Glorot) initialization of §1.2. DL
Glosten, L. R., Jagannathan, R. & Runkle, D. E. (1993). On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. Journal of Finance 48(5) — the GJR-GARCH asymmetric extension (EQ T4.6). TIME
Gneiting, T. & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. J. American Statistical Association 102(477) — the theory of why log loss and Brier reward honest probabilities (§3.5). MLOPS
Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. (2015). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. J. Computational and Graphical Statistics 24(1) — ICE curves (§6.3). MLOPS
Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. the canonical numerical reference for the SVD, power iteration, and conditioning. STATS
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning MIT Press — Ch. 8 (optimization) and Ch. 7 (regularization), the standard textbook treatment. DL
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS 2014 — the original adversarial game, the optimal discriminator (EQ N6.2), and the JSD reduction (EQ N6.3). DL · GAME
Google. TensorFlow Lite / LiteRT — On-Device Machine Learning. tensorflow.org/lite — the SavedModel→FlatBuffer converter and edge runtime of §3.3. FRAME
Google. TensorFlow Serving. tensorflow.org/tfx — SavedModel hosting with versioned hot-swap, the TF-native serving path. FRAME
Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 37(3) — the original definition of Granger causality (EQ T5.9). TIME
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10) — systematic ablation of LSTM components, including the value of the forget gate and forget-bias initialization (§3.3). DL
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. (2012). A Kernel Two-Sample Test (MMD). JMLR 13 — the maximum-mean-discrepancy test for multivariate covariate-shift detection (§5.3). MLOPS
Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?. NeurIPS 2022 Datasets & Benchmarks — the contested empirical case behind §15.5's honest comparison. VOL I
Groeneveld, D., Beltagy, I., Walsh, P. et al. (2024). OLMo: Accelerating the Science of Language Models. arXiv:2402.00838 — a genuinely open-source model: weights, data, and training code all released. OPEN
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2312.00752 — the modern selective state-space model reviving gated linear recurrence at scale (§3.4 footnote). DL · VOL II
Gu, A., Goel, K. & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces (S4). ICLR 2022 — the structured SSM that started the line (§11.2). VOL II
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. (2017). Improved Training of Wasserstein GANs. NeurIPS 2017 — WGAN-GP: the gradient penalty that replaced weight clipping. DL
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017 — modern deep networks are systematically over-confident; temperature scaling as a fix (§4.4). MLOPS
Gururangan, S., Marasović, A., Swayamdipta, S. et al. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL 2020 — domain- and task-adaptive continued pre-training (DAPT / TAPT). OPEN
Guyon, I. & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 — the canonical survey of filter, wrapper and embedded selection (§4.3). DATA
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46 — recursive feature elimination, EQ D4.5. DATA
Ha, D. & Schmidhuber, J. (2018). World Models. NeurIPS 2018 — the foundational demonstration: train an agent inside its own learned dream and transfer to the real environment. MM
Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor. arXiv:1801.01290 — the maximum-entropy objective (EQ R5.5) and the continuous-control default of §5.4. RL
Hafner, D. et al. (2025). V-JEPA 2: Self-Supervised Video World Models — see also Assran et al., I-JEPA (arXiv:2301.08243). embedding-prediction self-supervision scaled to video as a world model for planning. MM
Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). Mastering Diverse Domains through World Models (DreamerV3). arXiv — one fixed hyperparameter set across 150+ tasks; first to mine diamonds in Minecraft from scratch (EQ MM5.2). MM
Halko, N., Martinsson, P.-G. & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review 53(2) — randomized SVD, how truncated factorizations are actually computed at scale. VOL I
Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press — the graduate-level reference for stationarity, unit roots, and the econometric theory behind §1.2 and §1.4. TIME
Han, H., Wang, W.-Y. & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. ICIC 2005 — synthesizing only near the decision boundary (§5.3). DATA
Hand, D. J. (2009). Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve. Machine Learning 77(1) — the influential critique of AUC and the proposed H-measure. MLOPS
Hanley, J. A. & McNeil, B. J. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 143(1) — the AUC = Wilcoxon–Mann–Whitney equivalence (EQ V4.2). MLOPS
Hansen, P. R. & Lunde, A. (2005). A Forecast Comparison of Volatility Models: Does Anything Beat a GARCH(1,1)?. Journal of Applied Econometrics 20(7) — the large horse race finding GARCH(1,1) hard to beat for equities. TIME
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — free online; cross-validation, the train/test contract, and the right vs wrong way to cross-validate (§1.2, §1.5). DATA · VOL I · MLOPS
Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57(1) — the generalization to asymmetric proposals, completing Metropolis–Hastings. STATS
He, H. & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9) — the canonical survey of resampling, cost-sensitive learning, and evaluation. DATA
He, H., Bai, Y., Garcia, E. A. & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IJCNN 2008 — density-adaptive synthetic generation (§5.3). DATA
He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022 — masked autoencoding as a strong self-supervised pretext for vision transformers (§5.5). DL
He, K., Zhang, X., Ren, S. & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV — He/Kaiming initialization for ReLU networks (EQ N1.5). DL
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR — residual connections / ResNet (§1.4, EQ N1.7–N1.8). DL · MM
Heath, D., Jarrow, R. & Morton, A. (1992). Bond Pricing and the Term Structure of Interest Rates: A New Methodology. Econometrica 60(1) — the HJM no-arbitrage framework that generalizes all short-rate models to the whole forward curve. QUANT
Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI 2018 (arXiv:1709.06560) — the reproducibility and seed-variance study behind §5.5 and EQ R5.6. RL
Heston, S. L. (1993). A Closed-Form Solution for Options with Stochastic Volatility. Review of Financial Studies 6(2). — stochastic-variance model that generates the smile endogenously (EQ Q3.7). QUANT
Higgins, I. et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017 — weighting the KL term to encourage disentangled latent factors (§5.4). DL
Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) — deep autoencoders, trained layer-wise, beat PCA at nonlinear dimensionality reduction (§5.1). DL
Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS 2021 Workshop — the conditional/unconditional extrapolation of EQ MM3.4. MM
Ho, J., Jain, A. & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020 — the noise-prediction objective and forward reparameterization behind EQ MM3.1–MM3.2. MM
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8) — introduces the LSTM cell, the constant error carousel, and the gating scheme of EQ N3.4–N3.5. DL
Hoffman, M. D. & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. JMLR 15 — NUTS, the adaptive HMC sampler behind Stan / PyMC / NumPyro (§7.5). STATS
Holt, C. C. (2004, orig. 1957). Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting 20(1) — reprint of the 1957 ONR memorandum that introduced double smoothing (EQ T3.4). TIME
Howard, J. & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification (ULMFiT). ACL 2018 — gradual unfreezing and discriminative fine-tuning (§4.2). OPEN
Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press — introduced policy iteration (EQ R2.7) and the policy improvement theorem. RL
Hu, E. J., Shen, Y., Wallis, P. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022 — the low-rank weight update (EQ OM3.2) at the heart of practical open-model fine-tuning. OPEN
Hubert, L. & Arabie, P. (1985). Comparing Partitions. Journal of Classification 2(1) — the Adjusted Rand Index for chance-corrected external validation (§12.5). VOL I
Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 40(9) — the optimal prefix code of §8.4 (EQ S8.6, Instrument S8.3). STATS
Hugging Face. PEFT: Parameter-Efficient Fine-Tuning — Documentation. Official library docs for LoRA/QLoRA/DoRA training and adapter management. OPEN
Hull, J. & White, A. (1990). Pricing Interest-Rate-Derivative Securities. Review of Financial Studies 3(4) — time-dependent drift that fits the initial curve exactly (EQ Q4.10–Q4.11). QUANT
Hull, J. C. (2021). Options, Futures, and Other Derivatives (11th ed.). Pearson — Ch. 13–21: the standard practitioner treatment of binomial trees, risk-neutral valuation, and American-option pricing. QUANT
Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.), Ch. 8. OTexts — the freely available standard textbook treatment of SES, Holt-Winters, and ETS. TIME
Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts (free online) — the modern practitioner's guide; decomposition (§1.1), STL, and Box–Cox (§1.5). TIME
Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software 27(3) — the algorithm behind auto.arima and its AIC-driven stepwise order search (§2.5). TIME
Hyndman, R. J. & Koehler, A. B. (2006). Another Look at Measures of Forecast Accuracy. International J. Forecasting 22(4) — the canonical critique of MAPE and the case for scaled error measures (§3.2). MLOPS · TIME
Hyndman, R. J., Koehler, A. B., Ord, J. K. & Snyder, R. D. (2008). Forecasting with Exponential Smoothing: The State Space Approach. Springer — the definitive treatment of the ETS innovations state-space framework (EQ T3.8). TIME
Hyndman, R. J., Koehler, A. B., Snyder, R. D. & Grose, S. (2002). A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting 18(3) — the taxonomy of 30 ETS models and automatic AIC selection (§3.4). TIME
Ilharco, G., Ribeiro, M. T., Wortsman, M. et al. (2022). Editing Models with Task Arithmetic. ICLR 2023 — defines the task vector \(\tau = \theta_{ft} - \theta_{pre}\) and add/negate/analogy arithmetic (§6.2, EQ OM6.2). OPEN
Inan, H., Upasani, K., Chi, J. et al. (2023). Llama Guard: LLM-based Input-Output Safeguarding for Human-AI Conversations. Meta — the open guard-model approach behind the input/output filters of §5.4. OPEN
Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine 2(8) — the conditional-probability argument behind §4.6. STATS
Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML — batch normalization (§1.3, EQ N1.6). DL
Itô, K. (1944). Stochastic Integral. Proc. Imperial Acad. Tokyo 20(8) — the original construction of the Itô integral and the lemma of §1.4. QUANT
J.P. Morgan / Reuters (1996). RiskMetrics — Technical Document (4th ed.). The document that made parametric VaR an industry standard: EWMA covariance, Gaussian quantiles, √-time scaling. QUANT
Jamieson, K. & Talwalkar, A. (2016). Non-stochastic Best Arm Identification and Hyperparameter Optimization. AISTATS 2016 — the successive-halving subroutine behind EQ V2.6. MLOPS
Jamshidian, F. (1989). An Exact Bond Option Formula. Journal of Finance 44(1) — decomposes a swaption into a portfolio of bond options, making Gaussian models swaption-closed-form. QUANT
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press — the objective-Bayesian case for probability as extended logic. STATS
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. Probability as extended logic — the Bayesian reading of §1.1 and §1.3. STATS
Jiang, A. Q., Sablayrolles, A., Mensch, A. et al. (2023). Mistral 7B. arXiv:2310.06825 — an Apache-2.0 dense model that set the efficiency bar for small open models. OPEN
Johansen, S. (1991). Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models. Econometrica 59(6) — the maximum-likelihood (trace and max-eigenvalue) tests for cointegration rank (§5.4). TIME
Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Ch. 4: Naive Bayes and Sentiment Classification. Stanford — a modern, worked treatment of multinomial NB for text with smoothing. VOL I
Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 — an early, lucid survey of the problem formulation used throughout this chapter. RL
Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906 · the dual-encoder dense retriever behind modern semantic search VOL IV
Karras, T., Laine, S. & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR 2019 — StyleGAN: the mapping network, AdaIN style modulation (EQ N6.6), and style mixing. DL
Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020 — the reassociation trick of EQ 11.5. VOL II
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM TKDD 6(4) — the field's working definition and taxonomy of leakage (§1.3, EQ D1.3). DATA
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017 — histogram binning, leaf-wise growth, GOSS & EFB (EQ M15.8–M15.9). VOL I
Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika 30(1–2) — Kendall's τ, EQ S3.5. STATS
Keskar, N. S. et al. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017 — batch size, flat vs. sharp minima, and the generalization debate. DL
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. CoRL 2024 — an open-weight tokenized VLA, the reproducible counterpart to RT-2 (§6.2). MM
Kingma, D. P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015 — the first/second-moment adaptive optimizer with bias correction (EQ N7.3). DL
Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR — the variational autoencoder and the ELBO objective (EQ S8.9). STATS · DL
Kirillov, A. et al. (2023). Segment Anything. ICCV 2023 — SAM; promptable, open-vocabulary segmentation (§1.4 frontier). MM
Kirkpatrick, J., Pascanu, R., Rabinowitz, N. et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS 2017 — Elastic Weight Consolidation (EQ OM4.6). OPEN
Kloeden, P. & Platen, E. (1992). Numerical Solution of Stochastic Differential Equations. Springer — Euler–Maruyama and higher-order path-discretization schemes. QUANT
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI 1995 — the empirical case for stratified 10-fold CV (§1.2). MLOPS
Kolmogorov, A. N. (1933). Foundations of the Theory of Probability (Grundbegriffe der Wahrscheinlichkeitsrechnung). The axiomatic foundation of EQ S1.1; measure-theoretic probability. STATS
Koren, Y., Bell, R. & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42(8) — the canonical write-up of the latent-factor model and biases (EQ M13.4–M13.5), from the Netflix-Prize winners. VOL I
Kraskov, A., Stögbauer, H. & Grassberger, P. (2004). Estimating Mutual Information. Physical Review E 69(6) — the k-nearest-neighbour estimator behind practical MI feature scores (EQ D4.7). DATA
Kreuzberger, D., Kühl, N. & Hirschl, S. (2022). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. arXiv:2205.02302 — a current reference architecture for pipelines, CI/CD, and CT. MLOPS
Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 25 — AlexNet; ReLU, dropout, and GPU training that ignited the deep-learning era. DL · MM
Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (full text online) — encoding, scaling, transforms and leakage-safe resampling. DATA
Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics 22(1) — relative entropy / KL divergence (EQ S8.4). STATS
Kupiec, P. H. (1995). Techniques for Verifying the Accuracy of Risk Measurement Models. Journal of Derivatives 3(2) — the proportion-of-failures likelihood-ratio backtest (EQ Q6.7). QUANT
Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023 — the vLLM paper; paged KV cache and continuous batching. OPEN
Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). github.com/vllm-project/vllm — paged-attention KV cache and continuous batching for LLM throughput. FRAME
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview — the JEPA position paper; predict in representation space, not pixel space (EQ MM5.3). MM
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11) — LeNet-5; the conv → pool → dense template trained end-to-end by backprop. DL
Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 — NMF and the parts-based decomposition of §13.4 (EQ M13.6). VOL I
Lee, D. D. & Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization. NIPS 13 — the multiplicative update rules of EQ M13.7 and their convergence guarantee. VOL I
Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NIPS 27 — proof that word2vec's skip-gram is implicitly factorizing a shifted PMI matrix. VOL I
Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 · the paper that named RAG and trained retriever + generator jointly VOL IV
Li, J., Li, D., Savarese, S. & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs. ICML 2023 — the Q-Former, a query-based bridge between frozen vision and frozen language. MM
Li, L. et al. (2020). A System for Massively Parallel Hyperparameter Tuning. MLSys 2020 — ASHA, the asynchronous successive halving used in production tuners. MLOPS
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. (2017). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. JMLR 18 — successive halving across brackets, the bandit view of early stopping (§2.5). MLOPS
Li, Y. et al. (2023). Evaluating Object Hallucination in Large Vision-Language Models (POPE). EMNLP 2023 — the object-hallucination probe behind the §2.5 evaluation caveats. MM
Lieber, O. et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. AI21 — a production-scale interleaved Mamba/attention/MoE hybrid (§11.3). VOL II
Lightman, H. et al. (2023). Let's Verify Step by Step. OpenAI — process reward models (PRM) vs outcome supervision (EQ 12.3, §12.3). VOL II
Lillicrap, T. P. et al. (2016). Continuous control with deep reinforcement learning. arXiv:1509.02971 — DDPG, the deterministic actor–critic for continuous actions (§5.4). RL
Lim, B., Arık, S. Ö., Loeff, N. & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37(4) — attention, variable selection and quantile outputs (§6.4). TIME
Lin, J., Tang, J., Tang, H. et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024 — protects salient weight channels; widely used for 4-bit serving. OPEN
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017 (RetinaNet) — focal loss, EQ D5.5. DATA
Little, R. J. A. & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley — the canonical textbook on mechanisms, likelihood-based, and multiple imputation. DATA
Liu, H., Hussain, F., Tan, C. L. & Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6 — a survey of binning / discretization methods (EQ D3.9). DATA
Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023 — projected patches as input tokens (EQ MM2.1) plus LLM-bootstrapped instruction data; the dominant early-fusion recipe. MM
Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 · the U-shaped position effect that governs context assembly VOL IV
Liu, S.-Y., Wang, C.-Y., Yin, H. et al. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ICML 2024 — decouples magnitude from direction for a quality gain over vanilla LoRA. OPEN
Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the portmanteau test for "are these residuals white noise?" (§1.4). TIME
Longstaff, F. & Schwartz, E. (2001). Valuing American Options by Simulation: A Simple Least-Squares Approach. Review of Financial Studies 14(1) — least-squares Monte Carlo for early-exercise payoffs. QUANT
Loshchilov, I. & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 — cosine annealing and cyclical warm restarts (EQ N7.5). DL
Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR — AdamW, decoupling weight decay from the adaptive step (EQ N1.10). DL
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P. & Mordatch, I. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS — MADDPG and the CTDE gradient of EQ G3.6. GAME
Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017 — the SHAP framework and its uniqueness theorem (§6.5, EQ V6.4–V6.5). MLOPS
Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 30 (SHAP) — the Shapley value applied to machine-learning feature attribution, the §2.5 bridge into Chapter 03. GAME
Lundberg, S. M., Erion, G. G. & Lee, S.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles. arXiv — TreeSHAP, exact polynomial-time Shapley values for trees (§6.5). MLOPS
Luo, Y., Yang, Z., Meng, F. et al. (2023). An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning. Measures forgetting of general ability across instruction fine-tunes (§4.5). OPEN
Luong, M.-T., Pham, H. & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015 — multiplicative (dot/general/concat) and global-vs-local attention (EQ N4.4). DL
Lütkepohl, H. (2005). New Introduction to Multiple Time Series Analysis. Springer — the standard graduate reference for VAR estimation, IRFs, FEVD, and the companion form (EQ T5.2–T5.6). TIME
Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). Microsoft Research — ternary {-1,0,+1} weights trained from scratch (§11.4). VOL II
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press — free online; the canonical bridge from Shannon to machine learning. STATS
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR — adversarial training as the robust min-max of EQ G3.8. GAME
Mahalanobis, P. C. (1936, repr. 2018). On the generalised distance in statistics. Sankhyā A, 80(S1), 1–7. the original covariance-corrected distance (EQ M11.3), reprinted with commentary. VOL I
Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2020). The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36(1) — the modern evidence that exponential smoothing remains a top baseline (§3.4). TIME
Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2022). The M5 competition: Background, organization, and implementation. International Journal of Forecasting 38(4) — why gradient-boosted trees, not transformers, won (§6.4 caveat). TIME
Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business 36(4), 394–419. The founding argument that financial returns are heavy-tailed and possibly infinite-variance — the contested claim of §2.5. STATS · TIME
Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM 8(3) — arguably the first application of naive-Bayes-style probabilistic classification to text. VOL I
Matena, M. & Raffel, C. (2022). Merging Models with Fisher-Weighted Averaging. NeurIPS 2022 — per-parameter precision-weighted merge (§6.3, EQ OM6.5). OPEN
Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge University Press — the book-length development of ESS and the replicator perspective (EQ G2.6). GAME
Maynard Smith, J. & Price, G. R. (1973). The Logic of Animal Conflict. Nature 246 — introduces the Evolutionarily Stable Strategy (EQ G2.5) and the Hawk–Dove game (§2.4). GAME
McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop — the canonical multinomial vs. Bernoulli comparison (EQ M9.5 and §9.4). VOL I
McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation — the original diagnosis of forgetting. OPEN
Meng, K., Bau, D., Andonian, A. & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT (ROME). NeurIPS 2022 — causal tracing / activation patching of EQ 13.6, then weight editing. VOL II
Merton, R. C. (1973). Theory of Rational Option Pricing. Bell Journal of Economics and Management Science 4(1) — the no-arbitrage bounds and the proof that an American call on a non-dividend stock is never exercised early (§2.5). QUANT
Merton, R. C. (1973). Theory of Rational Option Pricing. Bell Journal of Economics and Management Science 4(1). — the rigorous continuous-time treatment; later extended with jump-diffusion. QUANT
Meta (2024). Llama 3 Community License Agreement. Primary source — the commercial terms, acceptable-use policy, and 700M-MAU scale clause discussed in §1.4. OPEN
Meta FAIR Diplomacy Team et al. (2022). Human-level play in the game of Diplomacy by combining language models with strategic reasoning (CICERO). Science 378 — mixed-motive multi-agent play with negotiation. GAME
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21(6) — the original Metropolis acceptance rule (symmetric-proposal special case of EQ S7.8). STATS
Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations 3(1) — the smoothed target/mean encoding of EQ D3.2. DATA
Micikevicius, P. et al. (2017). Mixed Precision Training. ICLR 2018 — fp16 training, the fp32 master copy, and loss scaling (EQ N7.7). DL
Microsoft. ONNX Runtime. onnxruntime.ai — the cross-platform inference engine with pluggable execution providers (CUDA, TensorRT, OpenVINO, CoreML, WebGPU). FRAME
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — deep Q-networks; the modern demonstration that ε-greedy exploration (EQ R1.7) scales to high-dimensional state spaces. RL
Mnih, V. et al. (2016). Asynchronous methods for deep reinforcement learning. ICML 2016 — A3C: parallel actor-learners, the n-step advantage and entropy bonus of EQ R4.8 (and the synchronous A2C that followed). RL
Molnar, C. (2022). Interpretable Machine Learning (2nd ed.). Open textbook — the standard practical reference covering every method in this chapter. MLOPS
Muennighoff, N. et al. (2025). s1: Simple Test-Time Scaling. Budget forcing as a decode-time control on thinking length (§12.5). VOL II
Nash, J. F. (1950). Equilibrium Points in n-Person Games. PNAS 36(1), 48–49 — the existence theorem for the equilibrium concept of EQ G1.4 in general finite games. GAME
Nash, J. F. (1951). Non-Cooperative Games. Annals of Mathematics 54(2), 286–295 — the full development of non-cooperative equilibrium, dominance, and the proof via Kakutani's fixed-point theorem. GAME
National Institute of Standards and Technology (2023). AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1 — the Govern/Map/Measure/Manage scaffolding generalizing §7.5. MLOPS
Nelson, D. B. (1991). Conditional Heteroskedasticity in Asset Returns: A New Approach. Econometrica 59(2) — the EGARCH model (EQ T4.7), capturing the leverage effect in log-variance. TIME
Neyman, J. & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Phil. Trans. R. Soc. A 231 — Type I/II error, power, and the framework of EQ S4.8. STATS
Ng, A. Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NeurIPS 14 — the generative/discriminative trade-off and small-data advantage of §9.1. VOL I
Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities With Supervised Learning. ICML 2005 — calibration behavior across model families and the Platt / isotonic fixes (§4.4). MLOPS
Norris, J. R. (1997). Markov Chains. Cambridge University Press — the standard rigorous treatment of transition matrices, stationarity, ergodicity, and reversibility. STATS
Northcutt, C. G., Athalye, A. & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets & Benchmarks — measured label-error rates in ImageNet and nine other canonical test sets (§1.1). DATA
NVIDIA. Triton Inference Server. developer.nvidia.com — multi-framework serving with concurrent execution and dynamic batching (§3.4). FRAME
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. (2019). Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366(6464) — a real-world label / proxy that leaked the wrong target into a deployed model (§1.1, §1.4). DATA
Øksendal, B. (2003). Stochastic Differential Equations: An Introduction with Applications (6th ed.). Springer — the standard graduate text on Itô calculus and SDEs. QUANT
Olsson, C. et al. (2022). In-context Learning and Induction Heads. Anthropic / Transformer Circuits — the induction-head circuit and its link to in-context learning (§13.5). VOL II
Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science. Science 349(6251) — the large-scale replication study that crystallized the crisis. STATS
Open Source Initiative (2024). The Open Source AI Definition 1.0. Official text — the bar separating open-source AI from merely open-weight releases. OPEN
Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA 2024 — pooling ~1M trajectories across 22 embodiments; the cross-embodiment data effort (§6.5). MM
Osborne, M. J. & Rubinstein, A. (1994). A Course in Game Theory. MIT Press — standard graduate reference for dominance, IESDS, Nash equilibrium, and mixed strategies as presented in §§1.2–1.5. GAME
Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS — the three-stage SFT → reward model → PPO RLHF recipe behind ChatGPT; source of the KL-regularized objective (EQ R6.5). RL
OWASP Foundation (2025). OWASP Top 10 for LLM Applications. The canonical defender's checklist — prompt injection (LLM01) and the system-level controls of §5.4–5.5. OPEN
Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab — the damped random-walk Markov chain whose stationary distribution is PageRank (§7.2). STATS
Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. ICML 2013 — the spectral-norm view of the gradient product and the gradient-clipping remedy for explosion (§3.2). DL
Paszke, A., Gross, S., Massa, F. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 32 — the system paper for define-by-run autograd. FRAME
Pearl, J. (1995). Causal Diagrams for Empirical Research. Biometrika 82(4) — the foundational presentation of the backdoor criterion. STATS
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press — DAGs, the do-operator and the backdoor criterion (EQ S3.7). STATS
Pearson, K. (1895). Note on Regression and Inheritance in the Case of Two Parents. Proceedings of the Royal Society of London 58 — the product-moment correlation coefficient (EQ S3.3). STATS
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572. the origin of principal-component analysis as projection onto a best-fit subspace (§6.2, §6.5). STATS
Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12 — the reference implementations of StandardScaler, MinMaxScaler, RobustScaler and PowerTransformer. DATA
Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023 — DiT, the transformer backbone that replaced the U-Net and scaled to SD3 and Sora. MM
Perez, E., Huang, S., Song, F. et al. (2022). Red Teaming Language Models with Language Models. EMNLP 2022 — using one LM to automatically generate test cases that surface harms in another, at scale. OPEN
Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Machine Learning Technologies 2(1) — the definitive survey of confusion-matrix metrics, their biases, and what each one really measures (§3.3–3.4). MLOPS
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS 2018 — ordered boosting and ordered target statistics (EQ M15.10). VOL I
Puterman, M. L. & Shin, M. C. (1978). Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science 24(11) — the m-sweep interpolation between value and policy iteration cited in §2.4. RL
PyTorch Team. torch.export & ExecuTorch. docs.pytorch.org — ahead-of-time graph capture (ExportedProgram) and the edge runtime succeeding TorchScript. FRAME
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. (eds.) (2009). Dataset Shift in Machine Learning. MIT Press — the reference volume formalizing covariate, prior, and concept shift (EQ V5.1). MLOPS
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press — covariate shift, label shift, and concept drift formalized (§1.4, EQ D1.4). DATA
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2) — the canonical reference for the forward, Viterbi, and Baum–Welch algorithms (§7.3). STATS
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021 — CLIP; contrastive image–text pre-training and zero-shot transfer (EQ MM1.4). MM
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI — Whisper; an encoder–decoder transformer trained on 680k hours, the model behind §4.2. MM
Radford, A., Metz, L. & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 — DCGAN: the stable convolutional architecture and latent-space vector arithmetic. DL
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS — DPO; the closed-form optimal policy (EQ R6.6) and the supervised preference loss (EQ R6.7) that skip the reward model and RL loop. RL
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. DALL·E 2 — the CLIP-latent prior plus diffusion decoder ("unCLIP") of §3.3. MM
Ren, Y. et al. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ICLR 2021 — non-autoregressive TTS with explicit duration prediction; the alignment fix for Tacotron. MM
Ribeiro, M. T., Singh, S. & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. KDD 2016 — LIME, local surrogate explanations (§6.4, EQ V6.3). MLOPS
Roberts, G. O., Gelman, A. & Gilks, W. R. (1997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. Ann. Appl. Probab. 7(1) — the ~0.234 optimal acceptance-rate result (§7.4). STATS
Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. DOI:10.1561/1500000019 · the canonical derivation of the BM25 scorer VOL IV
Rockafellar, R. T. & Uryasev, S. (2000). Optimization of Conditional Value-at-Risk. Journal of Risk 2(3) — CVaR/ES as a convex, optimizable risk measure. QUANT
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022 — the VAE-compressed latent space that makes Stable Diffusion affordable (§5.5). DL · MM
Ross, S., Gordon, G. & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011 — DAgger and the formal account of covariate shift in behavior cloning (§6.4, EQ MM6.5). MM
Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20 — the silhouette score for choosing and validating k (§12.5, EQ M12.6). VOL I
Rubin, D. B. (1976). Inference and Missing Data. Biometrika 63(3):581–592 — the paper that defined MCAR, MAR, and MNAR. DATA
Rubinstein, M. (1994). Implied Binomial Trees. Journal of Finance 49(3). — an early reconstruction of the post-1987 volatility smile from market prices. QUANT
Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High-Stakes Decisions and Use Interpretable Models Instead. Nature Machine Intelligence 1 — the case for inherently interpretable models (§6.1 caveat). MLOPS
Russakovsky, O. et al. (2015). ImageNet Large Scale Visual Recognition Challenge. IJCV 115(3) — the ImageNet/ILSVRC benchmark that drove the whole progression. MM
Saharia, C. et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Imagen — a large frozen T5 text encoder plus a pixel-space super-resolution cascade. MM
Saito, T. & Rehmsmeier, M. (2015). The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10(3) — the empirical case for PR over ROC under class imbalance (§4.2). MLOPS
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. (2016). Improved Techniques for Training GANs. NeurIPS 2016 — minibatch discrimination and feature matching, the classic anti-collapse fixes behind Instrument N6.2. DL
Salinas, D., Flunkert, V., Gasthaus, J. & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36(3) — the global probabilistic RNN of §6.4. TIME
Santurkar, S., Tsipras, D., Ilyas, A. & Mądry, A. (2018). How Does Batch Normalization Help Optimization? NeurIPS — the loss-smoothing reinterpretation of BatchNorm cited in §1.3. DL
Schölkopf, B. & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press — the definitive textbook treatment of kernels, Mercer's theorem, and the optimization behind EQ M10.3, M10.8. VOL I
Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero). Nature 2020 — decision-time planning with a learned latent model and MCTS, without being given the rules (EQ MM5.5). MM
Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. (2015). Trust Region Policy Optimization. arXiv:1502.05477 — the KL-constrained trust region PPO approximates with a first-order clip. RL
Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR 2016 — GAE, the λ-blended advantage estimator that sets the bias–variance dial between TD and Monte-Carlo (§4.4). RL
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv — PPO's clipped surrogate objective, the step-size fix that made policy gradients robust and the workhorse of RLHF (the §4.5 sequel). RL
Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics 6(2) — the Bayesian Information Criterion for model (and component-count) selection (§12.5, EQ M12.7). VOL I
scikit-learn developers. Imputation of missing values (User Guide). Official docs — SimpleImputer, KNNImputer, and IterativeImputer (MICE). DATA
Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015 — the "ML code is a small box" argument behind §7.1. MLOPS
Shafer, G. & Vovk, V. (2008). A tutorial on conformal prediction. JMLR 9 — distribution-free prediction intervals with finite-sample coverage (§6.3, EQ T6.6). TIME
Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011). Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming 127(1), 3–30. The hinge-loss sub-gradient method used in this chapter's first Python cell — how to train a linear SVM at scale. VOL I
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27 — the founding paper: entropy, source coding, and channel capacity (EQ S8.2, S8.6). STATS
Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv — introduces GRPO; the group-relative advantage (EQ R6.8) that removes PPO's value network. RL
Shapley, L. S. (1953). A Value for n-Person Games. Contributions to the Theory of Games II — the original Shapley value from cooperative game theory. MLOPS
Shapley, L. S. (1953). A Value for n-Person Games. In Contributions to the Theory of Games II, Princeton University Press — defines the Shapley value (EQ G2.7) and its axiomatic characterization (§2.5). GAME
Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. Google — Tacotron 2; the two-stage acoustic-model + vocoder template of §4.3 (EQ MM4.6). MM
Sheng, Y., Cao, S., Li, D. et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. MLSys 2024 — multi-tenant serving of many adapters over one shared base model (§3.5). OPEN
Shreve, S. E. (2004). Stochastic Calculus for Finance I: The Binomial Asset Pricing Model. Springer — a rigorous, self-contained development of replication, the risk-neutral measure, and market completeness (§2.1–§2.2). QUANT
Silver, D. et al. (2017). Mastering the game of Go without human knowledge. Nature 550 — AlphaGo Zero / AlphaZero self-play (EQ G3.4). GAME
Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015 — VGG; depth via stacks of \(3\times 3\) convolutions. DL
Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society B 13(2) — the paradox's namesake paper. STATS
Sims, C. A. (1980). Macroeconomics and Reality. Econometrica 48(1) — introduced the VAR as an atheoretical alternative to large structural macro models (EQ T5.1). TIME
Singh, S., Jaakkola, T., Littman, M. L. & Szepesvári, C. (2000). Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning 38 — convergence of SARSA (EQ R3.5) and the GLIE exploration conditions of §3.5. RL
Snell, C., Lee, J., Xu, K. & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters. The compute-optimal allocation and difficulty-dependent strategy of EQ 12.5 / §12.6. VOL II
Snoek, J., Larochelle, H. & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS 2012 — Gaussian-process surrogates and acquisition functions for tuning (§2.4). MLOPS
Spearman, C. (1904). The Proof and Measurement of Association between Two Things. American Journal of Psychology 15(1) — rank correlation, EQ S3.4. STATS
Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — the dropout regularizer and inverted-dropout scaling (EQ N7.6). DL
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — dropout regularization (EQ N1.9). DL
Stewart, G. W. (1993). On the early history of the singular value decomposition. SIAM Review, 35(4), 551–566. historical context tracing the SVD from Beltrami and Jordan to its modern role. STATS
Stiennon, N. et al. (2020). Learning to Summarize from Human Feedback. NeurIPS — the reward-hacking dynamics of over-optimizing a learned reward model (Instrument R6.3 §6.5). RL
Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. B 36(2) — the foundational formalization of cross-validation for model assessment. MLOPS
Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. the standard intuition-first text for the column-space, rank, eigenvalue, and SVD material here. STATS
Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1), 1–25. The original derivation of the Student-t distribution (EQ S2.7), written at the Guinness brewery. STATS
Sutskever, I., Vinyals, O. & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. NeurIPS 2014 — the encoder-decoder LSTM framework (EQ N4.1) and the source-reversal trick. DL
Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning 3 — the original TD(0) and TD(λ) prediction methods (EQ R3.2, EQ R3.3) and the bootstrapping idea at the heart of this chapter. RL
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — the canonical text; the agent–environment loop, MDPs, returns, value functions, and exploration as framed in this chapter. RL
Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS 1999 — the policy gradient theorem (EQ R4.3) and its compatibility with a learned value function, the formal basis of actor-critic. RL
Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015 — GoogLeNet / Inception; multi-scale blocks and \(1\times 1\) bottlenecks. DL
Taylor, S. J. & Letham, B. (2018). Forecasting at Scale (Prophet). The American Statistician 72(1) — the decomposable trend + seasonality + holidays model of §6.4 (EQ T6.7). TIME
Templeton, A. et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic Transformer Circuits — production-scale SAEs and feature steering (§13.4, §13.6). VOL II
TensorFlow Team. tf.data: Build TensorFlow input pipelines. Official guide — map / shuffle / batch / prefetch / cache and the overlap of §2.3 (EQ F2.3). FRAME
The PyTorch Team. Automatic Differentiation with torch.autograd. Tutorial — the dynamic graph,.grad accumulation, and zero_grad. FRAME
The PyTorch Team. PyTorch Documentation (stable). Official reference for tensors, autograd, nn, and optim. FRAME
The vLLM Team. vLLM Documentation. Official guide to deployment, quantization, and the OpenAI-compatible server. OPEN
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B 58(1) — the L1 penalty that performs embedded selection (EQ D4.6). DATA
Tishby, N., Pereira, F. C. & Bialek, N. (2000). The Information Bottleneck Method. arXiv — mutual information as a principle for representation learning (§8.3). STATS
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017 — the foundational sim-to-real randomization technique (§6.3, EQ MM6.4). MM
Toda, H. Y. & Yamamoto, T. (1995). Statistical Inference in Vector Autoregressions with Possibly Integrated Processes. Journal of Econometrics 66(1–2) — the lag-augmented Granger test valid under unit roots and cointegration (§5.5 caveat). TIME
Touvron, H., Martin, L., Stone, K. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 — the release (and custom community license) that catalyzed the open-weight ecosystem. OPEN
Troyanskaya, O. et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525 — the kNN-impute (KNNimpute) paper. DATA
Tseng, A., Chee, J., Sun, Q., Kuleshov, V. & De Sa, C. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. ICML 2024 — incoherence processing + E8 lattice codebooks toward ~2 bits. VOL II
Tversky, A. & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review 90(4) — the Linda problem (§1.5). STATS
Uhlenbeck, G. E. & Ornstein, L. S. (1930). On the Theory of the Brownian Motion. Phys. Rev. 36 — the mean-reverting process of §1.5. QUANT
Valevski, D. et al. (2024). Diffusion Models Are Real-Time Game Engines (GameNGen). arXiv — a neural network simulating DOOM interactively; a video predictor used as a playable environment. MM
van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). CRC Press — the practical, freely-readable reference for MICE / chained equations. DATA
van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. DeepMind — dilated causal convolutions for autoregressive audio (EQ MM4.7); the vocoder breakthrough. MM
van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS 2017 — discrete codebook latents that let autoregressive models generate over autoencoder tokens (§5.5). DL
van Hasselt, H. (2010). Double Q-learning. NeurIPS 23 — diagnoses and corrects the maximization bias of the \(\max\) operator in EQ R3.4. RL
van Hasselt, H., Guez, A. & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016 (arXiv:1509.06461) — double-DQN, decoupling action selection from evaluation to curb over-estimation. RL
Varma, S. & Simon, R. (2006). Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinformatics 7:91 — the selection-bias result behind nested CV (EQ V1.6). MLOPS
Vasicek, O. (1977). An Equilibrium Characterization of the Term Structure. Journal of Financial Economics 5(2) — the Ornstein–Uhlenbeck short rate and its closed-form bond (EQ Q4.4–Q4.6). QUANT
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 — drops recurrence for pure self-attention; the destination of EQ N4.5. DL
Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008 — the denoising autoencoder (EQ N5.2) and the manifold-projection view that seeds diffusion. DL
von Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Mathematische Annalen 100 — the original minimax theorem for two-player zero-sum games (§1.4). GAME
von Neumann, J. & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press — the founding text; normal-form games (EQ G1.1), the minimax theorem (EQ G1.6), and expected-utility theory. GAME
Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 — majority / weighted vote over sampled chains (EQ 12.2, §12.2). VOL II
Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58(301) — Ward's minimum-variance linkage for agglomerative clustering (§12.2, EQ M12.2). VOL I
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer — the standard modern reference for the distributions, moments, and convergence results in this chapter. STATS
Wasserstein, R. L. & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) — the profession's own caution on what a p-value is not. STATS
Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8 — the action-value function \(Q^\pi\) (EQ R1.6) and the convergence result that grounds model-free control. RL
Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L. & Petitjean, F. (2016). Characterizing Concept Drift. Data Mining and Knowledge Discovery 30 — a quantitative framework for describing how concepts drift over time. MLOPS
Wei, A., Haghtalab, N. & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail?. NeurIPS 2023 — the competing-objectives and mismatched-generalization framing used throughout §5.1–5.2. OPEN
Welch, B. L. (1947). The Generalization of Student's Problem When Several Different Population Variances Are Involved. Biometrika 34 — the unequal-variance two-sample test of EQ S4.10. STATS
White, I. R., Royston, P. & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 30(4):377–399 — practical guidance on running and pooling MICE. DATA
Wiener, N. (1923). Differential Space. J. Math. Phys. 2 — the rigorous existence proof of the process that bears his name. QUANT
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 — the original REINFORCE estimator (EQ R4.4) and the log-derivative / score-function trick behind every policy gradient. RL
Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management Science 6(3) — adds the seasonal component, completing Holt-Winters (EQ T3.6/T3.7). TIME
Wolpert, D. H. (1992). Stacked Generalization. Neural Networks 5(2) — learning the combiner with out-of-fold base predictions (EQ M14.6). VOL I
Wortsman, M., Ilharco, G., Gadre, S. Y. et al. (2022). Model Soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. ICML 2022 — weight averaging of fine-tunes (§6.1, EQ OM6.1). OPEN
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 — soft/hard attention beyond translation; shows the mechanism generalizes (§4.1). DL
Yadav, P., Tam, D., Choshen, L., Raffel, C. & Bansal, M. (2023). TIES-Merging: Resolving Interference When Merging Models. NeurIPS 2023 — trim, elect-sign, disjoint-mean (§6.3, EQ OM6.3). OPEN
Yang, A., Yang, B., Hui, B. et al. (2024). Qwen2 Technical Report. arXiv:2407.10671 — the Qwen family's wide size ladder and (mostly) Apache-2.0 licensing. OPEN
Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023 — explicit search over reasoning trees (§12.4). VOL II
Yeo, I.-K. & Johnson, R. A. (2000). A New Family of Power Transformations to Improve Normality or Symmetry. Biometrika 87(4) — the Yeo-Johnson extension to real-valued data, EQ D3.8. DATA
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks?. NeurIPS 27 — the empirical basis for transfer learning; transferability falls with depth. DL
Yu, L., Yu, B., Yu, H., Huang, F. & Li, Y. (2023). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (DARE). ICML 2024 — drop-and-rescale sparsification of task vectors (§6.3, EQ OM6.4). OPEN
Yue, X. et al. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark. CVPR 2024 — college-level multimodal reasoning, a current frontier evaluation. MM
Yule, G. U. (1927). On a Method of Investigating Periodicities in Disturbed Series. Phil. Trans. R. Soc. A 226 — the paper that introduced the autoregressive model (§1.3). TIME
Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS 2023 — ACT (action chunking with transformers) and the ALOHA teleoperation platform (§6.4). MM
Zhou, C., Liu, P., Xu, P. et al. (2023). LIMA: Less Is More for Alignment. NeurIPS 2023 — 1,000 curated examples beat far larger noisy sets; the evidence behind "quality over quantity." OPEN
Zou, A., Wang, Z., Carlini, N. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. The GCG attack (EQ OM5.2): gradient-based adversarial suffixes that transfer across models. OPEN
Zou, H. & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B 67(2) — the L1+L2 fix for Lasso's instability under collinearity (EQ D4.6 note). DATA