AI // ENCYCLOPEDIA / REFERENCES / BIBLIOGRAPHY INDEX
REFERENCES

Bibliography

Every primary source cited across the encyclopedia — 415 references, linked to where you can read them.

§

All sources

  1. Abadi, M. et al. (2016). TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016 — the computation-graph runtime and distributed execution model behind §2.1. FRAME
  2. Abadi, M. et al. (2016). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv — the original TensorFlow white paper describing the dataflow-graph design. FRAME
  3. Aggarwal, C. C., Hinneburg, A. & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT 2001, LNCS 1973, 420–434. shows lower-order (fractional, Manhattan) norms concentrate more slowly than Euclidean (§11.5). VOL I
  4. Aghajanyan, A., Zettlemoyer, L. & Gupta, S. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021 — the empirical low-rank hypothesis that motivates LoRA. OPEN
  5. Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6) — the Akaike Information Criterion behind automated order selection (EQ T2.9). TIME
  6. Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD 2019 — the TPE sampler plus pruning combination that is the practical default in 2026. MLOPS
  7. Alayrac, J.-B. et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022 — gated cross-attention into a frozen LLM; the canonical cross-attention design (§2.4). MM
  8. Ambroise, C. & McLachlan, G. J. (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. PNAS 99(10) — the definitive demonstration of feature-selection bias and why selection must sit inside cross-validation (§4.5). DATA
  9. Anil, C., Durmus, E., Sharma, M. et al. (2024). Many-shot Jailbreaking. Anthropic — long-context in-context attacks that scale with the number of faux-compliant examples. OPEN
  10. Ansel, J. et al. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. ASPLOS 2024 — torch.compile and TorchDynamo. FRAME
  11. Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN. ICML 2017 — replacing JSD with the Wasserstein distance (EQ N6.4–N6.5) and the critic. DL · GAME
  12. Arlot, S. & Celisse, A. (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys 4 — the comprehensive modern reference on CV variants and their bias/variance. MLOPS
  13. Artzner, P., Delbaen, F., Eber, J.-M. & Heath, D. (1999). Coherent Measures of Risk. Mathematical Finance 9(3) — the four coherence axioms; why VaR fails subadditivity and ES does not. QUANT
  14. Ashkboos, S. et al. (2024). QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. NeurIPS 2024 — Hadamard rotations that spread weight/activation outliers (§11.4). VOL II
  15. Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 — UCB and the regret framework for principled exploration beyond ε-greedy (§1.5). RL
  16. Axelrod, R. (1984). The Evolution of Cooperation. Basic Books — the round-robin tournaments and the four properties of tit-for-tat (EQ G2.4, §2.3); the founding popular text on repeated cooperation. GAME
  17. Axelrod, R. & Hamilton, W. D. (1981). The Evolution of Cooperation. Science 211(4489) — the peer-reviewed account of the iterated-PD tournaments and the evolutionary stability of tit-for-tat. GAME
  18. Bachelier, L. (1900). Théorie de la spéculation. Ann. Sci. ÉNS 17 — the founding thesis modelling prices as Brownian motion, five years before Einstein. QUANT
  19. Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Meta — self-supervised audio encoder; the discriminative alternative to generative ASR. MM
  20. Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 — additive attention (EQ N4.2/N4.3); the birth of the mechanism and the length-decay figure. DL
  21. Bai, J., Lu, F., Zhang, K. et al. (2019). ONNX: Open Neural Network Exchange. onnx.ai — the framework-agnostic graph format and standard operator set at the heart of §3.2. FRAME
  22. Baldi, P. & Hornik, K. (1989). Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima. Neural Networks 2(1) — proves the linear autoencoder optimum spans the top-k PCA subspace (§5.1). DL
  23. Basel Committee on Banking Supervision (2019). Minimum Capital Requirements for Market Risk (FRTB, finalized). BIS d457 — the switch to 97.5% stressed Expected Shortfall with liquidity-horizon scaling (EQ Q6.8). QUANT
  24. Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. JMLR 18(153) — forward vs reverse mode, the theory behind EQ F1.3. FRAME
  25. Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans. R. Soc. — the original statement of EQ S1.7. STATS
  26. Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics 6(5) — the formal origin of the MDP (EQ R1.1) and the recursive value relation behind EQ R1.4. RL
  27. Bellman, R. (1957). Dynamic Programming. Princeton University Press — the founding text; the principle of optimality and the recursive value relation behind EQ R2.1–R2.3. RL
  28. Bengio, Y., Louradour, J., Collobert, R. & Weston, J. (2009). Curriculum Learning. ICML 2009 — easy-to-hard example ordering (§4.4). OPEN
  29. Bengio, Y., Simard, P. & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5(2) — the formal analysis of vanishing/exploding gradients behind EQ N3.2–N3.3. DL
  30. Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate. J. R. Stat. Soc. B 57(1) — FDR control for the many-tests regime of EQ S4.13. STATS
  31. Bergmeir, C. & Benítez, J. M. (2012). On the Use of Cross-Validation for Time Series Predictor Evaluation. Information Sciences 191 — forward-chaining validation for temporal data (§1.4). MLOPS
  32. Bergstra, J. & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. JMLR 13 — the result that random search matches or beats grid search under low effective dimensionality (§2.3). MLOPS
  33. Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control (4th ed.). Athena Scientific — the contraction-mapping convergence analysis (EQ R2.6) in full rigor. RL
  34. Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When Is "Nearest Neighbor" Meaningful? ICDT 1999, LNCS 1540, 217–235. the foundational distance-concentration result behind EQ M11.6. VOL I
  35. Bickel, P. J., Hammel, E. A. & O'Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science 187(4175) — the canonical Simpson's paradox case study. STATS
  36. Bifet, A. & Gavaldà, R. (2007). Learning from Time-Changing Data with Adaptive Windowing (ADWIN). SIAM SDM 2007 — an adaptive-window detector with a formal false-positive bound. MLOPS
  37. Black, F. & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. Journal of Political Economy 81(3) — the continuous-time closed form the binomial tree converges to as N → ∞ (Quant 03); the lattice is its discrete, fully constructive counterpart. QUANT
  38. Black, F. & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. Journal of Political Economy 81(3). — the original no-arbitrage derivation and the closed-form formula. QUANT
  39. Black, K., Brown, N., Driess, D., Esmail, A., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence — flow-matching continuous action chunks at high frequency (§6.2, EQ MM6.3). MM
  40. Blitzstein, J. K. & Hwang, J. (2019). Introduction to Probability (2nd ed.). Chapman & Hall / CRC. Harvard Stat 110 — conditioning, Bayes, expectation, LLN; free course materials. STATS
  41. Board of Governors of the Federal Reserve System & OCC (2011). SR 11-7: Guidance on Model Risk Management. Supervisory letter — the model-risk framework and three pillars of §7.5 (EQ V7.5). MLOPS
  42. Bollerslev, T. (1986). Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics 31(3) — adds the lagged-variance term, giving GARCH(1,1) (EQ T4.4), the field's workhorse. TIME
  43. Borsos, Z. et al. (2022). AudioLM: A Language Modeling Approach to Audio Generation. Google — next-audio-token prediction over codec tokens (EQ MM4.8); the audio-LM of §4.4. MM
  44. Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. Proceedings of COLT '92, 144–152. Where the maximum-margin hyperplane and the kernel trick (EQ M10.1, M10.7) were first combined — the true origin of the kernel SVM. VOL I
  45. Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society B 26(2) — the Box-Cox power-transform family, EQ D3.7. DATA · TIME
  46. Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the canonical text; the ACF/PACF identification method (§1.3) and integration order \(d\) (§1.5) are its core. TIME
  47. Boyle, P. (1977). Options: A Monte Carlo Approach. Journal of Financial Economics 4(3) — the paper that brought simulation to option pricing. QUANT
  48. Bradbury, J., Frostig, R., Hawkins, P. et al. (2018). JAX: composable transformations of Python+NumPy programs. github.com/google/jax — jit/grad/vmap/pmap over XLA (EQ F3.5). FRAME
  49. Breck, E., Cai, S., Nielsen, E., Salib, M. & Sculley, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data 2017 — the data/model/infra test layers of §7.3. MLOPS
  50. Breiman, L. (1996). Bagging Predictors. Machine Learning 24(2) — bootstrap aggregating and its variance-reduction argument (EQ M14.2, M14.3). VOL I
  51. Breiman, L. (2001). Random Forests. Machine Learning 45(1) — feature subsampling as the second decorrelation lever; OOB error (§14.2). VOL I
  52. Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1) — the original proper scoring rule for probabilistic forecasts (§3.5). MLOPS
  53. Brigo, D. & Mercurio, F. (2006). Interest Rate Models — Theory and Practice (2nd ed.). Springer Finance — the standard practitioner reference for Vasicek, CIR, Hull–White, G2++, and calibration. QUANT
  54. Broder, A. Z. (1997/1998). On the resemblance and containment of documents & Min-wise independent permutations. SEQUENCES / STOC. MinHash estimation of Jaccard similarity at scale (§11.4). VOL I
  55. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Google DeepMind — action tokenization over a VLM vocabulary and web/robot co-training (§6.2, EQ MM6.2). MM
  56. Brooks, T. et al. (2024). Video Generation Models as World Simulators. OpenAI (Sora) — spacetime latent patches and a diffusion transformer over video, the basis of §3.5 and EQ MM3.6. MM
  57. Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science 365 — Pluribus, self-play in imperfect information. GAME
  58. Bruce, J. et al. (2024). Genie: Generative Interactive Environments. ICML 2024 — latent-action world model learned from unlabelled gameplay video; playable worlds from one prompt (EQ MM5.4). MM
  59. Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167. The most-cited tutorial — derives the dual (EQ M10.3) and the KKT support-vector conditions step by step. VOL I
  60. Campello, R. J. G. B., Moulavi, D., Zimek, A. & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM TKDD 10(1) — HDBSCAN, the variable-density successor that removes DBSCAN's ε knob (§12.3 note). VOL I
  61. Carion, N. et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020 — detection as set prediction; no anchors or non-max suppression. MM
  62. Carpenter, B. et al. (2017). Stan: A Probabilistic Programming Language. J. Stat. Softw. — Hamiltonian Monte Carlo for the hierarchical models of §5.5. STATS
  63. Casella, G. & Berger, R. L. (1987). Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem. JASA — a careful account of where the two frameworks agree and diverge. STATS
  64. Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning and reporting on the same folds inflates scores. MLOPS
  65. Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning on the test set inflates results, and nested cross-validation as the fix (§1.2). DATA
  66. Chang, C.-C. & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM TIST 2(3), 1–27. The standard implementation behind scikit-learn's SVC, and the reference for practical \((C, \gamma)\) selection (§10.5). VOL I
  67. Chang, H., Zhang, H., Jiang, L., Liu, C. & Freeman, W. T. (2022). MaskGIT: Masked Generative Image Transformer. CVPR 2022 — parallel masked-token decoding, the few-round alternative to autoregression in §3.4. MM
  68. Chaudhry, A., Ranzato, M., Rohrbach, M. & Elhoseiny, M. (2019). Efficient Lifelong Learning with A-GEM. ICLR 2019 — gradient-episodic-memory replay for continual learning (§4.4). OPEN
  69. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 — the interpolation method of EQ D5.3. DATA
  70. Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016 — regularized second-order objective and split gain (EQ M15.6–M15.7). VOL I
  71. Chicco, D. & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 and Accuracy. BMC Genomics 21 — the case for MCC on imbalanced binary problems (§5.5). DATA · MLOPS
  72. Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 — introduces the GRU (EQ N3.6) and the encoder–decoder framing carried into Chapter 04. DL
  73. Chollet, F. (2015). Keras. Official documentation — the high-level layers/Sequential/Functional API of §2.2 and the current Keras 3 multi-backend design. FRAME
  74. Chollet, F. (2021). Deep Learning with Python (2nd ed.). Manning — the canonical Keras text by its author; covers layers, fit, callbacks, and custom training loops. FRAME
  75. Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS — the original preference-to-reward pipeline and the Bradley–Terry reward model (EQ R6.2–R6.3) that RLHF scaled to language. RL
  76. Christoffersen, P. F. (1998). Evaluating Interval Forecasts. International Economic Review 39(4) — the conditional-coverage / independence test that complements Kupiec. QUANT
  77. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning Workshop — the LSTM-vs-GRU comparison underpinning the "no universal winner" claim (§3.4). DL
  78. Clauset, A., Shalizi, C. R. & Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review 51(4), 661–703. The methodological reference on fitting and — crucially — testing power-law tails against alternatives. STATS
  79. Cont, R. (2001). Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues. Quantitative Finance 1(2), 223–236. A careful survey of the fat-tail evidence and why pinning down the tail exponent is genuinely hard. STATS
  80. Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3), 273–297. The paper that introduced the soft margin and slack variables (EQ M10.5) and gave the method its modern form — the canonical primary source for this chapter. VOL I
  81. Cover, T. & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Information Theory, 13(1), 21–27. why the distance choice is the model: the founding analysis of k-NN. VOL I
  82. Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley — the standard graduate text; entropy, mutual information, source coding, and their inequalities. STATS
  83. Cox, J. C., Ingersoll, J. E. & Ross, S. A. (1985). A Theory of the Term Structure of Interest Rates. Econometrica 53(2) — the square-root diffusion, the Feller condition, and non-negative rates (EQ Q4.7–Q4.9). QUANT
  84. Cox, J. C., Ross, S. A. & Rubinstein, M. (1979). Option Pricing: A Simplified Approach. Journal of Financial Economics 7(3) — the original recombining binomial lattice, the CRR parameters of EQ Q2.4, and the convergence to Black–Scholes. QUANT
  85. Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024 — Mamba-2 and the SSD duality of EQ 11.4. VOL II
  86. Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. ICML 2006 — why PR-AUC, not ROC-AUC, is the metric to trust under imbalance (§5.5). DATA
  87. DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437 — a frontier-class MoE trained at a fraction of typical cost, with unusually open methodology. OPEN
  88. DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv — RLVR with GRPO at scale; reasoning behavior emerging from verifiable rewards (§6.5). RL
  89. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. JASIS 41(6) — LSA: truncated SVD of a term–document matrix, the §13.5 connection to embeddings. VOL I
  90. Défossez, A. et al. (2024). Moshi: A Speech-Text Foundation Model for Real-Time Dialogue. Kyutai — full-duplex spoken dialogue; the streaming / low-latency direction of §4.5. MM
  91. Défossez, A., Copet, J., Synnaeve, G. & Adi, Y. (2022). High Fidelity Neural Audio Compression. Meta — EnCodec; residual-vector-quantized neural codec (EQ MM4.4) that turns audio into discrete tokens. MM
  92. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B 39(1) — the Expectation–Maximization algorithm behind GMM fitting (§12.4, EQ M12.5). VOL I
  93. Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022 — outlier-aware 8-bit quantization; why a few features must be preserved. OPEN
  94. Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023 — 4-bit NF4 base weights plus LoRA adapters; fine-tuning a 65B model on a single 48 GB GPU. OPEN
  95. Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021 — the result marking diffusion's displacement of GANs for large-scale image generation (§6.5). DL
  96. Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the Augmented Dickey-Fuller unit-root test that decides how many differences \(d\) a series needs. TIME
  97. Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the unit-root test that decides random walk vs stationary (§1.4). TIME
  98. Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems, LNCS 1857 — the canonical survey of why and when ensembles help (§14.1). VOL I
  99. Domingos, P. & Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29 — explains why a model with violated independence assumptions still classifies well (the paradox of §9.5). VOL I
  100. Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021 — the Vision Transformer, the chief modern challenger to the convolutional prior. DL · MM
  101. Easley, D. & Kleinberg, J. (2010). Networks, Crowds, and Markets. Cambridge University Press (Ch. 6) — an accessible, freely available treatment of best response, dominant strategies, and equilibrium used to frame this chapter. GAME
  102. Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218. the optimal low-rank approximation theorem (EQ S6.7). STATS · VOL I
  103. Efron, B. & Morris, C. (1975). Data Analysis Using Stein's Estimator and Its Generalizations. JASA / Ann. Statist. — shrinkage and partial pooling as empirical Bayes. STATS
  104. Einstein, A. (1905). Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung…. Ann. Phys. 322 — the physical derivation that variance grows linearly in time. QUANT
  105. Engle, R. F. (1982). Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation. Econometrica 50(4) — the original ARCH model (EQ T4.2); the work cited for Engle's 2003 Nobel Prize. TIME
  106. Engle, R. F. (2002). Dynamic Conditional Correlation: A Simple Class of Multivariate GARCH Models. Journal of Business & Economic Statistics 20(3) — the DCC model bridging to the multivariate chapter. TIME
  107. Engle, R. F. & Granger, C. W. J. (1987). Co-integration and Error Correction: Representation, Estimation, and Testing. Econometrica 55(2) — defines cointegration and the Granger representation theorem linking it to the VECM (EQ T5.7–T5.8). TIME
  108. Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. KDD-96 — the original DBSCAN: core/border/noise points and density-reachability (§12.3, EQ M12.3). VOL I
  109. European Union (2024). Regulation (EU) 2024/1689 — the Artificial Intelligence Act. Official Journal — risk-tiered obligations (risk management, data governance, logging, human oversight) phasing in through 2026–2027. MLOPS
  110. Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters 27(8) — the canonical tutorial on ROC curves, AUC, and the pair-counting identity (§4.1). MLOPS
  111. Fischer, H. (2011). A History of the Central Limit Theorem: From Classical to Modern Probability Theory. Springer — the full lineage of EQ S2.6, from de Moivre and Laplace to Lindeberg and Lévy. STATS
  112. Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023 — second-order error-aware rounding (§11.4). VOL II · OPEN
  113. Freund, Y. & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences 55(1) — AdaBoost and its training-error bound (EQ M14.5). VOL I
  114. Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5) — boosting as functional gradient descent (EQ M14.4). VOL I · MLOPS
  115. Friedman, J. W. (1971). A Non-cooperative Equilibrium for Supergames. Review of Economic Studies 38(1) — Grim-Trigger equilibria and an early form of the Folk Theorem behind EQ G2.3 (§2.1). GAME
  116. Friedman, J., Hastie, T. & Tibshirani, R. (2000). Additive Logistic Regression: A Statistical View of Boosting. Annals of Statistics 28(2) — AdaBoost as stagewise exponential-loss minimization (EQ M15.5). VOL I
  117. Fujimoto, S., van Hoof, H. & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. arXiv:1802.09477 — TD3; twin critics, delayed updates, and target smoothing. RL
  118. Gama, J., Medas, P., Castillo, G. & Rodrigues, P. (2004). Learning with Drift Detection (DDM). SBIA 2004, LNCS 3171 — the error-rate drift detector behind EQ V5.4. MLOPS
  119. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia, A. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys 46(4) — the canonical taxonomy of drift types and adaptation strategies (§5.1). MLOPS
  120. Gardner, E. S. & McKenzie, E. (1985). Forecasting trends in time series. Management Science 31(10) — the damped-trend method (EQ T3.5), a perennial competition benchmark. TIME
  121. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press — the standard modern reference for conjugacy, hierarchy, and computation. STATS
  122. Geman, S. & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE TPAMI 6(6) — introduced Gibbs sampling (EQ S7.9) to statistics and image analysis. STATS
  123. Gerganov, G. et al. (2023). llama.cpp — LLM inference in C/C++. The reference local inference engine and the home of the GGUF format. OPEN
  124. Giles, M. & Glasserman, P. (2006). Smoking Adjoints: Fast Monte Carlo Greeks. RISK Magazine — adjoint algorithmic differentiation for computing all Greeks in one reverse pass. QUANT
  125. Glasserman, P. (2003). Monte Carlo Methods in Financial Engineering. Springer — the definitive reference for variance reduction, path simulation, and Greeks. QUANT
  126. Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS — the variance-preserving (Xavier/Glorot) initialization of §1.2. DL
  127. Glosten, L. R., Jagannathan, R. & Runkle, D. E. (1993). On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. Journal of Finance 48(5) — the GJR-GARCH asymmetric extension (EQ T4.6). TIME
  128. Gneiting, T. & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. J. American Statistical Association 102(477) — the theory of why log loss and Brier reward honest probabilities (§3.5). MLOPS
  129. Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. (2015). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. J. Computational and Graphical Statistics 24(1) — ICE curves (§6.3). MLOPS
  130. Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. the canonical numerical reference for the SVD, power iteration, and conditioning. STATS
  131. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning MIT Press — Ch. 8 (optimization) and Ch. 7 (regularization), the standard textbook treatment. DL
  132. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS 2014 — the original adversarial game, the optimal discriminator (EQ N6.2), and the JSD reduction (EQ N6.3). DL · GAME
  133. Google. TensorFlow Lite / LiteRT — On-Device Machine Learning. tensorflow.org/lite — the SavedModel→FlatBuffer converter and edge runtime of §3.3. FRAME
  134. Google. TensorFlow Serving. tensorflow.org/tfx — SavedModel hosting with versioned hot-swap, the TF-native serving path. FRAME
  135. Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 37(3) — the original definition of Granger causality (EQ T5.9). TIME
  136. Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10) — systematic ablation of LSTM components, including the value of the forget gate and forget-bias initialization (§3.3). DL
  137. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. (2012). A Kernel Two-Sample Test (MMD). JMLR 13 — the maximum-mean-discrepancy test for multivariate covariate-shift detection (§5.3). MLOPS
  138. Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?. NeurIPS 2022 Datasets & Benchmarks — the contested empirical case behind §15.5's honest comparison. VOL I
  139. Groeneveld, D., Beltagy, I., Walsh, P. et al. (2024). OLMo: Accelerating the Science of Language Models. arXiv:2402.00838 — a genuinely open-source model: weights, data, and training code all released. OPEN
  140. Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2312.00752 — the modern selective state-space model reviving gated linear recurrence at scale (§3.4 footnote). DL · VOL II
  141. Gu, A., Goel, K. & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces (S4). ICLR 2022 — the structured SSM that started the line (§11.2). VOL II
  142. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. (2017). Improved Training of Wasserstein GANs. NeurIPS 2017 — WGAN-GP: the gradient penalty that replaced weight clipping. DL
  143. Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017 — modern deep networks are systematically over-confident; temperature scaling as a fix (§4.4). MLOPS
  144. Gururangan, S., Marasović, A., Swayamdipta, S. et al. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL 2020 — domain- and task-adaptive continued pre-training (DAPT / TAPT). OPEN
  145. Guyon, I. & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 — the canonical survey of filter, wrapper and embedded selection (§4.3). DATA
  146. Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46 — recursive feature elimination, EQ D4.5. DATA
  147. Ha, D. & Schmidhuber, J. (2018). World Models. NeurIPS 2018 — the foundational demonstration: train an agent inside its own learned dream and transfer to the real environment. MM
  148. Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor. arXiv:1801.01290 — the maximum-entropy objective (EQ R5.5) and the continuous-control default of §5.4. RL
  149. Hafner, D. et al. (2025). V-JEPA 2: Self-Supervised Video World Models — see also Assran et al., I-JEPA (arXiv:2301.08243). embedding-prediction self-supervision scaled to video as a world model for planning. MM
  150. Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). Mastering Diverse Domains through World Models (DreamerV3). arXiv — one fixed hyperparameter set across 150+ tasks; first to mine diamonds in Minecraft from scratch (EQ MM5.2). MM
  151. Halko, N., Martinsson, P.-G. & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review 53(2) — randomized SVD, how truncated factorizations are actually computed at scale. VOL I
  152. Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press — the graduate-level reference for stationarity, unit roots, and the econometric theory behind §1.2 and §1.4. TIME
  153. Han, H., Wang, W.-Y. & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. ICIC 2005 — synthesizing only near the decision boundary (§5.3). DATA
  154. Hand, D. J. (2009). Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve. Machine Learning 77(1) — the influential critique of AUC and the proposed H-measure. MLOPS
  155. Hanley, J. A. & McNeil, B. J. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 143(1) — the AUC = Wilcoxon–Mann–Whitney equivalence (EQ V4.2). MLOPS
  156. Hansen, P. R. & Lunde, A. (2005). A Forecast Comparison of Volatility Models: Does Anything Beat a GARCH(1,1)?. Journal of Applied Econometrics 20(7) — the large horse race finding GARCH(1,1) hard to beat for equities. TIME
  157. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — free online; cross-validation, the train/test contract, and the right vs wrong way to cross-validate (§1.2, §1.5). DATA · VOL I · MLOPS
  158. Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57(1) — the generalization to asymmetric proposals, completing Metropolis–Hastings. STATS
  159. He, H. & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9) — the canonical survey of resampling, cost-sensitive learning, and evaluation. DATA
  160. He, H., Bai, Y., Garcia, E. A. & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IJCNN 2008 — density-adaptive synthetic generation (§5.3). DATA
  161. He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022 — masked autoencoding as a strong self-supervised pretext for vision transformers (§5.5). DL
  162. He, K., Zhang, X., Ren, S. & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV — He/Kaiming initialization for ReLU networks (EQ N1.5). DL
  163. He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR — residual connections / ResNet (§1.4, EQ N1.7–N1.8). DL · MM
  164. Heath, D., Jarrow, R. & Morton, A. (1992). Bond Pricing and the Term Structure of Interest Rates: A New Methodology. Econometrica 60(1) — the HJM no-arbitrage framework that generalizes all short-rate models to the whole forward curve. QUANT
  165. Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI 2018 (arXiv:1709.06560) — the reproducibility and seed-variance study behind §5.5 and EQ R5.6. RL
  166. Heston, S. L. (1993). A Closed-Form Solution for Options with Stochastic Volatility. Review of Financial Studies 6(2). — stochastic-variance model that generates the smile endogenously (EQ Q3.7). QUANT
  167. Higgins, I. et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017 — weighting the KL term to encourage disentangled latent factors (§5.4). DL
  168. Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) — deep autoencoders, trained layer-wise, beat PCA at nonlinear dimensionality reduction (§5.1). DL
  169. Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS 2021 Workshop — the conditional/unconditional extrapolation of EQ MM3.4. MM
  170. Ho, J., Jain, A. & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020 — the noise-prediction objective and forward reparameterization behind EQ MM3.1–MM3.2. MM
  171. Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8) — introduces the LSTM cell, the constant error carousel, and the gating scheme of EQ N3.4–N3.5. DL
  172. Hoffman, M. D. & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. JMLR 15 — NUTS, the adaptive HMC sampler behind Stan / PyMC / NumPyro (§7.5). STATS
  173. Holt, C. C. (2004, orig. 1957). Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting 20(1) — reprint of the 1957 ONR memorandum that introduced double smoothing (EQ T3.4). TIME
  174. Howard, J. & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification (ULMFiT). ACL 2018 — gradual unfreezing and discriminative fine-tuning (§4.2). OPEN
  175. Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press — introduced policy iteration (EQ R2.7) and the policy improvement theorem. RL
  176. Hu, E. J., Shen, Y., Wallis, P. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022 — the low-rank weight update (EQ OM3.2) at the heart of practical open-model fine-tuning. OPEN
  177. Hubert, L. & Arabie, P. (1985). Comparing Partitions. Journal of Classification 2(1) — the Adjusted Rand Index for chance-corrected external validation (§12.5). VOL I
  178. Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 40(9) — the optimal prefix code of §8.4 (EQ S8.6, Instrument S8.3). STATS
  179. Hugging Face. PEFT: Parameter-Efficient Fine-Tuning — Documentation. Official library docs for LoRA/QLoRA/DoRA training and adapter management. OPEN
  180. Hull, J. & White, A. (1990). Pricing Interest-Rate-Derivative Securities. Review of Financial Studies 3(4) — time-dependent drift that fits the initial curve exactly (EQ Q4.10–Q4.11). QUANT
  181. Hull, J. C. (2021). Options, Futures, and Other Derivatives (11th ed.). Pearson — Ch. 13–21: the standard practitioner treatment of binomial trees, risk-neutral valuation, and American-option pricing. QUANT
  182. Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.), Ch. 8. OTexts — the freely available standard textbook treatment of SES, Holt-Winters, and ETS. TIME
  183. Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts (free online) — the modern practitioner's guide; decomposition (§1.1), STL, and Box–Cox (§1.5). TIME
  184. Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software 27(3) — the algorithm behind auto.arima and its AIC-driven stepwise order search (§2.5). TIME
  185. Hyndman, R. J. & Koehler, A. B. (2006). Another Look at Measures of Forecast Accuracy. International J. Forecasting 22(4) — the canonical critique of MAPE and the case for scaled error measures (§3.2). MLOPS · TIME
  186. Hyndman, R. J., Koehler, A. B., Ord, J. K. & Snyder, R. D. (2008). Forecasting with Exponential Smoothing: The State Space Approach. Springer — the definitive treatment of the ETS innovations state-space framework (EQ T3.8). TIME
  187. Hyndman, R. J., Koehler, A. B., Snyder, R. D. & Grose, S. (2002). A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting 18(3) — the taxonomy of 30 ETS models and automatic AIC selection (§3.4). TIME
  188. Inan, H., Upasani, K., Chi, J. et al. (2023). Llama Guard: LLM-based Input-Output Safeguarding for Human-AI Conversations. Meta — the open guard-model approach behind the input/output filters of §5.4. OPEN
  189. Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine 2(8) — the conditional-probability argument behind §4.6. STATS
  190. Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML — batch normalization (§1.3, EQ N1.6). DL
  191. Itô, K. (1944). Stochastic Integral. Proc. Imperial Acad. Tokyo 20(8) — the original construction of the Itô integral and the lemma of §1.4. QUANT
  192. J.P. Morgan / Reuters (1996). RiskMetrics — Technical Document (4th ed.). The document that made parametric VaR an industry standard: EWMA covariance, Gaussian quantiles, √-time scaling. QUANT
  193. Jamieson, K. & Talwalkar, A. (2016). Non-stochastic Best Arm Identification and Hyperparameter Optimization. AISTATS 2016 — the successive-halving subroutine behind EQ V2.6. MLOPS
  194. Jamshidian, F. (1989). An Exact Bond Option Formula. Journal of Finance 44(1) — decomposes a swaption into a portfolio of bond options, making Gaussian models swaption-closed-form. QUANT
  195. Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press — the objective-Bayesian case for probability as extended logic. STATS
  196. Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. Probability as extended logic — the Bayesian reading of §1.1 and §1.3. STATS
  197. Jiang, A. Q., Sablayrolles, A., Mensch, A. et al. (2023). Mistral 7B. arXiv:2310.06825 — an Apache-2.0 dense model that set the efficiency bar for small open models. OPEN
  198. Johansen, S. (1991). Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models. Econometrica 59(6) — the maximum-likelihood (trace and max-eigenvalue) tests for cointegration rank (§5.4). TIME
  199. Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Ch. 4: Naive Bayes and Sentiment Classification. Stanford — a modern, worked treatment of multinomial NB for text with smoothing. VOL I
  200. Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 — an early, lucid survey of the problem formulation used throughout this chapter. RL
  201. Karras, T., Laine, S. & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR 2019 — StyleGAN: the mapping network, AdaIN style modulation (EQ N6.6), and style mixing. DL
  202. Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020 — the reassociation trick of EQ 11.5. VOL II
  203. Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM TKDD 6(4) — the field's working definition and taxonomy of leakage (§1.3, EQ D1.3). DATA
  204. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017 — histogram binning, leaf-wise growth, GOSS & EFB (EQ M15.8–M15.9). VOL I
  205. Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika 30(1–2) — Kendall's τ, EQ S3.5. STATS
  206. Keskar, N. S. et al. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017 — batch size, flat vs. sharp minima, and the generalization debate. DL
  207. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. CoRL 2024 — an open-weight tokenized VLA, the reproducible counterpart to RT-2 (§6.2). MM
  208. Kingma, D. P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015 — the first/second-moment adaptive optimizer with bias correction (EQ N7.3). DL
  209. Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR — the variational autoencoder and the ELBO objective (EQ S8.9). STATS · DL
  210. Kirillov, A. et al. (2023). Segment Anything. ICCV 2023 — SAM; promptable, open-vocabulary segmentation (§1.4 frontier). MM
  211. Kirkpatrick, J., Pascanu, R., Rabinowitz, N. et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS 2017 — Elastic Weight Consolidation (EQ OM4.6). OPEN
  212. Kloeden, P. & Platen, E. (1992). Numerical Solution of Stochastic Differential Equations. Springer — Euler–Maruyama and higher-order path-discretization schemes. QUANT
  213. Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI 1995 — the empirical case for stratified 10-fold CV (§1.2). MLOPS
  214. Kolmogorov, A. N. (1933). Foundations of the Theory of Probability (Grundbegriffe der Wahrscheinlichkeitsrechnung). The axiomatic foundation of EQ S1.1; measure-theoretic probability. STATS
  215. Koren, Y., Bell, R. & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42(8) — the canonical write-up of the latent-factor model and biases (EQ M13.4–M13.5), from the Netflix-Prize winners. VOL I
  216. Kraskov, A., Stögbauer, H. & Grassberger, P. (2004). Estimating Mutual Information. Physical Review E 69(6) — the k-nearest-neighbour estimator behind practical MI feature scores (EQ D4.7). DATA
  217. Kreuzberger, D., Kühl, N. & Hirschl, S. (2022). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. arXiv:2205.02302 — a current reference architecture for pipelines, CI/CD, and CT. MLOPS
  218. Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 25 — AlexNet; ReLU, dropout, and GPU training that ignited the deep-learning era. DL · MM
  219. Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (full text online) — encoding, scaling, transforms and leakage-safe resampling. DATA
  220. Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics 22(1) — relative entropy / KL divergence (EQ S8.4). STATS
  221. Kupiec, P. H. (1995). Techniques for Verifying the Accuracy of Risk Measurement Models. Journal of Derivatives 3(2) — the proportion-of-failures likelihood-ratio backtest (EQ Q6.7). QUANT
  222. Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023 — the vLLM paper; paged KV cache and continuous batching. OPEN
  223. Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). github.com/vllm-project/vllm — paged-attention KV cache and continuous batching for LLM throughput. FRAME
  224. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview — the JEPA position paper; predict in representation space, not pixel space (EQ MM5.3). MM
  225. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11) — LeNet-5; the conv → pool → dense template trained end-to-end by backprop. DL
  226. Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 — NMF and the parts-based decomposition of §13.4 (EQ M13.6). VOL I
  227. Lee, D. D. & Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization. NIPS 13 — the multiplicative update rules of EQ M13.7 and their convergence guarantee. VOL I
  228. Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NIPS 27 — proof that word2vec's skip-gram is implicitly factorizing a shifted PMI matrix. VOL I
  229. Li, J., Li, D., Savarese, S. & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs. ICML 2023 — the Q-Former, a query-based bridge between frozen vision and frozen language. MM
  230. Li, L. et al. (2020). A System for Massively Parallel Hyperparameter Tuning. MLSys 2020 — ASHA, the asynchronous successive halving used in production tuners. MLOPS
  231. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. (2017). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. JMLR 18 — successive halving across brackets, the bandit view of early stopping (§2.5). MLOPS
  232. Li, Y. et al. (2023). Evaluating Object Hallucination in Large Vision-Language Models (POPE). EMNLP 2023 — the object-hallucination probe behind the §2.5 evaluation caveats. MM
  233. Lieber, O. et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. AI21 — a production-scale interleaved Mamba/attention/MoE hybrid (§11.3). VOL II
  234. Lillicrap, T. P. et al. (2016). Continuous control with deep reinforcement learning. arXiv:1509.02971 — DDPG, the deterministic actor–critic for continuous actions (§5.4). RL
  235. Lim, B., Arık, S. Ö., Loeff, N. & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37(4) — attention, variable selection and quantile outputs (§6.4). TIME
  236. Lin, J., Tang, J., Tang, H. et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024 — protects salient weight channels; widely used for 4-bit serving. OPEN
  237. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017 (RetinaNet) — focal loss, EQ D5.5. DATA
  238. Little, R. J. A. & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley — the canonical textbook on mechanisms, likelihood-based, and multiple imputation. DATA
  239. Liu, H., Hussain, F., Tan, C. L. & Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6 — a survey of binning / discretization methods (EQ D3.9). DATA
  240. Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023 — projected patches as input tokens (EQ MM2.1) plus LLM-bootstrapped instruction data; the dominant early-fusion recipe. MM
  241. Liu, S.-Y., Wang, C.-Y., Yin, H. et al. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ICML 2024 — decouples magnitude from direction for a quality gain over vanilla LoRA. OPEN
  242. Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the portmanteau test for "are these residuals white noise?" (§1.4). TIME
  243. Longstaff, F. & Schwartz, E. (2001). Valuing American Options by Simulation: A Simple Least-Squares Approach. Review of Financial Studies 14(1) — least-squares Monte Carlo for early-exercise payoffs. QUANT
  244. Loshchilov, I. & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 — cosine annealing and cyclical warm restarts (EQ N7.5). DL
  245. Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR — AdamW, decoupling weight decay from the adaptive step (EQ N1.10). DL
  246. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P. & Mordatch, I. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS — MADDPG and the CTDE gradient of EQ G3.6. GAME
  247. Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017 — the SHAP framework and its uniqueness theorem (§6.5, EQ V6.4–V6.5). MLOPS
  248. Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 30 (SHAP) — the Shapley value applied to machine-learning feature attribution, the §2.5 bridge into Chapter 03. GAME
  249. Lundberg, S. M., Erion, G. G. & Lee, S.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles. arXiv — TreeSHAP, exact polynomial-time Shapley values for trees (§6.5). MLOPS
  250. Luo, Y., Yang, Z., Meng, F. et al. (2023). An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning. Measures forgetting of general ability across instruction fine-tunes (§4.5). OPEN
  251. Luong, M.-T., Pham, H. & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015 — multiplicative (dot/general/concat) and global-vs-local attention (EQ N4.4). DL
  252. Lütkepohl, H. (2005). New Introduction to Multiple Time Series Analysis. Springer — the standard graduate reference for VAR estimation, IRFs, FEVD, and the companion form (EQ T5.2–T5.6). TIME
  253. Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). Microsoft Research — ternary {-1,0,+1} weights trained from scratch (§11.4). VOL II
  254. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press — free online; the canonical bridge from Shannon to machine learning. STATS
  255. Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR — adversarial training as the robust min-max of EQ G3.8. GAME
  256. Mahalanobis, P. C. (1936, repr. 2018). On the generalised distance in statistics. Sankhyā A, 80(S1), 1–7. the original covariance-corrected distance (EQ M11.3), reprinted with commentary. VOL I
  257. Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2020). The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36(1) — the modern evidence that exponential smoothing remains a top baseline (§3.4). TIME
  258. Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2022). The M5 competition: Background, organization, and implementation. International Journal of Forecasting 38(4) — why gradient-boosted trees, not transformers, won (§6.4 caveat). TIME
  259. Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business 36(4), 394–419. The founding argument that financial returns are heavy-tailed and possibly infinite-variance — the contested claim of §2.5. STATS · TIME
  260. Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM 8(3) — arguably the first application of naive-Bayes-style probabilistic classification to text. VOL I
  261. Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge University Press — the book-length development of ESS and the replicator perspective (EQ G2.6). GAME
  262. Maynard Smith, J. & Price, G. R. (1973). The Logic of Animal Conflict. Nature 246 — introduces the Evolutionarily Stable Strategy (EQ G2.5) and the Hawk–Dove game (§2.4). GAME
  263. McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop — the canonical multinomial vs. Bernoulli comparison (EQ M9.5 and §9.4). VOL I
  264. McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation — the original diagnosis of forgetting. OPEN
  265. Merton, R. C. (1973). Theory of Rational Option Pricing. Bell Journal of Economics and Management Science 4(1) — the no-arbitrage bounds and the proof that an American call on a non-dividend stock is never exercised early (§2.5). QUANT
  266. Merton, R. C. (1973). Theory of Rational Option Pricing. Bell Journal of Economics and Management Science 4(1). — the rigorous continuous-time treatment; later extended with jump-diffusion. QUANT
  267. Meta (2024). Llama 3 Community License Agreement. Primary source — the commercial terms, acceptable-use policy, and 700M-MAU scale clause discussed in §1.4. OPEN
  268. Meta FAIR Diplomacy Team et al. (2022). Human-level play in the game of Diplomacy by combining language models with strategic reasoning (CICERO). Science 378 — mixed-motive multi-agent play with negotiation. GAME
  269. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21(6) — the original Metropolis acceptance rule (symmetric-proposal special case of EQ S7.8). STATS
  270. Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations 3(1) — the smoothed target/mean encoding of EQ D3.2. DATA
  271. Micikevicius, P. et al. (2017). Mixed Precision Training. ICLR 2018 — fp16 training, the fp32 master copy, and loss scaling (EQ N7.7). DL
  272. Microsoft. ONNX Runtime. onnxruntime.ai — the cross-platform inference engine with pluggable execution providers (CUDA, TensorRT, OpenVINO, CoreML, WebGPU). FRAME
  273. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — deep Q-networks; the modern demonstration that ε-greedy exploration (EQ R1.7) scales to high-dimensional state spaces. RL
  274. Mnih, V. et al. (2016). Asynchronous methods for deep reinforcement learning. ICML 2016 — A3C: parallel actor-learners, the n-step advantage and entropy bonus of EQ R4.8 (and the synchronous A2C that followed). RL
  275. Molnar, C. (2022). Interpretable Machine Learning (2nd ed.). Open textbook — the standard practical reference covering every method in this chapter. MLOPS
  276. Nash, J. F. (1950). Equilibrium Points in n-Person Games. PNAS 36(1), 48–49 — the existence theorem for the equilibrium concept of EQ G1.4 in general finite games. GAME
  277. Nash, J. F. (1951). Non-Cooperative Games. Annals of Mathematics 54(2), 286–295 — the full development of non-cooperative equilibrium, dominance, and the proof via Kakutani's fixed-point theorem. GAME
  278. National Institute of Standards and Technology (2023). AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1 — the Govern/Map/Measure/Manage scaffolding generalizing §7.5. MLOPS
  279. Nelson, D. B. (1991). Conditional Heteroskedasticity in Asset Returns: A New Approach. Econometrica 59(2) — the EGARCH model (EQ T4.7), capturing the leverage effect in log-variance. TIME
  280. Neyman, J. & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Phil. Trans. R. Soc. A 231 — Type I/II error, power, and the framework of EQ S4.8. STATS
  281. Ng, A. Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NeurIPS 14 — the generative/discriminative trade-off and small-data advantage of §9.1. VOL I
  282. Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities With Supervised Learning. ICML 2005 — calibration behavior across model families and the Platt / isotonic fixes (§4.4). MLOPS
  283. Norris, J. R. (1997). Markov Chains. Cambridge University Press — the standard rigorous treatment of transition matrices, stationarity, ergodicity, and reversibility. STATS
  284. Northcutt, C. G., Athalye, A. & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets & Benchmarks — measured label-error rates in ImageNet and nine other canonical test sets (§1.1). DATA
  285. NVIDIA. Triton Inference Server. developer.nvidia.com — multi-framework serving with concurrent execution and dynamic batching (§3.4). FRAME
  286. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. (2019). Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366(6464) — a real-world label / proxy that leaked the wrong target into a deployed model (§1.1, §1.4). DATA
  287. Øksendal, B. (2003). Stochastic Differential Equations: An Introduction with Applications (6th ed.). Springer — the standard graduate text on Itô calculus and SDEs. QUANT
  288. Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science. Science 349(6251) — the large-scale replication study that crystallized the crisis. STATS
  289. Open Source Initiative (2024). The Open Source AI Definition 1.0. Official text — the bar separating open-source AI from merely open-weight releases. OPEN
  290. Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA 2024 — pooling ~1M trajectories across 22 embodiments; the cross-embodiment data effort (§6.5). MM
  291. Osborne, M. J. & Rubinstein, A. (1994). A Course in Game Theory. MIT Press — standard graduate reference for dominance, IESDS, Nash equilibrium, and mixed strategies as presented in §§1.2–1.5. GAME
  292. Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS — the three-stage SFT → reward model → PPO RLHF recipe behind ChatGPT; source of the KL-regularized objective (EQ R6.5). RL
  293. OWASP Foundation (2025). OWASP Top 10 for LLM Applications. The canonical defender's checklist — prompt injection (LLM01) and the system-level controls of §5.4–5.5. OPEN
  294. Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab — the damped random-walk Markov chain whose stationary distribution is PageRank (§7.2). STATS
  295. Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. ICML 2013 — the spectral-norm view of the gradient product and the gradient-clipping remedy for explosion (§3.2). DL
  296. Paszke, A., Gross, S., Massa, F. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 32 — the system paper for define-by-run autograd. FRAME
  297. Pearl, J. (1995). Causal Diagrams for Empirical Research. Biometrika 82(4) — the foundational presentation of the backdoor criterion. STATS
  298. Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press — DAGs, the do-operator and the backdoor criterion (EQ S3.7). STATS
  299. Pearson, K. (1895). Note on Regression and Inheritance in the Case of Two Parents. Proceedings of the Royal Society of London 58 — the product-moment correlation coefficient (EQ S3.3). STATS
  300. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572. the origin of principal-component analysis as projection onto a best-fit subspace (§6.2, §6.5). STATS
  301. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12 — the reference implementations of StandardScaler, MinMaxScaler, RobustScaler and PowerTransformer. DATA
  302. Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023 — DiT, the transformer backbone that replaced the U-Net and scaled to SD3 and Sora. MM
  303. Perez, E., Huang, S., Song, F. et al. (2022). Red Teaming Language Models with Language Models. EMNLP 2022 — using one LM to automatically generate test cases that surface harms in another, at scale. OPEN
  304. Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Machine Learning Technologies 2(1) — the definitive survey of confusion-matrix metrics, their biases, and what each one really measures (§3.3–3.4). MLOPS
  305. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS 2018 — ordered boosting and ordered target statistics (EQ M15.10). VOL I
  306. Puterman, M. L. & Shin, M. C. (1978). Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science 24(11) — the m-sweep interpolation between value and policy iteration cited in §2.4. RL
  307. PyTorch Team. torch.export & ExecuTorch. docs.pytorch.org — ahead-of-time graph capture (ExportedProgram) and the edge runtime succeeding TorchScript. FRAME
  308. Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. (eds.) (2009). Dataset Shift in Machine Learning. MIT Press — the reference volume formalizing covariate, prior, and concept shift (EQ V5.1). MLOPS
  309. Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press — covariate shift, label shift, and concept drift formalized (§1.4, EQ D1.4). DATA
  310. Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2) — the canonical reference for the forward, Viterbi, and Baum–Welch algorithms (§7.3). STATS
  311. Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021 — CLIP; contrastive image–text pre-training and zero-shot transfer (EQ MM1.4). MM
  312. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI — Whisper; an encoder–decoder transformer trained on 680k hours, the model behind §4.2. MM
  313. Radford, A., Metz, L. & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 — DCGAN: the stable convolutional architecture and latent-space vector arithmetic. DL
  314. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS — DPO; the closed-form optimal policy (EQ R6.6) and the supervised preference loss (EQ R6.7) that skip the reward model and RL loop. RL
  315. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. DALL·E 2 — the CLIP-latent prior plus diffusion decoder ("unCLIP") of §3.3. MM
  316. Ren, Y. et al. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ICLR 2021 — non-autoregressive TTS with explicit duration prediction; the alignment fix for Tacotron. MM
  317. Ribeiro, M. T., Singh, S. & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. KDD 2016 — LIME, local surrogate explanations (§6.4, EQ V6.3). MLOPS
  318. Roberts, G. O., Gelman, A. & Gilks, W. R. (1997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. Ann. Appl. Probab. 7(1) — the ~0.234 optimal acceptance-rate result (§7.4). STATS
  319. Rockafellar, R. T. & Uryasev, S. (2000). Optimization of Conditional Value-at-Risk. Journal of Risk 2(3) — CVaR/ES as a convex, optimizable risk measure. QUANT
  320. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022 — the VAE-compressed latent space that makes Stable Diffusion affordable (§5.5). DL · MM
  321. Ross, S., Gordon, G. & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011 — DAgger and the formal account of covariate shift in behavior cloning (§6.4, EQ MM6.5). MM
  322. Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20 — the silhouette score for choosing and validating k (§12.5, EQ M12.6). VOL I
  323. Rubin, D. B. (1976). Inference and Missing Data. Biometrika 63(3):581–592 — the paper that defined MCAR, MAR, and MNAR. DATA
  324. Rubinstein, M. (1994). Implied Binomial Trees. Journal of Finance 49(3). — an early reconstruction of the post-1987 volatility smile from market prices. QUANT
  325. Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High-Stakes Decisions and Use Interpretable Models Instead. Nature Machine Intelligence 1 — the case for inherently interpretable models (§6.1 caveat). MLOPS
  326. Russakovsky, O. et al. (2015). ImageNet Large Scale Visual Recognition Challenge. IJCV 115(3) — the ImageNet/ILSVRC benchmark that drove the whole progression. MM
  327. Saharia, C. et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Imagen — a large frozen T5 text encoder plus a pixel-space super-resolution cascade. MM
  328. Saito, T. & Rehmsmeier, M. (2015). The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10(3) — the empirical case for PR over ROC under class imbalance (§4.2). MLOPS
  329. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. (2016). Improved Techniques for Training GANs. NeurIPS 2016 — minibatch discrimination and feature matching, the classic anti-collapse fixes behind Instrument N6.2. DL
  330. Salinas, D., Flunkert, V., Gasthaus, J. & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36(3) — the global probabilistic RNN of §6.4. TIME
  331. Santurkar, S., Tsipras, D., Ilyas, A. & Mądry, A. (2018). How Does Batch Normalization Help Optimization? NeurIPS — the loss-smoothing reinterpretation of BatchNorm cited in §1.3. DL
  332. Schölkopf, B. & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press — the definitive textbook treatment of kernels, Mercer's theorem, and the optimization behind EQ M10.3, M10.8. VOL I
  333. Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero). Nature 2020 — decision-time planning with a learned latent model and MCTS, without being given the rules (EQ MM5.5). MM
  334. Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. (2015). Trust Region Policy Optimization. arXiv:1502.05477 — the KL-constrained trust region PPO approximates with a first-order clip. RL
  335. Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR 2016 — GAE, the λ-blended advantage estimator that sets the bias–variance dial between TD and Monte-Carlo (§4.4). RL
  336. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv — PPO's clipped surrogate objective, the step-size fix that made policy gradients robust and the workhorse of RLHF (the §4.5 sequel). RL
  337. Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics 6(2) — the Bayesian Information Criterion for model (and component-count) selection (§12.5, EQ M12.7). VOL I
  338. scikit-learn developers. Imputation of missing values (User Guide). Official docs — SimpleImputer, KNNImputer, and IterativeImputer (MICE). DATA
  339. Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015 — the "ML code is a small box" argument behind §7.1. MLOPS
  340. Shafer, G. & Vovk, V. (2008). A tutorial on conformal prediction. JMLR 9 — distribution-free prediction intervals with finite-sample coverage (§6.3, EQ T6.6). TIME
  341. Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011). Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming 127(1), 3–30. The hinge-loss sub-gradient method used in this chapter's first Python cell — how to train a linear SVM at scale. VOL I
  342. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27 — the founding paper: entropy, source coding, and channel capacity (EQ S8.2, S8.6). STATS
  343. Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv — introduces GRPO; the group-relative advantage (EQ R6.8) that removes PPO's value network. RL
  344. Shapley, L. S. (1953). A Value for n-Person Games. Contributions to the Theory of Games II — the original Shapley value from cooperative game theory. MLOPS
  345. Shapley, L. S. (1953). A Value for n-Person Games. In Contributions to the Theory of Games II, Princeton University Press — defines the Shapley value (EQ G2.7) and its axiomatic characterization (§2.5). GAME
  346. Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. Google — Tacotron 2; the two-stage acoustic-model + vocoder template of §4.3 (EQ MM4.6). MM
  347. Sheng, Y., Cao, S., Li, D. et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. MLSys 2024 — multi-tenant serving of many adapters over one shared base model (§3.5). OPEN
  348. Shreve, S. E. (2004). Stochastic Calculus for Finance I: The Binomial Asset Pricing Model. Springer — a rigorous, self-contained development of replication, the risk-neutral measure, and market completeness (§2.1–§2.2). QUANT
  349. Silver, D. et al. (2017). Mastering the game of Go without human knowledge. Nature 550 — AlphaGo Zero / AlphaZero self-play (EQ G3.4). GAME
  350. Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015 — VGG; depth via stacks of \(3\times 3\) convolutions. DL
  351. Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society B 13(2) — the paradox's namesake paper. STATS
  352. Sims, C. A. (1980). Macroeconomics and Reality. Econometrica 48(1) — introduced the VAR as an atheoretical alternative to large structural macro models (EQ T5.1). TIME
  353. Singh, S., Jaakkola, T., Littman, M. L. & Szepesvári, C. (2000). Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning 38 — convergence of SARSA (EQ R3.5) and the GLIE exploration conditions of §3.5. RL
  354. Snoek, J., Larochelle, H. & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS 2012 — Gaussian-process surrogates and acquisition functions for tuning (§2.4). MLOPS
  355. Spearman, C. (1904). The Proof and Measurement of Association between Two Things. American Journal of Psychology 15(1) — rank correlation, EQ S3.4. STATS
  356. Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — the dropout regularizer and inverted-dropout scaling (EQ N7.6). DL
  357. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — dropout regularization (EQ N1.9). DL
  358. Stewart, G. W. (1993). On the early history of the singular value decomposition. SIAM Review, 35(4), 551–566. historical context tracing the SVD from Beltrami and Jordan to its modern role. STATS
  359. Stiennon, N. et al. (2020). Learning to Summarize from Human Feedback. NeurIPS — the reward-hacking dynamics of over-optimizing a learned reward model (Instrument R6.3 §6.5). RL
  360. Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. B 36(2) — the foundational formalization of cross-validation for model assessment. MLOPS
  361. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. the standard intuition-first text for the column-space, rank, eigenvalue, and SVD material here. STATS
  362. Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1), 1–25. The original derivation of the Student-t distribution (EQ S2.7), written at the Guinness brewery. STATS
  363. Sutskever, I., Vinyals, O. & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. NeurIPS 2014 — the encoder-decoder LSTM framework (EQ N4.1) and the source-reversal trick. DL
  364. Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning 3 — the original TD(0) and TD(λ) prediction methods (EQ R3.2, EQ R3.3) and the bootstrapping idea at the heart of this chapter. RL
  365. Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — the canonical text; the agent–environment loop, MDPs, returns, value functions, and exploration as framed in this chapter. RL
  366. Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS 1999 — the policy gradient theorem (EQ R4.3) and its compatibility with a learned value function, the formal basis of actor-critic. RL
  367. Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015 — GoogLeNet / Inception; multi-scale blocks and \(1\times 1\) bottlenecks. DL
  368. Taylor, S. J. & Letham, B. (2018). Forecasting at Scale (Prophet). The American Statistician 72(1) — the decomposable trend + seasonality + holidays model of §6.4 (EQ T6.7). TIME
  369. TensorFlow Team. tf.data: Build TensorFlow input pipelines. Official guide — map / shuffle / batch / prefetch / cache and the overlap of §2.3 (EQ F2.3). FRAME
  370. The PyTorch Team. Automatic Differentiation with torch.autograd. Tutorial — the dynamic graph,.grad accumulation, and zero_grad. FRAME
  371. The PyTorch Team. PyTorch Documentation (stable). Official reference for tensors, autograd, nn, and optim. FRAME
  372. The vLLM Team. vLLM Documentation. Official guide to deployment, quantization, and the OpenAI-compatible server. OPEN
  373. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B 58(1) — the L1 penalty that performs embedded selection (EQ D4.6). DATA
  374. Tishby, N., Pereira, F. C. & Bialek, N. (2000). The Information Bottleneck Method. arXiv — mutual information as a principle for representation learning (§8.3). STATS
  375. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017 — the foundational sim-to-real randomization technique (§6.3, EQ MM6.4). MM
  376. Toda, H. Y. & Yamamoto, T. (1995). Statistical Inference in Vector Autoregressions with Possibly Integrated Processes. Journal of Econometrics 66(1–2) — the lag-augmented Granger test valid under unit roots and cointegration (§5.5 caveat). TIME
  377. Touvron, H., Martin, L., Stone, K. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 — the release (and custom community license) that catalyzed the open-weight ecosystem. OPEN
  378. Troyanskaya, O. et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525 — the kNN-impute (KNNimpute) paper. DATA
  379. Tseng, A., Chee, J., Sun, Q., Kuleshov, V. & De Sa, C. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. ICML 2024 — incoherence processing + E8 lattice codebooks toward ~2 bits. VOL II
  380. Tversky, A. & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review 90(4) — the Linda problem (§1.5). STATS
  381. Uhlenbeck, G. E. & Ornstein, L. S. (1930). On the Theory of the Brownian Motion. Phys. Rev. 36 — the mean-reverting process of §1.5. QUANT
  382. Valevski, D. et al. (2024). Diffusion Models Are Real-Time Game Engines (GameNGen). arXiv — a neural network simulating DOOM interactively; a video predictor used as a playable environment. MM
  383. van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). CRC Press — the practical, freely-readable reference for MICE / chained equations. DATA
  384. van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. DeepMind — dilated causal convolutions for autoregressive audio (EQ MM4.7); the vocoder breakthrough. MM
  385. van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS 2017 — discrete codebook latents that let autoregressive models generate over autoencoder tokens (§5.5). DL
  386. van Hasselt, H. (2010). Double Q-learning. NeurIPS 23 — diagnoses and corrects the maximization bias of the \(\max\) operator in EQ R3.4. RL
  387. van Hasselt, H., Guez, A. & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016 (arXiv:1509.06461) — double-DQN, decoupling action selection from evaluation to curb over-estimation. RL
  388. Varma, S. & Simon, R. (2006). Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinformatics 7:91 — the selection-bias result behind nested CV (EQ V1.6). MLOPS
  389. Vasicek, O. (1977). An Equilibrium Characterization of the Term Structure. Journal of Financial Economics 5(2) — the Ornstein–Uhlenbeck short rate and its closed-form bond (EQ Q4.4–Q4.6). QUANT
  390. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 — drops recurrence for pure self-attention; the destination of EQ N4.5. DL
  391. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008 — the denoising autoencoder (EQ N5.2) and the manifold-projection view that seeds diffusion. DL
  392. von Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Mathematische Annalen 100 — the original minimax theorem for two-player zero-sum games (§1.4). GAME
  393. von Neumann, J. & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press — the founding text; normal-form games (EQ G1.1), the minimax theorem (EQ G1.6), and expected-utility theory. GAME
  394. Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58(301) — Ward's minimum-variance linkage for agglomerative clustering (§12.2, EQ M12.2). VOL I
  395. Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer — the standard modern reference for the distributions, moments, and convergence results in this chapter. STATS
  396. Wasserstein, R. L. & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) — the profession's own caution on what a p-value is not. STATS
  397. Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8 — the action-value function \(Q^\pi\) (EQ R1.6) and the convergence result that grounds model-free control. RL
  398. Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L. & Petitjean, F. (2016). Characterizing Concept Drift. Data Mining and Knowledge Discovery 30 — a quantitative framework for describing how concepts drift over time. MLOPS
  399. Wei, A., Haghtalab, N. & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail?. NeurIPS 2023 — the competing-objectives and mismatched-generalization framing used throughout §5.1–5.2. OPEN
  400. Welch, B. L. (1947). The Generalization of Student's Problem When Several Different Population Variances Are Involved. Biometrika 34 — the unequal-variance two-sample test of EQ S4.10. STATS
  401. White, I. R., Royston, P. & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 30(4):377–399 — practical guidance on running and pooling MICE. DATA
  402. Wiener, N. (1923). Differential Space. J. Math. Phys. 2 — the rigorous existence proof of the process that bears his name. QUANT
  403. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 — the original REINFORCE estimator (EQ R4.4) and the log-derivative / score-function trick behind every policy gradient. RL
  404. Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management Science 6(3) — adds the seasonal component, completing Holt-Winters (EQ T3.6/T3.7). TIME
  405. Wolpert, D. H. (1992). Stacked Generalization. Neural Networks 5(2) — learning the combiner with out-of-fold base predictions (EQ M14.6). VOL I
  406. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 — soft/hard attention beyond translation; shows the mechanism generalizes (§4.1). DL
  407. Yang, A., Yang, B., Hui, B. et al. (2024). Qwen2 Technical Report. arXiv:2407.10671 — the Qwen family's wide size ladder and (mostly) Apache-2.0 licensing. OPEN
  408. Yeo, I.-K. & Johnson, R. A. (2000). A New Family of Power Transformations to Improve Normality or Symmetry. Biometrika 87(4) — the Yeo-Johnson extension to real-valued data, EQ D3.8. DATA
  409. Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks?. NeurIPS 27 — the empirical basis for transfer learning; transferability falls with depth. DL
  410. Yue, X. et al. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark. CVPR 2024 — college-level multimodal reasoning, a current frontier evaluation. MM
  411. Yule, G. U. (1927). On a Method of Investigating Periodicities in Disturbed Series. Phil. Trans. R. Soc. A 226 — the paper that introduced the autoregressive model (§1.3). TIME
  412. Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS 2023 — ACT (action chunking with transformers) and the ALOHA teleoperation platform (§6.4). MM
  413. Zhou, C., Liu, P., Xu, P. et al. (2023). LIMA: Less Is More for Alignment. NeurIPS 2023 — 1,000 curated examples beat far larger noisy sets; the evidence behind "quality over quantity." OPEN
  414. Zou, A., Wang, Z., Carlini, N. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. The GCG attack (EQ OM5.2): gradient-based adversarial suffixes that transfer across models. OPEN
  415. Zou, H. & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B 67(2) — the L1+L2 fix for Lasso's instability under collinearity (EQ D4.6 note). DATA