§
All sources
- Abadi, M. et al. (2016). TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016 — the computation-graph runtime and distributed execution model behind §2.1.
- Abadi, M. et al. (2016). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv — the original TensorFlow white paper describing the dataflow-graph design.
- Aggarwal, C. C., Hinneburg, A. & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT 2001, LNCS 1973, 420–434. shows lower-order (fractional, Manhattan) norms concentrate more slowly than Euclidean (§11.5).
- Aghajanyan, A., Zettlemoyer, L. & Gupta, S. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021 — the empirical low-rank hypothesis that motivates LoRA.
- Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6) — the Akaike Information Criterion behind automated order selection (EQ T2.9).
- Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD 2019 — the TPE sampler plus pruning combination that is the practical default in 2026.
- Alayrac, J.-B. et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022 — gated cross-attention into a frozen LLM; the canonical cross-attention design (§2.4).
- Ambroise, C. & McLachlan, G. J. (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. PNAS 99(10) — the definitive demonstration of feature-selection bias and why selection must sit inside cross-validation (§4.5).
- Anil, C., Durmus, E., Sharma, M. et al. (2024). Many-shot Jailbreaking. Anthropic — long-context in-context attacks that scale with the number of faux-compliant examples.
- Ansel, J. et al. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. ASPLOS 2024 — torch.compile and TorchDynamo.
- Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN. ICML 2017 — replacing JSD with the Wasserstein distance (EQ N6.4–N6.5) and the critic.
- Arlot, S. & Celisse, A. (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys 4 — the comprehensive modern reference on CV variants and their bias/variance.
- Artzner, P., Delbaen, F., Eber, J.-M. & Heath, D. (1999). Coherent Measures of Risk. Mathematical Finance 9(3) — the four coherence axioms; why VaR fails subadditivity and ES does not.
- Ashkboos, S. et al. (2024). QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. NeurIPS 2024 — Hadamard rotations that spread weight/activation outliers (§11.4).
- Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 — UCB and the regret framework for principled exploration beyond ε-greedy (§1.5).
- Axelrod, R. (1984). The Evolution of Cooperation. Basic Books — the round-robin tournaments and the four properties of tit-for-tat (EQ G2.4, §2.3); the founding popular text on repeated cooperation.
- Axelrod, R. & Hamilton, W. D. (1981). The Evolution of Cooperation. Science 211(4489) — the peer-reviewed account of the iterated-PD tournaments and the evolutionary stability of tit-for-tat.
- Bachelier, L. (1900). Théorie de la spéculation. Ann. Sci. ÉNS 17 — the founding thesis modelling prices as Brownian motion, five years before Einstein.
- Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Meta — self-supervised audio encoder; the discriminative alternative to generative ASR.
- Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 — additive attention (EQ N4.2/N4.3); the birth of the mechanism and the length-decay figure.
- Bai, J., Lu, F., Zhang, K. et al. (2019). ONNX: Open Neural Network Exchange. onnx.ai — the framework-agnostic graph format and standard operator set at the heart of §3.2.
- Baldi, P. & Hornik, K. (1989). Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima. Neural Networks 2(1) — proves the linear autoencoder optimum spans the top-k PCA subspace (§5.1).
- Basel Committee on Banking Supervision (2019). Minimum Capital Requirements for Market Risk (FRTB, finalized). BIS d457 — the switch to 97.5% stressed Expected Shortfall with liquidity-horizon scaling (EQ Q6.8).
- Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. JMLR 18(153) — forward vs reverse mode, the theory behind EQ F1.3.
- Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans. R. Soc. — the original statement of EQ S1.7.
- Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics 6(5) — the formal origin of the MDP (EQ R1.1) and the recursive value relation behind EQ R1.4.
- Bellman, R. (1957). Dynamic Programming. Princeton University Press — the founding text; the principle of optimality and the recursive value relation behind EQ R2.1–R2.3.
- Bengio, Y., Louradour, J., Collobert, R. & Weston, J. (2009). Curriculum Learning. ICML 2009 — easy-to-hard example ordering (§4.4).
- Bengio, Y., Simard, P. & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5(2) — the formal analysis of vanishing/exploding gradients behind EQ N3.2–N3.3.
- Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate. J. R. Stat. Soc. B 57(1) — FDR control for the many-tests regime of EQ S4.13.
- Bergmeir, C. & Benítez, J. M. (2012). On the Use of Cross-Validation for Time Series Predictor Evaluation. Information Sciences 191 — forward-chaining validation for temporal data (§1.4).
- Bergstra, J. & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. JMLR 13 — the result that random search matches or beats grid search under low effective dimensionality (§2.3).
- Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control (4th ed.). Athena Scientific — the contraction-mapping convergence analysis (EQ R2.6) in full rigor.
- Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When Is "Nearest Neighbor" Meaningful? ICDT 1999, LNCS 1540, 217–235. the foundational distance-concentration result behind EQ M11.6.
- Bickel, P. J., Hammel, E. A. & O'Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science 187(4175) — the canonical Simpson's paradox case study.
- Bifet, A. & Gavaldà, R. (2007). Learning from Time-Changing Data with Adaptive Windowing (ADWIN). SIAM SDM 2007 — an adaptive-window detector with a formal false-positive bound.
- Black, F. & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. Journal of Political Economy 81(3) — the continuous-time closed form the binomial tree converges to as N → ∞ (Quant 03); the lattice is its discrete, fully constructive counterpart.
- Black, F. & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. Journal of Political Economy 81(3). — the original no-arbitrage derivation and the closed-form formula.
- Black, K., Brown, N., Driess, D., Esmail, A., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence — flow-matching continuous action chunks at high frequency (§6.2, EQ MM6.3).
- Blitzstein, J. K. & Hwang, J. (2019). Introduction to Probability (2nd ed.). Chapman & Hall / CRC. Harvard Stat 110 — conditioning, Bayes, expectation, LLN; free course materials.
- Board of Governors of the Federal Reserve System & OCC (2011). SR 11-7: Guidance on Model Risk Management. Supervisory letter — the model-risk framework and three pillars of §7.5 (EQ V7.5).
- Bollerslev, T. (1986). Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics 31(3) — adds the lagged-variance term, giving GARCH(1,1) (EQ T4.4), the field's workhorse.
- Borsos, Z. et al. (2022). AudioLM: A Language Modeling Approach to Audio Generation. Google — next-audio-token prediction over codec tokens (EQ MM4.8); the audio-LM of §4.4.
- Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. Proceedings of COLT '92, 144–152. Where the maximum-margin hyperplane and the kernel trick (EQ M10.1, M10.7) were first combined — the true origin of the kernel SVM.
- Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society B 26(2) — the Box-Cox power-transform family, EQ D3.7.
- Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the canonical text; the ACF/PACF identification method (§1.3) and integration order \(d\) (§1.5) are its core.
- Boyle, P. (1977). Options: A Monte Carlo Approach. Journal of Financial Economics 4(3) — the paper that brought simulation to option pricing.
- Bradbury, J., Frostig, R., Hawkins, P. et al. (2018). JAX: composable transformations of Python+NumPy programs. github.com/google/jax — jit/grad/vmap/pmap over XLA (EQ F3.5).
- Breck, E., Cai, S., Nielsen, E., Salib, M. & Sculley, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data 2017 — the data/model/infra test layers of §7.3.
- Breiman, L. (1996). Bagging Predictors. Machine Learning 24(2) — bootstrap aggregating and its variance-reduction argument (EQ M14.2, M14.3).
- Breiman, L. (2001). Random Forests. Machine Learning 45(1) — feature subsampling as the second decorrelation lever; OOB error (§14.2).
- Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1) — the original proper scoring rule for probabilistic forecasts (§3.5).
- Brigo, D. & Mercurio, F. (2006). Interest Rate Models — Theory and Practice (2nd ed.). Springer Finance — the standard practitioner reference for Vasicek, CIR, Hull–White, G2++, and calibration.
- Broder, A. Z. (1997/1998). On the resemblance and containment of documents & Min-wise independent permutations. SEQUENCES / STOC. MinHash estimation of Jaccard similarity at scale (§11.4).
- Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Google DeepMind — action tokenization over a VLM vocabulary and web/robot co-training (§6.2, EQ MM6.2).
- Brooks, T. et al. (2024). Video Generation Models as World Simulators. OpenAI (Sora) — spacetime latent patches and a diffusion transformer over video, the basis of §3.5 and EQ MM3.6.
- Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science 365 — Pluribus, self-play in imperfect information.
- Bruce, J. et al. (2024). Genie: Generative Interactive Environments. ICML 2024 — latent-action world model learned from unlabelled gameplay video; playable worlds from one prompt (EQ MM5.4).
- Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167. The most-cited tutorial — derives the dual (EQ M10.3) and the KKT support-vector conditions step by step.
- Campello, R. J. G. B., Moulavi, D., Zimek, A. & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM TKDD 10(1) — HDBSCAN, the variable-density successor that removes DBSCAN's ε knob (§12.3 note).
- Carion, N. et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020 — detection as set prediction; no anchors or non-max suppression.
- Carpenter, B. et al. (2017). Stan: A Probabilistic Programming Language. J. Stat. Softw. — Hamiltonian Monte Carlo for the hierarchical models of §5.5.
- Casella, G. & Berger, R. L. (1987). Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem. JASA — a careful account of where the two frameworks agree and diverge.
- Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning and reporting on the same folds inflates scores.
- Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning on the test set inflates results, and nested cross-validation as the fix (§1.2).
- Chang, C.-C. & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM TIST 2(3), 1–27. The standard implementation behind scikit-learn's SVC, and the reference for practical \((C, \gamma)\) selection (§10.5).
- Chang, H., Zhang, H., Jiang, L., Liu, C. & Freeman, W. T. (2022). MaskGIT: Masked Generative Image Transformer. CVPR 2022 — parallel masked-token decoding, the few-round alternative to autoregression in §3.4.
- Chaudhry, A., Ranzato, M., Rohrbach, M. & Elhoseiny, M. (2019). Efficient Lifelong Learning with A-GEM. ICLR 2019 — gradient-episodic-memory replay for continual learning (§4.4).
- Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 — the interpolation method of EQ D5.3.
- Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016 — regularized second-order objective and split gain (EQ M15.6–M15.7).
- Chicco, D. & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 and Accuracy. BMC Genomics 21 — the case for MCC on imbalanced binary problems (§5.5).
- Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 — introduces the GRU (EQ N3.6) and the encoder–decoder framing carried into Chapter 04.
- Chollet, F. (2015). Keras. Official documentation — the high-level layers/Sequential/Functional API of §2.2 and the current Keras 3 multi-backend design.
- Chollet, F. (2021). Deep Learning with Python (2nd ed.). Manning — the canonical Keras text by its author; covers layers, fit, callbacks, and custom training loops.
- Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS — the original preference-to-reward pipeline and the Bradley–Terry reward model (EQ R6.2–R6.3) that RLHF scaled to language.
- Christoffersen, P. F. (1998). Evaluating Interval Forecasts. International Economic Review 39(4) — the conditional-coverage / independence test that complements Kupiec.
- Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning Workshop — the LSTM-vs-GRU comparison underpinning the "no universal winner" claim (§3.4).
- Clauset, A., Shalizi, C. R. & Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review 51(4), 661–703. The methodological reference on fitting and — crucially — testing power-law tails against alternatives.
- Cont, R. (2001). Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues. Quantitative Finance 1(2), 223–236. A careful survey of the fat-tail evidence and why pinning down the tail exponent is genuinely hard.
- Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3), 273–297. The paper that introduced the soft margin and slack variables (EQ M10.5) and gave the method its modern form — the canonical primary source for this chapter.
- Cover, T. & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Information Theory, 13(1), 21–27. why the distance choice is the model: the founding analysis of k-NN.
- Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley — the standard graduate text; entropy, mutual information, source coding, and their inequalities.
- Cox, J. C., Ingersoll, J. E. & Ross, S. A. (1985). A Theory of the Term Structure of Interest Rates. Econometrica 53(2) — the square-root diffusion, the Feller condition, and non-negative rates (EQ Q4.7–Q4.9).
- Cox, J. C., Ross, S. A. & Rubinstein, M. (1979). Option Pricing: A Simplified Approach. Journal of Financial Economics 7(3) — the original recombining binomial lattice, the CRR parameters of EQ Q2.4, and the convergence to Black–Scholes.
- Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024 — Mamba-2 and the SSD duality of EQ 11.4.
- Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. ICML 2006 — why PR-AUC, not ROC-AUC, is the metric to trust under imbalance (§5.5).
- DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437 — a frontier-class MoE trained at a fraction of typical cost, with unusually open methodology.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv — RLVR with GRPO at scale; reasoning behavior emerging from verifiable rewards (§6.5).
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. JASIS 41(6) — LSA: truncated SVD of a term–document matrix, the §13.5 connection to embeddings.
- Défossez, A. et al. (2024). Moshi: A Speech-Text Foundation Model for Real-Time Dialogue. Kyutai — full-duplex spoken dialogue; the streaming / low-latency direction of §4.5.
- Défossez, A., Copet, J., Synnaeve, G. & Adi, Y. (2022). High Fidelity Neural Audio Compression. Meta — EnCodec; residual-vector-quantized neural codec (EQ MM4.4) that turns audio into discrete tokens.
- Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B 39(1) — the Expectation–Maximization algorithm behind GMM fitting (§12.4, EQ M12.5).
- Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022 — outlier-aware 8-bit quantization; why a few features must be preserved.
- Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023 — 4-bit NF4 base weights plus LoRA adapters; fine-tuning a 65B model on a single 48 GB GPU.
- Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021 — the result marking diffusion's displacement of GANs for large-scale image generation (§6.5).
- Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the Augmented Dickey-Fuller unit-root test that decides how many differences \(d\) a series needs.
- Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the unit-root test that decides random walk vs stationary (§1.4).
- Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems, LNCS 1857 — the canonical survey of why and when ensembles help (§14.1).
- Domingos, P. & Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29 — explains why a model with violated independence assumptions still classifies well (the paradox of §9.5).
- Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021 — the Vision Transformer, the chief modern challenger to the convolutional prior.
- Easley, D. & Kleinberg, J. (2010). Networks, Crowds, and Markets. Cambridge University Press (Ch. 6) — an accessible, freely available treatment of best response, dominant strategies, and equilibrium used to frame this chapter.
- Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218. the optimal low-rank approximation theorem (EQ S6.7).
- Efron, B. & Morris, C. (1975). Data Analysis Using Stein's Estimator and Its Generalizations. JASA / Ann. Statist. — shrinkage and partial pooling as empirical Bayes.
- Einstein, A. (1905). Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung…. Ann. Phys. 322 — the physical derivation that variance grows linearly in time.
- Engle, R. F. (1982). Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation. Econometrica 50(4) — the original ARCH model (EQ T4.2); the work cited for Engle's 2003 Nobel Prize.
- Engle, R. F. (2002). Dynamic Conditional Correlation: A Simple Class of Multivariate GARCH Models. Journal of Business & Economic Statistics 20(3) — the DCC model bridging to the multivariate chapter.
- Engle, R. F. & Granger, C. W. J. (1987). Co-integration and Error Correction: Representation, Estimation, and Testing. Econometrica 55(2) — defines cointegration and the Granger representation theorem linking it to the VECM (EQ T5.7–T5.8).
- Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. KDD-96 — the original DBSCAN: core/border/noise points and density-reachability (§12.3, EQ M12.3).
- European Union (2024). Regulation (EU) 2024/1689 — the Artificial Intelligence Act. Official Journal — risk-tiered obligations (risk management, data governance, logging, human oversight) phasing in through 2026–2027.
- Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters 27(8) — the canonical tutorial on ROC curves, AUC, and the pair-counting identity (§4.1).
- Fischer, H. (2011). A History of the Central Limit Theorem: From Classical to Modern Probability Theory. Springer — the full lineage of EQ S2.6, from de Moivre and Laplace to Lindeberg and Lévy.
- Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023 — second-order error-aware rounding (§11.4).
- Freund, Y. & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences 55(1) — AdaBoost and its training-error bound (EQ M14.5).
- Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5) — boosting as functional gradient descent (EQ M14.4).
- Friedman, J. W. (1971). A Non-cooperative Equilibrium for Supergames. Review of Economic Studies 38(1) — Grim-Trigger equilibria and an early form of the Folk Theorem behind EQ G2.3 (§2.1).
- Friedman, J., Hastie, T. & Tibshirani, R. (2000). Additive Logistic Regression: A Statistical View of Boosting. Annals of Statistics 28(2) — AdaBoost as stagewise exponential-loss minimization (EQ M15.5).
- Fujimoto, S., van Hoof, H. & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. arXiv:1802.09477 — TD3; twin critics, delayed updates, and target smoothing.
- Gama, J., Medas, P., Castillo, G. & Rodrigues, P. (2004). Learning with Drift Detection (DDM). SBIA 2004, LNCS 3171 — the error-rate drift detector behind EQ V5.4.
- Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia, A. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys 46(4) — the canonical taxonomy of drift types and adaptation strategies (§5.1).
- Gardner, E. S. & McKenzie, E. (1985). Forecasting trends in time series. Management Science 31(10) — the damped-trend method (EQ T3.5), a perennial competition benchmark.
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press — the standard modern reference for conjugacy, hierarchy, and computation.
- Geman, S. & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE TPAMI 6(6) — introduced Gibbs sampling (EQ S7.9) to statistics and image analysis.
- Gerganov, G. et al. (2023). llama.cpp — LLM inference in C/C++. The reference local inference engine and the home of the GGUF format.
- Giles, M. & Glasserman, P. (2006). Smoking Adjoints: Fast Monte Carlo Greeks. RISK Magazine — adjoint algorithmic differentiation for computing all Greeks in one reverse pass.
- Glasserman, P. (2003). Monte Carlo Methods in Financial Engineering. Springer — the definitive reference for variance reduction, path simulation, and Greeks.
- Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS — the variance-preserving (Xavier/Glorot) initialization of §1.2.
- Glosten, L. R., Jagannathan, R. & Runkle, D. E. (1993). On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. Journal of Finance 48(5) — the GJR-GARCH asymmetric extension (EQ T4.6).
- Gneiting, T. & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. J. American Statistical Association 102(477) — the theory of why log loss and Brier reward honest probabilities (§3.5).
- Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. (2015). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. J. Computational and Graphical Statistics 24(1) — ICE curves (§6.3).
- Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. the canonical numerical reference for the SVD, power iteration, and conditioning.
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning MIT Press — Ch. 8 (optimization) and Ch. 7 (regularization), the standard textbook treatment.
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS 2014 — the original adversarial game, the optimal discriminator (EQ N6.2), and the JSD reduction (EQ N6.3).
- Google. TensorFlow Lite / LiteRT — On-Device Machine Learning. tensorflow.org/lite — the SavedModel→FlatBuffer converter and edge runtime of §3.3.
- Google. TensorFlow Serving. tensorflow.org/tfx — SavedModel hosting with versioned hot-swap, the TF-native serving path.
- Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 37(3) — the original definition of Granger causality (EQ T5.9).
- Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10) — systematic ablation of LSTM components, including the value of the forget gate and forget-bias initialization (§3.3).
- Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. (2012). A Kernel Two-Sample Test (MMD). JMLR 13 — the maximum-mean-discrepancy test for multivariate covariate-shift detection (§5.3).
- Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?. NeurIPS 2022 Datasets & Benchmarks — the contested empirical case behind §15.5's honest comparison.
- Groeneveld, D., Beltagy, I., Walsh, P. et al. (2024). OLMo: Accelerating the Science of Language Models. arXiv:2402.00838 — a genuinely open-source model: weights, data, and training code all released.
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2312.00752 — the modern selective state-space model reviving gated linear recurrence at scale (§3.4 footnote).
- Gu, A., Goel, K. & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces (S4). ICLR 2022 — the structured SSM that started the line (§11.2).
- Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. (2017). Improved Training of Wasserstein GANs. NeurIPS 2017 — WGAN-GP: the gradient penalty that replaced weight clipping.
- Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017 — modern deep networks are systematically over-confident; temperature scaling as a fix (§4.4).
- Gururangan, S., Marasović, A., Swayamdipta, S. et al. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL 2020 — domain- and task-adaptive continued pre-training (DAPT / TAPT).
- Guyon, I. & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 — the canonical survey of filter, wrapper and embedded selection (§4.3).
- Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46 — recursive feature elimination, EQ D4.5.
- Ha, D. & Schmidhuber, J. (2018). World Models. NeurIPS 2018 — the foundational demonstration: train an agent inside its own learned dream and transfer to the real environment.
- Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor. arXiv:1801.01290 — the maximum-entropy objective (EQ R5.5) and the continuous-control default of §5.4.
- Hafner, D. et al. (2025). V-JEPA 2: Self-Supervised Video World Models — see also Assran et al., I-JEPA (arXiv:2301.08243). embedding-prediction self-supervision scaled to video as a world model for planning.
- Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). Mastering Diverse Domains through World Models (DreamerV3). arXiv — one fixed hyperparameter set across 150+ tasks; first to mine diamonds in Minecraft from scratch (EQ MM5.2).
- Halko, N., Martinsson, P.-G. & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review 53(2) — randomized SVD, how truncated factorizations are actually computed at scale.
- Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press — the graduate-level reference for stationarity, unit roots, and the econometric theory behind §1.2 and §1.4.
- Han, H., Wang, W.-Y. & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. ICIC 2005 — synthesizing only near the decision boundary (§5.3).
- Hand, D. J. (2009). Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve. Machine Learning 77(1) — the influential critique of AUC and the proposed H-measure.
- Hanley, J. A. & McNeil, B. J. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 143(1) — the AUC = Wilcoxon–Mann–Whitney equivalence (EQ V4.2).
- Hansen, P. R. & Lunde, A. (2005). A Forecast Comparison of Volatility Models: Does Anything Beat a GARCH(1,1)?. Journal of Applied Econometrics 20(7) — the large horse race finding GARCH(1,1) hard to beat for equities.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — free online; cross-validation, the train/test contract, and the right vs wrong way to cross-validate (§1.2, §1.5).
- Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57(1) — the generalization to asymmetric proposals, completing Metropolis–Hastings.
- He, H. & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9) — the canonical survey of resampling, cost-sensitive learning, and evaluation.
- He, H., Bai, Y., Garcia, E. A. & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IJCNN 2008 — density-adaptive synthetic generation (§5.3).
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022 — masked autoencoding as a strong self-supervised pretext for vision transformers (§5.5).
- He, K., Zhang, X., Ren, S. & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV — He/Kaiming initialization for ReLU networks (EQ N1.5).
- He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR — residual connections / ResNet (§1.4, EQ N1.7–N1.8).
- Heath, D., Jarrow, R. & Morton, A. (1992). Bond Pricing and the Term Structure of Interest Rates: A New Methodology. Econometrica 60(1) — the HJM no-arbitrage framework that generalizes all short-rate models to the whole forward curve.
- Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI 2018 (arXiv:1709.06560) — the reproducibility and seed-variance study behind §5.5 and EQ R5.6.
- Heston, S. L. (1993). A Closed-Form Solution for Options with Stochastic Volatility. Review of Financial Studies 6(2). — stochastic-variance model that generates the smile endogenously (EQ Q3.7).
- Higgins, I. et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017 — weighting the KL term to encourage disentangled latent factors (§5.4).
- Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) — deep autoencoders, trained layer-wise, beat PCA at nonlinear dimensionality reduction (§5.1).
- Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS 2021 Workshop — the conditional/unconditional extrapolation of EQ MM3.4.
- Ho, J., Jain, A. & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020 — the noise-prediction objective and forward reparameterization behind EQ MM3.1–MM3.2.
- Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8) — introduces the LSTM cell, the constant error carousel, and the gating scheme of EQ N3.4–N3.5.
- Hoffman, M. D. & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. JMLR 15 — NUTS, the adaptive HMC sampler behind Stan / PyMC / NumPyro (§7.5).
- Holt, C. C. (2004, orig. 1957). Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting 20(1) — reprint of the 1957 ONR memorandum that introduced double smoothing (EQ T3.4).
- Howard, J. & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification (ULMFiT). ACL 2018 — gradual unfreezing and discriminative fine-tuning (§4.2).
- Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press — introduced policy iteration (EQ R2.7) and the policy improvement theorem.
- Hu, E. J., Shen, Y., Wallis, P. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022 — the low-rank weight update (EQ OM3.2) at the heart of practical open-model fine-tuning.
- Hubert, L. & Arabie, P. (1985). Comparing Partitions. Journal of Classification 2(1) — the Adjusted Rand Index for chance-corrected external validation (§12.5).
- Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 40(9) — the optimal prefix code of §8.4 (EQ S8.6, Instrument S8.3).
- Hugging Face. PEFT: Parameter-Efficient Fine-Tuning — Documentation. Official library docs for LoRA/QLoRA/DoRA training and adapter management.
- Hull, J. & White, A. (1990). Pricing Interest-Rate-Derivative Securities. Review of Financial Studies 3(4) — time-dependent drift that fits the initial curve exactly (EQ Q4.10–Q4.11).
- Hull, J. C. (2021). Options, Futures, and Other Derivatives (11th ed.). Pearson — Ch. 13–21: the standard practitioner treatment of binomial trees, risk-neutral valuation, and American-option pricing.
- Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.), Ch. 8. OTexts — the freely available standard textbook treatment of SES, Holt-Winters, and ETS.
- Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts (free online) — the modern practitioner's guide; decomposition (§1.1), STL, and Box–Cox (§1.5).
- Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software 27(3) — the algorithm behind auto.arima and its AIC-driven stepwise order search (§2.5).
- Hyndman, R. J. & Koehler, A. B. (2006). Another Look at Measures of Forecast Accuracy. International J. Forecasting 22(4) — the canonical critique of MAPE and the case for scaled error measures (§3.2).
- Hyndman, R. J., Koehler, A. B., Ord, J. K. & Snyder, R. D. (2008). Forecasting with Exponential Smoothing: The State Space Approach. Springer — the definitive treatment of the ETS innovations state-space framework (EQ T3.8).
- Hyndman, R. J., Koehler, A. B., Snyder, R. D. & Grose, S. (2002). A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting 18(3) — the taxonomy of 30 ETS models and automatic AIC selection (§3.4).
- Inan, H., Upasani, K., Chi, J. et al. (2023). Llama Guard: LLM-based Input-Output Safeguarding for Human-AI Conversations. Meta — the open guard-model approach behind the input/output filters of §5.4.
- Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine 2(8) — the conditional-probability argument behind §4.6.
- Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML — batch normalization (§1.3, EQ N1.6).
- Itô, K. (1944). Stochastic Integral. Proc. Imperial Acad. Tokyo 20(8) — the original construction of the Itô integral and the lemma of §1.4.
- J.P. Morgan / Reuters (1996). RiskMetrics — Technical Document (4th ed.). The document that made parametric VaR an industry standard: EWMA covariance, Gaussian quantiles, √-time scaling.
- Jamieson, K. & Talwalkar, A. (2016). Non-stochastic Best Arm Identification and Hyperparameter Optimization. AISTATS 2016 — the successive-halving subroutine behind EQ V2.6.
- Jamshidian, F. (1989). An Exact Bond Option Formula. Journal of Finance 44(1) — decomposes a swaption into a portfolio of bond options, making Gaussian models swaption-closed-form.
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press — the objective-Bayesian case for probability as extended logic.
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. Probability as extended logic — the Bayesian reading of §1.1 and §1.3.
- Jiang, A. Q., Sablayrolles, A., Mensch, A. et al. (2023). Mistral 7B. arXiv:2310.06825 — an Apache-2.0 dense model that set the efficiency bar for small open models.
- Johansen, S. (1991). Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models. Econometrica 59(6) — the maximum-likelihood (trace and max-eigenvalue) tests for cointegration rank (§5.4).
- Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Ch. 4: Naive Bayes and Sentiment Classification. Stanford — a modern, worked treatment of multinomial NB for text with smoothing.
- Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 — an early, lucid survey of the problem formulation used throughout this chapter.
- Karras, T., Laine, S. & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR 2019 — StyleGAN: the mapping network, AdaIN style modulation (EQ N6.6), and style mixing.
- Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020 — the reassociation trick of EQ 11.5.
- Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM TKDD 6(4) — the field's working definition and taxonomy of leakage (§1.3, EQ D1.3).
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017 — histogram binning, leaf-wise growth, GOSS & EFB (EQ M15.8–M15.9).
- Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika 30(1–2) — Kendall's τ, EQ S3.5.
- Keskar, N. S. et al. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017 — batch size, flat vs. sharp minima, and the generalization debate.
- Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. CoRL 2024 — an open-weight tokenized VLA, the reproducible counterpart to RT-2 (§6.2).
- Kingma, D. P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015 — the first/second-moment adaptive optimizer with bias correction (EQ N7.3).
- Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR — the variational autoencoder and the ELBO objective (EQ S8.9).
- Kirillov, A. et al. (2023). Segment Anything. ICCV 2023 — SAM; promptable, open-vocabulary segmentation (§1.4 frontier).
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N. et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS 2017 — Elastic Weight Consolidation (EQ OM4.6).
- Kloeden, P. & Platen, E. (1992). Numerical Solution of Stochastic Differential Equations. Springer — Euler–Maruyama and higher-order path-discretization schemes.
- Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI 1995 — the empirical case for stratified 10-fold CV (§1.2).
- Kolmogorov, A. N. (1933). Foundations of the Theory of Probability (Grundbegriffe der Wahrscheinlichkeitsrechnung). The axiomatic foundation of EQ S1.1; measure-theoretic probability.
- Koren, Y., Bell, R. & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42(8) — the canonical write-up of the latent-factor model and biases (EQ M13.4–M13.5), from the Netflix-Prize winners.
- Kraskov, A., Stögbauer, H. & Grassberger, P. (2004). Estimating Mutual Information. Physical Review E 69(6) — the k-nearest-neighbour estimator behind practical MI feature scores (EQ D4.7).
- Kreuzberger, D., Kühl, N. & Hirschl, S. (2022). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. arXiv:2205.02302 — a current reference architecture for pipelines, CI/CD, and CT.
- Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 25 — AlexNet; ReLU, dropout, and GPU training that ignited the deep-learning era.
- Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (full text online) — encoding, scaling, transforms and leakage-safe resampling.
- Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics 22(1) — relative entropy / KL divergence (EQ S8.4).
- Kupiec, P. H. (1995). Techniques for Verifying the Accuracy of Risk Measurement Models. Journal of Derivatives 3(2) — the proportion-of-failures likelihood-ratio backtest (EQ Q6.7).
- Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023 — the vLLM paper; paged KV cache and continuous batching.
- Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). github.com/vllm-project/vllm — paged-attention KV cache and continuous batching for LLM throughput.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview — the JEPA position paper; predict in representation space, not pixel space (EQ MM5.3).
- LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11) — LeNet-5; the conv → pool → dense template trained end-to-end by backprop.
- Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 — NMF and the parts-based decomposition of §13.4 (EQ M13.6).
- Lee, D. D. & Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization. NIPS 13 — the multiplicative update rules of EQ M13.7 and their convergence guarantee.
- Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NIPS 27 — proof that word2vec's skip-gram is implicitly factorizing a shifted PMI matrix.
- Li, J., Li, D., Savarese, S. & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs. ICML 2023 — the Q-Former, a query-based bridge between frozen vision and frozen language.
- Li, L. et al. (2020). A System for Massively Parallel Hyperparameter Tuning. MLSys 2020 — ASHA, the asynchronous successive halving used in production tuners.
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. (2017). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. JMLR 18 — successive halving across brackets, the bandit view of early stopping (§2.5).
- Li, Y. et al. (2023). Evaluating Object Hallucination in Large Vision-Language Models (POPE). EMNLP 2023 — the object-hallucination probe behind the §2.5 evaluation caveats.
- Lieber, O. et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. AI21 — a production-scale interleaved Mamba/attention/MoE hybrid (§11.3).
- Lillicrap, T. P. et al. (2016). Continuous control with deep reinforcement learning. arXiv:1509.02971 — DDPG, the deterministic actor–critic for continuous actions (§5.4).
- Lim, B., Arık, S. Ö., Loeff, N. & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37(4) — attention, variable selection and quantile outputs (§6.4).
- Lin, J., Tang, J., Tang, H. et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024 — protects salient weight channels; widely used for 4-bit serving.
- Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017 (RetinaNet) — focal loss, EQ D5.5.
- Little, R. J. A. & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley — the canonical textbook on mechanisms, likelihood-based, and multiple imputation.
- Liu, H., Hussain, F., Tan, C. L. & Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6 — a survey of binning / discretization methods (EQ D3.9).
- Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023 — projected patches as input tokens (EQ MM2.1) plus LLM-bootstrapped instruction data; the dominant early-fusion recipe.
- Liu, S.-Y., Wang, C.-Y., Yin, H. et al. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ICML 2024 — decouples magnitude from direction for a quality gain over vanilla LoRA.
- Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the portmanteau test for "are these residuals white noise?" (§1.4).
- Longstaff, F. & Schwartz, E. (2001). Valuing American Options by Simulation: A Simple Least-Squares Approach. Review of Financial Studies 14(1) — least-squares Monte Carlo for early-exercise payoffs.
- Loshchilov, I. & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 — cosine annealing and cyclical warm restarts (EQ N7.5).
- Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR — AdamW, decoupling weight decay from the adaptive step (EQ N1.10).
- Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P. & Mordatch, I. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS — MADDPG and the CTDE gradient of EQ G3.6.
- Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017 — the SHAP framework and its uniqueness theorem (§6.5, EQ V6.4–V6.5).
- Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 30 (SHAP) — the Shapley value applied to machine-learning feature attribution, the §2.5 bridge into Chapter 03.
- Lundberg, S. M., Erion, G. G. & Lee, S.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles. arXiv — TreeSHAP, exact polynomial-time Shapley values for trees (§6.5).
- Luo, Y., Yang, Z., Meng, F. et al. (2023). An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning. Measures forgetting of general ability across instruction fine-tunes (§4.5).
- Luong, M.-T., Pham, H. & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015 — multiplicative (dot/general/concat) and global-vs-local attention (EQ N4.4).
- Lütkepohl, H. (2005). New Introduction to Multiple Time Series Analysis. Springer — the standard graduate reference for VAR estimation, IRFs, FEVD, and the companion form (EQ T5.2–T5.6).
- Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). Microsoft Research — ternary {-1,0,+1} weights trained from scratch (§11.4).
- MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press — free online; the canonical bridge from Shannon to machine learning.
- Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR — adversarial training as the robust min-max of EQ G3.8.
- Mahalanobis, P. C. (1936, repr. 2018). On the generalised distance in statistics. Sankhyā A, 80(S1), 1–7. the original covariance-corrected distance (EQ M11.3), reprinted with commentary.
- Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2020). The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36(1) — the modern evidence that exponential smoothing remains a top baseline (§3.4).
- Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2022). The M5 competition: Background, organization, and implementation. International Journal of Forecasting 38(4) — why gradient-boosted trees, not transformers, won (§6.4 caveat).
- Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business 36(4), 394–419. The founding argument that financial returns are heavy-tailed and possibly infinite-variance — the contested claim of §2.5.
- Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM 8(3) — arguably the first application of naive-Bayes-style probabilistic classification to text.
- Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge University Press — the book-length development of ESS and the replicator perspective (EQ G2.6).
- Maynard Smith, J. & Price, G. R. (1973). The Logic of Animal Conflict. Nature 246 — introduces the Evolutionarily Stable Strategy (EQ G2.5) and the Hawk–Dove game (§2.4).
- McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop — the canonical multinomial vs. Bernoulli comparison (EQ M9.5 and §9.4).
- McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation — the original diagnosis of forgetting.
- Merton, R. C. (1973). Theory of Rational Option Pricing. Bell Journal of Economics and Management Science 4(1) — the no-arbitrage bounds and the proof that an American call on a non-dividend stock is never exercised early (§2.5).
- Merton, R. C. (1973). Theory of Rational Option Pricing. Bell Journal of Economics and Management Science 4(1). — the rigorous continuous-time treatment; later extended with jump-diffusion.
- Meta (2024). Llama 3 Community License Agreement. Primary source — the commercial terms, acceptable-use policy, and 700M-MAU scale clause discussed in §1.4.
- Meta FAIR Diplomacy Team et al. (2022). Human-level play in the game of Diplomacy by combining language models with strategic reasoning (CICERO). Science 378 — mixed-motive multi-agent play with negotiation.
- Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21(6) — the original Metropolis acceptance rule (symmetric-proposal special case of EQ S7.8).
- Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations 3(1) — the smoothed target/mean encoding of EQ D3.2.
- Micikevicius, P. et al. (2017). Mixed Precision Training. ICLR 2018 — fp16 training, the fp32 master copy, and loss scaling (EQ N7.7).
- Microsoft. ONNX Runtime. onnxruntime.ai — the cross-platform inference engine with pluggable execution providers (CUDA, TensorRT, OpenVINO, CoreML, WebGPU).
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — deep Q-networks; the modern demonstration that ε-greedy exploration (EQ R1.7) scales to high-dimensional state spaces.
- Mnih, V. et al. (2016). Asynchronous methods for deep reinforcement learning. ICML 2016 — A3C: parallel actor-learners, the n-step advantage and entropy bonus of EQ R4.8 (and the synchronous A2C that followed).
- Molnar, C. (2022). Interpretable Machine Learning (2nd ed.). Open textbook — the standard practical reference covering every method in this chapter.
- Nash, J. F. (1950). Equilibrium Points in n-Person Games. PNAS 36(1), 48–49 — the existence theorem for the equilibrium concept of EQ G1.4 in general finite games.
- Nash, J. F. (1951). Non-Cooperative Games. Annals of Mathematics 54(2), 286–295 — the full development of non-cooperative equilibrium, dominance, and the proof via Kakutani's fixed-point theorem.
- National Institute of Standards and Technology (2023). AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1 — the Govern/Map/Measure/Manage scaffolding generalizing §7.5.
- Nelson, D. B. (1991). Conditional Heteroskedasticity in Asset Returns: A New Approach. Econometrica 59(2) — the EGARCH model (EQ T4.7), capturing the leverage effect in log-variance.
- Neyman, J. & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Phil. Trans. R. Soc. A 231 — Type I/II error, power, and the framework of EQ S4.8.
- Ng, A. Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NeurIPS 14 — the generative/discriminative trade-off and small-data advantage of §9.1.
- Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities With Supervised Learning. ICML 2005 — calibration behavior across model families and the Platt / isotonic fixes (§4.4).
- Norris, J. R. (1997). Markov Chains. Cambridge University Press — the standard rigorous treatment of transition matrices, stationarity, ergodicity, and reversibility.
- Northcutt, C. G., Athalye, A. & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets & Benchmarks — measured label-error rates in ImageNet and nine other canonical test sets (§1.1).
- NVIDIA. Triton Inference Server. developer.nvidia.com — multi-framework serving with concurrent execution and dynamic batching (§3.4).
- Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. (2019). Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366(6464) — a real-world label / proxy that leaked the wrong target into a deployed model (§1.1, §1.4).
- Øksendal, B. (2003). Stochastic Differential Equations: An Introduction with Applications (6th ed.). Springer — the standard graduate text on Itô calculus and SDEs.
- Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science. Science 349(6251) — the large-scale replication study that crystallized the crisis.
- Open Source Initiative (2024). The Open Source AI Definition 1.0. Official text — the bar separating open-source AI from merely open-weight releases.
- Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA 2024 — pooling ~1M trajectories across 22 embodiments; the cross-embodiment data effort (§6.5).
- Osborne, M. J. & Rubinstein, A. (1994). A Course in Game Theory. MIT Press — standard graduate reference for dominance, IESDS, Nash equilibrium, and mixed strategies as presented in §§1.2–1.5.
- Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS — the three-stage SFT → reward model → PPO RLHF recipe behind ChatGPT; source of the KL-regularized objective (EQ R6.5).
- OWASP Foundation (2025). OWASP Top 10 for LLM Applications. The canonical defender's checklist — prompt injection (LLM01) and the system-level controls of §5.4–5.5.
- Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab — the damped random-walk Markov chain whose stationary distribution is PageRank (§7.2).
- Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. ICML 2013 — the spectral-norm view of the gradient product and the gradient-clipping remedy for explosion (§3.2).
- Paszke, A., Gross, S., Massa, F. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 32 — the system paper for define-by-run autograd.
- Pearl, J. (1995). Causal Diagrams for Empirical Research. Biometrika 82(4) — the foundational presentation of the backdoor criterion.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press — DAGs, the do-operator and the backdoor criterion (EQ S3.7).
- Pearson, K. (1895). Note on Regression and Inheritance in the Case of Two Parents. Proceedings of the Royal Society of London 58 — the product-moment correlation coefficient (EQ S3.3).
- Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572. the origin of principal-component analysis as projection onto a best-fit subspace (§6.2, §6.5).
- Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12 — the reference implementations of StandardScaler, MinMaxScaler, RobustScaler and PowerTransformer.
- Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023 — DiT, the transformer backbone that replaced the U-Net and scaled to SD3 and Sora.
- Perez, E., Huang, S., Song, F. et al. (2022). Red Teaming Language Models with Language Models. EMNLP 2022 — using one LM to automatically generate test cases that surface harms in another, at scale.
- Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Machine Learning Technologies 2(1) — the definitive survey of confusion-matrix metrics, their biases, and what each one really measures (§3.3–3.4).
- Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS 2018 — ordered boosting and ordered target statistics (EQ M15.10).
- Puterman, M. L. & Shin, M. C. (1978). Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science 24(11) — the m-sweep interpolation between value and policy iteration cited in §2.4.
- PyTorch Team. torch.export & ExecuTorch. docs.pytorch.org — ahead-of-time graph capture (ExportedProgram) and the edge runtime succeeding TorchScript.
- Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. (eds.) (2009). Dataset Shift in Machine Learning. MIT Press — the reference volume formalizing covariate, prior, and concept shift (EQ V5.1).
- Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press — covariate shift, label shift, and concept drift formalized (§1.4, EQ D1.4).
- Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2) — the canonical reference for the forward, Viterbi, and Baum–Welch algorithms (§7.3).
- Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021 — CLIP; contrastive image–text pre-training and zero-shot transfer (EQ MM1.4).
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI — Whisper; an encoder–decoder transformer trained on 680k hours, the model behind §4.2.
- Radford, A., Metz, L. & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 — DCGAN: the stable convolutional architecture and latent-space vector arithmetic.
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS — DPO; the closed-form optimal policy (EQ R6.6) and the supervised preference loss (EQ R6.7) that skip the reward model and RL loop.
- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. DALL·E 2 — the CLIP-latent prior plus diffusion decoder ("unCLIP") of §3.3.
- Ren, Y. et al. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ICLR 2021 — non-autoregressive TTS with explicit duration prediction; the alignment fix for Tacotron.
- Ribeiro, M. T., Singh, S. & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. KDD 2016 — LIME, local surrogate explanations (§6.4, EQ V6.3).
- Roberts, G. O., Gelman, A. & Gilks, W. R. (1997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. Ann. Appl. Probab. 7(1) — the ~0.234 optimal acceptance-rate result (§7.4).
- Rockafellar, R. T. & Uryasev, S. (2000). Optimization of Conditional Value-at-Risk. Journal of Risk 2(3) — CVaR/ES as a convex, optimizable risk measure.
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022 — the VAE-compressed latent space that makes Stable Diffusion affordable (§5.5).
- Ross, S., Gordon, G. & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011 — DAgger and the formal account of covariate shift in behavior cloning (§6.4, EQ MM6.5).
- Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20 — the silhouette score for choosing and validating k (§12.5, EQ M12.6).
- Rubin, D. B. (1976). Inference and Missing Data. Biometrika 63(3):581–592 — the paper that defined MCAR, MAR, and MNAR.
- Rubinstein, M. (1994). Implied Binomial Trees. Journal of Finance 49(3). — an early reconstruction of the post-1987 volatility smile from market prices.
- Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High-Stakes Decisions and Use Interpretable Models Instead. Nature Machine Intelligence 1 — the case for inherently interpretable models (§6.1 caveat).
- Russakovsky, O. et al. (2015). ImageNet Large Scale Visual Recognition Challenge. IJCV 115(3) — the ImageNet/ILSVRC benchmark that drove the whole progression.
- Saharia, C. et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Imagen — a large frozen T5 text encoder plus a pixel-space super-resolution cascade.
- Saito, T. & Rehmsmeier, M. (2015). The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10(3) — the empirical case for PR over ROC under class imbalance (§4.2).
- Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. (2016). Improved Techniques for Training GANs. NeurIPS 2016 — minibatch discrimination and feature matching, the classic anti-collapse fixes behind Instrument N6.2.
- Salinas, D., Flunkert, V., Gasthaus, J. & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36(3) — the global probabilistic RNN of §6.4.
- Santurkar, S., Tsipras, D., Ilyas, A. & Mądry, A. (2018). How Does Batch Normalization Help Optimization? NeurIPS — the loss-smoothing reinterpretation of BatchNorm cited in §1.3.
- Schölkopf, B. & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press — the definitive textbook treatment of kernels, Mercer's theorem, and the optimization behind EQ M10.3, M10.8.
- Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero). Nature 2020 — decision-time planning with a learned latent model and MCTS, without being given the rules (EQ MM5.5).
- Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. (2015). Trust Region Policy Optimization. arXiv:1502.05477 — the KL-constrained trust region PPO approximates with a first-order clip.
- Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR 2016 — GAE, the λ-blended advantage estimator that sets the bias–variance dial between TD and Monte-Carlo (§4.4).
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv — PPO's clipped surrogate objective, the step-size fix that made policy gradients robust and the workhorse of RLHF (the §4.5 sequel).
- Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics 6(2) — the Bayesian Information Criterion for model (and component-count) selection (§12.5, EQ M12.7).
- scikit-learn developers. Imputation of missing values (User Guide). Official docs — SimpleImputer, KNNImputer, and IterativeImputer (MICE).
- Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015 — the "ML code is a small box" argument behind §7.1.
- Shafer, G. & Vovk, V. (2008). A tutorial on conformal prediction. JMLR 9 — distribution-free prediction intervals with finite-sample coverage (§6.3, EQ T6.6).
- Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011). Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming 127(1), 3–30. The hinge-loss sub-gradient method used in this chapter's first Python cell — how to train a linear SVM at scale.
- Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27 — the founding paper: entropy, source coding, and channel capacity (EQ S8.2, S8.6).
- Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv — introduces GRPO; the group-relative advantage (EQ R6.8) that removes PPO's value network.
- Shapley, L. S. (1953). A Value for n-Person Games. Contributions to the Theory of Games II — the original Shapley value from cooperative game theory.
- Shapley, L. S. (1953). A Value for n-Person Games. In Contributions to the Theory of Games II, Princeton University Press — defines the Shapley value (EQ G2.7) and its axiomatic characterization (§2.5).
- Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. Google — Tacotron 2; the two-stage acoustic-model + vocoder template of §4.3 (EQ MM4.6).
- Sheng, Y., Cao, S., Li, D. et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. MLSys 2024 — multi-tenant serving of many adapters over one shared base model (§3.5).
- Shreve, S. E. (2004). Stochastic Calculus for Finance I: The Binomial Asset Pricing Model. Springer — a rigorous, self-contained development of replication, the risk-neutral measure, and market completeness (§2.1–§2.2).
- Silver, D. et al. (2017). Mastering the game of Go without human knowledge. Nature 550 — AlphaGo Zero / AlphaZero self-play (EQ G3.4).
- Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015 — VGG; depth via stacks of \(3\times 3\) convolutions.
- Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society B 13(2) — the paradox's namesake paper.
- Sims, C. A. (1980). Macroeconomics and Reality. Econometrica 48(1) — introduced the VAR as an atheoretical alternative to large structural macro models (EQ T5.1).
- Singh, S., Jaakkola, T., Littman, M. L. & Szepesvári, C. (2000). Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning 38 — convergence of SARSA (EQ R3.5) and the GLIE exploration conditions of §3.5.
- Snoek, J., Larochelle, H. & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS 2012 — Gaussian-process surrogates and acquisition functions for tuning (§2.4).
- Spearman, C. (1904). The Proof and Measurement of Association between Two Things. American Journal of Psychology 15(1) — rank correlation, EQ S3.4.
- Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — the dropout regularizer and inverted-dropout scaling (EQ N7.6).
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — dropout regularization (EQ N1.9).
- Stewart, G. W. (1993). On the early history of the singular value decomposition. SIAM Review, 35(4), 551–566. historical context tracing the SVD from Beltrami and Jordan to its modern role.
- Stiennon, N. et al. (2020). Learning to Summarize from Human Feedback. NeurIPS — the reward-hacking dynamics of over-optimizing a learned reward model (Instrument R6.3 §6.5).
- Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. B 36(2) — the foundational formalization of cross-validation for model assessment.
- Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. the standard intuition-first text for the column-space, rank, eigenvalue, and SVD material here.
- Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1), 1–25. The original derivation of the Student-t distribution (EQ S2.7), written at the Guinness brewery.
- Sutskever, I., Vinyals, O. & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. NeurIPS 2014 — the encoder-decoder LSTM framework (EQ N4.1) and the source-reversal trick.
- Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning 3 — the original TD(0) and TD(λ) prediction methods (EQ R3.2, EQ R3.3) and the bootstrapping idea at the heart of this chapter.
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — the canonical text; the agent–environment loop, MDPs, returns, value functions, and exploration as framed in this chapter.
- Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS 1999 — the policy gradient theorem (EQ R4.3) and its compatibility with a learned value function, the formal basis of actor-critic.
- Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015 — GoogLeNet / Inception; multi-scale blocks and \(1\times 1\) bottlenecks.
- Taylor, S. J. & Letham, B. (2018). Forecasting at Scale (Prophet). The American Statistician 72(1) — the decomposable trend + seasonality + holidays model of §6.4 (EQ T6.7).
- TensorFlow Team. tf.data: Build TensorFlow input pipelines. Official guide — map / shuffle / batch / prefetch / cache and the overlap of §2.3 (EQ F2.3).
- The PyTorch Team. Automatic Differentiation with torch.autograd. Tutorial — the dynamic graph,.grad accumulation, and zero_grad.
- The PyTorch Team. PyTorch Documentation (stable). Official reference for tensors, autograd, nn, and optim.
- The vLLM Team. vLLM Documentation. Official guide to deployment, quantization, and the OpenAI-compatible server.
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B 58(1) — the L1 penalty that performs embedded selection (EQ D4.6).
- Tishby, N., Pereira, F. C. & Bialek, N. (2000). The Information Bottleneck Method. arXiv — mutual information as a principle for representation learning (§8.3).
- Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017 — the foundational sim-to-real randomization technique (§6.3, EQ MM6.4).
- Toda, H. Y. & Yamamoto, T. (1995). Statistical Inference in Vector Autoregressions with Possibly Integrated Processes. Journal of Econometrics 66(1–2) — the lag-augmented Granger test valid under unit roots and cointegration (§5.5 caveat).
- Touvron, H., Martin, L., Stone, K. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 — the release (and custom community license) that catalyzed the open-weight ecosystem.
- Troyanskaya, O. et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525 — the kNN-impute (KNNimpute) paper.
- Tseng, A., Chee, J., Sun, Q., Kuleshov, V. & De Sa, C. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. ICML 2024 — incoherence processing + E8 lattice codebooks toward ~2 bits.
- Tversky, A. & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review 90(4) — the Linda problem (§1.5).
- Uhlenbeck, G. E. & Ornstein, L. S. (1930). On the Theory of the Brownian Motion. Phys. Rev. 36 — the mean-reverting process of §1.5.
- Valevski, D. et al. (2024). Diffusion Models Are Real-Time Game Engines (GameNGen). arXiv — a neural network simulating DOOM interactively; a video predictor used as a playable environment.
- van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). CRC Press — the practical, freely-readable reference for MICE / chained equations.
- van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. DeepMind — dilated causal convolutions for autoregressive audio (EQ MM4.7); the vocoder breakthrough.
- van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS 2017 — discrete codebook latents that let autoregressive models generate over autoencoder tokens (§5.5).
- van Hasselt, H. (2010). Double Q-learning. NeurIPS 23 — diagnoses and corrects the maximization bias of the \(\max\) operator in EQ R3.4.
- van Hasselt, H., Guez, A. & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016 (arXiv:1509.06461) — double-DQN, decoupling action selection from evaluation to curb over-estimation.
- Varma, S. & Simon, R. (2006). Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinformatics 7:91 — the selection-bias result behind nested CV (EQ V1.6).
- Vasicek, O. (1977). An Equilibrium Characterization of the Term Structure. Journal of Financial Economics 5(2) — the Ornstein–Uhlenbeck short rate and its closed-form bond (EQ Q4.4–Q4.6).
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 — drops recurrence for pure self-attention; the destination of EQ N4.5.
- Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008 — the denoising autoencoder (EQ N5.2) and the manifold-projection view that seeds diffusion.
- von Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Mathematische Annalen 100 — the original minimax theorem for two-player zero-sum games (§1.4).
- von Neumann, J. & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press — the founding text; normal-form games (EQ G1.1), the minimax theorem (EQ G1.6), and expected-utility theory.
- Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58(301) — Ward's minimum-variance linkage for agglomerative clustering (§12.2, EQ M12.2).
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer — the standard modern reference for the distributions, moments, and convergence results in this chapter.
- Wasserstein, R. L. & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) — the profession's own caution on what a p-value is not.
- Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8 — the action-value function \(Q^\pi\) (EQ R1.6) and the convergence result that grounds model-free control.
- Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L. & Petitjean, F. (2016). Characterizing Concept Drift. Data Mining and Knowledge Discovery 30 — a quantitative framework for describing how concepts drift over time.
- Wei, A., Haghtalab, N. & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail?. NeurIPS 2023 — the competing-objectives and mismatched-generalization framing used throughout §5.1–5.2.
- Welch, B. L. (1947). The Generalization of Student's Problem When Several Different Population Variances Are Involved. Biometrika 34 — the unequal-variance two-sample test of EQ S4.10.
- White, I. R., Royston, P. & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 30(4):377–399 — practical guidance on running and pooling MICE.
- Wiener, N. (1923). Differential Space. J. Math. Phys. 2 — the rigorous existence proof of the process that bears his name.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 — the original REINFORCE estimator (EQ R4.4) and the log-derivative / score-function trick behind every policy gradient.
- Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management Science 6(3) — adds the seasonal component, completing Holt-Winters (EQ T3.6/T3.7).
- Wolpert, D. H. (1992). Stacked Generalization. Neural Networks 5(2) — learning the combiner with out-of-fold base predictions (EQ M14.6).
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 — soft/hard attention beyond translation; shows the mechanism generalizes (§4.1).
- Yang, A., Yang, B., Hui, B. et al. (2024). Qwen2 Technical Report. arXiv:2407.10671 — the Qwen family's wide size ladder and (mostly) Apache-2.0 licensing.
- Yeo, I.-K. & Johnson, R. A. (2000). A New Family of Power Transformations to Improve Normality or Symmetry. Biometrika 87(4) — the Yeo-Johnson extension to real-valued data, EQ D3.8.
- Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks?. NeurIPS 27 — the empirical basis for transfer learning; transferability falls with depth.
- Yue, X. et al. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark. CVPR 2024 — college-level multimodal reasoning, a current frontier evaluation.
- Yule, G. U. (1927). On a Method of Investigating Periodicities in Disturbed Series. Phil. Trans. R. Soc. A 226 — the paper that introduced the autoregressive model (§1.3).
- Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS 2023 — ACT (action chunking with transformers) and the ALOHA teleoperation platform (§6.4).
- Zhou, C., Liu, P., Xu, P. et al. (2023). LIMA: Less Is More for Alignment. NeurIPS 2023 — 1,000 curated examples beat far larger noisy sets; the evidence behind "quality over quantity."
- Zou, A., Wang, Z., Carlini, N. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. The GCG attack (EQ OM5.2): gradient-based adversarial suffixes that transfer across models.
- Zou, H. & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B 67(2) — the L1+L2 fix for Lasso's instability under collinearity (EQ D4.6 note).