Training Deep Networks in Practice

7.1

Optimizers — SGD, momentum, Adam, AdamW

Every optimizer answers one question: given the gradient $g_t = \nabla_\theta \mathcal{L}$ at the current parameters, how far and in what direction do we step? The answers form a short, important lineage. Stochastic gradient descent is the bare minimum — step downhill by a fixed multiple of the gradient on a mini-batch:

EQ N7.1 — SGD UPDATE $$ \theta_{t+1} \;=\; \theta_t - \eta\, g_t, \qquad g_t = \nabla_\theta\, \mathcal{L}\!\left(\theta_t;\, \mathcal{B}_t\right) $$

$\eta$ is the learning rate; $\mathcal{B}_t$ a random mini-batch. The mini-batch makes $g_t$ a noisy estimate of the true gradient — the "stochastic" in SGD — and that noise is not purely a nuisance: it helps the optimizer escape sharp, brittle minima. SGD's flaw is that one scalar $\eta$ must serve every parameter and every direction of curvature, so it crawls along flat directions and oscillates across steep ones.

The first fix is momentum: accumulate an exponentially-decaying running average of past gradients (a velocity $v_t$) and step along that instead. Consistent directions reinforce; oscillating ones cancel.

EQ N7.2 — SGD WITH MOMENTUM $$ v_{t} = \mu\, v_{t-1} + g_t, \qquad \theta_{t+1} = \theta_t - \eta\, v_{t}, \qquad 0 \le \mu < 1 $$

$\mu$ (typically $0.9$) is the momentum coefficient. For a steady gradient $g$, the velocity converges to a geometric series, $v_\infty = g/(1-\mu)$, so the effective step grows by $1/(1-\mu)$. At $\mu = 0.9$ that is a 10× amplification along persistent directions — the source of momentum's speed, and the reason it can overshoot. Nesterov's variant evaluates the gradient at the look-ahead point $\theta_t - \eta\mu v_{t-1}$ for a slightly better-anticipated correction.

SGD with momentum $\mu = 0.9$ is fed the same gradient $g$ every step. At steady state, by what factor is the effective step size larger than plain SGD's $\eta g$? (Use $v_\infty = g/(1-\mu)$.)

The steady-state velocity is $v_\infty = \dfrac{g}{1-\mu} = \dfrac{g}{1-0.9} = \dfrac{g}{0.1} = 10\,g$. The effective step is $\eta v_\infty = 10\,\eta g$, so the amplification factor is 10. This is exactly why a momentum run often needs a smaller $\eta$ than a plain-SGD run at the same stability.

The second fix is adaptivity: give each parameter its own effective learning rate, scaled down where gradients have been large. Adam combines this with momentum. It maintains a first moment $m_t$ (the momentum-like mean of gradients) and a second moment $v_t$ (a mean of squared gradients), bias-corrects both, and divides the step by the root of the second moment:

EQ N7.3 — ADAM $$ m_t = \beta_1 m_{t-1} + (1-\beta_1)\,g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2 $$ $$ \hat m_t = \frac{m_t}{1-\beta_1^{\,t}}, \quad \hat v_t = \frac{v_t}{1-\beta_2^{\,t}}, \qquad \theta_{t+1} = \theta_t - \eta\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} $$

Defaults: $\beta_1 = 0.9,\ \beta_2 = 0.999,\ \epsilon = 10^{-8}$. The bias correction matters most at the start: with $m_0 = v_0 = 0$, the raw $m_t$ is biased toward zero, and dividing by $1 - \beta_1^{\,t}$ undoes it. The $\hat m_t / \sqrt{\hat v_t}$ ratio makes each coordinate's step roughly scale-invariant — large, noisy gradients are damped, tiny consistent ones are amplified — which is why Adam "just works" across wildly different layers and is the default for transformers and most modern deep nets.

Adam with $\beta_1 = 0.9$, starting from $m_0 = 0$, takes one step with gradient $g = 1$. What is the bias-corrected first moment $\hat m_1$? (Compute $m_1 = (1-\beta_1)g$, then $\hat m_1 = m_1 / (1 - \beta_1^{\,1})$.)

$m_1 = (1 - 0.9)\times 1 = 0.1$. Bias correction: $\hat m_1 = \dfrac{m_1}{1 - 0.9^{1}} = \dfrac{0.1}{0.1} = $ 1. The correction exactly cancels the cold-start shrinkage, so the very first effective gradient estimate equals $g$ — without it, Adam would take vanishingly small steps for the first dozen updates.

AdamW is the variant you should actually reach for. The issue it fixes is subtle: classical weight decay was implemented as an L2 penalty added to the loss, so its gradient $\lambda\theta$ flows through Adam's adaptive denominator and gets rescaled per-parameter — coupling the regularization strength to each coordinate's gradient history. Loshchilov & Hutter showed that decoupling the decay — applying it directly to the weights, outside the adaptive step — restores the intended behavior and consistently generalizes better:

EQ N7.4 — ADAMW: DECOUPLED WEIGHT DECAY $$ \theta_{t+1} = \theta_t - \eta\left( \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} \;+\; \lambda\, \theta_t \right) $$

The decay term $\lambda\theta_t$ is added after the adaptive rescaling, so it shrinks every weight by the same relative amount $\eta\lambda$ each step — true weight decay, not an adaptive-gradient L2 term. This decoupling is now the default in essentially every transformer training recipe (typical $\lambda \approx 0.01\!-\!0.1$). Bias and normalization-scale parameters are conventionally excluded from decay.

Optimizer	State per parameter	Strength	Weakness
SGD	none	Cheapest; flat minima; strong final accuracy on vision with a good schedule	Slow on ill-conditioned loss surfaces; very LR-sensitive
SGD + momentum	1 (velocity)	Accelerates persistent directions, damps oscillation; the CNN workhorse	Can overshoot; still one global $\eta$
Adam	2 ($m$, $v$)	Per-parameter adaptive; robust across layer types; fast early progress	2× optimizer memory; L2 decay misbehaves
AdamW	2 ($m$, $v$)	Adam with correct weight decay; default for transformers	Same memory cost; still needs a schedule

Adam's two extra moments cost real memory: at fp32 they add 8 bytes per parameter on top of the 4-byte weight and 4-byte gradient — the "16 bytes/param" rule that sizes training clusters (and the reason 8-bit optimizers and ZeRO sharding exist). The contested point worth flagging: on some vision benchmarks well-tuned SGD+momentum still generalizes slightly better than Adam, so "Adam always wins" is folklore, not law — it wins on convenience and on transformers, where SGD struggles.

PYTHON · RUNNABLE IN-BROWSER

# SGD vs momentum vs Adam on an ill-conditioned 2D quadratic
# Loss = 0.5*(a*x^2 + b*y^2); steep in x (a=20), flat in y (b=1).
import numpy as np

a, b = 20.0, 1.0
def grad(p): return np.array([a*p[0], b*p[1]])    # gradient of the quadratic
def loss(p): return 0.5*(a*p[0]**2 + b*p[1]**2)

def run(kind, lr, steps=300):
    p = np.array([1.0, 1.0]); m = np.zeros(2); v = np.zeros(2)
    for t in range(1, steps+1):
        g = grad(p)
        if kind == "sgd":
            p = p - lr*g
        elif kind == "mom":
            m = 0.9*m + g;  p = p - lr*m
        else:  # adam
            m = 0.9*m + 0.1*g;  v = 0.999*v + 0.001*g*g
            mh = m/(1-0.9**t); vh = v/(1-0.999**t)
            p = p - lr*mh/(np.sqrt(vh)+1e-8)
    return loss(p)

# Each optimizer gets its own near-best stable lr (the fair way to compare them)
for kind, lr in [("sgd", 0.04), ("mom", 0.02), ("adam", 0.20)]:
    print(f"{kind:5s} (lr={lr:.2f}) final loss after 300 steps: {run(kind, lr):.2e}")

print("\nAdam reaches the lowest loss: it scales x and y independently, so the")
print("steep x-direction and flat y-direction converge at the same rate -- the")
print("single global step size that hobbles SGD on this surface is gone.")

edits are live — break it on purpose

INSTRUMENT N7.1 — OPTIMIZER RACESGD vs MOMENTUM vs ADAM ON A LOSS SURFACE · EQ N7.1–N7.3

CONDITION NUMBER (steepness ratio) 20

LEARNING RATE η 0.030

SGD FINAL LOSS

—

MOMENTUM FINAL LOSS

—

ADAM FINAL LOSS

—

Elliptical contours of an ill-conditioned quadratic — the canonical hard case. Three trajectories race from the same start: SGD zig-zags across the steep axis, momentum rolls through it faster, and Adam rescales each axis and heads almost straight for the minimum. Crank the condition number up and watch SGD stall while Adam barely notices; push the learning rate too high and momentum overshoots into divergence first.

7.2

Learning-rate schedules — warmup, cosine, cyclical

The single learning rate $\eta$ is the most consequential hyperparameter in deep learning, and the best value is not constant over a run. Two facts shape the schedule: early on, weights are random and gradients are large and chaotic, so a big step can blow up; late on, you want small steps to settle into a minimum. The modern default answers both with a warmup followed by a cosine decay.

EQ N7.5 — WARMUP + COSINE SCHEDULE $$ \eta(t) = \begin{cases} \eta_{\max}\,\dfrac{t}{T_w} & t \le T_w \quad\text{(linear warmup)} \\[1.2em] \eta_{\min} + \tfrac{1}{2}\!\left(\eta_{\max} - \eta_{\min}\right)\!\left(1 + \cos\!\dfrac{\pi\,(t - T_w)}{T - T_w}\right) & t > T_w \quad\text{(cosine decay)} \end{cases} $$

$T_w$ is the warmup length (commonly 1–5% of total steps $T$); $\eta_{\min}$ is often $0$ or a small floor. Warmup ramps the rate linearly from $0$ to $\eta_{\max}$, giving the optimizer's adaptive statistics (and a transformer's fragile early layers) time to stabilize before full-size steps. Cosine decay then eases the rate down a half-cosine: gentle at first, steepest in the middle, flattening to $\eta_{\min}$ at the end. At $t = T_w$, $\cos 0 = 1 \Rightarrow \eta = \eta_{\max}$; at $t = T$, $\cos \pi = -1 \Rightarrow \eta = \eta_{\min}$ — the curve joins the two phases continuously.

Why a cosine rather than a straight line or exponential? Empirically the cosine's slow start (it lingers near $\eta_{\max}$) buys more exploration before annealing, and its slow finish lets the model fine-settle — and it consistently beats step decay on large language and vision models. The cyclical / warm-restart family (SGDR) takes the idea further, resetting the schedule periodically so the rate jumps back up; each restart can knock the model out of a mediocre basin into a better one, and the snapshots make a cheap ensemble. The contested part: with a good cosine, restarts rarely help large single-run pretraining, so they have fallen out of fashion for frontier models while remaining useful for smaller budgets.

True or false: after warmup, a cosine schedule decays the learning rate along a cosine curve — from $\eta_{\max}$ down to $\eta_{\min}$ as $t$ goes from $T_w$ to $T$. (Answer true or false.)

By EQ N7.5, the decay phase is $\eta_{\min} + \tfrac12(\eta_{\max}-\eta_{\min})(1+\cos\frac{\pi(t-T_w)}{T-T_w})$ — exactly a half-period of a cosine, starting at $\eta_{\max}$ (where $\cos 0 = 1$) and ending at $\eta_{\min}$ (where $\cos\pi = -1$). The statement is true.

You train for $T = 10{,}000$ steps and set warmup to 3% of the run. How many steps $T_w$ does the linear warmup phase last?

$T_w = 0.03 \times 10{,}000 = $ 300 steps. Over those 300 steps the rate climbs linearly from $0$ to $\eta_{\max}$; the remaining 9,700 steps follow the cosine decay down toward $\eta_{\min}$.

PYTHON · RUNNABLE IN-BROWSER

# Warmup + cosine learning-rate schedule (EQ N7.5): build and inspect it
import numpy as np

T, Tw = 1000, 50          # total steps, warmup steps (5%)
eta_max, eta_min = 1e-3, 0.0

def lr_at(t):
    if t < Tw:                                   # linear warmup
        return eta_max * t / Tw
    prog = (t - Tw) / (T - Tw)                   # 0..1 through the decay
    return eta_min + 0.5*(eta_max - eta_min)*(1 + np.cos(np.pi*prog))

ts  = np.arange(T)
eta = np.array([lr_at(t) for t in ts])

print("step    0:", f"{lr_at(0):.2e}   (warmup starts at 0)")
print("step   50:", f"{lr_at(50):.2e}   (peak = eta_max at end of warmup)")
print("step  525:", f"{lr_at(525):.2e}   (~midpoint of decay, steepest part)")
print("step  999:", f"{lr_at(999):.2e}   (decayed to eta_min)")
print(f"\npeak step is {ts[eta.argmax()]} -> rate peaks exactly at warmup's end")
plot_xy(ts, eta)          # the classic ramp-then-cosine shape

edits are live — break it on purpose

INSTRUMENT N7.2 — LR-SCHEDULE DESIGNERWARMUP + COSINE · EQ N7.5

PEAK RATE η_max 1.0e-3

WARMUP (% of run) 5%

MIN RATE FLOOR (× peak) 0%

WARMUP STEPS

—

PEAK RATE

—

FINAL RATE

—

The full schedule over a 10,000-step run: a linear warmup ramp into a cosine descent. Drag warmup to 0% and the curve starts at full rate — fine for a fine-tune, often unstable for from-scratch transformer pretraining. Raise the min-rate floor and the decay flattens above zero, which keeps the model learning if you plan to train longer than $T$. The peak always lands exactly at the end of warmup.

7.3

Regularization & early stopping

A network with millions of parameters can memorize its training set outright. Regularization is the set of pressures that push it to generalize instead — to fit the signal, not the noise. The deep-learning toolkit is small and well-understood.

Weight decay (the $\lambda\theta$ term of EQ N7.4). Shrinks weights toward zero each step, favoring simpler, smaller-norm solutions. Use the decoupled form via AdamW; exclude biases and norm scales.
Dropout. During training, zero each activation independently with probability $p$ and rescale the survivors by $1/(1-p)$ (so the expected activation is unchanged). This prevents co-adaptation — no neuron can rely on any specific other — and approximates training an ensemble of subnetworks. At inference, dropout is off. Transformers use light dropout ($p \approx 0.0\!-\!0.1$); large-data pretraining often sets it to zero.
Data augmentation. The cheapest regularizer: expand the effective dataset with label-preserving transforms (crops, flips, mixup/cutmix for vision; token masking for text). More data beats every other trick.
Label smoothing. Replace one-hot targets with $(1-\varepsilon)$ on the true class and $\varepsilon/K$ elsewhere, discouraging the model from becoming over-confident and improving calibration.
Early stopping. Track a held-out validation loss; keep the checkpoint at its minimum and stop once it has stopped improving for a patience window. It is regularization by when you quit.

EQ N7.6 — DROPOUT (TRAIN-TIME, INVERTED) $$ \tilde a_i = \frac{r_i}{1-p}\, a_i, \qquad r_i \sim \mathrm{Bernoulli}(1-p), \qquad \mathbb{E}[\tilde a_i] = a_i $$

Each activation $a_i$ survives with probability $1-p$ and is scaled up by $1/(1-p)$. The expectation $\mathbb{E}[\tilde a_i] = (1-p)\cdot \frac{a_i}{1-p} = a_i$ is preserved, so inference needs no rescaling — you simply disable dropout. The randomness forces redundant, robust representations; the rescaling keeps the forward pass's scale honest between train and test.

The signature of overfitting is a validation loss that bottoms out and then rises while the training loss keeps falling — the model is now learning the training set's idiosyncrasies. Underfitting is the opposite: both losses sit high and flat, the model lacks the capacity, the right features, or enough training. Early stopping catches the first; more capacity, better features, or longer training fixes the second.

A dropout layer scales the surviving activations by $1/(1-p) = 1.25$ at train time (EQ N7.6). What is the keep probability $1-p$?

The scale factor is $1/(1-p) = 1.25$, so the keep probability is $1-p = 1/1.25 = $ 0.8. That means $p = 0.2$: one activation in five is dropped each step, and the rest are boosted by 25% to keep the expected signal unchanged.

INSTRUMENT N7.3 — LOSS-CURVE DIAGNOSERTRAIN vs VALIDATION · UNDERFIT / OVERFIT / LR-TOO-HIGH

FAILURE MODE

DIAGNOSIS

—

FINAL TRAIN / VAL

—

FIX

—

Each button paints the canonical shape of a real training pathology — train loss and validation loss over epochs, with an early-stopping marker where validation bottoms out. OVERFIT: train keeps dropping, val turns back up (the classic divergence). UNDERFIT: both stay high and flat. LR TOO HIGH: loss spikes and oscillates, often blowing up. Learn the silhouettes here and you will diagnose a run from across the room.

7.4

Mixed precision & numerical stability

Modern GPUs run dramatically faster in 16-bit than in 32-bit, and 16-bit tensors halve memory. Mixed-precision training captures both wins while keeping fp32 where precision is non-negotiable. The catch is dynamic range: the older float16 format has only ~5 exponent bits, so its largest representable value is about $65{,}504$ and small gradients underflow to zero. The fix is loss scaling.

EQ N7.7 — LOSS SCALING $$ \mathcal{L}_{\text{scaled}} = S \cdot \mathcal{L} \;\Rightarrow\; g_{\text{scaled}} = S \cdot g, \qquad g \;=\; \frac{1}{S}\, g_{\text{scaled}} \;\text{(unscale before the optimizer step)} $$

Multiply the loss by a large factor $S$ (e.g. $2^{15}$) before backprop. By the chain rule every gradient is multiplied by the same $S$, lifting tiny values out of fp16's underflow region. The gradients are then divided by $S$ before the weight update, so the math is unchanged — only the representable range was borrowed. Dynamic loss scaling automates $S$: raise it while gradients stay finite, and halve it (skipping that step) whenever an inf/NaN appears.

Three practices keep mixed precision numerically safe:

Keep an fp32 master copy of the weights. Updates are tiny relative to the weights; adding a small fp16 step to an fp16 weight rounds to nothing. The optimizer updates the fp32 master, then casts to fp16 for the next forward pass.
Run reductions in fp32. Softmax, layer-norm statistics, and loss accumulation sum many terms; do them in fp32 to avoid catastrophic cancellation, even when the matmuls run in 16-bit.
Prefer bfloat16 when the hardware has it. bf16 keeps fp32's 8 exponent bits (same ~$10^{38}$ range) at the cost of mantissa precision, so it almost never overflows and usually needs no loss scaling — the reason it is the default for large-model training on recent accelerators. fp8 pushes further still and is now used for the heaviest matmuls, with per-tensor scaling.

THE NUMERICS THAT BITE

Most "my loss went to NaN" failures are numeric, not algorithmic. The usual suspects: fp16 gradient overflow (use loss scaling or switch to bf16); a learning rate high enough to send weights to inf in a few steps; $\log(0)$ or $0/0$ in a hand-written loss (add an $\epsilon$, use the log-sum-exp trick); and un-clipped gradients on a spiky batch. Gradient clipping — rescale the gradient so $\lVert g\rVert \le c$ (typically $c = 1.0$) — is cheap insurance against the last one and is standard in transformer recipes.

The float16 format's largest representable finite value — the overflow ceiling that motivates loss scaling — is which number? (It is $(2 - 2^{-10})\times 2^{15}$.)

$(2 - 2^{-10})\times 2^{15} = (2 - 0.0009765625)\times 32768 = 1.9990234375 \times 32768 = $ 65504. Any gradient (or activation) above this overflows to inf in fp16, which is exactly why loss scaling — and, better, bf16's fp32-sized exponent — exist.

PYTHON · RUNNABLE IN-BROWSER

# Why loss scaling exists: fp16 underflow, and how scaling rescues gradients
import numpy as np

FP16_MAX = 65504.0     # largest finite fp16; above this -> inf (overflow)

# A batch of tiny gradients, the kind deep nets produce late in training.
# fp16's smallest positive value is ~6e-8, so anything well below that vanishes.
g = np.array([1e-3, 2e-5, 5e-7, 4e-8, 9e-9])

# Cast to fp16 with NO scaling -> the smallest entries flush to zero (underflow)
g_fp16 = g.astype(np.float16)
lost = int(np.sum((g != 0) & (g_fp16 == 0)))
print("raw gradients          :", g)
print("naive fp16             :", g_fp16.astype(np.float32))
print(f"-> {lost} of {g.size} gradients underflowed to exactly 0\n")

# Loss scaling: multiply by S before fp16, divide back after (EQ N7.7)
S = 2**15
scaled    = g * S
g_scaled  = scaled.astype(np.float16).astype(np.float32) / S
recovered = int(np.sum((g_fp16 == 0) & (g_scaled != 0)))
overflow  = bool(np.any(np.abs(scaled) > FP16_MAX))
print(f"with loss scale S={S}:", g_scaled)
print(f"-> {recovered} previously-lost gradient(s) recovered; overflow? {overflow}")
print("\nScaling lifts tiny gradients above fp16's underflow floor, then")
print("unscales them after backprop -- same math, full dynamic range recovered.")

edits are live — break it on purpose

7.5

A practical recipe & debugging

Theory converges; in practice the failures are mundane and repetitive. Here is a default that survives contact with reality for most supervised deep-learning tasks, followed by the debugging loop that finds the bug when it does not.

# Defaults that work for most from-scratch deep-net training
optimizer:   AdamW · β1=0.9 · β2=0.999 (0.95 for big transformers) · ε=1e-8
weight_decay: 0.1 on weights · 0.0 on biases & norm/scale params
lr:          tune η_max first (it dominates); 3e-4 is a sane transformer start
schedule:    linear warmup 1–5% of steps → cosine decay to ~0
batch:       as large as memory allows; raise lr with batch (lin/sqrt rule)
precision:   bf16 if available (no loss scaling); else fp16 + dynamic scaling
grad_clip:   global-norm clip at 1.0 — cheap insurance against spikes
regularize:  dropout 0.0–0.1 · augmentation · early-stop on val loss
init:        scaled init (He/Xavier or per-arch); verify activations don't explode

When a run misbehaves, work the ladder from cheapest check to most expensive — most bugs are caught in the first three rungs:

Overfit one batch. Before anything else, train on a single mini-batch until the loss hits (near) zero. If it cannot, the bug is in the model, the loss, or the data pipeline — not the hyperparameters. This one test catches a remarkable fraction of failures.
Sanity-check the initial loss. For $K$-class classification with random weights, cross-entropy should start near $\ln K$. If it starts far off, your labels, logits, or loss are wired wrong.
Read the loss curve (Instrument N7.3). NaN/spike → lower LR, clip gradients, check for fp16 overflow. Flat-and-high → underfit: more capacity/LR/steps. Val turns up → overfit: regularize or early-stop.
Do an LR sweep. The learning rate dominates every other knob. Sweep it over a few orders of magnitude (or use an LR-range test) before touching architecture.
Watch gradient and activation norms. Exploding norms → clip, lower LR, check init/normalization. Vanishing norms → check residual connections, normalization placement, and activation functions.

A 10-class classifier with random initial weights predicts roughly uniform probabilities. What initial cross-entropy loss should you expect — the sanity-check value $\ln K$ for $K = 10$?

A uniform prediction assigns probability $1/K$ to the true class, so the loss is $-\ln(1/K) = \ln K = \ln 10 \approx $ 2.302. If your run starts at, say, 6.0 instead, something is wrong with the labels, the logit scale, or the loss reduction — fix that before tuning anything else.

You can now train a network that fits a fixed dataset; the next volume removes the dataset. Reinforcement learning replaces "minimize a loss on labeled examples" with "maximize a reward signal an agent must discover by acting" — a setting where the data is generated by the very policy you are optimizing. RL · 01 opens with the formalism that makes that tractable: the Markov decision process, states, actions, rewards, and the discounting that ties a future payoff to a present choice.

7.R

References

Kingma, D. P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015 — the first/second-moment adaptive optimizer with bias correction (EQ N7.3).
Loshchilov, I. & Hutter, F. (2017). Decoupled Weight Decay Regularization. ICLR 2019 — AdamW; why decoupling weight decay from the adaptive step generalizes better (EQ N7.4).
Micikevicius, P. et al. (2017). Mixed Precision Training. ICLR 2018 — fp16 training, the fp32 master copy, and loss scaling (EQ N7.7).
Loshchilov, I. & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 — cosine annealing and cyclical warm restarts (EQ N7.5).
Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — the dropout regularizer and inverted-dropout scaling (EQ N7.6).
Keskar, N. S. et al. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017 — batch size, flat vs. sharp minima, and the generalization debate.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning MIT Press — Ch. 8 (optimization) and Ch. 7 (regularization), the standard textbook treatment.