Lecture 4: Boosting, Blackwell Approachability, and Calibration
Date: February 6, 2026 Topic: Game Theory and Boosting, Blackwell’s Theorem, and Calibrated Forecasting
1. Boosting and Game Theory
Boosting (e.g., AdaBoost) is one of the most successful ideas in machine learning. While often introduced as a way to combine “weak” learners into a “strong” one, its theoretical foundation is deeply rooted in game theory—specifically, in finding an approximate Nash equilibrium in a zero-sum game.
The Boosting Game
We can frame boosting as a zero-sum game between two players:
- The Weighting Player (Adversary): Chooses a distribution $p \in \Delta_n$ over $n$ training samples $(x_i, y_i)$.
- The Learner Player: Chooses a hypothesis $h$ from a class $\mathcal{H}$.
The payoff for a pair $(p, h)$ is the expected “edge” over random guessing:
\[M(p, h) = \sum_{i=1}^n p_i y_i h(x_i)\]Weak vs. Strong Learning
The fundamental theorem of boosting establishes the equivalence between two assumptions:
- Weak Learning Assumption (WLA): For every distribution $p$ over the data, there exists a weak hypothesis $h$ that performs slightly better than random:
for some fixed $\gamma > 0$.
- Strong Learning Assumption (SLA): There exists a fixed convex combination of hypotheses (an ensemble) that classifies every sample correctly with some margin:
The Duality Proof: By von Neumann’s Minimax Theorem:
\[\min_{p \in \Delta_n} \max_{q \in \Delta_{\mathcal{H}}} \mathbb{E}_{p, q} [y_i h(x_i)] = \max_{q \in \Delta_{\mathcal{H}}} \min_{p \in \Delta_n} \mathbb{E}_{p, q} [y_i h(x_i)]\]If WLA holds, the LHS is at least $\gamma$. Therefore, the RHS must also be at least $\gamma$, which implies SLA. This “trick” shows that if you can always beat a distribution, there must be a way to beat all points simultaneously.
Boosting via Hedge
AdaBoost essentially runs the Hedge algorithm where the “experts” are the training samples. At each step $t$:
- The Weighting Player uses Hedge to maintain their distribution:
-
The Learner Player finds a hypothesis $h_t$ that satisfies WLA for $p_t$.
-
The weights are updated:
- The final classifier is the average hypothesis $\bar{h} = \frac{1}{T} \sum h_t$, which converges to the maximin strategy.
2. Blackwell Approachability
Blackwell’s Approachability Theorem (1956) is a landmark generalization of the Minimax Theorem to games with vector-valued payoffs. It provides the geometric foundation for almost all of modern online learning.
The Setup
Consider a repeated game where:
- The Learner chooses $x_t \in \mathcal{X}$.
- The Adversary chooses $y_t \in \mathcal{Y}$.
- The payoff is a vector $r(x_t, y_t) \in \mathbb{R}^d$.
Let $\bar{r}_T$ be the average payoff vector:
\[\bar{r}_T = \frac{1}{T} \sum_{t=1}^T r(x_t, y_t)\]The learner’s goal is to ensure that $\bar{r}_T$ approaches a target closed convex set $S \subseteq \mathbb{R}^d$:
\[\text{dist}(\bar{r}_T, S) \to 0 \quad \text{as } T \to \infty\]Satisfiability and Approachability
Blackwell proved that $S$ is approachable if and only if it satisfies one of these two equivalent conditions:
- Response-Satisfiability (RS): For every mixed strategy of the adversary $q \in \Delta_{\mathcal{Y}}$, the learner has a response $p \in \Delta_{\mathcal{X}}$ that “lands” in $S$ in expectation:
- Halfspace-Satisfiability (HS): For every halfspace $H$ containing $S$, the learner has a fixed response $p$ that keeps the expected payoff in $H$ regardless of what the adversary does:
Blackwell’s Algorithm and the Proof of Approachability
While the satisfiability conditions (RS and HS) tell us when a set is approachable, Blackwell also provided a constructive algorithm to achieve it.
The Strategy: Let $\bar{r}_t$ be the average payoff vector after $t$ rounds.
Now we define how the the learner should pick the distribution $p_{t+1}$.
(1) If $\bar{r}_t \in S$, the learner can pick any distribution.
(2) If $\bar{r}_t \notin S$, let $\pi_t = \Pi_S(\bar{r}_t)$ be the Euclidean projection of $\bar{r}_t$ onto the set $S$.
(3) Let $a_t = \bar{r}_t - \pi_t$ be the “error vector”. This vector defines a halfspace $H_t = {z : \langle a_t, z - \pi_t \rangle \le 0}$ such that $S \subseteq H_t$.
(4) By the Halfspace-Satisfiability (HS) condition, there exists a strategy $p_{t+1}$ such that for all opponent plays $y$:
\[\mathbb{E}_{x \sim p_{t+1}} [\langle \bar{r}_t - \pi_t, r(x, y) - \pi_t \rangle] \le 0\]The Proof (Potential Function Analysis): Define the potential function $\Phi_t = \text{dist}(\bar{r}_t, S)^2 = \lVert \bar{r}_t - \pi_t \rVert^2$. We want to show $\Phi_t \to 0$.
- Iterate Update: Recall the update rule:
- Distance Bound: By the properties of projection onto a convex set:
- Expansion:
-
Expectation: We now take the expectation over the learner’s randomized choice $x_{t+1} \sim p_{t+1}$. The goal is to show the middle term is non-positive.
By the properties of projection onto a convex set, $S$ is contained in the halfspace $H_t$ defined by:
By the Halfspace-Satisfiability (HS) condition, there exists a mixed strategy $p_{t+1}$ such that for any adversary action $y_{t+1}$, the expected payoff lands in $H_t$. That is:
\[\mathbb{E}_{x \sim p_{t+1}} [r(x, y_{t+1})] \in H_t \implies \langle \bar{r}_t - \pi_t, \mathbb{E}[r_{t+1}] - \pi_t \rangle \le 0\]Applying this to our potential expansion:
\[\mathbb{E}[\Phi_{t+1}] \le \left(\frac{t}{t+1}\right)^2 \Phi_t + 0 + \frac{R^2}{(t+1)^2}\]where $R$ is a bound on the maximum distance $\lVert r(x, y) - \pi \rVert$.
- Convergence: We can show by induction that $\mathbb{E}[\Phi_T] \le \frac{R^2}{T}$.
- Base Case ($T=1$): $\mathbb{E}[\Phi_1] \le R^2$.
- Inductive Step: Assume $\mathbb{E}[\Phi_t] \le \frac{R^2}{t}$. Then:
- Conclusion: Since $\mathbb{E}[\text{dist}(\bar{r}_T, S)^2] \le R^2/T$, the root-mean-square distance satisfies $\sqrt{\mathbb{E}[\text{dist}(\bar{r}_T, S)^2]} \le R/\sqrt{T}$. By Jensen’s inequality, $\mathbb{E}[\text{dist}(\bar{r}_T, S)] \le \frac{R}{\sqrt{T}}$. This is why the approachability rate is $O(1/\sqrt{T})$.
Reduction: No-Regret Learning as Approachability
Blackwell’s Theorem is a “master theorem” for online learning. We can derive the existence of no-regret algorithms (like Hedge) directly by framing the “Expert Advice” problem as a vector-valued game.
1. The Setup: Consider $N$ experts. In each round, the learner chooses a distribution $p_t \in \Delta_N$ and the environment chooses a loss vector $\ell_t \in [0, 1]^N$.
Vector Payoff: Define $r(p, \ell) \in \mathbb{R}^N$ such that each component $i$ is: $r(p, \ell)_i = \langle p, \ell \rangle - \ell_i$
Target Set: Let $S = \mathbb{R}_-^N$ (the negative orthant).
2. Why Approachability implies No-Regret: If the learner can force the average payoff vector $\bar{r}_T$ to approach $S$, it means that for every expert $i$:
\[\frac{1}{T} \sum_{t=1}^T (\langle p_t, \ell_t \rangle - \ell_{i,t}) \le \text{dist}(\bar{r}_T, S) \to 0\]This is exactly the definition of external regret (averaged over time).
3. Proving RS is satisfied (The Minimax Connection): To prove that $S$ is approachable, we must show that for any fixed distribution over losses $q$, there exists a distribution over actions $p$ such that the expected payoff vector is in $S$.
Let $\bar{\ell} = \mathbb{E}_{\ell \sim q}[\ell]$ be the expected loss vector. We need to find $p$ such that for all $i$:
\[\langle p, \bar{\ell} \rangle - \bar{\ell}_i \le 0 \iff \langle p, \bar{\ell} \rangle \le \min_i \bar{\ell}_i\]But this is always possible! The learner can simply choose $p$ to be a point mass on the expert $i^\star$ with the minimum expected loss.
The “Aha!” Moment: This shows that no-regret learning is possible because, in a single-shot game, you can always perform as well as the best expert if you know the environment’s distribution. Blackwell’s Theorem then guarantees that you can perform just as well in the long run without knowing that distribution.
3. Calibrated Forecasting
Calibration is a fundamental property of probabilistic forecasts. We say a forecaster is calibrated if their predicted probabilities match the long-run empirical frequencies of the events they are predicting.
The “Weather Forecast” Intuition
Consider your local meteorologist. If they say there is a “30% chance of rain” today, what does that actually mean? It doesn’t mean it will rain in 30% of the city. Rather, it is a claim about the reliability of the forecast over time: If we look at all the days where the meteorologist predicted exactly 30% rain, it should have actually rained on roughly 30% of those days.
If it only rained on 10% of those days, the forecaster is over-confident. If it rained on 50%, they are under-confident. In both cases, the forecast is uncalibrated, even if it happens to be “correct” on any single given day.
Why is this important?
Calibration is essential for downstream decision-making. If you are a farmer deciding whether to protect your crops, or an investor deciding whether to hedge a position, you need to trust that a “10% risk” truly represents a 1-in-10 event. Without calibration, probabilities are just arbitrary numbers.
In statistics, this is often captured by the Brier Score:
\[BS = \frac{1}{T} \sum_{t=1}^T (p_t - y_t)^2\]which can be decomposed into three components:
- Calibration: How well the probabilities match the frequencies.
- Refinement (or Sharpness): How close the probabilities are to 0 or 1.
- Irreducible Uncertainty: The inherent variance of the process.
The $(\ell_1, \epsilon)$-Calibration Rate
To formalize this, let’s discretize the probability space. Let $\epsilon = 1/m$ for some integer $m$. We define a grid of possible forecasts $V = {v_0, v_1, \dots, v_m}$ where $v_i = i/m$.
Let $n_i(T)$ be the number of times the forecaster predicted $v_i$ up to time $T$, and let $\bar{y}_i(T)$ be the average outcome on those rounds. The $(\ell_1, \epsilon)$-calibration rate is:
\[C^\epsilon_T = \sum_{i=0}^m \frac{n_i(T)}{T} \lvert v_i - \bar{y}_i(T) \rvert - \frac{\epsilon}{2}\]A forecaster is $(\ell_1, \epsilon)$-calibrated if $\limsup_{T \to \infty} C^\epsilon_T \le 0$. This definition sums the absolute “reliability error” across all buckets, allowing for a small discretization error of $\epsilon/2$.
The Approachability Framework
We can achieve calibration by reducing it to a Blackwell approachability problem.
Learner Strategy: In each round $t$, the learner chooses a distribution $w_t \in \Delta_{m+1}$ over the grid $V$ and samples a prediction $p_t \sim w_t$.
Vector Payoff: Define $r(w, y) \in \mathbb{R}^{m+1}$ such that each component $i$ is: \(r(w, y)_i = w(i) (v_i - y)\)
Target Set: $S$ is the $\ell_1$ ball of radius $\epsilon/2$: \(S = \{ z \in \mathbb{R}^{m+1} : \sum_{i=0}^m \lvert z_i \rvert \le \frac{\epsilon}{2} \}\)
Why the average payoff relates to calibration: The average payoff vector $\bar{r}_T$ has components $\frac{1}{T} \sum_t w_t(i) (v_i - y_t)$. By concentration, this is close to the empirical error $\frac{n_i(T)}{T} (v_i - \bar{y}_i(T))$. The distance to the set $S$ then corresponds to the $(\ell_1, \epsilon)$-calibration rate.
Deep Dive: Why is RS satisfied?
Recall the Response-Satisfiability (RS) condition: For any mixed strategy $y \sim \text{Bernoulli}(\bar{y})$ of the adversary, there must exist a learner distribution $w$ such that the expected payoff vector $\mathbb{E}[r(w, y)]$ lands in $S$.
The Mathematical Logic:
(1) Expected Payoff: The $i$-th component is $\mathbb{E}[r(w, y)]_i = w(i) (v_i - \bar{y})$.
(2) Pick the Best Response: The learner can choose $w$ to be a point mass on the grid point $v_{i^\star}$ that is closest to $\bar{y}$ (i.e., $\lvert v_{i^\star}-\bar{y} \rvert \le \epsilon/2$).
(3) Analyze the Norm: With this $w$, only one component is non-zero, and the $\ell_1$ norm is simply $\lvert v_{i^\star}-\bar{y} \rvert \le \epsilon/2$.
Conclusion: The expected payoff vector is inside the $\ell_1$ ball $S$. Thus, the set is approachable!
The “Magic” of the Result
Blackwell’s Theorem provides an online strategy (the Foster-Vohra algorithm) that achieves this $(\ell_1, \epsilon)$-calibration guarantee without ever knowing the adversary’s distribution. Even if an adversary tries to move the outcomes to make you look uncalibrated, your randomized strategy guarantees long-term reliability.
4. Historical Context
- Minimax & Duality: Freud and Schapire (1996) showed that AdaBoost is a game-playing algorithm.
- Blackwell’s Breakthrough: David Blackwell (1956) generalized von Neumann’s Minimax Theorem to vector-valued payoffs.
- The Calibration Problem: Foster and Vohra (1997) used Blackwell’s Theorem to prove that calibrated forecasting is possible even against an adversary.
- Equivalence: Abernethy, Bartlett, and Hazan (2011) showed that calibration, no-regret learning, and Blackwell approachability are all equivalent.