class: center, middle, inverse # Lecture 4: Boosting, Blackwell Approachability, and Calibration ## CS 8803: Sequence Prediction ### February 6, 2026 --- # The Big Picture We've covered: * **Online Learning:** Regret minimization, Experts, Hedge, EXP3. * **Game Theory:** Minimax Theorem, Zero-sum games. Today, we connect these ideas to solve three major problems: 1. **Boosting:** How to turn weak learners into strong ones (AdaBoost). 2. **Vector Games:** What if payoffs aren't just scalars? (Blackwell Approachability). 3. **Calibration:** How to make probabilistic forecasts trustworthy. --- class: center, middle, inverse # Part 1: Boosting and Game Theory ### Can a committee of idiots be a genius? --- # Weak vs. Strong Learning **The Question (Kearns & Valiant, 1988):** Does **Weak Learnability** imply **Strong Learnability**? **1. Weak Learning:** Can we always find a hypothesis $h$ that performs *slightly* better than random guessing? * Example: A classifier that is 51% accurate. * "I can always find a rule of thumb." **2. Strong Learning:** Can we generate a hypothesis (or ensemble) with arbitrarily low error? * Example: A classifier that is 99.9% accurate. * "I can solve the problem perfectly." **The Answer (Schapire, 1990):** Yes! **Boosting** is the constructive proof: we can combine many weak hypotheses into a single strong one. --- # Weak vs. Strong Assumptions (Formal) Let's stick to the definitions: **Weak Learning Assumption (WLA):** For *every* distribution $p$ over the data, *there exists* a hypothesis $h \in \mathcal{H}$ that has a strictly positive edge $\gamma$. \\[ \forall p \in \Delta\_n, \exists h \in \mathcal{H} \text{ s.t. } \mathbb{E}\_{i \sim p} [y\_i h(x\_i)] \ge \gamma \\] **Strong Learning Assumption (SLA):** *There exists* an ensemble (distribution over hypotheses) $q$ such that for *every* data point $i$, the margin is at least $\gamma$. \\[ \exists q \in \Delta\_{\mathcal{H}} \text{ s.t. } \forall i \in \\{1, \dots, n\\}, \sum\_{h} q(h) y\_i h(x\_i) \ge \gamma \\] *Note: The surprising result is that these two conditions are equivalent!* --- # The Boosting Game We can frame this as a **zero-sum game**: * **Player 1 (Adversary/Weighting):** Chooses a distribution $p \in \Delta\_n$ over training data (trying to find hard examples). * **Player 2 (Learner):** Chooses a hypothesis $h \in \mathcal{H}$ (trying to classify well). **The Payoff (Edge):** Let $M$ be the matrix with entries $M(i, h) = y\_i h(x\_i)$. The expected payoff for a distribution $p$ is: \\[ M(p, h) = \sum\_{i=1}^n p\_i M(i, h) = \mathbb{E}\_{i \sim p} [y\_i h(x\_i)] \\] * WLA says: $\min\_{p} \max\_{h} M(p, h) \ge \gamma$. * SLA says: $\max\_{q} \min\_{i} \text{Margin}(q, i) \ge \gamma$. --- # The Duality Proof Recall $M(p, h) = \mathbb{E}\_{i \sim p} [y\_i h(x\_i)]$. Why are WLA and SLA equivalent? **Von Neumann's Minimax Theorem!** \\[ \min\_{p \in \Delta\_n} \max\_{q \in \Delta\_{\mathcal{H}}} M(p, q) = \max\_{q \in \Delta\_{\mathcal{H}}} \min\_{p \in \Delta\_n} M(p, q) \\] * **Left Side (WLA):** The adversary goes first. They pick a distribution $p$. We can always find an $h$ (and thus a trivial $q$) to get margin $\ge \gamma$. * **Right Side (SLA):** We go first. We pick an ensemble $q$. The adversary picks the hardest *single* point $i$ (a pure strategy). The value is still $\ge \gamma$. **Conclusion:** If you can always beat a distribution, you can build an ensemble that beats every point. --- # AdaBoost as Hedge The **AdaBoost** algorithm is simply the **Hedge algorithm** applied to this game. **Algorithm:** 1. **Initialize:** Weights $p\_1(i) = 1/n$ for all $i$. 2. **For $t = 1 \dots T$:** * **Learner:** Finds weak hypothesis $h\_t$ maximizing edge w.r.t $p\_t$. (Best Response) * **Adversary:** Updates weights using Hedge (Multiplicative Weights). \\[ p\_{t+1}(i) \propto p\_t(i) \exp(-\eta y\_i h\_t(x\_i)) \\] * *Note: If point $i$ is correctly classified ($y\_i h\_t(x\_i) > 0$), weight decreases. If wrong, weight increases.* 3. **Output:** Final ensemble $\text{sign}(\sum \alpha\_t h\_t)$. *This forces the weak learner to focus on the "hard" examples (the ones with high weights).* --- # Visualizing AdaBoost
Round 1: First Weak Learner
Mistake!
Round 2: Re-weighting
Final Strong Learner
--- class: center, middle, inverse # Part 2: Blackwell Approachability ### Optimizing multiple objectives simultaneously --- # The Setup * **Learner:** Chooses action $x \in \mathcal{X}$. * **Adversary:** Chooses action $y \in \mathcal{Y}$. (Assume $\mathcal{X}, \mathcal{Y}$ finite). * **Payoff:** A vector $r(x, y) \in \mathbb{R}^d$. * **Goal:** We want the average payoff $\bar{r}\_T = \frac{1}{T} \sum\_{t=1}^T r(x\_t, y\_t)$ to **approach** a target set $S$. **Target Set $S$:** * Must be **Closed** and **Convex**. * Metric: Euclidean distance $\text{dist}(z, S) = \inf\_{s \in S} \\lVert z - s \\rVert\_2$. * *Note: $L\_2$ is convenient, but results hold for other norms.* --- # Blackwell's Theorem (1956) This is the "Fundamental Theorem" of vector-valued games. We wish we had a simple Minimax theorem: "If for every distribution of the adversary, I can hit $S$, then I can always hit $S$." * In one-shot games, this isn't true for vector payoffs! * But Blackwell showed it **is** true for the long-run average. **Theorem:** If a set $S$ satisfies the **Response-Satisfiability (RS)** condition, then there exists a learning algorithm such that: \\[ \text{dist}(\bar{r}\_T, S) \to 0 \quad \text{as } T \to \infty \\] Specifically, the rate is $O(1/\sqrt{T})$. --- # Satisfiability Conditions When is a set $S$ approachable? Blackwell gave two equivalent conditions. *Notation:* Let $r(p, q) = \mathbb{E}\_{x \sim p, y \sim q} [r(x, y)]$. **1. Response-Satisfiability (RS):** (The "Offense" View) * For any adversary strategy $q$, there exists a response $p$ that lands in $S$. * $\forall q \in \Delta\_{\mathcal{Y}}, \exists p \in \Delta\_{\mathcal{X}} \text{ s.t. } r(p, q) \in S$. **2. Halfspace-Satisfiability (HS):** (The "Defense" View) * For any halfspace $H$ containing $S$, there is a strategy $p$ that forces the payoff into $H$, regardless of $y$. * $\forall H \supseteq S, \exists p \in \Delta\_{\mathcal{X}} \text{ s.t. } \forall y \in \mathcal{Y}, r(p, y) \in H$. *Homework Problem: Prove that RS $\iff$ HS!* --- # Blackwell's Algorithm How do we construct the strategy $p\_{t+1}$? 1. **Check:** Calculate average payoff $\bar{r}\_t$. 2. **Project:** If $\bar{r}\_t \notin S$, compute projection $\pi\_t = \Pi\_S(\bar{r}\_t)$. 3. **Error Vector:** Let $a\_t = \bar{r}\_t - \pi\_t$. * This defines a halfspace $H\_t = \{ z : \langle a\_t, z - \pi\_t \rangle \le 0 \}$ containing $S$. 4. **Play:** Choose $p\_{t+1}$ to satisfy **HS** for $H\_t$. * Find $p\_{t+1}$ such that $\forall y, \langle a\_t, r(p\_{t+1}, y) - \pi\_t \rangle \le 0$. * *Why possible?* Because $S$ is approachable (satisfies HS). --- # Geometric Intuition
Safe Halfspace H
t
S
r
t
π
t
Steer!
We "steer" the average $\bar{r}\_t$ back towards $S$. --- # Proof Sketch (Potential Function) Define potential $\Phi\_t = \text{dist}(\bar{r}\_t, S)^2 = \\lVert \bar{r}\_t - \pi\_t \\rVert^2$. 1. **Update:** $\bar{r}\_{t+1} = \frac{t}{t+1}\bar{r}\_t + \frac{1}{t+1}r\_{t+1}$. 2. **Bound:** $\Phi\_{t+1} \le \\lVert \bar{r}\_{t+1} - \pi\_t \\rVert^2$ (Projection property). 3. **Expansion:** \\[ \\lVert \bar{r}\_{t+1} - \pi\_t \\rVert^2 \approx \Phi\_t + \frac{2}{t} \langle \bar{r}\_t - \pi\_t, r\_{t+1} - \pi\_t \rangle + O(1/t^2) \\] 4. **Expectation:** The middle term is $\le 0$ because we chose $p\_{t+1}$ to satisfy HS! \\[ \mathbb{E}[\Phi\_{t+1}] \le \Phi\_t \left( \frac{t}{t+1} \right)^2 + \frac{R^2}{(t+1)^2} \\] 5. **Induction:** Solving the recurrence gives $\mathbb{E}[\\lVert \bar{r}\_t - \pi\_t \\rVert^2] = \mathbb{E}[\Phi\_T] \le \frac{R^2}{T}$, so expected distance from set $\mathbb{E}[\text{dist}(\bar{r}\_t, S)] = O(1/\sqrt{T})$. --- # Reduction: No-Regret as Approachability We can re-derive **Hedge** / **Experts** using this framework. * **Payoff Vector:** $r(i, \ell) \in \mathbb{R}^N$. * $i \in \{1 \dots N\}$ is the expert we choose. $\ell \in [0,1]^N$ is the loss vector. * For a mixed strategy $p$, the $j$-th component of the vector is: \\[ [r(p, \ell)]\_j = \underbrace{\langle p, \ell \rangle}\_{\text{My Loss}} - \underbrace{\ell(j)}\_{\text{Expert } j \text{ Loss}} \\] * **Target Set:** $S = \mathbb{R}^N\_{\le 0}$ (The negative orthant). **Why this works:** 1. $\bar{r}\_T \in S \implies [\bar{r}\_T]\_j \le 0$ for all $j$. 2. $\implies \frac{1}{T} \sum (\langle p\_t, \ell\_t \rangle - \ell\_t(j)) \le 0$. 3. $\implies \sum \text{Loss}(\text{Alg}) \le \min\_j \sum \text{Loss}(\text{Expert } j)$. *Check RS:* If adversary plays $\bar{\ell}$, we play $p$ on $\arg\min \bar{\ell}\_j$. Then my loss $\le$ expert loss. --- class: center, middle, inverse # Part 3: Calibrated Forecasting ### "It's 30% likely to rain." --- # The Problem of Calibration A meteorologist says "30% chance of rain" every day for a year. * **Scenario A:** It never rains. (Bad forecast). * **Scenario B:** It rains every day. (Bad forecast). * **Scenario C:** It rains exactly 30% of the days. (Calibrated!) **Definition:** A forecaster is **calibrated** if, among all times they predicted probability $p$, the event occurred with frequency $p$. --- # Reliability Diagrams
Perfect Calibration
Overconfident
Underconfident
Predicted Probability
Observed Frequency
--- # Defining Calibration Let's discretize predictions into **buckets** $V = \{v\_1, \dots, v\_K\}$ (e.g., $0.0, 0.1, \dots, 1.0$). **Key Quantities at time $T$:** \\[ N\_j^T = \sum\_{t=1}^T \mathbb{I}(p\_t = v\_j), \quad \rho\_j^T = \frac{1}{N\_j^T} \sum\_{t: p\_t = v\_j} y\_t \\] * $N\_j^T$: The number of times we predicted value $v\_j$ up to time $T$. * $\rho\_j^T$: The actual frequency of outcomes when we predicted $v\_j$. **$\epsilon$-Calibration:** We want $\\lvert \rho\_j^T - v\_j \\rvert < \epsilon$ for all buckets where $N\_j^T$ is large. **$L\_1$-Calibration:** We want the weighted average error to vanish: \\[ \sum\_{j=1}^K \frac{N\_j^T}{T} \\lvert \rho\_j^T - v\_j \\rvert \to 0 \\] *(These are roughly equivalent for fine grids).* --- # The Adversary Strikes Back Why is this hard? Imagine a simple deterministic strategy: **Strategy:** "If it rained a lot recently, predict high. Else predict low." * Buckets: Low (0.2), High (0.8). **Adversary Strategy:** 1. Wait for you to predict **Low** (0.2) $\to$ Make it **Rain** ($y=1$). * Error: $\\lvert 1 - 0.2 \\rvert = 0.8$. 2. Wait for you to predict **High** (0.8) $\to$ Make it **Sunny** ($y=0$). * Error: $\\lvert 0 - 0.8 \\rvert = 0.8$. **Result:** You are 100% wrong. Your calibration score is terrible. **Solution:** You must randomize your forecasts! --- # Calibration as a Game We can use Blackwell's Theorem to solve this! 1. **Learner:** Choose a bucket index $i \in \{1, \dots, K\}$. 2. **Payoff Vector:** $r(i, y) \in \mathbb{R}^K$ where $j$-th component is: \\[ [r(i, y)]\_j = \mathbb{I}(i = j) (v\_j - y) \\] * *Note: In each round, only one component is non-zero.* * *The average payoff $\bar{r}\_T$ has $j$-th component $\frac{1}{T} N\_j^T (v\_j - \rho\_j^T)$.* 3. **Target Set $S$:** The $L\_1$ ball of radius $\epsilon$. \\[ S = \\{ z \in \mathbb{R}^K : \sum\_{j=1}^K \\lvert z\_j \\rvert \le \epsilon \\} \\] If $\bar{r}\_T \to S$, then $\sum \frac{1}{T} \\lvert N\_j^T (v\_j - \rho\_j^T) \\rvert \le \epsilon$. **Calibrated!** --- # Proof of Approachability Does this game satisfy the **RS Condition**? * Adversary plays $y \in [0,1]$ (prob of rain). * Can we find a distribution $p \in \Delta\_K$ s.t. $\mathbb{E}\_{i \sim p}[r(i, y)] \in S$? **Strategy:** 1. Look at adversary's $y$. 2. Find the bucket $v\_{i^\star}$ closest to $y$. (Distance $\le \epsilon$). 3. Choose $p$ to be a point mass on bucket $i^\star$. **Check:** The expected payoff $\mathbb{E}[r]$ has only one non-zero component: \\[ [\mathbb{E}[r]]\_{i^\star} = 1 \cdot (v\_{i^\star} - y) \\] \\[ \\lVert \\mathbb{E}[r] \\rVert\_1 = \\lvert v\_{i^\star} - y \\rvert \le \epsilon \\] Yes! The expected vector is inside the $L\_1$ ball $S$. --- # Summary **1. Boosting:** * Weak Learning $\iff$ Strong Learning. * Proven via Minimax Duality. * AdaBoost $\approx$ Hedge. **2. Blackwell Approachability:** * Vector-valued Minimax. * Target set $S$ is approachable if we can "steer" towards it. * RS $\iff$ HS. **3. Calibration:** * Can be cast as an approachability game. * **Foster-Vohra Algorithm:** A randomized forecasting strategy that guarantees calibration against any adversary. (See Abernethy, Bartlett, Hazan '11 for the full reduction). --- class: center, middle, inverse # Thanks!