Lecture 4: Boosting, Blackwell Approachability, and Calibration

# Lecture 4: Boosting, Blackwell Approachability, and Calibration
## CS 8803: Sequence Prediction
### February 6, 2026

---

# The Big Picture

We've covered:
* **Online Learning:** Regret minimization, Experts, Hedge, EXP3.
* **Game Theory:** Minimax Theorem, Zero-sum games.

Today, we connect these ideas to solve three major problems:
1.  **Boosting:** How to turn weak learners into strong ones (AdaBoost).
2.  **Vector Games:** What if payoffs aren't just scalars? (Blackwell Approachability).
3.  **Calibration:** How to make probabilistic forecasts trustworthy.

---

# Part 1: Boosting and Game Theory

### Can a committee of idiots be a genius?

---

# Weak vs. Strong Learning

**The Question (Kearns & Valiant, 1988):**
Does **Weak Learnability** imply **Strong Learnability**?

**1. Weak Learning:**
Can we always find a hypothesis $h$ that performs *slightly* better than random guessing?
*   Example: A classifier that is 51% accurate.
*   "I can always find a rule of thumb."

**2. Strong Learning:**
Can we generate a hypothesis (or ensemble) with arbitrarily low error?
*   Example: A classifier that is 99.9% accurate.
*   "I can solve the problem perfectly."

**The Answer (Schapire, 1990):** Yes!
**Boosting** is the constructive proof: we can combine many weak hypotheses into a single strong one.

---

# Weak vs. Strong Assumptions (Formal)

Let's stick to the definitions:

**Weak Learning Assumption (WLA):**
For *every* distribution $p$ over the data, *there exists* a hypothesis $h \in \mathcal{H}$ that has a strictly positive edge $\gamma$.

\\[ \forall p \in \Delta\_n, \exists h \in \mathcal{H} \text{ s.t. } \mathbb{E}\_{i \sim p} [y\_i h(x\_i)] \ge \gamma \\]

**Strong Learning Assumption (SLA):**
*There exists* an ensemble (distribution over hypotheses) $q$ such that for *every* data point $i$, the margin is at least $\gamma$.

\\[ \exists q \in \Delta\_{\mathcal{H}} \text{ s.t. } \forall i \in \\{1, \dots, n\\}, \sum\_{h} q(h) y\_i h(x\_i) \ge \gamma \\]

*Note: The surprising result is that these two conditions are equivalent!*

---

# The Boosting Game

We can frame this as a **zero-sum game**:
* **Player 1 (Adversary/Weighting):** Chooses a distribution $p \in \Delta\_n$ over training data (trying to find hard examples).
* **Player 2 (Learner):** Chooses a hypothesis $h \in \mathcal{H}$ (trying to classify well).

**The Payoff (Edge):**
Let $M$ be the matrix with entries $M(i, h) = y\_i h(x\_i)$.

The expected payoff for a distribution $p$ is:
\\[ M(p, h) = \sum\_{i=1}^n p\_i M(i, h) = \mathbb{E}\_{i \sim p} [y\_i h(x\_i)] \\]

*   WLA says: $\min\_{p} \max\_{h} M(p, h) \ge \gamma$.
*   SLA says: $\max\_{q} \min\_{i} \text{Margin}(q, i) \ge \gamma$.

---

# The Duality Proof

Recall $M(p, h) = \mathbb{E}\_{i \sim p} [y\_i h(x\_i)]$.

Why are WLA and SLA equivalent? **Von Neumann's Minimax Theorem!**

\\[ \min\_{p \in \Delta\_n} \max\_{q \in \Delta\_{\mathcal{H}}} M(p, q) = \max\_{q \in \Delta\_{\mathcal{H}}} \min\_{p \in \Delta\_n} M(p, q) \\]

*   **Left Side (WLA):** The adversary goes first. They pick a distribution $p$. We can always find an $h$ (and thus a trivial $q$) to get margin $\ge \gamma$.
*   **Right Side (SLA):** We go first. We pick an ensemble $q$. The adversary picks the hardest *single* point $i$ (a pure strategy). The value is still $\ge \gamma$.

**Conclusion:** If you can always beat a distribution, you can build an ensemble that beats every point.

---

# AdaBoost as Hedge

The **AdaBoost** algorithm is simply the **Hedge algorithm** applied to this game.

**Algorithm:**
1.  **Initialize:** Weights $p\_1(i) = 1/n$ for all $i$.
2.  **For $t = 1 \dots T$:**
    *   **Learner:** Finds weak hypothesis $h\_t$ maximizing edge w.r.t $p\_t$. (Best Response)
    *   **Adversary:** Updates weights using Hedge (Multiplicative Weights).
        \\[ p\_{t+1}(i) \propto p\_t(i) \exp(-\eta y\_i h\_t(x\_i)) \\]
    *   *Note: If point $i$ is correctly classified ($y\_i h\_t(x\_i) > 0$), weight decreases. If wrong, weight increases.*
3.  **Output:** Final ensemble $\text{sign}(\sum \alpha\_t h\_t)$.

*This forces the weak learner to focus on the "hard" examples (the ones with high weights).*

---

# Visualizing AdaBoost

<g transform="translate(20, 40)">
    <text x="75" y="-15" text-anchor="middle" font-family="sans-serif" font-weight="bold" font-size="14">Round 1: First Weak Learner</text>
    <rect x="0" y="0" width="150" height="150" fill="#f9f9f9" stroke="#333" stroke-width="2"/>
    <circle cx="30" cy="40" r="5" fill="#4477aa"/> 
    <circle cx="50" cy="110" r="5" fill="#4477aa"/>
    <circle cx="80" cy="30" r="5" fill="#4477aa"/>
    <rect x="110" y="30" width="10" height="10" fill="#ee6677"/>
    <rect x="120" y="100" width="10" height="10" fill="#ee6677"/>
    <rect x="40" y="120" width="10" height="10" fill="#ee6677"/> 
    <line x1="90" y1="0" x2="90" y2="150" stroke="#333" stroke-width="3" stroke-dasharray="5,3"/>
    <text x="45" y="140" text-anchor="middle" font-family="sans-serif" font-size="10" fill="#ee6677" font-weight="bold">Mistake!</text>
  </g>

<line x1="180" y1="115" x2="210" y2="115" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>

<g transform="translate(220, 40)">
    <text x="75" y="-15" text-anchor="middle" font-family="sans-serif" font-weight="bold" font-size="14">Round 2: Re-weighting</text>
    <rect x="0" y="0" width="150" height="150" fill="#f9f9f9" stroke="#333" stroke-width="2"/>
    <circle cx="30" cy="40" r="3" fill="#4477aa" opacity="0.4"/> 
    <circle cx="50" cy="110" r="3" fill="#4477aa" opacity="0.4"/>
    <circle cx="80" cy="30" r="3" fill="#4477aa" opacity="0.4"/>
    <rect x="110" y="30" width="6" height="6" fill="#ee6677" opacity="0.4"/>
    <rect x="120" y="100" width="6" height="6" fill="#ee6677" opacity="0.4"/>
    <rect x="30" y="110" width="25" height="25" fill="#ee6677" stroke="black" stroke-width="1"/> 
    <line x1="0" y1="90" x2="150" y2="90" stroke="#333" stroke-width="3" stroke-dasharray="5,3"/>
  </g>

<line x1="380" y1="115" x2="410" y2="115" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>

<g transform="translate(420, 40)">
    <text x="75" y="-15" text-anchor="middle" font-family="sans-serif" font-weight="bold" font-size="14">Final Strong Learner</text>
    <rect x="0" y="0" width="150" height="150" fill="#f9f9f9" stroke="#333" stroke-width="2"/>
    <path d="M0 0 L90 0 L90 90 L0 90 Z" fill="#ccddff"/>
    <path d="M90 0 L150 0 L150 150 L0 150 L0 90 L90 90 Z" fill="#ffcccc"/>
    <line x1="90" y1="0" x2="90" y2="90" stroke="#333" stroke-width="2"/>
    <line x1="0" y1="90" x2="90" y2="90" stroke="#333" stroke-width="2"/>
    <circle cx="30" cy="40" r="5" fill="#4477aa"/> 
    <circle cx="50" cy="110" r="5" fill="#4477aa"/>
    <circle cx="80" cy="30" r="5" fill="#4477aa"/>
    <rect x="110" y="30" width="10" height="10" fill="#ee6677"/>
    <rect x="120" y="100" width="10" height="10" fill="#ee6677"/>
    <rect x="40" y="120" width="10" height="10" fill="#ee6677"/>
  </g>
</svg>
</div>

---

# Part 2: Blackwell Approachability

### Optimizing multiple objectives simultaneously

---

# The Setup

*   **Learner:** Chooses action $x \in \mathcal{X}$.
*   **Adversary:** Chooses action $y \in \mathcal{Y}$. (Assume $\mathcal{X}, \mathcal{Y}$ finite).
*   **Payoff:** A vector $r(x, y) \in \mathbb{R}^d$.
*   **Goal:** We want the average payoff $\bar{r}\_T = \frac{1}{T} \sum\_{t=1}^T r(x\_t, y\_t)$ to **approach** a target set $S$.

**Target Set $S$:**
*   Must be **Closed** and **Convex**.
*   Metric: Euclidean distance $\text{dist}(z, S) = \inf\_{s \in S} \\lVert z - s \\rVert\_2$.
*   *Note: $L\_2$ is convenient, but results hold for other norms.*

---

# Blackwell's Theorem (1956)

This is the "Fundamental Theorem" of vector-valued games.

We wish we had a simple Minimax theorem: "If for every distribution of the adversary, I can hit $S$, then I can always hit $S$."
*   In one-shot games, this isn't true for vector payoffs!
*   But Blackwell showed it **is** true for the long-run average.

**Theorem:**
If a set $S$ satisfies the **Response-Satisfiability (RS)** condition, then there exists a learning algorithm such that:
\\[ \text{dist}(\bar{r}\_T, S) \to 0 \quad \text{as } T \to \infty \\]
Specifically, the rate is $O(1/\sqrt{T})$.

---

# Satisfiability Conditions

When is a set $S$ approachable? Blackwell gave two equivalent conditions.

*Notation:* Let $r(p, q) = \mathbb{E}\_{x \sim p, y \sim q} [r(x, y)]$.

**1. Response-Satisfiability (RS):** (The "Offense" View)
*   For any adversary strategy $q$, there exists a response $p$ that lands in $S$.
*   $\forall q \in \Delta\_{\mathcal{Y}}, \exists p \in \Delta\_{\mathcal{X}} \text{ s.t. } r(p, q) \in S$.

**2. Halfspace-Satisfiability (HS):** (The "Defense" View)
*   For any halfspace $H$ containing $S$, there is a strategy $p$ that forces the payoff into $H$, regardless of $y$.
*   $\forall H \supseteq S, \exists p \in \Delta\_{\mathcal{X}} \text{ s.t. } \forall y \in \mathcal{Y}, r(p, y) \in H$.

*Homework Problem: Prove that RS $\iff$ HS!*

---

# Blackwell's Algorithm

How do we construct the strategy $p\_{t+1}$?

1.  **Check:** Calculate average payoff $\bar{r}\_t$.
2.  **Project:** If $\bar{r}\_t \notin S$, compute projection $\pi\_t = \Pi\_S(\bar{r}\_t)$.
3.  **Error Vector:** Let $a\_t = \bar{r}\_t - \pi\_t$.
    *   This defines a halfspace $H\_t = \{ z : \langle a\_t, z - \pi\_t \rangle \le 0 \}$ containing $S$.
4.  **Play:** Choose $p\_{t+1}$ to satisfy **HS** for $H\_t$.
    *   Find $p\_{t+1}$ such that $\forall y, \langle a\_t, r(p\_{t+1}, y) - \pi\_t \rangle \le 0$.
    *   *Why possible?* Because $S$ is approachable (satisfies HS).

---

# Geometric Intuition

<div style="text-align: center;">
<svg width="500" height="300" viewBox="0 0 500 300">
  <defs>
    <marker id="steerArrow" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
      <path d="M0,0 L0,6 L9,3 z" fill="orange" />
    </marker>
  </defs>
  <path d="M 0 0 L 198 0 L 327 300 L 0 300 Z" fill="#e8f5e9" opacity="0.8"/>
  <text x="30" y="280" font-family="serif" font-size="14" fill="#2e7d32" font-weight="bold">Safe Halfspace H<tspan baseline-shift="sub" font-size="10">t</tspan></text>
  <path d="M 50 150 Q 50 50 150 50 Q 250 50 250 150 Q 250 250 150 250 Q 50 250 50 150" fill="#aaccff" stroke="#0055aa" stroke-width="2"/>
  <text x="130" y="160" font-family="serif" font-size="24" fill="#0055aa" font-weight="bold">S</text>
  <circle cx="360" cy="60" r="6" fill="#d32f2f"/>
  <text x="370" y="55" font-family="serif" font-size="18" fill="#d32f2f" font-weight="bold">
    <tspan style="text-decoration: overline;">r</tspan><tspan baseline-shift="sub" font-size="10">t</tspan>
  </text>
  <circle cx="245" cy="110" r="6" fill="#388e3c"/>
  <text x="210" y="105" font-family="serif" font-size="18" fill="#388e3c" font-weight="bold">
    π<tspan baseline-shift="sub" font-size="10">t</tspan>
  </text>
  <line x1="245" y1="110" x2="360" y2="60" stroke="#333" stroke-width="2" stroke-dasharray="4"/>
  <line x1="198" y1="0" x2="327" y2="300" stroke="#2e7d32" stroke-width="4"/>
  <line x1="360" y1="60" x2="280" y2="95" stroke="orange" stroke-width="4" marker-end="url(#steerArrow)"/>
  <text x="300" y="120" font-family="serif" font-size="16" fill="orange" font-weight="bold">Steer!</text>
</svg>
</div>

We "steer" the average $\bar{r}\_t$ back towards $S$.

---

# Proof Sketch (Potential Function)

Define potential $\Phi\_t = \text{dist}(\bar{r}\_t, S)^2 = \\lVert \bar{r}\_t - \pi\_t \\rVert^2$.

1.  **Update:** $\bar{r}\_{t+1} = \frac{t}{t+1}\bar{r}\_t + \frac{1}{t+1}r\_{t+1}$.
2.  **Bound:** $\Phi\_{t+1} \le \\lVert \bar{r}\_{t+1} - \pi\_t \\rVert^2$ (Projection property).
3.  **Expansion:**
    \\[ \\lVert \bar{r}\_{t+1} - \pi\_t \\rVert^2 \approx \Phi\_t + \frac{2}{t} \langle \bar{r}\_t - \pi\_t, r\_{t+1} - \pi\_t \rangle + O(1/t^2) \\]
4.  **Expectation:** The middle term is $\le 0$ because we chose $p\_{t+1}$ to satisfy HS!
    \\[ \mathbb{E}[\Phi\_{t+1}] \le \Phi\_t \left( \frac{t}{t+1} \right)^2 + \frac{R^2}{(t+1)^2} \\]
5.  **Induction:**
    Solving the recurrence gives $\mathbb{E}[\\lVert \bar{r}\_t - \pi\_t \\rVert^2] = \mathbb{E}[\Phi\_T] \le \frac{R^2}{T}$, so expected distance from set $\mathbb{E}[\text{dist}(\bar{r}\_t, S)] = O(1/\sqrt{T})$.

---

# Reduction: No-Regret as Approachability

We can re-derive **Hedge** / **Experts** using this framework.

*   **Payoff Vector:** $r(i, \ell) \in \mathbb{R}^N$.
    *   $i \in \{1 \dots N\}$ is the expert we choose. $\ell \in [0,1]^N$ is the loss vector.
    *   For a mixed strategy $p$, the $j$-th component of the vector is:
        \\[ [r(p, \ell)]\_j = \underbrace{\langle p, \ell \rangle}\_{\text{My Loss}} - \underbrace{\ell(j)}\_{\text{Expert } j \text{ Loss}} \\]
*   **Target Set:** $S = \mathbb{R}^N\_{\le 0}$ (The negative orthant).

**Why this works:**
1.  $\bar{r}\_T \in S \implies [\bar{r}\_T]\_j \le 0$ for all $j$.
2.  $\implies \frac{1}{T} \sum (\langle p\_t, \ell\_t \rangle - \ell\_t(j)) \le 0$.
3.  $\implies \sum \text{Loss}(\text{Alg}) \le \min\_j \sum \text{Loss}(\text{Expert } j)$.

*Check RS:* If adversary plays $\bar{\ell}$, we play $p$ on $\arg\min \bar{\ell}\_j$. Then my loss $\le$ expert loss.

---

# Part 3: Calibrated Forecasting

### "It's 30% likely to rain."

---

# The Problem of Calibration

A meteorologist says "30% chance of rain" every day for a year.
*   **Scenario A:** It never rains. (Bad forecast).
*   **Scenario B:** It rains every day. (Bad forecast).
*   **Scenario C:** It rains exactly 30% of the days. (Calibrated!)

**Definition:**
A forecaster is **calibrated** if, among all times they predicted probability $p$, the event occurred with frequency $p$.

---

# Reliability Diagrams

<div style="text-align: center;">
<svg width="350" height="350" viewBox="0 0 350 350">
  <line x1="50" y1="300" x2="300" y2="300" stroke="black" stroke-width="2"/>
  <line x1="50" y1="300" x2="50" y2="50" stroke="black" stroke-width="2"/>
  <line x1="50" y1="300" x2="300" y2="50" stroke="gray" stroke-dasharray="5"/>
  <text x="220" y="40" font-family="sans-serif" font-size="12" fill="gray">Perfect Calibration</text>
  <circle cx="100" cy="250" r="5" fill="green"/>
  <circle cx="175" cy="175" r="5" fill="green"/>
  <circle cx="250" cy="200" r="5" fill="red"/>
  <text x="260" y="200" font-family="sans-serif" font-size="12" fill="red">Overconfident</text>
  <circle cx="100" cy="150" r="5" fill="blue"/>
  <text x="110" y="150" font-family="sans-serif" font-size="12" fill="blue">Underconfident</text>
  <text x="120" y="340" font-family="sans-serif" font-size="14">Predicted Probability</text>
  <text x="15" y="200" font-family="sans-serif" font-size="14" transform="rotate(-90 15,200)">Observed Frequency</text>
</svg>
</div>

---

# Defining Calibration

Let's discretize predictions into **buckets** $V = \{v\_1, \dots, v\_K\}$ (e.g., $0.0, 0.1, \dots, 1.0$).

**Key Quantities at time $T$:**
\\[ N\_j^T = \sum\_{t=1}^T \mathbb{I}(p\_t = v\_j), \quad \rho\_j^T = \frac{1}{N\_j^T} \sum\_{t: p\_t = v\_j} y\_t \\]

*   $N\_j^T$: The number of times we predicted value $v\_j$ up to time $T$.
*   $\rho\_j^T$: The actual frequency of outcomes when we predicted $v\_j$.

**$\epsilon$-Calibration:**
We want $\\lvert \rho\_j^T - v\_j \\rvert < \epsilon$ for all buckets where $N\_j^T$ is large.

**$L\_1$-Calibration:**
We want the weighted average error to vanish:
\\[ \sum\_{j=1}^K \frac{N\_j^T}{T} \\lvert \rho\_j^T - v\_j \\rvert \to 0 \\]

*(These are roughly equivalent for fine grids).*

---

# The Adversary Strikes Back

Why is this hard? Imagine a simple deterministic strategy:

**Strategy:** "If it rained a lot recently, predict high. Else predict low."
*   Buckets: Low (0.2), High (0.8).

**Adversary Strategy:**
1.  Wait for you to predict **Low** (0.2) $\to$ Make it **Rain** ($y=1$).
    *   Error: $\\lvert 1 - 0.2 \\rvert = 0.8$.
2.  Wait for you to predict **High** (0.8) $\to$ Make it **Sunny** ($y=0$).
    *   Error: $\\lvert 0 - 0.8 \\rvert = 0.8$.

**Result:** You are 100% wrong. Your calibration score is terrible.
**Solution:** You must randomize your forecasts!

---

# Calibration as a Game

We can use Blackwell's Theorem to solve this!

1.  **Learner:** Choose a bucket index $i \in \{1, \dots, K\}$.
2.  **Payoff Vector:** $r(i, y) \in \mathbb{R}^K$ where $j$-th component is:
    \\[ [r(i, y)]\_j = \mathbb{I}(i = j) (v\_j - y) \\]
    *   *Note: In each round, only one component is non-zero.*
    *   *The average payoff $\bar{r}\_T$ has $j$-th component $\frac{1}{T} N\_j^T (v\_j - \rho\_j^T)$.*
3.  **Target Set $S$:** The $L\_1$ ball of radius $\epsilon$.
    \\[ S = \\{ z \in \mathbb{R}^K : \sum\_{j=1}^K \\lvert z\_j \\rvert \le \epsilon \\} \\]

If $\bar{r}\_T \to S$, then $\sum \frac{1}{T} \\lvert N\_j^T (v\_j - \rho\_j^T) \\rvert \le \epsilon$. **Calibrated!**

---

# Proof of Approachability

Does this game satisfy the **RS Condition**?
*   Adversary plays $y \in [0,1]$ (prob of rain).
*   Can we find a distribution $p \in \Delta\_K$ s.t. $\mathbb{E}\_{i \sim p}[r(i, y)] \in S$?

**Strategy:**
1.  Look at adversary's $y$.
2.  Find the bucket $v\_{i^\star}$ closest to $y$. (Distance $\le \epsilon$).
3.  Choose $p$ to be a point mass on bucket $i^\star$.

**Check:**
The expected payoff $\mathbb{E}[r]$ has only one non-zero component:
\\[ [\mathbb{E}[r]]\_{i^\star} = 1 \cdot (v\_{i^\star} - y) \\]
\\[ \\lVert \\mathbb{E}[r] \\rVert\_1 = \\lvert v\_{i^\star} - y \\rvert \le \epsilon \\]
Yes! The expected vector is inside the $L\_1$ ball $S$.

---

# Summary

**1. Boosting:**
*   Weak Learning $\iff$ Strong Learning.
*   Proven via Minimax Duality.
*   AdaBoost $\approx$ Hedge.

**2. Blackwell Approachability:**
*   Vector-valued Minimax.
*   Target set $S$ is approachable if we can "steer" towards it.
*   RS $\iff$ HS.

**3. Calibration:**
*   Can be cast as an approachability game.
*   **Foster-Vohra Algorithm:** A randomized forecasting strategy that guarantees calibration against any adversary. (See Abernethy, Bartlett, Hazan '11 for the full reduction).

---
class: center, middle, inverse

# Thanks!