Lecture 1: Introduction to Sequence Prediction
Date: January 16, 2026 Topic: Course Intro, Online Learning Basics, Exponential Weights, and the Perceptron
1. Course Overview and Policies
Welcome to CS 8803: Sequence Prediction
This course explores the theoretical foundations of sequential prediction and decision making. We bridge the gap between classical information theory (Kolmogorov Complexity, Solomonoff Induction) and modern sequence models (Transformers, Diffusion).
Logistics
- Lectures: Fridays (Two sessions per day).
- Grading Split:
- Problem Sets (20%): 2 major problem sets (one in Part I, one in Part II).
- Exams (40%): Two midterms (Feb 20 and April 3).
- Final Project (40%): Report and presentation in the final 3 weeks.
- Key Dates:
- Exam 1: Feb 20
- Exam 2: April 3
- Student Presentations: April 10, 17, 24.
Course Philosophy
We focus on Online Learning and Algorithmic Information Theory. The central theme is Compression as Prediction: any regularity in data that allows for compression can be exploited for better prediction.
Historical Context & Motivation
The study of sequence prediction sits at the intersection of several profound intellectual movements of the 20th century:
- The Information Age (1940s): Claude Shannon’s 1948 work established that information can be quantified and that there are fundamental limits to compression (Entropy).
- The Gambling Connection (1950s): J.L. Kelly Jr. (1956) showed that the rate of information is directly proportional to the growth of wealth in sequential betting, linking information theory to sequential decision-making.
- Algorithmic Foundations (1960s): Solomonoff (1960), Kolmogorov (1965), and Chaitin (1969) independently formalized the idea of “Universal Prediction”—the notion that the best predictor is the one that finds the shortest program to explain the sequence (Occam’s Razor).
- Sequential Learning (1980s-90s): The emergence of the “Agnostic” and “Adversarial” frameworks. We moved from assuming data comes from a fixed distribution (Batch Learning) to competing with the best strategy in hindsight (No-Regret Learning). Key milestones include Littlestone & Warmuth’s Weighted Majority (1994) and Freund & Schapire’s Hedge (1997).
- Modern State & Context (1990s-2010s): The transition from fixed-state models like HMMs (Baum et al. late 60s) to long-range memory with LSTMs (Hochreiter & Schmidhuber 1997) and eventually the dynamic “Attention” mechanisms in Transformers (Vaswani et al. 2017).
Today, we see these ideas merge in Large Language Models, which essentially act as practical approximations to the universal predictors envisioned by Solomonoff and Kolmogorov.
2. Introduction to Online Learning
In the classical “Batch” learning setting, we assume data is sampled i.i.d. from some distribution. In Online Learning, we make no such assumptions. The data can be adversarial, and we evaluate our performance sequentially.
The Protocol
For $t = 1, 2, \dots, T$:
- Learner receives input $x_t \in \mathcal{X}$.
- Learner predicts $\hat{y}_t \in \mathcal{Y}$.
- Environment reveals true label $y_t \in \mathcal{Y}$.
- Learner suffers loss $\ell(\hat{y}_t, y_t)$.
Sequential Prediction with 0-1 Loss
Consider a binary setting where $\mathcal{Y} = {0, 1}$. The loss is the 0-1 loss: $\ell(\hat{y}, y) = \mathbb{I}(\hat{y} \neq y)$.
The Halving Algorithm (Realizable Case)
Suppose we have a set of $N$ experts {$f_1, \dots, f_N$}, and we assume there exists at least one “perfect” expert $i^\star$ such that $f_{i^\star, t} = y_t$ for all $t$.
Algorithm:
- Maintain a set of “active” experts $S_t$, initialized to $S_1 = {1, \dots, N}$.
- At each step $t$:
- Predict $\hat{y}_t$ to be the majority vote of experts in $S_t$.
- After observing $y_t$, set $S_{t+1} = {i \in S_t : f_{i,t} = y_t}$.
Theorem: The total number of mistakes $M$ made by the Halving Algorithm is at most $\log_2 N$.
Proof: Each time the learner makes a mistake, at least half of the experts in $S_t$ must have been wrong. Thus, $\lvert S_{t+1} \rvert \le \frac{1}{2} \lvert S_t \rvert$. After $M$ mistakes, $1 \le \lvert S_{M+1} \rvert \le N(1/2)^M$. Taking the log gives $M \le \log_2 N$.
Regret
When no expert is perfect, we move from “mistake bounds” to Regret, defined as: \(R_T = \sum_{t=1}^T \ell(\hat{y}_t, y_t) - \min_{i \in \{1, \dots, N\}} \sum_{t=1}^T \ell(f_{i,t}, y_t)\)
Here we’re comparing the loss of the algorithm to the loss of the best expert in hindsight.
3. The Exponential Weights Algorithm (EWA)
To handle the non-realizable case, we use a randomized strategy that weights experts based on their historical performance.
The Algorithm
Given $N$ experts and a learning rate $\eta > 0$:
- Initialize weights $w_{i,1} = 1$ for all $i = 1, \dots, N$.
- For each round $t = 1, \dots, T$:
- Maintain a distribution $p_t$ where $p_{i,t} = \frac{w_{i,t}}{\Phi_t}$ and $\Phi_t = \sum_{j=1}^N w_{j,t}$.
- Predict $\hat{y}_t$ by sampling $i \sim p_t$.
- Observe expert losses $\ell_{i,t} \in [0, 1]$.
- Update weights: $w_{i,t+1} = w_{i,t} e^{-\eta \ell_{i,t}}$.
Regret Bound for EWA
Theorem: For any sequence of losses $\ell_{i,t} \in [0, 1]$, the expected regret of EWA with $\eta = \sqrt{\frac{8 \ln N}{T}}$ satisfies: \(E[R_T] \le \sqrt{\frac{T}{2} \ln N}\)
Proof Sketch:
- Consider the potential $\Phi_t = \sum_i w_{i,t}$.
- Analyze the progress: \(\frac{\Phi_{t+1}}{\Phi_t} = \frac{\sum_i w_{i,t} e^{-\eta \ell_{i,t}}}{\Phi_t} = \sum_i p_{i,t} e^{-\eta \ell_{i,t}}\)
- Using Hoeffding’s Lemma ($E[e^{-\eta X}] \le e^{-\eta E[X] + \eta^2/8}$ for $X \in [0, 1]$): \(\ln \left( \frac{\Phi_{t+1}}{\Phi_t} \right) \le -\eta \sum_{i=1}^N p_{i,t} \ell_{i,t} + \frac{\eta^2}{8} = -\eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{8}\)
- Summing from $t=1$ to $T$: \(\ln \Phi_{T+1} - \ln \Phi_1 \le -\eta \sum_{t=1}^T \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{8}\)
- Since $\ln \Phi_{T+1} \ge \ln w_{i,T+1} = -\eta \sum_{t=1}^T \ell_{i,t}$ for any $i$, and $\ln \Phi_1 = \ln N$: \(-\eta \sum_{t=1}^T \ell_{i,t} - \ln N \le -\eta \sum_{t=1}^T \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{8}\)
- Rearranging and dividing by $\eta$ gives the bound on the expected regret: \(E[R_T] = \sum \langle p_t, \ell_t \rangle - \min_i \sum \ell_{i,t}\)
Digression: Hoeffding’s Lemma
In the proof above, we used a critical inequality to bound the potential ratio $\sum p_i e^{-\eta \ell_i}$.
Lemma (Hoeffding): Let $X \in [0, 1]$ be a random variable with $E[X] = p$. Then for any $\lambda \in \mathbb{R}$: \(E[e^{\lambda X}] \le e^{\lambda p + \frac{\lambda^2}{8}}\)
Proof Sketch:
- Convexity: Since $f(x) = e^{\lambda x}$ is convex, we can bound it by the line segment connecting $(0, 1)$ and $(1, e^\lambda)$. For any $x \in [0, 1]$: \(e^{\lambda x} \le (1-x) e^0 + x e^\lambda = 1 - x + x e^\lambda\)
- Expectation: Taking the expectation of both sides: \(E[e^{\lambda X}] \le 1 - p + p e^\lambda\)
- Logarithmic Bound: To complete the proof, we use the following inequality for $p \in [0, 1]$ and $\lambda \in \mathbb{R}$: \(\ln(1 - p + p e^\lambda) - \lambda p \le \frac{\lambda^2}{8}\)
- The “Work”: This log-bound can be proven by let $L(\lambda) = \ln(1 - p + p e^\lambda) - \lambda p$. Note that $L(0) = 0$ and $L’(0) = 0$. By Taylor’s Theorem, $L(\lambda) = \frac{\lambda^2}{2} L’’(\xi)$ for some $\xi \in [0, \lambda]$. One can show that $L’’(\xi) = \frac{p(1-p)e^\xi}{(1-p+pe^\xi)^2} \le \frac{1}{4}$, leading to the $\lambda^2/8$ bound.
- Conclusion: $E[e^{\lambda X}] \le e^{\ln(1 - p + p e^\lambda)} \le e^{\lambda p + \lambda^2/8}$.
4. Historical Context
The algorithms discussed today form the bedrock of Online Learning theory.
The Perceptron (1957-1962)
- Origin: Frank Rosenblatt (1957) introduced the Perceptron as a model for biological learning at the Cornell Aeronautical Laboratory.
- Mistake Bound: Albert Novikoff (1962) provided the first formal proof of the mistake bound (Novikoff’s Theorem), establishing that the algorithm converges in a finite number of steps for linearly separable data.
- Impact: While Marvin Minsky and Seymour Papert (1969) later highlighted its limitations with XOR and non-linear data, the Perceptron remains the simplest example of an online “kernelizable” algorithm.
The Halving Algorithm (1972-1988)
- Origin: The idea was developed across several decades. Early precursors appeared in the work of Barzdin and Freivald (1972).
- Formalization: It was independently formalized and popularized in the late 1980s by Dana Angluin (1988) and Nick Littlestone (1988). Littlestone’s work on the Winnow algorithm and mistake-bound learning models was particularly influential.
Exponential Weights (1990-1994)
- Origin: The transition from the realizable case (Halving) to the agnostic/noisy case led to the development of “soft” weighting schemes.
- Key Results:
- Vladimir Vovk (1990): Introduced the Aggregating Algorithm, laying the groundwork for competitive online prediction.
- Littlestone & Warmuth (1994): Published the “Weighted Majority Algorithm” paper, which popularized the multiplicative weight update method.
- Hedge (1997): Yoav Freund and Robert Schapire introduced the “Hedge” algorithm as a generalization of weighted majority for multi-outcome games, which is essentially the version of EWA/Hedge used today.
- The Workhorse: EWA is now recognized as a special case of Follow-the-Regularized-Leader (FTRL) and Online Mirror Descent (OMD) with entropic regularization.
Multi-Armed Bandits (1933-2002)
The bandit problem introduces the fundamental “Exploration vs. Exploitation” tradeoff, which we will see throughout this course.
-
Origin: William R. Thompson (1933) first described the problem in the context of clinical trials (now known as “Thompson Sampling”).
-
The “Bandit” Name: In the 1950s, the problem was popularized as the “one-armed bandit” (slot machine) by researchers like Frederick Mosteller and Robert Bush.
-
Gittins Index (1974): John Gittins solved the Bayesian version of the problem for discounted rewards, showing that an optimal index strategy exists.
-
Regret Bounds (1985): Lai and Robbins provided the first lower bounds on regret, showing that it must grow at least logarithmically ($\ln T$) for most distributions.
-
The Adversarial Turn (1995-2002): Auer, Cesa-Bianchi, Freund, and Schapire introduced the “adversarial bandit” framework (EXP3), which directly connects the expert advice algorithms like EWA/Hedge to the bandit setting by only observing the loss of the chosen arm.