Lecture 1: Introduction to Sequence Prediction

Date: January 16, 2026 Topic: Course Intro, Online Learning Basics, Exponential Weights, and the Perceptron

1. Course Overview and Policies

Welcome to CS 8803: Sequence Prediction

This course explores the theoretical foundations of sequential prediction and decision making. We bridge the gap between classical information theory (Kolmogorov Complexity, Solomonoff Induction) and modern sequence models (Transformers, Diffusion).

Logistics

Lectures: Fridays (Two sessions per day).
Grading Split:
- Problem Sets (20%): 2 major problem sets (one in Part I, one in Part II).
- Exams (40%): Two midterms (Feb 20 and April 3).
- Final Project (40%): Report and presentation in the final 3 weeks.
Key Dates:
- Exam 1: Feb 20
- Exam 2: April 3
- Student Presentations: April 10, 17, 24.

Course Philosophy

We focus on Online Learning and Algorithmic Information Theory. The central theme is Compression as Prediction: any regularity in data that allows for compression can be exploited for better prediction.

Historical Context & Motivation

The study of sequence prediction sits at the intersection of several profound intellectual movements of the 20th century:

The Information Age (1940s): Claude Shannon’s 1948 work established that information can be quantified and that there are fundamental limits to compression (Entropy).
The Gambling Connection (1950s): J.L. Kelly Jr. (1956) showed that the rate of information is directly proportional to the growth of wealth in sequential betting, linking information theory to sequential decision-making.
Algorithmic Foundations (1960s): Solomonoff (1960), Kolmogorov (1965), and Chaitin (1969) independently formalized the idea of “Universal Prediction”—the notion that the best predictor is the one that finds the shortest program to explain the sequence (Occam’s Razor).
Sequential Learning (1980s-90s): The emergence of the “Agnostic” and “Adversarial” frameworks. We moved from assuming data comes from a fixed distribution (Batch Learning) to competing with the best strategy in hindsight (No-Regret Learning). Key milestones include Littlestone & Warmuth’s Weighted Majority (1994) and Freund & Schapire’s Hedge (1997).
Modern State & Context (1990s-2010s): The transition from fixed-state models like HMMs (Baum et al. late 60s) to long-range memory with LSTMs (Hochreiter & Schmidhuber 1997) and eventually the dynamic “Attention” mechanisms in Transformers (Vaswani et al. 2017).

Today, we see these ideas merge in Large Language Models, which essentially act as practical approximations to the universal predictors envisioned by Solomonoff and Kolmogorov.

2. Introduction to Online Learning

In the classical “Batch” learning setting, we assume data is sampled i.i.d. from some distribution. In Online Learning, we make no such assumptions. The data can be adversarial, and we evaluate our performance sequentially.

The Protocol

For $t = 1, 2, \dots, T$:

Learner receives input $x_t \in \mathcal{X}$.
Learner predicts $\hat{y}_t \in \mathcal{Y}$.
Environment reveals true label $y_t \in \mathcal{Y}$.
Learner suffers loss $\ell(\hat{y}_t, y_t)$.

Sequential Prediction with 0-1 Loss

Consider a binary setting where $\mathcal{Y} = {0, 1}$. The loss is the 0-1 loss: $\ell(\hat{y}, y) = \mathbb{I}(\hat{y} \neq y)$.

The Halving Algorithm (Realizable Case)

Suppose we have a set of $N$ experts {$f_1, \dots, f_N$}, and we assume there exists at least one “perfect” expert $i^\star$ such that $f_{i^\star, t} = y_t$ for all $t$.

Algorithm:

Maintain a set of “active” experts $S_t$, initialized to $S_1 = {1, \dots, N}$.
At each step $t$:
- Predict $\hat{y}_t$ to be the majority vote of experts in $S_t$.
- After observing $y_t$, set $S_{t+1} = {i \in S_t : f_{i,t} = y_t}$.

Theorem: The total number of mistakes $M$ made by the Halving Algorithm is at most $\log_2 N$.

Proof: Each time the learner makes a mistake, at least half of the experts in $S_t$ must have been wrong. Thus, $\lvert S_{t+1} \rvert \le \frac{1}{2} \lvert S_t \rvert$. After $M$ mistakes, $1 \le \lvert S_{M+1} \rvert \le N(1/2)^M$. Taking the log gives $M \le \log_2 N$.

Regret

When no expert is perfect, we move from “mistake bounds” to Regret, defined as: $R_T = \sum_{t=1}^T \ell(\hat{y}_t, y_t) - \min_{i \in \{1, \dots, N\}} \sum_{t=1}^T \ell(f_{i,t}, y_t)$

Here we’re comparing the loss of the algorithm to the loss of the best expert in hindsight.

3. The Exponential Weights Algorithm (EWA)

To handle the non-realizable case, we use a randomized strategy that weights experts based on their historical performance.

The Algorithm

Given $N$ experts and a learning rate $\eta > 0$:

Initialize weights $w_{i,1} = 1$ for all $i = 1, \dots, N$.
For each round $t = 1, \dots, T$:
- Maintain a distribution $p_t$ where $p_{i,t} = \frac{w_{i,t}}{\Phi_t}$ and $\Phi_t = \sum_{j=1}^N w_{j,t}$.
- Predict $\hat{y}_t$ by sampling $i \sim p_t$.
- Observe expert losses $\ell_{i,t} \in [0, 1]$.
- Update weights: $w_{i,t+1} = w_{i,t} e^{-\eta \ell_{i,t}}$.

Regret Bound for EWA

Theorem: For any sequence of losses $\ell_{i,t} \in [0, 1]$, the expected regret of EWA with $\eta = \sqrt{\frac{8 \ln N}{T}}$ satisfies: $E[R_T] \le \sqrt{\frac{T}{2} \ln N}$

Proof Sketch:

Consider the potential $\Phi_t = \sum_i w_{i,t}$.
Analyze the progress: $\frac{\Phi_{t+1}}{\Phi_t} = \frac{\sum_i w_{i,t} e^{-\eta \ell_{i,t}}}{\Phi_t} = \sum_i p_{i,t} e^{-\eta \ell_{i,t}}$
Using Hoeffding’s Lemma ($E[e^{-\eta X}] \le e^{-\eta E[X] + \eta^2/8}$ for $X \in [0, 1]$): $\ln \left( \frac{\Phi_{t+1}}{\Phi_t} \right) \le -\eta \sum_{i=1}^N p_{i,t} \ell_{i,t} + \frac{\eta^2}{8} = -\eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{8}$
Summing from $t=1$ to $T$: $\ln \Phi_{T+1} - \ln \Phi_1 \le -\eta \sum_{t=1}^T \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{8}$
Since $\ln \Phi_{T+1} \ge \ln w_{i,T+1} = -\eta \sum_{t=1}^T \ell_{i,t}$ for any $i$, and $\ln \Phi_1 = \ln N$: $-\eta \sum_{t=1}^T \ell_{i,t} - \ln N \le -\eta \sum_{t=1}^T \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{8}$
Rearranging and dividing by $\eta$ gives the bound on the expected regret: $E[R_T] = \sum \langle p_t, \ell_t \rangle - \min_i \sum \ell_{i,t}$

Digression: Hoeffding’s Lemma

In the proof above, we used a critical inequality to bound the potential ratio $\sum p_i e^{-\eta \ell_i}$.

Lemma (Hoeffding): Let $X \in [0, 1]$ be a random variable with $E[X] = p$. Then for any $\lambda \in \mathbb{R}$: $E[e^{\lambda X}] \le e^{\lambda p + \frac{\lambda^2}{8}}$