CS 8803 Sequence Prediction

Course materials for CS 8803 Sequence Prediction, Spring 2026 at Georgia Tech.

View project on GitHub

Lecture 7: The Grand Unification & Limits of Predictability

1. Introduction

In previous lectures, we explored distinct frameworks: online learning with expert advice (Adversarial), Kelly gambling (Finance), Shannon compression (Information Theory), and Solomonoff induction (Computability).

Today, we establish a “Grand Unification.” We will show that a probability distribution is fundamentally a coding strategy, that adversarial online learning is mathematically identical to Bayesian inference, and that the “cost” of not knowing the true environment is always measured by the KL Divergence. Finally, we will formalize the ultimate limits of sequence prediction via the Entropy Rate, setting the stage for how practical models (from Markov Chains to Transformers) attempt to reach this theoretical limit.

2. The Epistemology of Probability and Coding

What is a Probability Distribution?

In the information-theoretic view, a probability distribution is not a metaphysical property of the universe; it is simply a coding strategy or a betting strategy. This shift is grounded in data compression.

The Kraft Inequality

To understand why a probability distribution is a code, we must understand the limits of binary encoding.

Prefix-Free Codes: A code is instantaneous if no codeword is a prefix of any other.

Theorem (Kraft’s Inequality): For any prefix-free code over a binary alphabet with codeword lengths $\ell_1, \ell_2, \dots, \ell_m$, the following inequality holds:

\[\sum_{i=1}^m 2^{-\ell_i} \le 1\]

Conversely, if a set of lengths satisfies this inequality, there exists a prefix-free code with those lengths.

Proof: Consider a complete binary tree where each left branch represents a 0 and each right branch represents a 1. Every node at depth $d$ corresponds to a unique binary string of length $d$.

Assigning a codeword of length $\ell_i$ means selecting a specific node at depth $\ell_i$. Because the code is prefix-free, if we choose a node as a codeword, we cannot choose any of its descendants (otherwise the chosen node’s string would be a prefix of its descendant’s string).

Let $L_{\max} = \max_i \ell_i$ be the maximum depth of the tree. A node chosen at depth $\ell_i$ “casts a shadow” over exactly $2^{L_{\max} - \ell_i}$ leaf nodes at the maximum depth. Since no codeword is a prefix of another, these shadows are disjoint.

The total number of leaves at depth $L_{\max}$ is exactly $2^{L_{\max}}$. Therefore, the sum of all shadowed leaves cannot exceed the total number of leaves:

\[\sum_{i=1}^m 2^{L_{\max} - \ell_i} \le 2^{L_{\max}}\]

Dividing both sides by $2^{L_{\max}}$ gives exactly $\sum_{i=1}^m 2^{-\ell_i} \le 1$. $\square$

Example of a Prefix-Free Code: Let our alphabet be ${A, B, C, D}$ with desired codeword lengths $\ell_A = 1, \ell_B = 2, \ell_C = 3, \ell_D = 3$. We check Kraft’s inequality:

\[2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} = 0.5 + 0.25 + 0.125 + 0.125 = 1 \le 1\]

We can assign the codes as follows along the binary tree:

  • $A \to 0$
  • $B \to 10$
  • $C \to 110$
  • $D \to 111$

Notice that no codeword is a prefix of another, meaning a string like 100110 can be unambiguously decoded from left to right as 10, 0, 110 $\to B, A, C$.

Shannon Coding and Equivalence

Kraft’s inequality looks exactly like a probability distribution. Given a valid prefix code with lengths $\ell(x)$, we can define a probability distribution:

\[P(x) = 2^{-\ell(x)}\]

Conversely, Shannon Coding guarantees that given a true probability distribution $P(x)$, we can explicitly construct a prefix-free code where the codeword length is $\ell(x) = \lceil -\log_2 P(x) \rceil$.

Implementation of Shannon Coding: How do we construct this code?

  1. Sort the outcomes such that $P(x_1) \ge P(x_2) \ge \dots \ge P(x_m)$.
  2. Compute the cumulative probabilities: $F(x_i) = \sum_{j < i} P(x_j)$, with $F(x_1) = 0$.
  3. Write $F(x_i)$ as a binary fraction (e.g., $0.101101\dots_2$).
  4. The codeword for $x_i$ is simply the first $\ell(x_i) = \lceil -\log_2 P(x_i) \rceil$ bits of the fractional part of $F(x_i)$.

This works because the interval allocated to $x_i$ has width $P(x_i)$. By taking $\lceil -\log_2 P(x_i) \rceil$ bits, we specify a binary sub-interval of width $2^{-\ell(x_i)} \le P(x_i)$ that fits entirely within the cumulative probability interval $[F(x_i), F(x_{i+1}))$. This ensures the intervals for different codewords are disjoint, and thus the binary expansions do not form prefixes of one another.

Because of this equivalence, three domains are mathematically identical under the log-loss penalty:

  1. Compression: Minimizing the expected code length $\mathbb{E}[-\log_2 Q(x)]$.
  2. Prediction: Minimizing log-loss penalty $-\log_2 Q(x)$.
  3. Gambling: Maximizing Kelly wealth growth $\mathbb{E}[\log b(x)]$ using betting fractions $b(x)$.

The Universal Distance: The penalty for having the “wrong” distribution $Q$ instead of the true distribution $P$ is exactly the KL Divergence:

\[D_{\text{KL}}(P \parallel Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\]

3. Sequential Prediction with Log Loss

We previously studied Adversarial Online Learning, using the Exponential Weights algorithm to aggregate expert advice without assuming any true underlying distribution.

Let’s rigorously examine sequential prediction under logarithmic loss. Suppose we have a set of $N$ experts $\{f_1, \dots, f_N\}$. At time $t$, each expert $i$ predicts a probability distribution over the outcomes, which we denote $f_{i,t}$.

We use the Bayesian Algorithm (which is mathematically identical to Exponential Weights with learning rate $\eta=1$). We maintain a weight $w_{i,t}$ for each expert, initialized to $w_{i,0} = 1$.

At step $t$:

  1. Predict: We predict the distribution $\hat{p}_t$ as a weighted average of the experts:
\[\hat{p}_t = \frac{\sum_{i=1}^N w_{i,t-1} f_{i,t}}{\sum_{j=1}^N w_{j,t-1}}\]
  1. Update: After observing the true label $y_t$, we update the weights using the likelihood of the observed label:
\[w_{i,t} = w_{i,t-1} \cdot f_{i,t}[y_t]\]

A Regret Bound Independent of T

A remarkable property of the log-loss is that we can guarantee a regret bound that does not grow with the time horizon $T$.

Let $W_t = \sum_{i=1}^N w_{i,t}$ be the sum of weights. Notice the ratio of successive weights:

\[\frac{W_t}{W_{t-1}} = \frac{\sum_{i=1}^N w_{i,t-1} f_{i,t}[y_t]}{\sum_{j=1}^N w_{j,t-1}} = \hat{p}_t[y_t]\]

Taking the negative logarithm gives us the learner’s loss at time $t$, which we denote $\ell_t = -\log \hat{p}_t[y_t]$. The expert’s loss is $\ell_{i,t} = -\log f_{i,t}[y_t]$.

\[\ell_t = -\log \hat{p}_t[y_t] = \log \frac{W_{t-1}}{W_t}\]

Summing this over all $T$ steps gives a telescoping sum for the total learner loss:

\[\sum_{t=1}^T \ell_t = \sum_{t=1}^T \log \frac{W_{t-1}}{W_t} = \log \frac{W_0}{W_T}\]

Since $W_T$ is a sum of non-negative weights, it is bounded from below by the weight of the best individual expert $i^\star$:

\[W_T \ge w_{i^\star,T} = \prod_{t=1}^T f_{i^\star,t}[y_t]\]

Substituting this back, we get:

\[\sum_{t=1}^T \ell_t \le \log W_0 - \log \left( \prod_{t=1}^T f_{i^\star,t}[y_t] \right)\]

Since $W_0 = N$ (we started with $N$ experts with weight 1), and the second term is exactly the cumulative loss of the best expert $i^\star$:

\[\sum_{t=1}^T \ell_t \le \sum_{t=1}^T \ell_{i^\star,t} + \log N\]

The regret is bounded by exactly $\log N$, completely independent of $T$! This “constant” regret is possible because logarithmic loss is perfectly mixable, allowing the learner to aggregate experts such that the total excess loss is bounded strictly by the initial prior uncertainty.

Corollary (Non-Uniform Prior): What if we don’t initialize the weights uniformly? Suppose we assign each expert $i$ an initial weight $w_{i,0} = \pi_i$ such that $\sum_{i=1}^N \pi_i = 1$. In this case, $W_0 = 1$. The lower bound on $W_T$ becomes $W_T \ge \pi_{i^\star} \prod_{t=1}^T f_{i^\star,t}[y_t]$. Plugging this into our telescoping sum yields:

\[\sum_{t=1}^T \ell_t \le \sum_{t=1}^T \ell_{i^\star,t} + \log \frac{1}{\pi_{i^\star}} = \sum_{t=1}^T \ell_{i^\star,t} - \log \pi_{i^\star}\]

The regret against expert $i^\star$ is exactly the negative log-prior we assigned to that expert. This elegantly connects to coding theory: our regret is simply the code length of the expert in our prior distribution!

The Regret of Solomonoff Induction

What if our “experts” are not a finite set, but the infinite set of all computable probability measures? Instead of initializing with uniform weights $w_{i,0} = 1$, we can initialize each computable expert $\mu$ with a prior weight equal to its algorithmic probability: $w_{\mu,0} = 2^{-K(\mu)}$, where $K(\mu)$ is its Kolmogorov complexity. This is exactly the Solomonoff universal prior from Lecture 6.

By the exact same telescoping sum logic, if the true sequence is generated by a computable measure $\mu^\star$, the final total weight $W_T$ is bounded below by the final weight of the true environment $w_{\mu^\star,T}$.

But wait, why does Kraft’s inequality apply here? Recall from Lecture 6 that Kolmogorov complexity $K(\mu)$ is defined using a prefix-free universal Turing machine. This means the set of all valid, halting programs forms a prefix-free code (no valid program is a prefix of another). Because these programs form a prefix code, Kraft’s inequality guarantees that the sum of $2^{-\lvert p \rvert}$ over all halting programs—and thus over all computable measures they generate—is strictly bounded by 1.

Because of this, the initial sum of weights $\sum_\mu w_{\mu,0} \le 1$, meaning we have $W_0 \le 1$.

Therefore, substituting into our telescoping sum equation:

\[\sum_{t=1}^T \ell_t \le \sum_{t=1}^T \ell_{\mu^\star,t} + \log \frac{W_0}{w_{\mu^\star,0}} \le \sum_{t=1}^T \ell_{\mu^\star,t} - \log(2^{-K(\mu^\star)})\]

If we are using natural logarithms for our loss, $-\log(2^{-K(\mu^\star)}) = K(\mu^\star) \ln 2$.

The cumulative log-loss regret of Solomonoff Induction against the true computable environment is bounded by exactly $K(\mu^\star) \ln 2$! The algorithmic prior perfectly bounds the sequential log-loss regret, giving us a universal sequence predictor whose regret depends only on the algorithmic complexity of the true environment.

The Bayesian Connection

Unrolling the weight update recursion from $t=0$ to $n$ for any specific hypothesis $\theta \in \Theta$:

\[w_{n, \theta} = w_{0, \theta} \prod_{t=1}^n P_\theta(x_t \mid x_{<t}) = w_{0, \theta} \cdot P_\theta(x_1, \dots, x_n)\]

If we normalize the weights to form a distribution, $w_{0, \theta}$ acts as our Prior $\pi(\theta)$, and $P_\theta(x_{1:n})$ is our Likelihood. The normalized weight $w_{n, \theta}$ is exactly the Bayesian Posterior $P(\theta \mid x_{1:n})$.

Conclusion: Bayesian updating is mathematically equivalent to running adversarial Exponential Weights with $\eta=1$ on log-loss. Bayes is a minimax-optimal strategy against an adversary!

4. Universal Coding and the Shtarkov Sum

In the previous section, we bounded our log-loss regret against a finite set of $N$ experts. But what if we are predicting using a continuous family of models, such as all possible biased coins, or all possible neural network weights?

The Setup

Suppose we have a sequence of observations $x^n = (x_1, \dots, x_n)$. We also have a hypothesis class (or model class) $\mathcal{M} = { P_\theta : \theta \in \Theta }$, where each $\theta$ parameterizes a probability distribution over sequences of length $n$. For example, if $\mathcal{M}$ is the class of all biased coins, $\theta \in [0, 1]$ represents the probability of heads.

We are still in the sequential prediction (or compression) setting. Our goal is to choose a single universal distribution $Q(x^n)$ to predict or compress the data sequence.

However, we want to measure our performance against a much stronger baseline than a fixed expert. After seeing the entire sequence $x^n$, we can look back in hindsight and find the single best parameter $\theta$ for that specific sequence. This is precisely the Maximum Likelihood Estimator (MLE):

\[\hat{\theta}(x^n) = \arg\max_{\theta \in \Theta} P_\theta(x^n)\]

Regret against the MLE

The optimal hindsight code length (or minimum possible log-loss within the class) for the sequence $x^n$ is $-\log P_{\hat{\theta}(x^n)}(x^n)$.

The regret (also called redundancy in compression) of our chosen universal distribution $Q$ for a specific sequence $x^n$ is the extra penalty we pay compared to this hindsight optimal model:

\[\text{Regret}(Q, x^n) = \underbrace{-\log Q(x^n)}_{\text{Our Loss}} - \underbrace{\left( -\log P_{\hat{\theta}(x^n)}(x^n) \right)}_{\text{Best Hindsight Loss}} = \log \frac{P_{\hat{\theta}(x^n)}(x^n)}{Q(x^n)}\]

Our goal in universal coding is to find a single distribution $Q$ that minimizes the worst-case regret over all possible sequences of length $n$:

\[\min_Q \max_{x^n} \text{Regret}(Q, x^n)\]

The Shtarkov Solution (Normalized Maximum Likelihood)

To minimize the maximum regret, $Q$ should be designed such that the regret is a constant, no matter what sequence $x^n$ occurs. If it varied, an adversary would just pick the sequence where our regret is highest.

For the ratio $\frac{P_{\hat{\theta}(x^n)}(x^n)}{Q(x^n)}$ to be constant, $Q(x^n)$ must be proportional to the maximum likelihood for that sequence. But $Q$ must be a valid probability distribution, meaning it must sum to 1 over all possible sequences $y^n \in \mathcal{X}^n$.

Normalizing it gives us the Normalized Maximum Likelihood (NML) distribution:

\[Q^\star(x^n) = \frac{P_{\hat{\theta}(x^n)}(x^n)}{\sum_{y^n} P_{\hat{\theta}(y^n)}(y^n)}\]

The Shtarkov Sum and Model Complexity

The denominator in the NML distribution is famous in information theory. It is called the Shtarkov Sum:

\[S(n) = \sum_{y^n} P_{\hat{\theta}(y^n)}(y^n)\]

If we plug our NML distribution $Q^\star$ back into the regret equation, the minimax regret is exactly the log of the Shtarkov sum:

\[\text{Minimax Regret} = \log S(n)\]

Why this is profound: The Shtarkov sum mathematically measures the capacity or complexity of a model class $\mathcal{M}$.

  • If a model is simple, it can only assign high likelihoods to a few specific sequences. The sum $S(n)$ will be small, and the regret $\log S(n)$ will be small.
  • If a model is highly complex (overfitting), it has enough parameters to assign a high likelihood to any random sequence $y^n$. The maximum likelihoods will all be large, causing the sum $S(n)$ to explode, resulting in massive regret.

This provides a non-Bayesian, purely objective mathematical derivation of Occam’s Razor. The penalty for model complexity naturally emerges from the goal of minimizing worst-case regret against an adversary in a continuous hypothesis space!

(Side Note on Bayesian connections: While NML is derived from minimax game theory rather than Bayes’ rule, there is a deep asymptotic connection. As $n \to \infty$, the NML distribution converges to a Bayesian predictor that uses the Jeffreys prior $\pi_J(\theta) \propto \sqrt{\det(I(\theta))}$, perfectly linking Fisher information geometry to minimax sequence prediction).

Connection to Solomonoff Induction

Recall from Lecture 6 that Solomonoff Induction provides the ultimate universal predictor by using Bayesian updating with the universal prior $M(x) = \sum_{U(p)=x} 2^{-\lvert p \rvert}$, which mixes over all computable probability measures.

How does the NML framework relate to Solomonoff Induction?

  1. The Hypothesis Class: NML typically operates over a specific, restricted continuous family $\mathcal{M}$ (like all Markov chains of order $k$, or all neural networks of a specific architecture). Solomonoff Induction operates over the ultimate, unrestricted class: all Turing-computable distributions.
  2. The Regret Penalty: In our finite expert setting (Section 3), the regret was exactly bounded by the prior uncertainty $\log N$. In the NML setting, the minimax regret is bounded by $\log S(n)$, the log of the Shtarkov sum, which scales with the degrees of freedom of the model class. In Solomonoff Induction, the total expected regret (KL divergence) against a true computable measure $\mu$ is bounded by $K(\mu) \ln 2$. The Kolmogorov complexity $K(\mu)$ plays the exact same mathematical role as $\log N$ or $\log S(n)$!
  3. Two Paths to Occam’s Razor: Both frameworks independently discover Occam’s Razor. Solomonoff arrives at it via Computability Theory (shorter programs naturally have higher prior probability mass). NML arrives at it via Minimax Game Theory (complex models spread their maximum-likelihood flexibility too thin across the $2^n$ possible sequences, incurring a massive worst-case regret penalty).

In essence, NML provides the optimal adversarial strategy for a known, restricted model class, while Solomonoff Induction provides the optimal Bayesian strategy for the unrestricted class of all computable environments.

5. Tractable Models and Entropy Rates

Solomonoff Induction and NML are theoretically beautiful but generally incomputable. We must restrict our hypothesis class to tractable structures. But how do we measure the ultimate predictability of a generic sequence?

Stochastic Processes and Stationarity

A stochastic process ${X_i}$ is stationary if its joint distribution is invariant to time shifts:

\[\mathbb{P}\{X_{1:n} = x_{1:n}\} = \mathbb{P}\{X_{1+\ell : n+\ell} = x_{1:n}\}\]

The Entropy Rate

The fundamental limit of predictability for a stationary process is the Entropy Rate $H(\mathcal{X})$, which measures the average uncertainty per symbol as the sequence grows infinitely long:

\[H(\mathcal{X}) = \lim_{n \to \infty} \frac{1}{n} H(X_{1:n}) = \lim_{n \to \infty} H(X_n \mid X_{1:n-1})\]

This is the “ultimate optimal perplexity” that any language model strives to achieve.

The Need for Tractable Approximations

While the Entropy Rate defines the fundamental limit of predictability, computing it directly requires conditioning on an infinitely long history. Practical sequence models must find a way to approximate this without exploding the state space or requiring infinite compute.

In our next lecture, we will explore the foundational architectures—from Markov Chains to Recurrent Neural Networks and finally Transformers—that attempt to solve this problem, examining their tradeoffs between sequence context, parallelizability, and computational expressivity.