CS 8803 Sequence Prediction

Course materials for CS 8803 Sequence Prediction, Spring 2026 at Georgia Tech.

View project on GitHub

Lecture 3: Online Convex Optimization

Date: January 30, 2026 Topic: Online Convex Optimization (OCO), Online Gradient Descent, and Reductions


1. Online Convex Optimization (OCO)

Online Convex Optimization (Zinkevich, 2003) is a general framework that unifies many online learning problems.

The OCO Protocol

For $t = 1, \dots, T$:

  1. Learner chooses $x_t \in \mathcal{K}$ (a convex, compact set).
  2. Environment reveals a convex loss function $f_t: \mathcal{K} \to \mathbb{R}$.
  3. Learner incurs loss $f_t(x_t)$.

Regret:

\[\text{Regret}_T = \sum_{t=1}^T f_t(x_t) - \min_{x \in \mathcal{K}} \sum_{t=1}^T f_t(x)\]

Online Gradient Descent (OGD)

OGD is the online analog of classical gradient descent.

Update Rule:

\[x_{t+1} = \Pi_{\mathcal{K}} (x_t - \eta \nabla f_t(x_t))\]

where $\Pi_{\mathcal{K}}(y) = \arg\min_{x \in \mathcal{K}} \lVert x - y \rVert$ is the Euclidean projection.

Regret Bound (Theorem 3.1 in Hazan)

Theorem: Suppose $\mathcal{K}$ has diameter $D$ (i.e., $\max_{x,y \in \mathcal{K}} \lVert x-y \rVert \le D$) and the functions $f_t$ are $G$-Lipschitz ($\lVert \nabla f_t(x) \rVert \le G$). With learning rate $\eta = \frac{D}{G\sqrt{T}}$, the regret of OGD is:

\[\text{Regret}_T \le \frac{3}{2} GD\sqrt{T}\]

Proof Sketch:

  1. Let $x^\star = \arg\min_{x \in \mathcal{K}} \sum f_t(x)$.
  2. Analyze distance to $x^\star$: $\lVert x_{t+1} - x^\star \rVert^2 = \lVert \Pi_{\mathcal{K}}(x_t - \eta \nabla f_t(x_t)) - \Pi_{\mathcal{K}}(x^\star) \rVert^2$.
  3. By projection property: $\lVert x_{t+1} - x^\star \rVert^2 \le \lVert x_t - \eta \nabla f_t(x_t) - x^\star \rVert^2$.
  4. Expand: $\lVert x_{t+1} - x^\star \rVert^2 \le \lVert x_t - x^\star \rVert^2 - 2\eta \langle \nabla f_t(x_t), x_t - x^\star \rangle + \eta^2 \lVert \nabla f_t(x_t) \rVert^2$.
  5. By convexity: $f_t(x_t) - f_t(x^\star) \le \langle \nabla f_t(x_t), x_t - x^\star \rangle$.
  6. Summing and telescoping: $\sum (f_t(x_t) - f_t(x^\star)) \le \frac{\lVert x_1 - x^\star \rVert^2}{2\eta} + \frac{\eta}{2} \sum \lVert \nabla f_t(x_t) \rVert^2 \le \frac{D^2}{2\eta} + \frac{\eta T G^2}{2}$.
  7. Optimal $\eta$ gives the $\sqrt{T}$ result.

2. Examples of OCO

We can frame many learning problems as instances of OCO by properly defining the convex set $\mathcal{K}$ and the loss functions $f_t$.

Online Density Estimation

Consider a family of distributions ${P_\theta : \theta \in \Theta \subseteq \mathbb{R}^d}$ from an exponential family, where $\Theta$ is a convex set. The density is given by $P_\theta(x) = \exp(\langle \theta, \phi(x) \rangle - A(\theta))$.

Protocol: For $t = 1, \dots, T$:

  1. Learner chooses parameter $\theta_t \in \Theta$.
  2. Nature reveals sample $x_t \sim \mathcal{D}$.
  3. Learner incurs loss $f_t(\theta_t) = -\log P_{\theta_t}(x_t)$ (negative log likelihood).

Convexity: The negative log likelihood is convex in $\theta$:

\[f_t(\theta) = -\langle \theta, \phi(x_t) \rangle + A(\theta)\]

Since the log-partition function $A(\theta)$ is convex, $f_t(\theta)$ is convex.

Online Linear Regression

We aim to predict a real-valued label $y_t$ given features $x_t \in \mathbb{R}^d$.

Protocol: For $t = 1, \dots, T$:

  1. Learner chooses predictor $w_t \in \mathbb{R}^d$.
  2. Nature reveals pair $(x_t, y_t) \in \mathbb{R}^d \times \mathbb{R}$.
  3. Learner incurs squared loss $f_t(w_t) = \frac{1}{2} (\langle w_t, x_t \rangle - y_t)^2$.

Convexity: The function $f_t(w) = \frac{1}{2}(\langle w, x_t \rangle - y_t)^2$ is convex in $w$ because it is the composition of a linear function and a convex quadratic.


3. Applications and Reductions

OCO is powerful because it allows us to solve offline problems and convert online guarantees to statistical ones.

Convex Optimization to OCO

We can use an OCO algorithm (like OGD) to solve a standard offline convex optimization problem:

\[\min_{x \in \mathcal{K}} f(x)\]

Procedure: Run the OCO algorithm for $T$ steps by setting the online loss functions to be the static objective function, i.e., $f_t(x) = f(x)$ for all $t$. Let $\bar{x}T = \frac{1}{T} \sum{t=1}^T x_t$ be the average iterate.

Theorem: If the OCO algorithm has regret $\text{Regret}_T$, then:

\[f(\bar{x}_T) - \min_{x \in \mathcal{K}} f(x) \le \frac{\text{Regret}_T}{T}\]

Proof: Let $x^\star \in \arg\min_{x \in \mathcal{K}} f(x)$. By Jensen’s inequality (since $f$ is convex):

\[f(\bar{x}_T) = f\left(\frac{1}{T} \sum_{t=1}^T x_t\right) \le \frac{1}{T} \sum_{t=1}^T f(x_t)\]

Since $f_t(x) = f(x)$, the sum on the RHS is exactly the cumulative loss of the algorithm. By the definition of regret:

\[\sum_{t=1}^T f(x_t) = \min_{x \in \mathcal{K}} \sum_{t=1}^T f_t(x) + \text{Regret}_T = T f(x^\star) + \text{Regret}_T\]

Dividing by $T$:

\[\frac{1}{T} \sum_{t=1}^T f(x_t) = f(x^\star) + \frac{\text{Regret}_T}{T}\]

Combining with Jensen’s inequality:

\[f(\bar{x}_T) \le f(x^\star) + \frac{\text{Regret}_T}{T} \implies f(\bar{x}_T) - f(x^\star) \le \frac{\text{Regret}_T}{T}\]

Online to Batch Conversion

We can convert an online learner into a batch learner for stochastic problems. Setting: Samples $z_t = (x_t, y_t)$ are drawn i.i.d. from a distribution $\mathcal{D}$. We want to minimize the expected risk $\mathcal{R}(w) = \mathbb{E}_{z \sim \mathcal{D}}[\ell(w, z)]$.

Procedure:

  1. Run OCO with loss functions $f_t(w) = \ell(w, z_t)$.
  2. Output the average iterate $\bar{w}T = \frac{1}{T} \sum{t=1}^T w_t$.

Theorem: \(\mathbb{E}[\mathcal{R}(\bar{w}_T)] - \mathcal{R}(w^\star) \le \frac{\mathbb{E}[\text{Regret}_T]}{T}\)

Proof: Since $f_t(w)$ is an unbiased estimate of the risk (i.e., $\mathbb{E}_{z_t}[f_t(w) \mid w] = \mathcal{R}(w)$), we have:

\[\mathbb{E}\left[\sum_{t=1}^T f_t(w_t)\right] = \sum_{t=1}^T \mathbb{E}[\mathcal{R}(w_t)] \ge T \mathbb{E}[\mathcal{R}(\bar{w}_T)]\]

The last inequality follows from Jensen’s inequality on $\mathcal{R}$ (since $\mathcal{R}$ is a convex combination of convex functions). Also, for any fixed $w^\star$:

\[\mathbb{E}\left[\sum_{t=1}^T f_t(w^\star)\right] = T \mathcal{R}(w^\star)\]

Taking the expectation of the regret definition:

\[\mathbb{E}[\text{Regret}_T] = \mathbb{E}\left[\sum_{t=1}^T f_t(w_t) - \sum_{t=1}^T f_t(w^\star)\right] \ge T \mathbb{E}[\mathcal{R}(\bar{w}_T)] - T \mathcal{R}(w^\star)\]

Rearranging gives the result.


4. Historical Context

  • No-Regret & OCO: The idea of no-regret learning emerged in the 1950s (Hannan, Blackwell), but it was Martin Zinkevich’s 2003 paper that unified the field through Online Convex Optimization and OGD. Elad Hazan’s work (2006-Present) further refined these tools, particularly for logarithmic regret.
  • Online to Batch: The technique of converting online algorithms to batch learners (“online-to-batch conversion”) was formalized for general convex loss functions by Cesa-Bianchi, Conconi, and Gentile (2004). This generalized earlier work by Littlestone (1989), who explored similar ideas in the mistake-bound model.
  • Iterate Averaging: The power of averaging iterates (as seen in the OCO-to-convex-optimization reduction) has deep roots. Polyak and Juditsky (1992) famously showed that averaging iterates in stochastic gradient descent (Polyak-Ruppert averaging) leads to statistically optimal asymptotic convergence rates. Zinkevich (2003) later provided a non-asymptotic view, showing that low regret implies the average iterate converges to the optimum.