In the last lecture, we saw some algorithms that, while simple and appealing, were somewhat unmotivated. We now try to derive them from general principles, and in a setting that will allow us to attack other problems in competitive analysis.
Gradient descent: The proximal view
Let us first recall the upper bound we derived for the regret in the last lecture:
RT≤T∑t=1[‖pt−pt+1‖1+⟨pt+1,ℓt−ℓt(x∗T)⟩].
Trying to minimize this expression leads to the question of how we should update our probability distribution pt→pt+1 to simultaneously be stable (control the first term) and competitive (the second term).
A very natural algorithm in this setting is gradient descent. Indeed, suppose that ℓ:Rn→R is differentiable, and consider the optimization
min{12‖x−x0‖2+ηℓ(x):x∈Rn},
where η>0 is a small constant and ‖⋅‖ denotes the Euclidean norm. Then first-order optimality conditions dictate that the optimizer satisfies x∗=x0−η∇ℓ(x0)+O(η2).
Two questions immediately arise: Why do we use the Euclidean norm when our reference problem (1) refers to the ℓ1 norm, and if x is meant to encode a probability distribution, how do we maintain this constraint for x∗?
Projected gradient descent
Let’s address the feasibility problem first. Suppose K⊆Rn is a closed convex set and F:Rn→Rn is a sufficiently smooth vector field (think of F=∇ℓ). How should we move in the direction of F while simultaneously remaining inside K?
The unconstrained flow along F can be described as a trajectory x:[0,∞)→Rn given by x′(t)=F(x(t)).
PK(y):=argmin{‖y−z‖2:z∈K},
and the result of taking an infinitesimal step in direction v and and then projecting: ΠK(x,v):=limε→0PK(x+εv)−xε.
A Riemannian version
One can view ΠK(x,⋅) as a function on the tangent space at x. To specify such a projection, we only need a local Euclidean structure. An inner product ⟨⋅,⋅⟩x that varies smoothly over x∈K is precisely a Riemannian metric.
Equivalently, we specify at every point x∈K, a smoothly varying positive-definite matrix M(x) so that
⟨u,v⟩M,x=⟨u,M(x)v⟩‖u‖2M,x=⟨u,u⟩M,x.
The associated projection operator is then given by
PMK(y;x):=argmin{‖y−z‖2M,x:z∈K}ΠMK(x,v):=limε→0PMK(x+εv,x)−xε.
This leads to the dynamical system:
x′(t)=ΠMK(x(t),F(x(t)))x(0)=x0∈K.
Lyapunov functions
The problem with stating things at this level of generality is that even when F=∇ℓ is the gradient of a convex function ℓ:Rn→R, we don’t have a global way of controlling convergence of F(x(t)) to min{F(x):x∈K}. In the Euclidean setting (M(x)≡Id), there is a natural Lyapunov function: If ℓ is convex and ℓ(x∗)=min{ℓ(x):x∈K}, then for every x∈K:
⟨−∇ℓ(x),x∗−x⟩≥0.
In other words, gradient descent always makes progress toward x∗.
If x′(t)=ΠK(x(t),∇ℓ(x(t))), then in the language of competitive analysis, the quantity 12‖x(t)−x∗‖2 acts a potential function (a global measure of progress).
We will consider geometries that come equipped with such a Lyapunov function. In a sense that can be formalized in various ways, these are the Hessian structures on Rn, i.e., those arising when M(x)=∇2Φ(x) for some strictly convex function Φ:K→R.
Mirror descent dynamics
Consider now a compact, convex set K⊆Rn, a strictly convex function Φ:K→R, and a continuous time-varying vector field F:[0,∞)×K→Rn. We will refer to continuous-time mirror descent as the dynamics specified by
x′(t)=Π∇2ΦK(⨁x(t),F(t,x(t)))x(0)=x0∈K.
We will sometimes refer to Φ as the mirror map.
As one might expect, we can decompose x′(t) into two components: One flowing in the direction F(t,x(t)), and the other component arising from the normal forces that are keeping x(t) inside K. We recall the normal cone to K at x is given by
NK(x)={p∈Rn:⟨p,y−x⟩≤0 for all y∈K}.
This is the set of directions that point out of the body K. The next theorem is proved in the paper k-server via multiscale entropic regularization.
If ∇2Φ(x)−1 is continuous on K, then for any x0∈K, there is an absolutely continuous trajectory x:[0,∞)→K satisfying ∇2Φ(x(t))x′(t)∈F(t,x(t))−NK(x(t)),x(0)=x0.
Note that (2) is a differential inclusion: We only require that the derivative lies in the specified set.
Lagrangian multipliers
If K is a polyhedron, the one can write
K={x∈Rn:Ax≤b},A∈Rm×n,b∈Rm.
In this case, the normal cone at x is the cone spanned by the normals of the tight constraints at x:
NK(x)={ATy:y≥0 and yT(b−Ax)=0}.
Consider now the application of Theorem MD to a polyhedron and a solution x:[0,∞)→K, λ:[0,∞)→Rn such that
∇2Φ(x(t))x′(t)=F(t,x(t))−λ(t),
and λ(t)∈NK(x(t)) for t≥0.
Let us consider the dual variables to the constraints (3): We can fix a measurable ˆλ:[0,∞)→Rm+ such that ATˆλ(t)=λ(t),t≥0.
The Bregman divergence as a Lyapunov function
We promised earlier the existence of a functional to control the dynamics, and this is provided by the Bregman divergence associated to Φ:
DΦ(y;x):=Φ(y)−Φ(x)−⟨∇Φ(x),y−x⟩.
Let x(t) be a trajectory satisfying (5). Then for any y∈K:
∂tDΦ(y;x(t))=−⟨∇Φ(x(t)),x′(t)⟩+⟨∇Φ(x(t)),x′(t)⟩−⟨∂tΦ(x(t)),y−x(t)⟩=−⟨∇2Φ(x(t))x′(t),y−x(t)⟩=−⟨F(t,x(t))−λ(t),y−x(t)⟩≤−⟨F(t,x(t)),y−x(t)⟩,
where the last inequality used that y∈K and λ(t)∈NK(x(t)).
If F(t,x(t))=−c(t) is a cost function, say, then this inequality aligns with a goal stated at the beginning of the first lecture: As long as the algorithm x(t) is suffering more cost than some feasible point y∈K, we would like to be “learning” about y.
The algorithm from last time
In the next lecture, we will use this framework to derive and analyze algorithms for metrical task systems (MTS) and the k-server problem. For now, let us show that the algorithm and analysis from last time (for MTS on uniform metrics) fit precisely into our framework.
Suppose that K={x∈Rn+:∑ni=1xi=1} is the probability simplex and
Φ(x)=n∑i=1(xi+δ)log(xi+δ)
is the (negative) entropy with some shift by δ>0. In the next lecture, we will see why the negative entropy arises naturally as a mirror map.
Then ∇2Φ(x) is a diagonal matrix with (∇2Φ(x))ii=1xi+δ. Let F(t,⋅)=−c(t) be a time-varying cost vector with c(t)≥0.
Therefore (5) gives
x′i(t)=(xi(t)+δ)(−ci(t)+ˆμ(t)−ˆλi(t)).
This is precisley the algorithm described before (as an exercise, one might try rewriting it to match exactly), and (6) constitutes half of the analysis. In the next lecture, we will discuss some general methods for the other half: Tracking the movement cost.