Navigating a convex body online

 

In the last lecture, we saw some algorithms that, while simple and appealing, were somewhat unmotivated. We now try to derive them from general principles, and in a setting that will allow us to attack other problems in competitive analysis.

Gradient descent: The proximal view

Let us first recall the upper bound we derived for the regret in the last lecture:

RTTt=1[ptpt+11+pt+1,tt(xT)].

Trying to minimize this expression leads to the question of how we should update our probability distribution ptpt+1 to simultaneously be stable (control the first term) and competitive (the second term).

A very natural algorithm in this setting is gradient descent. Indeed, suppose that :RnR is differentiable, and consider the optimization

min{12xx02+η(x):xRn},

where η>0 is a small constant and denotes the Euclidean norm. Then first-order optimality conditions dictate that the optimizer satisfies x=x0η(x0)+O(η2).

Two questions immediately arise: Why do we use the Euclidean norm when our reference problem (1) refers to the 1 norm, and if x is meant to encode a probability distribution, how do we maintain this constraint for x?

Projected gradient descent

Let’s address the feasibility problem first. Suppose KRn is a closed convex set and F:RnRn is a sufficiently smooth vector field (think of F=). How should we move in the direction of F while simultaneously remaining inside K?

The unconstrained flow along F can be described as a trajectory x:[0,)Rn given by x(t)=F(x(t)).

The most natural way to keep this flow inside K is to project back into the body whenever we leave. Define the Euclidean projection

PK(y):=argmin{yz2:zK},

and the result of taking an infinitesimal step in direction v and and then projecting: ΠK(x,v):=limε0PK(x+εv)xε.

Then the projected dynamics looks like x(t)=ΠK(x(t),F(x(t))).
This is an example of a projected dynamical system. Having now addressed feasibility, we are left to consider the role of the Euclidean norm.

A Riemannian version

One can view ΠK(x,) as a function on the tangent space at x. To specify such a projection, we only need a local Euclidean structure. An inner product ,x that varies smoothly over xK is precisely a Riemannian metric.

Equivalently, we specify at every point xK, a smoothly varying positive-definite matrix M(x) so that

u,vM,x=u,M(x)vu2M,x=u,uM,x.

The associated projection operator is then given by

PMK(y;x):=argmin{yz2M,x:zK}ΠMK(x,v):=limε0PMK(x+εv,x)xε.

This leads to the dynamical system:

x(t)=ΠMK(x(t),F(x(t)))x(0)=x0K.

Lyapunov functions

The problem with stating things at this level of generality is that even when F= is the gradient of a convex function :RnR, we don’t have a global way of controlling convergence of F(x(t)) to min{F(x):xK}. In the Euclidean setting (M(x)Id), there is a natural Lyapunov function: If is convex and (x)=min{(x):xK}, then for every xK:

(x),xx0.

In other words, gradient descent always makes progress toward x.

If x(t)=ΠK(x(t),(x(t))), then in the language of competitive analysis, the quantity 12x(t)x2 acts a potential function (a global measure of progress).

We will consider geometries that come equipped with such a Lyapunov function. In a sense that can be formalized in various ways, these are the Hessian structures on Rn, i.e., those arising when M(x)=2Φ(x) for some strictly convex function Φ:KR.

Mirror descent dynamics

Consider now a compact, convex set KRn, a strictly convex function Φ:KR, and a continuous time-varying vector field F:[0,)×KRn. We will refer to continuous-time mirror descent as the dynamics specified by

x(t)=Π2ΦK(x(t),F(t,x(t)))x(0)=x0K.

We will sometimes refer to Φ as the mirror map.

As one might expect, we can decompose x(t) into two components: One flowing in the direction F(t,x(t)), and the other component arising from the normal forces that are keeping x(t) inside K. We recall the normal cone to K at x is given by

NK(x)={pRn:p,yx0 for all yK}.

This is the set of directions that point out of the body K. The next theorem is proved in the paper k-server via multiscale entropic regularization.

If 2Φ(x)1 is continuous on K, then for any x0K, there is an absolutely continuous trajectory x:[0,)K satisfying 2Φ(x(t))x(t)F(t,x(t))NK(x(t)),x(0)=x0.

Moreover, if 2Φ(x) is Lipschitz on K and F is locally Lipschitz, then the solution is unique.

Note that (2) is a differential inclusion: We only require that the derivative lies in the specified set.

Lagrangian multipliers

If K is a polyhedron, the one can write

K={xRn:Axb},ARm×n,bRm.

In this case, the normal cone at x is the cone spanned by the normals of the tight constraints at x:

NK(x)={ATy:y0 and yT(bAx)=0}.

Consider now the application of Theorem MD to a polyhedron and a solution x:[0,)K, λ:[0,)Rn such that

2Φ(x(t))x(t)=F(t,x(t))λ(t),

and λ(t)NK(x(t)) for t0.

Let us consider the dual variables to the constraints (3): We can fix a measurable ˆλ:[0,)Rm+ such that ATˆλ(t)=λ(t),t0.

Now (4) and λ(t)NK(x(t)) yield the complementary-slackness conditions: For all i=1,2,,m and t0: ˆλi(t)>0Ai,x(t)=bi,
where Ai is the ith row of A.

The Bregman divergence as a Lyapunov function

We promised earlier the existence of a functional to control the dynamics, and this is provided by the Bregman divergence associated to Φ:

DΦ(y;x):=Φ(y)Φ(x)Φ(x),yx.

Let x(t) be a trajectory satisfying (5). Then for any yK:

tDΦ(y;x(t))=Φ(x(t)),x(t)+Φ(x(t)),x(t)tΦ(x(t)),yx(t)=2Φ(x(t))x(t),yx(t)=F(t,x(t))λ(t),yx(t)F(t,x(t)),yx(t),

where the last inequality used that yK and λ(t)NK(x(t)).

If F(t,x(t))=c(t) is a cost function, say, then this inequality aligns with a goal stated at the beginning of the first lecture: As long as the algorithm x(t) is suffering more cost than some feasible point yK, we would like to be “learning” about y.

The algorithm from last time

In the next lecture, we will use this framework to derive and analyze algorithms for metrical task systems (MTS) and the k-server problem. For now, let us show that the algorithm and analysis from last time (for MTS on uniform metrics) fit precisely into our framework.

Suppose that K={xRn+:ni=1xi=1} is the probability simplex and

Φ(x)=ni=1(xi+δ)log(xi+δ)

is the (negative) entropy with some shift by δ>0. In the next lecture, we will see why the negative entropy arises naturally as a mirror map.

Then 2Φ(x) is a diagonal matrix with (2Φ(x))ii=1xi+δ. Let F(t,)=c(t) be a time-varying cost vector with c(t)0.

Therefore (5) gives

xi(t)=(xi(t)+δ)(ci(t)+ˆμ(t)ˆλi(t)).

Here, ˆλi(t) is the Lagrangian multiplier corresponding to the constraint xi0, and ˆμ(t) is the multiplier corresponding to ni=1xi=1.

This is precisley the algorithm described before (as an exercise, one might try rewriting it to match exactly), and (6) constitutes half of the analysis. In the next lecture, we will discuss some general methods for the other half: Tracking the movement cost.