Natural gradient descent and mirror descent
16 Feb 2018In this post, we discuss the natural gradient, which is the direction of steepest descent in a Riemannian manifold [1], and present the main result of Raskutti and Mukherjee (2014) [2], which shows that the mirror descent algorithm is equivalent to natural gradient descent in the dual Riemannian manifold.
Motivation
Steepest descent line search
Suppose we want to minimize some convex differentiable function \(f: \Theta \rightarrow {\mathbb{R}}\) . A common approach is to do a line search using the steepest descent algorithm where \(\alpha_t\) is the stepsize and the direction we step in is the gradient of the function.
However, in applications where the parameters lie on a nonEuclidean Riemannian manifold^{1} with metric tensor \(H = (h_{ij})\) (an inner product on the tangent space of the manifold at a point), the direction of steepest descent is actually the natural gradient:
where \(H\) is the Fisher information matrix for a statistical manifold. This update equation is called natural gradient descent.
Natural Gradients
Recall that in an Euclidean space (with a orthonormal coordinate system) \(\Theta={\mathbb{R}}^n\) , the squared length of a small incremental vector \(d\theta\) that connects \(\theta\) and \(\theta + d\theta\) is given by
But for a Riemannian manifold, the squared length is given by the quadratic form
where \(H = (h_{ij})\) is an \(n \times n\) matrix called the Riemannian metric tensor (which depends on \(\theta\). In the Euclidean case, we just get that \(H = I_n\).
In this Riemannian manifold, the direction of steepest descent of a function \(f(\theta)\) at \(\theta \in \Theta\) is defined by the vector \(d\theta\) that minimizes \(f(\theta + d\theta)\), where \(d\theta\) has a fixed length, i.e., under the constraint
Theorem (Amari, 1998) [1]:
The steepest descent direction of \(f(\theta)\) in a Riemannian space is given by
where \(H^{1} = (h^{ij})\) is the inverse of the metric \(H = (h_{ij})\), and \(\nabla f\) is the conventional gradient:
Proof: Write \(d\theta = \epsilon v\), i.e., we decompose the vector into its magnitude \(\epsilon\) and direction \(v\), and search for the vector \(v\) that decreases the value of the function the most.
Writing the linearized function, we have
subject to the constraint that \(v^2 = v^\top H v =1\), i.e., \(v\) is a unit vector representating just the direction.
Setting the derivative with respect to \(v\) of the Lagrangian to 0, i.e.,
so we have that
which is the direction that causes the function’s value to decrease the most. Lastly, plugging \(v\) above into the constraint \(v^\top H v = 1\), we can solve for \(\lambda\).
Thus, the direction of steepest descent is \(H^{1} \nabla f(\theta)\), i.e., the natural gradient.
Lastly, note that in a Euclidean space, the metric tensor is just the identity matrix, so we get back standard gradient descent.
Mirror descent and Bregman divergence
Mirror descent is a generalization of gradient descent that takes into account nonEuclidean geometry. The mirror descent update equation is
Here \(\Psi: {\mathbb{R}}^p \times {\mathbb{R}}^p \rightarrow {\mathbb{R}}_+\), and when
we get back the standard gradient descent update^{2}.
The proximity function \(\Psi\) is commonly chosen to be the Bregman divergence. Let \(G: \Theta \rightarrow {\mathbb{R}}\) be a strictly convex, twicedifferentiable function, then the Bregman divergence \(B_G: \Theta \times \Theta \rightarrow {\mathbb{R}}^+\) is:
Equivalence of natural gradient and mirror descent
Bregman divergences and convex duality
The convex conjugate^{3} of a function \(G\) is defined to be
Now, if \(G\) is a lower semicontinuous function, then we have that \(G\) is the convex conjugate of \(H\). This implies that \(G\) and \(H\) have a dual relationship. If \(G\) is strictly convex and twice differentiable, so is \(H\). Let \(g = \nabla G\) and \(h = \nabla H\). Then \(g = h^{1}\).
Now let \(\mu = g(\theta) \in \Phi\) be the point at which the supremum for the dual function is attained be the dual coordinate system to \(\theta\). The dual Bregman divergence \(B_H: \Phi \times \Phi \rightarrow {\mathbb{R}}^+\) induced by the strictly convex differentiable function \(H\)
The dual coordinate relationship gives us

\(B_H(\mu,\mu’) = B_G(h(\mu’), h(\mu))\)

\(B_G(\theta,\theta’) = B_H(g(\theta’), g(\theta))\)
For exponential families, the function \(G\) is the log partition function^{4}. We’ll now work through some examples.
Example: [Normal]
Consider the family \(N(\theta, I_p)\). The log partition function is
and the conjugate function of \(G\) is
since the expression in the supremum is maximized when \(\theta = \mu\) (so we plug this in for \(\theta\) to get the second line). Then we can easily write the Bregman divergence induced by \(G,H\) as
Example: [Poisson]
Now suppose we have the family \(\text{Poisson}(\exp(\theta))\). We have
where we plugged in \(\theta = \log \mu\) above. So the Bregman divergence induced by \(G,H\) is
Bregman divergences and Riemannian manifolds
Now that we have introduced what the Bregman divergence of an exponential family looks like (with respect to the log partition function \(G\) and its dual Bregman divergence with respect to \(H\), where \(H\) is the conjugate of \(G\), we are ready to discuss the respective primal and dual Riemannian manifolds induced by these divergences.

Primal Riemannian manifold: \((\Theta, \nabla^2 G)\), where \(\nabla^2 G \succ 0\) (since \(G\) is strictly convex and twice differentiable).

Dual Riemmanian manifold: \(B_H(\theta,\theta’)\) induces the Riemannian manifold \((\Phi, \nabla^2 H)\), where \(\Phi = {\mu: g(\theta) = \mu, \theta \in \Theta}\), i.e., the image of \(\Theta\) under the continuous map \(g = \nabla G\).
Theorem (Raskutti and Mukherjee, 2014) [2]:
The mirror descent step with Bregman divergence defined by \(G\) applied to the function \(f\) in the space \(\Theta\) is equivalent to the natural gradient step along the dual Riemannian manifold \((\Phi, \nabla^2 H)\).
(Proof Sketch):
Step 1: Rewrite the mirror descent update in terms of the dual Riemannian manifold coordinates \(\mu \in \Phi\):
since minimizing by differentiation and the dual coordinates relationship gives us
Step 2: Apply the chain rule:
and plugging this in to the update above, we get
which corresponds to the natural gradient step.
Highlevel summary:
The main result of [2] is that the mirror descent update with Bregman divergence is equivalent to the natural gradient step along the dual Riemannian manifold.
What does this mean? For natural gradient descent we know that

the natural gradient is the direction of steepest descent along a Riemannian manifold, and

it is Fisher efficient for parameter estimation (achieves the CRLB, i.e., the variance of any unbiased estimator is at least as high as the inverse Fisher information),
but neither of these things are known for mirror descent.
The paper also outlines potential algorithmic benefits: natural gradient descent is a secondorder method, since it requires the computation of an inverse Hessian, while the mirror descent algorithm is a firstorder method and only involves gradients.
References

Amari. Natural gradient works efficiently in learning. 1998

Raskutti and Mukherjee. The information geometry of mirror descent. 2014.
Footnotes

a smooth manifold equipped with an inner product on the tangent space \(T_p \Theta\) ↩

this can be seen by writing down the projected gradient descent update and rearranging terms ↩

aka Fenchel dual, Legendre transform. Also e.g. for a convex optimization function, we can essentially write the dual problem (i.e., the inf of the Lagrangian), as the negative conjugate function plus linear functions of the Langrange multipliers. ↩

see the table listing the log partition function for exponential family distributions ↩