# Why use iterative methods?

One of the most fundamental math problems is solving a linear system: $Ax = b,$ where $$A \in \mathbb{R}^{n \times m}$$ is a matrix, and $$x \in \mathbb{R}^n$$ and $$b \in \mathbb{R}^m$$ are vectors; we are interested in the solution, $\tilde x = A^{-1} b,$ given by the matrix-vector product of the pseudoinverse of the matrix $$A$$ and the vector $$b$$.

Now, for the rest of this post, we’ll consider the case where the number of rows and columns are equal, i.e., $$n=m$$ and further assume that $$A$$ is a symmetric, positive definite (s.p.d.) matrix. This guarantees that an inverse exists.

Typically (without making assumptions on the structure or sparsity of the matrix), matrix inversion via Gaussian elimination (a.k.a. LU factorization) requires $$O(n^3)$$ computation. There are many other types of matrix factorization algorithms, including the Cholesky decomposition, which applies to s.p.d. matrices that can be used to solve the problem. However, when the dimension of the problem is very large—for instance, on the order of millions or even larger—matrix factorization becomes too computationally expensive. Trefethen and Bau [3, pp. 243] give the following dates and sizes for what was considered “large” in terms of so-called direct, dense matrix inversion methods:

• 1950: $$n = 20$$ (Wilkinson)
• 1965: $$n = 200$$ (Forsythe and Moler)
• 1980: $$n = 2000$$ (LINPACK)
• 1995: $$n = 20000$$ (LAPACK) - LAPACK remains a popular for linear algebra computations today.

Some problems where large linear systems that come up in practice include Kalman filters, problems large covariance matrices (e.g., in genetics), large kernel matrices the size of the square of the number of data points, large regression problems, spectral methods for large graphs.

In these cases, iterative methods, such as conjugate gradient, are popular, especially when the matrix $$A$$ is sparse. In direct matrix inversion methods, there are typically $$O(n)$$ steps, each requiring $$O(n^2)$$ computation; iterative methods aim to cut down on the running time of each of these numbers, and the performance typically depends on the spectral properties of the matrix.

In this post, we’ll give an overview of the linear conjugate gradient method, and discuss some scenarios where it does very well, and others where it falls short.

The basic idea of conjugate gradient is to find a set of $$n$$ conjugate direction vectors, i.e., a set of vectors $$p_1,\ldots,p_n$$ satisfying

$p_i \, A \, p_j = 0, \quad \forall i \neq j.$

This set of vectors has the property of being linearly independent and spanning $$\mathbb{R}^n$$.

Before discussing the conjugate gradient algorithm, we’ll first present the conjugate directions algorithm, which assumes that we are given a set of $$n$$ conjugate vectors. Then we’ll see that conjugate gradient is just a special type of conjugate directions algorithm that can generate the next conjugate vector using only the current vector, a property which gives rise to its computational efficiency.

## Conjugate directions algorithm

Suppose we are given the set of conjugate vectors $$p_1,\ldots,p_n$$. The conjugate directions algorithm constructs iterates stepping in these directions:

$x_{k+1} = x_k + \alpha_k p_k,$

where $$p_k$$ is the $$k$$th conjugate direction, and the step size $$\alpha_k$$ is obtained as the minimizer of the convex quadratic equation along $$x_k + \alpha p_k$$:

$\phi(x) = \frac{1}{2} x^\top A x - b x.$

Note that the gradient of this equation recovers the solution of original equation of interest

$\nabla \phi(x) = A x - b =: r(x).$

The above equation, which we define as $$r(x)$$ is called the residual. Furthermore, define the residual at iterate $$x_k$$ as $$r_k := r(x_k)$$.

The explicit step-size is given by

$\alpha_k = -\frac{ r_k^\top \, p_k }{p_k \, A \, p_k}.$

### Properties of the conjugate directions algorithm

Below, we list two properties that hold for any conjugate directions algorithm. The proofs are in Theorem 5.1 and Theorem 5.2 of Nocedal and Wright [2].

#### Conjugate directions converges in at most $$n$$ steps

An important result is the following: any sequence of vectors generated by the conjugate direction algorithm converges to the solution $$\tilde x$$ in at most $$n$$ steps.

#### Expanding subspace minimization

Another important property is that the residual at the current iterate, $$r_k$$, is orthogonal to all previous conjugate vectors: i.e.,

$r_k^\top p_i = 0, \quad \forall i=0,\ldots,k-1.$

Furthermore, each iterate $$x_k$$ is the minimizer of the quadratic function $$\phi$$ over the set defined by vectors in the span of the previous search directions plus the initial point:

$\{ x : x = x_0 + \mathrm{span}(p_0, \ldots, p_{k-1})\} .$

How do generate a set of conjugate directions? Methods such as computing the eigenvectors (which are conjugate) of $$A$$ are expensive. The conjugate gradient algorithm generates a new $$p_{k+1}$$ using the vector $$p_k$$ with the property that it is conjugate to all existing vectors. As a result, little computation or storage is needed for this algorithm.

The initial conjugate vector $$p_0$$ is set to the negative residual at the initial point, i.e., $$p_0 = - r_0$$; recall that the residual is defined to be the gradient of the quadratic objective function, and so, this is the direction of steepest descent.

Each subsequent point is a linear combination of the negative residual $$-r_k$$ and the previous direction $$p_{k-1}$$:

$p_k = -r_k + \beta_k \, p_{k-1},$

where the scalar

$\beta_k = \frac{r_k^\top A p_{k-1}}{p_{k-1}^\top A p_{k-1}}$

is such that the vector $$p_k$$ along with the previous search directions are conjugate. The above expression for $$\beta_k$$ is obtained by premultiplying $$p_k$$ by $$p_{k-1} A$$ and the condition that $$p_{k-1}^\top A p_k = 0$$.

The final algorithm for conjugate gradient involves a few more simplifications. First, the step size can be simplified by recognizing that $$r_k^\top p_i = 0$$ (from the expanding subspace minimization theorem), and substituting $$p_k = -r_{k} + \beta_{k} p_{k-1}$$ to get

$\alpha_k = \frac{r_k^\top r_k}{p_k^\top A p_k}.$

Similarly, we can rewrite the scalar for the linear combination as

$\beta_k = \frac{r_{k}^\top r_{k}}{r_{k-1}^\top r_{k-1}}.$

To summarize, the algorithm is:

1. Input: initial point $$x_0$$
2. Initialize $$r_0 = A x_0 - b, \,\, p_0 = - r_0, \,\, k=0$$
3. While $$r_k \neq 0$$:
1. Construct the step size $\alpha_k = \frac{r_k^\top r_k}{p_k^\top \, A \, p_k}$
2. Construct next iterate by stepping in direction $$p_k$$ $x_{k+1} = x_k + \alpha_k \, p_k$
3. Construct new residual $r_{k+1} = r_k + \alpha_k \, A \, p_k$
4. Construct scalar for linear combination for next direction $\beta_{k+1} = \frac{r_{k+1}^\top \, r_{k+1}}{r_k^\top \, r_k}$
5. Construct next conjugate vector $p_{k + 1} = -r_{k+1} + \beta_{k+1} \, p_k$
6. Increment $$k = k + 1$$

## Properties of the conjugate gradient algorithm

### Convergence in at most $$r$$ steps

The directions produced by conjugate gradient are conjugate vectors and therefore converge to the solution in at most $$n$$ steps [2, Theorem 5.3]. However, we can say a few more things.

If the matrix has $$r$$ unique eigenvalues, then the algorithm converges in at most $$r$$ steps [2, Theorem 5.4].

### Clustered eigenvalues

If some eigenvalues are clustered around a value, it’s almost as if those were 1 unique eigenvalue, and thus, the convergence will generally be better. For instance, if there are 5 large eigenvalues and the remaining eigenvalues are eigenvalues clustered around 1, then it is almost as if there were only 6 distinct eigenvalues, and therefore we’d expect the error to be small after about 6 iterations.

### Numerical stability vs computational tradeoffs

Conjugate gradient is only recommended for large problems, as it is less numerically stable than direct methods, such as Gaussian elimination. On the other hand, direct matrix factorization methods require more storage and time.

### Aside: Krylov subspace methods

As an aside, conjugate gradient is a member of the class of Krylov subspace methods, which generate elements from the $$k$$th Krylov subspace generated by the vector $$b$$ defined by $\mathcal{K}_k = \mathrm{span} \{ b, Ab, \ldots, A^{k-1} b \}.$

These methods have the property of converging in the a number of steps equal to the number of unique eigenvalues. Other popular Krylov subspace methods include GMRES, Lanczos, and Arnoldi iterations. More details can be found in Trefethen and Bau [3].

## Preconditioning

The goal of preconditioning is to construct a new matrix / linear system, such that the problem is better-conditioned, i.e., the condition number (the ratio of largest to smallest singular values with respect to Euclidean norm) is decreased or the eigenvalues are better clustered (for the case of conjugate gradient). Choosing an appropriate preconditioned leads to better convergence of the algorithm.

Let $$C$$ be a nonsingular matrix. Then applying the change of variables

$\hat x = Cx \implies x = C^{-1} \hat x,$

to the original quadratic objective $$\phi$$ gives

$\hat \phi(\hat x) = \frac{1}{2} \hat x^\top (C^{-\top} A C^{-1}) \, \hat x - (C^{-\top} b)^\top \, \hat x,$

thereby solving the linear system

$(C^{-\top} A C^{-1}) \, \hat x = C^{-\top} b.$

Thus, the goal of the preconditioner is to pick a $$C$$ such that the matrix $$(C^{-\top} A C^{-1})$$ has good spectral properties (e.g., well-conditioned or clustered eigenvalues).

The modified conjugate gradient algorithm does not directly involve the matrix $$(C^{-\top} A C^{-1})$$. Instead it inolves the the matrix $$M = C^\top C$$ and solving 1 additional linear system with $$M$$.

One type of preconditioner used in practice is the Incomplete Cholesky Procedure, which involves constructing a sparse, approximate Cholesky decomposition that is cheap to compute: that is,

$A \approx \tilde L \tilde L^\top =: M,$

where we’ve chosen $$C = \tilde L^\top$$. The idea is with this choice of preconditioner, we’ll have

$C^{-\top} A C^{-1} \approx \tilde L^{-1} \tilde L \tilde L^\top \tilde L^{-\top} = I.$

We won’t explore preconditioners further in this post, but additional details can be found in Nocedal and Wright [2, pp. 118-120].

# Implementation and empirical study

We implement the conjugate gradient algorithm and study a few example problems. The full code for this post is linked at the bottom of this post.

## Julia code snippet

Below is a Julia code snippet implementing a basic conjugate gradient algorithm.

The inner loop iteration is:

function iterate_CG(xk, rk, pk, A)
"""
Basic iteration of the conjugate gradient algorithm

Parameters:
xk: current iterate
rk: current residual
pk: current direction
A: matrix of interest
"""

# construct step size
αk = (rk' * rk) / (pk' * A * pk)

# take a step in current conjugate direction
xk_new = xk + αk * pk

# construct new residual
rk_new = rk + αk * A * pk

# construct new linear combination
betak_new = (rk_new' * rk_new) / (rk' * rk)

# generate new conjugate vector
pk_new = -rk_new + betak_new * pk

return xk_new, rk_new, pk_new

end


The full algorithm is below:

function run_conjugate_gradient(x0, A, b; max_iter = 2000)
"""

Parameters:
x0: initial point
A: matrix of interest
b: vector in linear system (Ax = b)
max_iter: max number of iterations to run CG
"""

# initial iteration
xk = x0
rk = A * xk - b
pk = -rk

for i in 1:max_iter

xk, rk, pk = iterate_CG(xk, rk, pk, A)

# compute absolute error and break if converged
err = sum(abs.(rk)); push!(errors, err)
err < 10e-6 && break

end

println("Terminated in $i iterations") return xk end  ## Examples We first look at two small toy examples. The first is a diagonal matrix, where we set the eigenvalues. The second is a randomly generated s.p.d. matrix. Next we look at a larger-scale setting, where were randomly generate two matrices, one that is better-conditioned, and one that is worse-conditioned (i.e., smaller vs larger condition numebrs). We see that the latter takes many more iterations (than the first problem, but also than the dimension of the problem) due to numerical errors. ### Example 1: Small diagonal matrix In our first example, we construct a 5x5 diagonal matrix with 3 eigenvalues around 10, 1 eigenvalue equal to 2, and 1 equal to 1: A = zeros(5,5) A[1,1] = 10 A[2,2] = 10.1 A[3,3] = 10.2 A[4,4] = 2 A[5,5] = 1  Running the conjugate gradient, the algorithm terminates in 5 steps. We examine the error in the residual by computing the sum of the absolute residual components. Note that this is not the error in the solution but rather the error in the system. We see that the error drops sharply at 4 iterations, and then again in the 5th iteration. Example 1. Small diagonal matrix. ### Example 2: Small random s.p.d. matrix In our next example, we construct a 5x5 random s.p.d. matrix: A = rand(5,5) A = 0.5*(A+A') A = A + 5*I  with, in this particular instance, eigenvalues > eigvals(A) 5-element Array{Float64,1}: 4.46066817714894 4.979395596044116 5.2165679905921785 5.56276039133931 7.909981699390226  Here the error drops sharply at the 5th iteration, reaching the desired convergence tolerance. Example 2. Small random s.p.d. matrix. ### Example 3: Large random s.p.d. matrix (smaller condition number) Next, we examine a matrix of dimension 1000, and generate a random s.p.d matrix. A = rand(N,N) A = 0.5*(A+A') A = A + 50*I  Note that in the above snippet, the last line adds a large term to the diagonal, making the matrix “more nonsingular.” The condition number is: > cond(A) 14.832223763483976  The general rule of thumb is that we lose about 1 digit of accuracy with condition numbers on this order. The absolute error of the residual is plotted against iteration. Here the algorithm terminated in 11 iterations. Example 3. Large random s.p.d. matrix, better conditioned. ### Example 4: Large random s.p.d. matrix (larger condition number) Now we examine a matrix again of size 1000 but this time with a smaller multiplier on the diagonal (here we picked the smallest integer value such that the eigenvalues of the resulting matrix were all nonnegative): A = rand(N,N) A = 0.5*(A+A') A = A + 13*I  The condition number is much larger than the previous case, almost at $$10^3$$: > cond(A) 2054.8953570922145  The error plot is below: Example 4. Large random s.p.d. matrix, worse conditioned. Here, the algorithm converged after 80 iterations. A quick check of the “conjugate” vectors shows that the set of vectors is indeed not conjugate. ## Code on Github A Jupyter notebook with all of the code (running Julia v1.0) is available on Github. ## References 1. Shewchuk. An Introduction to Conjugate Gradient without the Agonizing Pain. 2. Nocedal and Wright. Numerical Optimization, Chapter 5. 3. Trefethen and Bau. Numerical Linear Algebra, Chapter 6. # Julia v1.0 First Impressions The Julia Programming Language has finally released version 1.0! I was still using version 0.6 when pretty much suddenly both v0.7 and v1.0 came out within a day apart (thanks JuliaCon). As a result, I didn’t get to actually try out a lot of the new features until now. Here I’ll document the installation process and some first impressions. On the Julia website is a nice blog post on new features of Julia 1.0. They recommend installing v0.7 first, which contains deprecation warnings, fixing those warnings, then upgrading to 1.0. Why is 1.0 so exciting? From the blog post linked above, it says The single most significant new feature in Julia 1.0, of course, is a commitment to language API stability: code you write for Julia 1.0 will continue to work in Julia 1.1, 1.2, etc. The language is “fully baked.” The core language devs and community alike can focus on packages, tools, and new features built upon this solid foundation. ## Installing Julia I haven’t been using Homebrew to install Julia since v0.5, and so I decided to go with that again this time. The first thing I did was download and install Julia from the website, and then create the symlink: sudo ln -s /Applications/Julia-1.0.app/Contents/Resources/julia/bin/julia /usr/local/bin/julia  Now typing in julia into the command prompt brings up the 1.0 REPL. The new REPL now starts up noticeably faster than before. I compared the startup time: for version 1.0, we have ~$ time julia -e ""

real    0m0.360s
user    0m0.114s
sys     0m0.153s


and for version 0.6, we have

~ \$ time /Applications/Julia-0.6.app/Contents/Resources/julia/bin/julia -e ""

real    0m0.829s
user    0m0.306s
sys     0m0.237s


## Installing Packages in the REPL

Next up is installing packages that I need for work. The first thing I always do is install IJulia, and so after trying Pkg.add("IJulia"), I was immediately reminded that there is now a new package manager.

To access Pkg, the new package manager, press the ] key in the REPL. Then we just type the package name at the prompt, e.g.,

(v1.0) pkg> add IJulia


Multiple packages can be installed in each line by specifying add Pack1 Pack2 Pack3. Typing help will display a list of commands. To return to the julia> prompt, press backspace on an empty prompt or Ctrl+C.

Now IJulia is installed. After starting up jupyter notebook and confirming that the v1.0.0 kernel was present, I then proceeded to install my usual packages:

1. Distributions, StatsBase - for basic probabilistic modeling and statistics
2. Seaborn (which installs all the Python dependencies) - plotting built on PyPlot (and PyCall to Matplotlib)
3. Optim - optimization package
4. Calculus, ForwardDiff - computing derivatives

The installation process was quite fast for each package. Overall, I like the simplicity of the interface for the new package manager.

While I was able to successfully install all of the packages above, I wasn’t able to successful install a few packages that either hadn’t yet upgraded to v0.7 or had some dependency that hadn’t yet upgraded.

Thus, if your work depends heavily on certain packages that have yet to be updated, it may be worth sticking with v0.6 for a little bit longer.

### Installing packages from Github

However, it’s worth checking the package online to see if you can directly download the latest code from Github. E.g., if you go to the package’s Github page and see that there have been indeed updates for v0.7, then try installing directly from Github.

After trying to import a few packages (PyPlot and Seaborn), I found that they had import errors during precompilation (e.g., ERROR: LoadError: Failed to precompile LaTeXStrings), which were usually due to some issue with a dependency of that package. For example, for PyPlot, I found that the package LaTeXStrings had indeed made fixes for 0.7, but this seemed to not be in the latest release, and so I removed the package and added the code on the master branch on Github:

(v1.0) pkg> add LaTeXStrings#master


This fixed the issue with both PyPlot and Seaborn immediately in that I could now import the packages.

However, I wasn’t able to get the plot function to actually work initially. After adding PyCall#master, I was able to get PyPlot and Seaborn both working in both the REPL and Jupyter notebooks.

As in the old package manager, we can still install unregistered packages by adding the URL directly, which is really nice for releasing your own research code for others to use.

## Final Thoughts

While the Julia v1.0 upgrade was very exciting, because many packages that I rely on aren’t yet 1.0-ready, I’m going to hold off on transitioning my research work and code to v1.0, and stick with v0.6 for a little longer. Supposedly the Julia developers will release a list of 1.0-ready packages soon.

From perusing the Julia channel, it seems like many packages are become almost ready. I plan to check back again in a month or so to see if packages I rely on are ready for use. Overall, I’m thrilled that v1.0 is finally released, and look forward to using it soon.

# My LaTeX Setup - vim, Skim, and latexmk

I’ll describe my current LaTeX setup – vim, skim, and latexmk. I currently write notes and papers using LaTeX on a Mac, though in a previous life, my setup on Linux was very similar (i.e., s/macvim/gvim/g and s/skim/zathura/g). I like being able to use my own customized text editor (with my favorite keyboard shortcuts) to edit a raw text file that I can then put under version control. Though I describe things with respect to vim, most of this setup can also achieved with some other text editor, e.g., emacs or sublime.

## The basic setup

### Text editor: vim

I use the text editor, vim. For writing documents, I often also like using macvim, which is basically vim but with a few extra GUI features, such as scrolling and text selection features.

For my pdf reader, I use the Skim app. The nice thing about Skim is that it will autoupdate the pdf after you compile the tex and keep the document in the same place. By contrast, Preview will refresh the document and reload it from the beginning of the document, which is quite cumbersome if you’d just edited text on page 5.

### Compilation: latexmk + vim keybindings

In vim, I have several keybindings set up in my .vimrc file. Currently, I compile the .tex file using latexmk. Files can be compiled to pdf with the command

latexmk -pdf [file]


In latexmk, you can also automatically recompile the tex when needed, but I prefer to manually compile the tex.

I have the following key bindings in vim:

• control-T compiles the tex to pdf using latexmk
• shift-T opens the pdf in Skim
• shift-C cleans the directory

For instance, the compile keybinding above is done by adding the following line to the .vimrc file

autocmd FileType tex nmap <buffer> <C-T> :!latexmk -pdf %<CR>


These types of key bindings can usually be setup with other text editors as well.

### Putting it all together

When I’m writing a document, I’ll often have my text editor full screen on the left and the Skim on the right with the pdf open. After making edits, I’ll press control-T to compile and the pdf will auto-refresh in Skim.

Example of macvim next to Skim setup. Monitor recommended :)

## Tools for writing LaTeX documents

There are a few other tools that I use for writing papers (or extensive notes on work) that I’ll list below that save me a lot of time.

### Plugins: snippets

In vim, I have several plugins I have in my .vimrc file. For writing papers, I use ()[snippets] extensively. My most used command is typing begin <tab>, which allows me to allows me to autocomplete the parts of LaTeX that I don’t want to spend time typing but use often. For instance, in the TeX below, snippets means I only have to type align* once.

\begin{align*}
\pi(\theta | x) = \frac{\pi(\theta) p(x | \theta)}
{\int \pi(\theta) p(x | \theta) d\theta}
\end{align*}


And should I need to change align* to just align, I only have to edit this in one of the parens instead of in both lines.

To use vim plugins (after installing), you just need to add the github page to your .vimrc file, e.g., for Ultisnips, add the line

Plugin 'SirVer/ultisnips'


and execute :PluginInstall while in command mode in vim.

### TeX skeleton

For writing quick notes, I have a TeX skeleton file that always loads when I start a new vim document, i.e., when executing vim notes.tex in the command line, a fully-functional tex skeleton with all the packages I normally use and font styles are automatically loaded. E.g., a version of the following:

\documentclass[11pt]{article}

\usepackage{amsmath, amssymb, amsthm, bbm}
\usepackage{booktabs, fancyhdr, verbatim, graphicx}

\title{title}
\author{Diana Cai}

\begin{document}

\maketitle

\end{document}



### subfiles

For longer documents, I use the subfiles package extensively. This allows me to split a document into smaller files, which can then be individually compiled into its own document. Subfiles inherit the packages, etc., that are defined in the main file in which the subfiles are inserted.

The advantage of using subfiles is that you can just edit the subfile you’re currently focusing on, which is much faster than compiling the entire document (especially one with many images and references). This also helps with reducing the number of git conflicts one encounters when collaborating on a document.

### macros

I have a set of macros that I commonly use that are imported when I load a .tex document. For writing papers, I’ll include a separate macros.tex file with all of my macro definitions that I reuse often. Some example macros I’ve defined:

• \reals for \mathbb{R}
• \E for \mathbb{E}
• \iid for \stackrel{\mathrm{iid}}{\sim}
• \convas for \stackrel{\mathrm{a.s.}}{\longrightarrow}

Also super convenient if you’re defining a variable that is always bold if, for instance, that variable is a vector.

### commenting

My collaborators and I often have a commenting file we use to add various comments (in different colors too!) in the margins, in the text, or highlighting certain regions of the text. An alternative is to use a LaTeX commenting package and then define custom commands controlling how the comments appear in the text.

## .vimrc

You can check out my .vimrc file on github to see how I set up keybindings, plugins, and my .tex skeleton file.

# Natural gradient descent and mirror descent

In this post, we discuss the natural gradient, which is the direction of steepest descent in a Riemannian manifold [1], and present the main result of Raskutti and Mukherjee (2014) [2], which shows that the mirror descent algorithm is equivalent to natural gradient descent in the dual Riemannian manifold.

# Motivation

Suppose we want to minimize some convex differentiable function $$f: \Theta \rightarrow {\mathbb{R}}$$ . A common approach is to do a line search using the steepest descent algorithm \begin{aligned} \theta_{t+1} = \theta_t - \alpha_t \nabla f(\theta),\end{aligned} where $$\alpha_t$$ is the step-size and the direction we step in is the gradient of the function.

However, in applications where the parameters lie on a non-Euclidean Riemannian manifold1 with metric tensor $$H = (h_{ij})$$ (an inner product on the tangent space of the manifold at a point), the direction of steepest descent is actually the natural gradient:

where $$H$$ is the Fisher information matrix for a statistical manifold. This update equation is called natural gradient descent.

Recall that in an Euclidean space (with a orthonormal coordinate system) $$\Theta={\mathbb{R}}^n$$ , the squared length of a small incremental vector $$d\theta$$ that connects $$\theta$$ and $$\theta + d\theta$$ is given by

But for a Riemannian manifold, the squared length is given by the quadratic form

where $$H = (h_{ij})$$ is an $$n \times n$$ matrix called the Riemannian metric tensor (which depends on $$\theta$$. In the Euclidean case, we just get that $$H = I_n$$.

In this Riemannian manifold, the direction of steepest descent of a function $$f(\theta)$$ at $$\theta \in \Theta$$ is defined by the vector $$d\theta$$ that minimizes $$f(\theta + d\theta)$$, where $$||d\theta||$$ has a fixed length, i.e., under the constraint

#### Theorem (Amari, 1998) [1]:

The steepest descent direction of $$f(\theta)$$ in a Riemannian space is given by

where $$H^{-1} = (h^{ij})$$ is the inverse of the metric $$H = (h_{ij})$$, and $$\nabla f$$ is the conventional gradient:

Proof: Write $$d\theta = \epsilon v$$, i.e., we decompose the vector into its magnitude $$\epsilon$$ and direction $$v$$, and search for the vector $$v$$ that decreases the value of the function the most.

Writing the linearized function, we have

subject to the constraint that $$||v||^2 = v^\top H v =1$$, i.e., $$v$$ is a unit vector representating just the direction.

Setting the derivative with respect to $$v$$ of the Lagrangian to 0, i.e.,

so we have that

which is the direction that causes the function’s value to decrease the most. Lastly, plugging $$v$$ above into the constraint $$v^\top H v = 1$$, we can solve for $$\lambda$$.

Thus, the direction of steepest descent is $$H^{-1} \nabla f(\theta)$$, i.e., the natural gradient.

Lastly, note that in a Euclidean space, the metric tensor is just the identity matrix, so we get back standard gradient descent.

# Mirror descent and Bregman divergence

Mirror descent is a generalization of gradient descent that takes into account non-Euclidean geometry. The mirror descent update equation is

Here $$\Psi: {\mathbb{R}}^p \times {\mathbb{R}}^p \rightarrow {\mathbb{R}}_+$$, and when

we get back the standard gradient descent update2.

The proximity function $$\Psi$$ is commonly chosen to be the Bregman divergence. Let $$G: \Theta \rightarrow {\mathbb{R}}$$ be a strictly convex, twice-differentiable function, then the Bregman divergence $$B_G: \Theta \times \Theta \rightarrow {\mathbb{R}}^+$$ is:

# Equivalence of natural gradient and mirror descent

#### Bregman divergences and convex duality

The convex conjugate3 of a function $$G$$ is defined to be

Now, if $$G$$ is a lower semi-continuous function, then we have that $$G$$ is the convex conjugate of $$H$$. This implies that $$G$$ and $$H$$ have a dual relationship. If $$G$$ is strictly convex and twice differentiable, so is $$H$$. Let $$g = \nabla G$$ and $$h = \nabla H$$. Then $$g = h^{-1}$$.

Now let $$\mu = g(\theta) \in \Phi$$ be the point at which the supremum for the dual function is attained be the dual coordinate system to $$\theta$$. The dual Bregman divergence $$B_H: \Phi \times \Phi \rightarrow {\mathbb{R}}^+$$ induced by the strictly convex differentiable function $$H$$

The dual coordinate relationship gives us

• $$B_H(\mu,\mu’) = B_G(h(\mu’), h(\mu))$$

• $$B_G(\theta,\theta’) = B_H(g(\theta’), g(\theta))$$

For exponential families, the function $$G$$ is the log partition function4. We’ll now work through some examples.

#### Example: [Normal]

Consider the family $$N(\theta, I_p)$$. The log partition function is

and the conjugate function of $$G$$ is

since the expression in the supremum is maximized when $$\theta = \mu$$ (so we plug this in for $$\theta$$ to get the second line). Then we can easily write the Bregman divergence induced by $$G,H$$ as

#### Example: [Poisson]

Now suppose we have the family $$\text{Poisson}(\exp(\theta))$$. We have

where we plugged in $$\theta = \log \mu$$ above. So the Bregman divergence induced by $$G,H$$ is

#### Bregman divergences and Riemannian manifolds

Now that we have introduced what the Bregman divergence of an exponential family looks like (with respect to the log partition function $$G$$ and its dual Bregman divergence with respect to $$H$$, where $$H$$ is the conjugate of $$G$$, we are ready to discuss the respective primal and dual Riemannian manifolds induced by these divergences.

• Primal Riemannian manifold: $$(\Theta, \nabla^2 G)$$, where $$\nabla^2 G \succ 0$$ (since $$G$$ is strictly convex and twice differentiable).

• Dual Riemmanian manifold: $$B_H(\theta,\theta’)$$ induces the Riemannian manifold $$(\Phi, \nabla^2 H)$$, where $$\Phi = {\mu: g(\theta) = \mu, \theta \in \Theta}$$, i.e., the image of $$\Theta$$ under the continuous map $$g = \nabla G$$.

#### Theorem (Raskutti and Mukherjee, 2014) [2]:

The mirror descent step with Bregman divergence defined by $$G$$ applied to the function $$f$$ in the space $$\Theta$$ is equivalent to the natural gradient step along the dual Riemannian manifold $$(\Phi, \nabla^2 H)$$.

(Proof Sketch):

Step 1: Rewrite the mirror descent update in terms of the dual Riemannian manifold coordinates $$\mu \in \Phi$$:

since minimizing by differentiation and the dual coordinates relationship gives us

Step 2: Apply the chain rule:

and plugging this in to the update above, we get

which corresponds to the natural gradient step.

## High-level summary:

The main result of [2] is that the mirror descent update with Bregman divergence is equivalent to the natural gradient step along the dual Riemannian manifold.

What does this mean? For natural gradient descent we know that

• the natural gradient is the direction of steepest descent along a Riemannian manifold, and

• it is Fisher efficient for parameter estimation (achieves the CRLB, i.e., the variance of any unbiased estimator is at least as high as the inverse Fisher information),

but neither of these things are known for mirror descent.

The paper also outlines potential algorithmic benefits: natural gradient descent is a second-order method, since it requires the computation of an inverse Hessian, while the mirror descent algorithm is a first-order method and only involves gradients.

### References

1. Amari. Natural gradient works efficiently in learning. 1998

2. Raskutti and Mukherjee. The information geometry of mirror descent. 2014.

### Footnotes

1. a smooth manifold equipped with an inner product on the tangent space $$T_p \Theta$$

2. this can be seen by writing down the projected gradient descent update and rearranging terms

3. aka Fenchel dual, Legendre transform. Also e.g. for a convex optimization function, we can essentially write the dual problem (i.e., the inf of the Lagrangian), as the negative conjugate function plus linear functions of the Langrange multipliers.

# The Johnson-Lindenstrauss Lemma

The so-called “curse of dimensionality” reflects the idea that many methods are more difficult in higher-dimensions. This difficulty may be due a number of issues that become more complicated in higher-dimensions: e.g., NP-hardness, sample complexity, or algorithmic efficiency. Thus, for such problems, the data are often first transformed using some dimensionality reduction technique before applying the method of choice. In this post, we discuss a result in high-dimensional geometry regarding how much one can reduce the dimension while still preserving $\ell_2$ distances.

## Johnson-Lindenstrauss Lemma

Given $N$ points $$z_1,z_2,\ldots,z_N \in \mathbb{R}^d$$, we want to find $N$ points $$u_1,\ldots,u_N \in \mathbb{R}^k$$, where $k \ll d$, such that the distance between points is approximately preserved, i.e., for all $i,j$, $||z_i - z_j||_2 \leq ||u_i - u_j||_2 \leq (1+\epsilon) ||z_i-z_j||_2,$ where $||z||_2 := \sqrt{\sum_l |z_{l}|^2}$ is the $\ell_2$ norm. Thus, we’d like to find some mapping $f$, where $u_i = f(z_i)$, that maps the data to a much lower dimension while satisfying the above inequalities.

The Johnson-Lindenstrauss Lemma (JL lemma) tells us that we need dimension $k = O\left(\frac{\log N}{\epsilon^2}\right)$, and that the mapping $f$ is a (random) linear mapping. The proof of this lemma is essentially given by constructing $u_i$ via a random projection, and then showing that for all $i,j$, the random variable $||u_i - u_j||$ concentrates around its expectation.

This argument is an example of the probabilistic method, which is a nonconstructive method of proving existence of entities with particular properties: if the probability of getting an entity with the desired property is positive, then this entity must be an element of the sample space and therefore exists.

### Proof of the JL lemma

We can randomly choose $k$ vectors $(x_n)_{n=1}^k$, where each $x_n \in \mathbb{R}^d$, by choosing each coordinate $x_{nl}$ of the vector $x_n$ randomly from the set $\left\{\left(\frac{1+\epsilon}{k}\right)^{\frac{1}{2}},-\left(\frac{1+\epsilon}{k}\right)^{\frac{1}{2}}\right\}.$

Now consider the mapping from $\mathbb{R}^d \rightarrow \mathbb{R}^k$ defined by the inner products of $z \in \mathbb{R}^d$ with the $k$ random vectors: $z \mapsto (z^\top x_1, \ldots, z^\top x_k)$ So, each vector $u_i = (z_i^\top x_1,\ldots, z_i^\top x_k)$ is obtained via a random projection. (Alternatively, we can think of the mapping as a random linear transformation given by a random matrix $A \in \mathbb{R}^{k \times d}$, where the $k$ vectors form the rows of the matrix, and $Az_i = u_i$.) The goal is to show that there exists some $u_1,\ldots,u_k$ that satisfies the above inequalities.

Fixing $i,j$, define $u := u_i - u_j$ and $z := z_i - z_j$, and $Y_n := \left(\sum_{l=1}^d z_l x_{nl} \right)^2$. Thus, we can write the squared $\ell_2$ norm of $u$ as % where $x_{nl}$ refers to the $l$th component of the $n$th vector.

Now we consider the random variable $Y_n$ in the sum. The expectation is $\mathbb{E}(Y_n) = \frac{1+\epsilon}{k} ||z||_2^2.$ So, the expectation of $$||u||_2^2$$ is

It remains to show that $||u||_2^2$ concentrates around its mean $\mu$, which we can do using a Chernoff bound. In particular, consider the two cases of $||u||_2^2 > (1 + \delta) \mu$ and $% $. Via a Chernoff bound, the probability of at least one of these two “bad” events occurring is upper bounded by $\Pr[\{ ||u||^2 > (1+\delta) \mu)\} \lor \{||u||^2 > (1-\delta) \mu \}] < \exp(-c \delta^2 k),$ for some $c > 0$.

Recall that $||u|| := ||u_i - u_j||$, and so there are $\binom{N}{2}$ such random variables. Now choosing $\delta = \frac{\epsilon}{2}$, the probability that any of these random variables is outside of $(1 \pm \frac{\epsilon}{2})$ of their expected value is bounded by

which follows from a union bound.

Choosing $k > \frac{8(\log N + \log c)}{\epsilon^2}$ ensures that with all $\binom{N}{2}$ variables are within $(1 \pm \frac{\epsilon}{2})$ of their expectations, i.e., $(1+\epsilon) ||z||_2^2$. Thus, rewriting this, we have that for all $i,j$,

which then implies the desired result:

### References

1. Arora and Kothari. High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction.
2. Dasgupta and Gupta. An Elementary Proof of Theorem of Johnson and Lindenstrauss.
3. Nelson. Dimensionality Reduction Notes Part 1 Part 2