16 Feb 2018 by Diana Cai
In this post, we discuss the natural gradient, which is the direction of
steepest descent in a Riemannian manifold [1], and present the main result of
Raskutti and Mukherjee (2014) [2], which shows that the mirror descent algorithm is equivalent to natural
gradient descent in the dual Riemannian manifold.
Motivation
Steepest descent line search
Suppose we want to minimize some convex differentiable function
\(f: \Theta \rightarrow {\mathbb{R}}\) . A common approach is to do a line
search using the steepest descent algorithm
where \(\alpha_t\) is the stepsize and the direction we step in is the
gradient of the function.
However, in applications where the parameters lie on a nonEuclidean
Riemannian manifold^{1} with metric tensor \(H = (h_{ij})\) (an inner
product on the tangent space of the manifold at a point), the direction
of steepest descent is actually the natural gradient:
where \(H\) is the Fisher information matrix for a statistical manifold. This update equation is called
natural gradient descent.
Natural Gradients
Recall that in an Euclidean space (with a orthonormal coordinate system)
\(\Theta={\mathbb{R}}^n\) , the squared length of a small incremental
vector \(d\theta\) that connects \(\theta\) and \(\theta + d\theta\) is given by
But for a Riemannian manifold, the squared length is given by the quadratic form
where \(H = (h_{ij})\) is an \(n \times n\) matrix called the Riemannian
metric tensor (which depends on \(\theta\).
In the Euclidean case, we just get that \(H = I_n\).
In this Riemannian manifold, the direction of steepest descent of a
function \(f(\theta)\) at \(\theta \in \Theta\) is defined by the vector
\(d\theta\) that minimizes \(f(\theta + d\theta)\), where \(d\theta\) has
a fixed length, i.e., under the constraint
Theorem (Amari, 1998) [1]:
The steepest descent direction of \(f(\theta)\) in a Riemannian space is
given by
where \(H^{1} = (h^{ij})\) is the inverse of the metric \(H = (h_{ij})\),
and \(\nabla f\) is the conventional gradient:
Proof:
Write \(d\theta = \epsilon v\), i.e., we decompose the vector into its
magnitude \(\epsilon\) and direction \(v\), and search for the vector
\(v\) that decreases the value of the function the most.
Writing the linearized function, we have
subject to the constraint that \(v^2 = v^\top H v =1\), i.e., \(v\) is a
unit vector representating just the direction.
Setting the derivative with respect to \(v\) of the Lagrangian to 0, i.e.,
so we have that
which is the direction that causes the function’s value to decrease the
most. Lastly, plugging \(v\) above into the constraint \(v^\top H v = 1\),
we can solve for \(\lambda\).
Thus, the direction of steepest descent is \(H^{1} \nabla f(\theta)\),
i.e., the natural gradient.
Lastly, note that in a Euclidean space, the metric tensor is just the
identity matrix, so we get back standard gradient descent.
Mirror descent and Bregman divergence
Mirror descent is a generalization of gradient descent that takes into
account nonEuclidean geometry. The mirror descent update equation is
Here \(\Psi: {\mathbb{R}}^p \times {\mathbb{R}}^p \rightarrow {\mathbb{R}}_+\),
and when
we get back the standard gradient descent update^{2}.
The proximity function \(\Psi\) is commonly chosen to be the Bregman divergence. Let
\(G: \Theta \rightarrow {\mathbb{R}}\) be a strictly convex,
twicedifferentiable function, then the Bregman divergence
\(B_G: \Theta \times \Theta \rightarrow {\mathbb{R}}^+\) is:
Equivalence of natural gradient and mirror descent
Bregman divergences and convex duality
The convex conjugate^{3} of a function \(G\) is defined to be
Now, if \(G\) is a lower semicontinuous
function, then we have that \(G\) is the convex conjugate of \(H\). This
implies that \(G\) and \(H\) have a dual relationship. If \(G\) is strictly
convex and twice differentiable, so is \(H\). Let \(g = \nabla G\) and
\(h = \nabla H\). Then \(g = h^{1}\).
Now let \(\mu = g(\theta) \in \Phi\) be the point at which the supremum
for the dual function is attained be the dual coordinate system to
\(\theta\). The dual Bregman divergence
\(B_H: \Phi \times \Phi \rightarrow {\mathbb{R}}^+\) induced by the
strictly convex differentiable function \(H\)
The dual coordinate relationship gives us

\(B_H(\mu,\mu’) = B_G(h(\mu’), h(\mu))\)

\(B_G(\theta,\theta’) = B_H(g(\theta’), g(\theta))\)
For exponential families, the function \(G\) is the log partition
function^{4}. We’ll now work through some examples.
Example: [Normal]
Consider the family \(N(\theta, I_p)\). The log partition function is
and the conjugate function of \(G\) is
since the expression in the supremum is maximized when \(\theta = \mu\)
(so we plug this in for \(\theta\) to get the second line). Then we can
easily write the Bregman divergence induced by \(G,H\) as
Example: [Poisson]
Now suppose we have the family \(\text{Poisson}(\exp(\theta))\). We have
where we plugged in \(\theta = \log \mu\) above. So the Bregman divergence induced by \(G,H\) is
Bregman divergences and Riemannian manifolds
Now that we have introduced what the Bregman divergence of an
exponential family looks like (with respect to the log partition
function \(G\) and its dual Bregman divergence with respect to \(H\), where
\(H\) is the conjugate of \(G\), we are ready to discuss the respective
primal and dual Riemannian manifolds induced by these divergences.

Primal Riemannian manifold: \((\Theta, \nabla^2 G)\), where
\(\nabla^2 G \succ 0\) (since \(G\) is strictly convex and
twice differentiable).

Dual Riemmanian manifold: \(B_H(\theta,\theta’)\) induces the
Riemannian manifold \((\Phi, \nabla^2 H)\), where
\(\Phi = {\mu: g(\theta) = \mu, \theta \in \Theta}\), i.e., the
image of \(\Theta\) under the continuous map \(g = \nabla G\).
Theorem (Raskutti and Mukherjee, 2014) [2]:
The mirror descent step with Bregman divergence defined by \(G\) applied
to the function \(f\) in the space \(\Theta\) is equivalent to the natural
gradient step along the dual Riemannian manifold \((\Phi, \nabla^2 H)\).
(Proof Sketch):
Step 1: Rewrite the mirror descent update in terms of the dual
Riemannian manifold coordinates \(\mu \in \Phi\):
since minimizing by differentiation and the dual coordinates
relationship gives us
Step 2:
Apply the chain rule:
and plugging this in to the update above, we get
which corresponds to the natural gradient step.
Highlevel summary:
The main result of [2] is that the mirror descent update with Bregman
divergence is equivalent to the natural gradient step along the dual
Riemannian manifold.
What does this mean? For natural gradient descent we know that

the natural gradient is the direction of steepest descent along a
Riemannian manifold, and

it is Fisher efficient for parameter estimation (achieves the CRLB,
i.e., the variance of any unbiased estimator is at least as high as
the inverse Fisher information),
but neither of these things are known for mirror descent.
The paper also outlines potential algorithmic benefits: natural gradient
descent is a secondorder method, since it requires the computation of
an inverse Hessian, while the mirror descent algorithm is a firstorder
method and only involves gradients.
References

Amari. Natural gradient works efficiently in learning. 1998

Raskutti and Mukherjee. The information geometry of mirror
descent. 2014.
21 Nov 2017 by Diana Cai
The socalled “curse of dimensionality” reflects the idea that many methods are more difficult in higherdimensions.
This difficulty may be due a number of issues that become more complicated in
higherdimensions: e.g., NPhardness, sample complexity, or algorithmic efficiency.
Thus, for such problems, the data are often first transformed using some
dimensionality reduction technique before applying the method of choice.
In this post, we discuss a result in highdimensional geometry regarding
how much one can reduce the dimension while still preserving distances.
JohnsonLindenstrauss Lemma
Given points \(z_1,z_2,\ldots,z_N \in \mathbb{R}^d\),
we want to find points \(u_1,\ldots,u_N \in \mathbb{R}^k\),
where , such that the distance between points is approximately preserved, i.e.,
for all ,
\[
z_i  z_j_2 \leq u_i  u_j_2 \leq (1+\epsilon) z_iz_j_2,
\]
where is the norm.
Thus, we’d like to find some mapping , where , that maps
the data to a much lower dimension while satisfying the above inequalities.
The JohnsonLindenstrauss Lemma (JL lemma) tells us that we need dimension ,
and that the mapping is a (random) linear mapping. The proof of this lemma
is essentially given by constructing via a random projection, and then
showing that for all , the random variable concentrates around its expectation.
This argument is an example of the probabilistic method, which is a nonconstructive method of
proving existence of entities with particular properties: if the probability of
getting an entity with the desired property is positive, then this entity must be
an element of the sample space and therefore exists.
Proof of the JL lemma
We can randomly choose vectors , where each , by
choosing each coordinate of the vector randomly from the set
\[
\left\{\left(\frac{1+\epsilon}{k}\right)^{\frac{1}{2}},\left(\frac{1+\epsilon}{k}\right)^{\frac{1}{2}}\right\}.
\]
Now consider the mapping from defined by the inner products of with the
random vectors:
\[z \mapsto (z^\top x_1, \ldots, z^\top x_k)\]
So, each vector is obtained via a
random projection. (Alternatively, we can think of the mapping as a random linear transformation
given by a random matrix , where the vectors form the rows of the matrix, and .)
The goal is to show that there exists some that satisfies the
above inequalities.
Fixing , define and , and .
Thus, we can write the squared norm of as
where refers to the th component of the th vector.
Now we consider the random variable in the sum. The expectation is
\[ \mathbb{E}(Y_n) = \frac{1+\epsilon}{k} z_2^2. \]
So, the expectation of \(u_2^2\) is
It remains to show that concentrates around its mean , which we can
do using a Chernoff bound. In particular, consider the two cases of
and .
Via a Chernoff bound, the probability of at least one of these two “bad” events occurring is upper bounded by
\[
\Pr[\{ u^2 > (1+\delta) \mu)\} \lor \{u^2 > (1\delta) \mu \}] < \exp(c \delta^2 k),
\]
for some .
Recall that , and so there are such random variables.
Now choosing , the probability that any of these random variables is outside of
of their expected value is bounded by
which follows from a union bound.
Choosing ensures that with all
variables are within of their
expectations, i.e., . Thus, rewriting this, we have that for all ,
which then implies the desired result:
References
 Arora and Kothari. High Dimensional Geometry, Curse of Dimensionality,
Dimension Reduction.
 Dasgupta and Gupta. An Elementary Proof of Theorem of Johnson and
Lindenstrauss.
 Nelson. Dimensionality Reduction Notes
Part 1
Part 2
15 Nov 2017 by Diana Cai
Last updated: 11/15/17
This post is a collection of references for Bayesian nonparametrics that I’ve found helpful or wish that I had known about earlier.
For many of these topics, some of the best references are lecture notes, tutorials, and lecture videos.
For general references, I’ve prioritized listing newer references with a more updated or comprehensive treatment of the topics.
Many references are missing, and so I’ll continue to update this over time.
Nonparametric Bayes: an introduction
These are references for people who have little to no exposure to the area, and
want to learn more. Tutorials and lecture videos offer a great way to get up to
speed on some of the moreestablished models, methods, and results.

Teh (and others). Nonparametric Bayes tutorials.
This webpage lists a number of review articles and lecture videos that as an
introduction to the topic. The main focus is on Dirichlet processes, but a few
tutorials cover other topics: e.g., Indian buffet processes, fragmentationcoagulation processes.

Broderick. Nonparametric Bayesian Methods: Models, Algorithms, and
Applications. 2017.
See also joint tutorial with Michael Jordan at the Simon’s Institute.
An introduction to Dirichlet processes and the Chinese restaurant process,
and nonparametric mixture models. Also comes with R code for generating from and
sampling (inference) in these models.

Wasserman and Lafferty. Nonparametric Bayesian Methods. 2010.
This gives an overview of Dirichlet processes and Gaussian processes and places
these methods in the context of popular frequentist nonparametric estimation methods for, e.g., CDF and
density estimation, regression function estimation.
Theory and foundations
It turns out there’s a lot of theory and foundational topics.

Orbanz. Lecture notes on Bayesian nonparametrics.
2014.
A great introduction to foundations of Bayesian nonparametrics, and provides
many references for those who want a more indepth understanding of topics.
E.g.: random measures, clustering and feature modeling, Gaussian processes,
exchangeability, posterior distributions.

Ghosal and van der Vaart. Fundamentals of Bayesian Nonparametric Inference.
Cambridge Series in Statistical and Probabilistic Mathematics, 2017.
contents
The most recent textbook on Bayesian nonparametrics, focusing on topics
such as random measures, consistency, contraction rates, and also covers
topics such as Gaussian processes, Dirichlet processes, beta processes.

Kleijn, van der Vaart, van Zanten. Lectures on Nonparametric Bayesian
Statistics.
2012.
Lecture notes with some similar topics as (and partly based on) the Ghosal and van der Vaart textbook, including a comprehensive treatment of posterior consistency.
Specific topics

Kingman. Poisson processes. Oxford Studies in Probability, 1993.
Everything you wanted to know about Poisson processes. (See Orbanz lecture
notes above for more references on even more topics on general point process theory.)

Pitman. Combinatorial stochastic processes. 2002.

Aldous. Exchangeability and related topics. 1985.

Orbanz and Roy. Bayesian Models of Graphs, Arrays and
Other Exchangeable Random Structures. IEEE TPAMI, 2015.

Jordan. Hierarchical models, nested models, and completely random
measures. 2013.

Broderick, Jordan, Pitman. Cluster and feature modeling from combinatorial
stochastic processes. 2013.
Background: probability
Having basic familiarity with measuretheoretic probability is fairly important for understanding many of the ideas in this section. Many of the introductory references aim to avoid measure
theory (especially for the discrete models), but even this is not always the case, so it is helpful to have as much exposure as possible.

Hoff. Measure and probability. Lecture notes. 2013. pdf
Gives an overview of the basics of measuretheoretic probability, which are often assumed in many of the above/below references.

Williams. Probability with martingales. Cambridge Mathematical Textbooks, 1991.

Cinlar. Probability and stochastics. Springer Graduate Texts in Mathematics. 2011.
Models and inference algorithms
There are too many papers on nonparametric Bayesian models and inference methods. Below, I’ll list a few “classic” ones, and continue to add more over time.
The above tutorials contain many more references.
Dirichlet processes: mixture models and admixtures
Dirichlet process mixture model

Neal. Markov Chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 2000.
pdf

Blei and Jordan. Variational inference for Dirichlet process mixtures.
Bayesian Analysis, 2006.
pdf
Hierarchical Dirichlet process

Teh, Jordan, Beal, Blei. Hierarchical Dirichlet processes.
Journal of the American Statistical Association, 2006.
pdf

Hoffman, Blei, Wang, Paisley. Stochastic variational inference.
Journal of Machine Learning Research, 2013.
pdf
Nested Dirichlet process & Chinese restaurant process

Rodriguez, Dunson, Gelfand. The Nested Dirichlet process.
Journal of the American Statistical Association, 2008.
pdf

Blei, Griffiths, Jordan. The nested Chinese restaurant process and Bayesian
nonparametric inference of topic hierarchies.
Journal of the ACM, 2010.
pdf

Paisley, Wang, Blei, Jordan. Nested hierarchical Dirichlet processes.
IEEE TPAMI, 2015.
pdf
Mixtures of Dirichlet processes
 Antoniak. Mixture of Dirichlet processes with applications to Bayesian nonparametrics.
Annals of Statistics, 1974.
pdf
Additional inference methods

Ishwaran and James. Gibbs sampling methods for stickbreaking priors.
Journal of the American Statistical Association, 2001.
pdf

MacEachern. Estimating normal means with a conjugate style dirichlet process prior.
Communications in Statistics  Simulation and Computation, 1994.
pdf

Escobar and West. Bayesian density estimation and inference using mixtures.
Journal of the American Statististical Association, 1995.
pdf
Indian buffet processes and latent feature models

Griffiths and Ghahramani. The Indian buffet process: an introduction and review.
Journal of Machine Learning Research, 2011.
pdf

Thibaux and Jordan. Hierarchical beta processes and the Indian Buffet
process. AISTATS, 2007.
pdf

longer
05 Nov 2017 by Diana Cai
In this post, we will discuss a few simple algorithms for online decision making with expert
advice. In particular, this setting assumes no prior distribution on the set of
outcomes, but we use hindsight to improve future decisions. The algorithms discussed include a simple deterministic and randomized majority weighted decision algorithm.
Lastly, we discuss a randomized algorithm called the multiplicative weights algorithm. This
algorithm has been discovered in a number of fields, and is the basis of many
popular algorithms, such as the Ada Boost algorithm in machine learning and
gameplaying algorithms in economics.
This post mostly follows [1] and [2], which contain much more detail on this subject.
In particular, we will omit proofs and refer to these references for the details.
Overview
Consider a setting with rounds. During each round, you get a finite set of actions you can take, e.g.,
, and associated with each action is some cost
associated with it, that is revealed after taking the action. We would like to
design a policy that minimizes our cost (or maximizes the reward).
For example: consider the scenario of predicting whether or not a single stock’s price
will go up or down. Thus, each round is a day, and the action we take is binary,
corresponding to up/down. At the end of the day, we observe the final price of
the stock: if we make a correct prediction, we lose nothing, but if we make an
incorrect prediction, we lose 1 dollar.
We will consider the setting of total uncertainty, where we a priori have
no knowledge of the distribution on the set of outcomes, e.g., due to lack of
computational resources or data.
We will consider a few algorithms based on knowledge of experts.
Predictions with expert advice
Consider again the example of predicting a stock’s price, whose movement can be arbitrary
or adversarial (which comes up, in practice, in a variety of other settings).
But, we get to view the predictions of experts (who may or may not be good
at predicting and could even be correlated in some manner).
We would like to design an algorithm that limiting the total cost – i.e., bad predictions – by
limiting it to be about the same as the best expert. Because we do not know who
the best expert is until the end, we need some way of maintaining and updating
our belief of the best expert so that we can make some prediction in each round.
Predicting with the majority
Deterministic algorithm
The simplest algorithm is to just predict according to the majority prediction
of the experts: if most experts predict the price will go up, we will also
predict the price will go up. But what happens if the majority is wrong every
single day? Then, we will lose money every day.
Instead, we can maintain a weight for each expert , that is initially 1
for all experts, but that we decay every time the expert makes a mistake in the
prediction. Then, our action is to predict according to the weighted majority,
which will downweight the predictions of the bad experts. Thus, the algorithm
will predict, at each round, according to the decision with the highest total
weight of the experts.
Let be a parameter such that if the expert makes a mistake,
we will decay their weight by , i.e., for the th expert, we have
for the th round
Then, after steps, if is the number of mistakes from expert
, we have following bound on the number of mistakes of our algorithm :
for all , we have
Note that the best expert will have the fewest number of mistakes ,
and that the bound holds for all experts. Thus, the number of mistakes the
algorithm makes is roughly a little less than twice the number of mistakes of
the best expert (only the first term depends on ).
Randomized algorithm
It turns out we can do even better if we convert the above algorithm to a
randomized algorithm. Here, instead of predicting with the weighted majority,
we will predict with the weighted majority with probability proportional to the
weight. For instance, if the total weight of the experts predicting “up” is
, then instead of predicting up, our algorithm will instead
predict up with probability proportional to .
For this algorithm, we instead have the bound
which is a factor of 2 less (in the first term) than the above deterministic
algorithm. Thus, this algorithm will perform roughly on the same order as the
best expert.
Multiplicative weights
Now, we consider a more general setting. Here we will choose one decision in
each round out of possible decisions, and each decision will incur a
cost, which is revealed after making the decision. Above, we studied the
special case where each decision corresponds to a choice of an expert, and
the cost is 1 for a mistake, and 0 otherwise. Here we will instead
consister costs that can be in .
A naive strategy would be to pick a decision randomly; the expected penalty is
that of the average decision. But, if a few decisions are better, we can observe
this as the costs are revealed, and upweight those better decisions so that we
can pick them in the future. This motivates the multiplicative weight algorithm,
which has been discovered independently in many fields, e.g., machine learning, optimization, and game theory.
The goal is to design an algorithm that, in the long run, has total expected cost roughly on the
order of the best decision, i.e., .
Again, we will maintain a weighting of the decisions , where all weights initially are set to 1.
At each round , we have a distribution over a
set of decisions
\[
p^{(t)} = \left\{\frac{w_1^{(t)}}{\Phi^{(t)}}, \ldots, \frac{w_n^{(t)}}{\Phi^{(t)}} \right\},
\]
where is the normalization term.
For each round , we iterate the following:

Randomly select a decision from (thus, each decision
is chosen with probability proportional to its weight ).

The decision is made, and the cost vector is revealed, where
the th component correponds to the cost of decision (with cost ). The costs could be chosen arbitrarily by nature.

Update the weights of the costly decisions: for each decision , set
\[
w_i^{(t+1)} = w_i^{(t)} (1  \eta \, m_i^{(t)}),
\]
where is fixed in advance.
Here, the multiplicative term is less than 1 (thus, a
decay) if there is a larger cost, but if the cost is negative, this will increase the weights. A cost of 0 would not change the weight at all.
The expected cost of the algorithm from sampling decision is
\[
E_{i \in p^{(t)}}(m_i^{(t)}) = \langle m^{(t)}, p^{(t)} \rangle,
\]
i.e., the sum of the costs weighted by the probability of sampling each respective decision.
The total expected cost is the sum of the expected cost for each round:
\[
\sum_{t=1}^T \langle m^{(t)}, p^{(t)}\rangle.
\]
We can now consider a bound for this value.
Bound for the expected total cost
Assuming all costs lie in , and that ,
then we have the following bound after rounds: for any decision
,
\[
\sum_{t=1}^T \langle m^{(t)}, p^{(t)}\rangle
\leq
\sum_{t=1}^T m_i^{(t)} + \eta \sum_{t=1}^T m_i^{(t)} + \frac{\log n}{\eta}.
\]
References
 Arora. Advanced Algorithms notes.
 Arora, Hazan, Kale. The multiplicative weights update method: a meta
algorithm and its applications.
 Borodin and El Yaniv. Online computation and competitive analysis.
21 Oct 2017 by Diana Cai
For data that have a high signaltonoise ratio, a nonparametric, adaptive method
might be appropriate. In particular, we may want to fit the data to functions
that are spatially imhomogenous, i.e., the smoothness of the function varies a lot with .
In this post, we will discuss wavelets, which can be used an adaptive nonparametric estimation method.
First, we will introduce some background on function spaces and Fourier transforms, and then we will discuss Haar wavelets, a specific type of wavelet, and how to construct wavelets in general.
This presentation follows Wasserman [2], but I’ve included some additional code and images.
Preliminaries
Function spaces
Let \(L_2(a,b)\) denote the set of functions \(f : [a,b] \rightarrow \mathbb{R}\)
such that
\[ \int_a^b f^2(x) dx < \infty.\]
For our purposes, we will assume \(a=0,b=1\).
The inner product of \(f,g \in L_2(a,b)\) is defined as
\[
\langle f,g \rangle :=
\int_a^b f(x) g(x) dx,
\]
and the norm of \(f\) is defined as
\[
 f  = \left( \int_a^b f^2(x) dx \right)^{\frac{1}{2}}.
\]
A sequence of functions \(\phi_1,\phi_2,\ldots\) is orthonormal if
\(\phi_j = 1 \) for all \(j\) (i.e., has norm 1), and
\( \int_a^b \phi_i(x) \phi_j(x) dx = 0, \,\, i \neq j\) (i.e., orthogonal).
A complete (i.e., the only function orthogonal to each \(\phi_j\) is the 0
function) and orthonormal set of functions form a basis, i.e., if \(f \in L_2(a,b)\),
then \(f\) can be expanded in the basis in the following way:
\[
f(x) = \sum_{j=1}^\infty \theta_j \phi_j(x),
\]
where \(\theta_j = \int_a^b f(x) \phi_j(x) dx \).
A function \(f = \sum_j \beta_j \phi_j \) is sparse in a basis
\(\phi_1,\phi_2,\ldots\) if most of the \(\beta_j\)’s are 0 or close to 0.
Sparseness can be seen as a generalization of smoothness, i.e., a smooth
function is sparse, but there are also nonsmooth functions that are sparse.
Let \(f^{*}\) denote the Fourier transform of a function \(f\):
\[ f^{*}(t) = \int_{\infty}^\infty \exp(ixt) \, f(x) \,dx.\]
We can recover \(f\) at almost all \(x\) using the inverse Fourier
transform:
\[
f(x) = \frac{1}{2\pi} \int_{\infty}^\infty \exp(ixt) \, f^*(t) \,dt,
\]
assuming that \(f^{*}\) is absolutely integrable.
Throughout our discussion of wavelets, we will use the following notation: given
any function \(f\) and \(j,k \in \mathbb{Z}\), define
\[
f_{jk}(x) = 2^{\frac{j}{2}} \, f(2^j x  k).
\]
Wavelets
We now turn our attention to wavelets, beginning with the simplest type of
wavelet, the Haar wavelet.
Haar wavelet
Haar wavelets are a simple type of wavelet given in terms of stepfunctions.
Specifically, these wavelets are expressed in terms of the the Father and Mother Haar wavelets.
Our goal is to define an orthonormal basis for —to do so,
we need to introduce the Father and Mother wavelets and their shifted and
rescaled sets.
The Father wavelet is defined as:
and looks like:
The Mother wavelet is defined as:
and looks like:
Now we define the wavelets as shifted and rescaled versions of the
Father and Mother wavelets, as above:
and
Below we plot some examples.
The shifted/rescaled father wavelet looks like:
The shifted/rescaled mother wavelet looks like:
Now we define
the set of rescaled and shifted mother wavelets at resolution $j$
is defined as:
We plot an example where :
The next theorem defines gives an orthonormal basis for in terms of
the introduced sets, which allows us to write any function in this space as a
linear combination of the basis elements.
Theorem: The set of functions
is an orthonormal basis for , i.e., the set of realvalued functions on where .
As a result, we can expand any function in this basis:
where is the scaling coefficient, and
are the detail coefficients.
So to approximate a function , we can take the finite sum
This is called the resolution approximation, and has terms.
We consider an example below. Suppose we are interested in approximating
the Doppler function:
We can approximate this function by considering several resolutions (i.e.,
finite truncations of the wavelet expansion sum).
Below, we plot the original function along with the resolution
approximations:
Here, the coefficients were computed using numerical quadrature.
Below, we plot the resolution approximation along with the coefficients.
The axis represents which resolution or level the coefficient comes from.
The height of the bars are proportional to the size of the coefficients, and the
direction of the bar corresponds to the sign of the coefficient.
For instance, for certain applications, the axis could be represent time, and the resolutions
could then be interpreted as subintervals of time.
Constructing smooth wavelets
Haar wavelets are simple to describe and are localized, i.e., the mass is
concentrated in one area. We can express these same ideas for more
general functions, which can give us approximations that are smooth and localized.
Intuitively, it is useful to consider these specific concepts in terms of Haar
wavelets, and to know that we can use these ideas for more general functions in the following way.
Given any function , we can define the subspaces of as follows:
Definition: We say that generates a multiresolution analysis (MRA) of if
and
i.e., for any function , there exists a sequence of functions
such that each and as .
In other words, (…)
Lemma: If is an MRA generated by , then
is an orthonormal basis for .
As an example, we consider the father Haar wavelet as the function .
The the MRA generated by is given by , where
each is the set of functions that are
piecewise constant on the interval
for .
Code
Code (Jupyter/iPython notebook) for generating these plots is available on Github.
References
 W. Härdle, G. Kerkyacharian, D. Picard, A. Tsybakov. Wavelets, Approximation, and Statistical Applications.
 L. Wasserman. All of Nonparametric Statistics. Chapter 9.