This post is a collection of references for Bayesian nonparametrics that I’ve found helpful or wish that I had known about earlier.
For many of these topics, some of the best references are lecture notes, tutorials, and lecture videos.
For general references, I’ve prioritized listing newer references with a more updated or comprehensive treatment of the topics.
Many references are missing, and so I’ll continue to update this over time.
Nonparametric Bayes: an introduction
These are references for people who have little to no exposure to the area, and
want to learn more. Tutorials and lecture videos offer a great way to get up to
speed on some of the moreestablished models, methods, and results.
This webpage lists a number of review articles and lecture videos that as an
introduction to the topic. The main focus is on Dirichlet processes, but a few
tutorials cover other topics: e.g., Indian buffet processes, fragmentationcoagulation processes.
An introduction to Dirichlet processes and the Chinese restaurant process,
and nonparametric mixture models. Also comes with R code for generating from and
sampling (inference) in these models.
This gives an overview of Dirichlet processes and Gaussian processes and places
these methods in the context of popular frequentist nonparametric estimation methods for, e.g., CDF and
density estimation, regression function estimation.
Theory and foundations
It turns out there’s a lot of theory and foundational topics.
A great introduction to foundations of Bayesian nonparametrics, and provides
many references for those who want a more indepth understanding of topics.
E.g.: random measures, clustering and feature modeling, Gaussian processes,
exchangeability, posterior distributions.
Ghosal and van der Vaart. Fundamentals of Bayesian Nonparametric Inference.
Cambridge Series in Statistical and Probabilistic Mathematics, 2017.
contents
The most recent textbook on Bayesian nonparametrics, focusing on topics
such as random measures, consistency, contraction rates, and also covers
topics such as Gaussian processes, Dirichlet processes, beta processes.
Lecture notes with some similar topics as (and partly based on) the Ghosal and van der Vaart textbook, including a comprehensive treatment of posterior consistency.
Specific topics
Kingman. Poisson processes. Oxford Studies in Probability, 1993.
Everything you wanted to know about Poisson processes. (See Orbanz lecture
notes above for more references on even more topics on general point process theory.)
Having basic familiarity with measuretheoretic probability is fairly important for understanding many of the ideas in this section. Many of the introductory references aim to avoid measure
theory (especially for the discrete models), but even this is not always the case, so it is helpful to have as much exposure as possible.
Hoff. Measure and probability. Lecture notes. 2013. pdf
Gives an overview of the basics of measuretheoretic probability, which are often assumed in many of the above/below references.
Williams. Probability with martingales. Cambridge Mathematical Textbooks, 1991.
Cinlar. Probability and stochastics. Springer Graduate Texts in Mathematics. 2011.
Models and inference algorithms
There are too many papers on nonparametric Bayesian models and inference methods. Below, I’ll list a few “classic” ones, and continue to add more over time.
The above tutorials contain many more references.
Dirichlet processes: mixture models and admixtures
Dirichlet process mixture model
Neal. Markov Chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 2000.
pdf
Blei and Jordan. Variational inference for Dirichlet process mixtures.
Bayesian Analysis, 2006.
pdf
Hierarchical Dirichlet process
Teh, Jordan, Beal, Blei. Hierarchical Dirichlet processes.
Journal of the American Statistical Association, 2006.
pdf
Hoffman, Blei, Wang, Paisley. Stochastic variational inference.
Journal of Machine Learning Research, 2013.
pdf
Nested Dirichlet process & Chinese restaurant process
Rodriguez, Dunson, Gelfand. The Nested Dirichlet process.
Journal of the American Statistical Association, 2008.
pdf
Blei, Griffiths, Jordan. The nested Chinese restaurant process and Bayesian
nonparametric inference of topic hierarchies.
Journal of the ACM, 2010.
pdf
Antoniak. Mixture of Dirichlet processes with applications to Bayesian nonparametrics.
Annals of Statistics, 1974.
pdf
Additional inference methods
Ishwaran and James. Gibbs sampling methods for stickbreaking priors.
Journal of the American Statistical Association, 2001.
pdf
MacEachern. Estimating normal means with a conjugate style dirichlet process prior.
Communications in Statistics  Simulation and Computation, 1994.
pdf
Escobar and West. Bayesian density estimation and inference using mixtures.
Journal of the American Statististical Association, 1995.
pdf
Indian buffet processes and latent feature models
Griffiths and Ghahramani. The Indian buffet process: an introduction and review.
Journal of Machine Learning Research, 2011.
pdf
Thibaux and Jordan. Hierarchical beta processes and the Indian Buffet
process. AISTATS, 2007.
pdf

longer
In this post, we will discuss a few simple algorithms for online decision making with expert
advice. In particular, this setting assumes no prior distribution on the set of
outcomes, but we use hindsight to improve future decisions. The algorithms discussed include a simple deterministic and randomized majority weighted decision algorithm.
Lastly, we discuss a randomized algorithm called the multiplicative weights algorithm. This
algorithm has been discovered in a number of fields, and is the basis of many
popular algorithms, such as the Ada Boost algorithm in machine learning and
gameplaying algorithms in economics.
This post mostly follows [1] and [2], which contain much more detail on this subject.
In particular, we will omit proofs and refer to these references for the details.
Overview
Consider a setting with rounds. During each round, you get a finite set of actions you can take, e.g.,
, and associated with each action is some cost
associated with it, that is revealed after taking the action. We would like to
design a policy that minimizes our cost (or maximizes the reward).
For example: consider the scenario of predicting whether or not a single stock’s price
will go up or down. Thus, each round is a day, and the action we take is binary,
corresponding to up/down. At the end of the day, we observe the final price of
the stock: if we make a correct prediction, we lose nothing, but if we make an
incorrect prediction, we lose 1 dollar.
We will consider the setting of total uncertainty, where we a priori have
no knowledge of the distribution on the set of outcomes, e.g., due to lack of
computational resources or data.
We will consider a few algorithms based on knowledge of experts.
Predictions with expert advice
Consider again the example of predicting a stock’s price, whose movement can be arbitrary
or adversarial (which comes up, in practice, in a variety of other settings).
But, we get to view the predictions of experts (who may or may not be good
at predicting and could even be correlated in some manner).
We would like to design an algorithm that limiting the total cost – i.e., bad predictions – by
limiting it to be about the same as the best expert. Because we do not know who
the best expert is until the end, we need some way of maintaining and updating
our belief of the best expert so that we can make some prediction in each round.
Predicting with the majority
Deterministic algorithm
The simplest algorithm is to just predict according to the majority prediction
of the experts: if most experts predict the price will go up, we will also
predict the price will go up. But what happens if the majority is wrong every
single day? Then, we will lose money every day.
Instead, we can maintain a weight for each expert , that is initially 1
for all experts, but that we decay every time the expert makes a mistake in the
prediction. Then, our action is to predict according to the weighted majority,
which will downweight the predictions of the bad experts. Thus, the algorithm
will predict, at each round, according to the decision with the highest total
weight of the experts.
Let be a parameter such that if the expert makes a mistake,
we will decay their weight by , i.e., for the th expert, we have
for the th round
Then, after steps, if is the number of mistakes from expert
, we have following bound on the number of mistakes of our algorithm :
for all , we have
Note that the best expert will have the fewest number of mistakes ,
and that the bound holds for all experts. Thus, the number of mistakes the
algorithm makes is roughly a little less than twice the number of mistakes of
the best expert (only the first term depends on ).
Randomized algorithm
It turns out we can do even better if we convert the above algorithm to a
randomized algorithm. Here, instead of predicting with the weighted majority,
we will predict with the weighted majority with probability proportional to the
weight. For instance, if the total weight of the experts predicting “up” is
, then instead of predicting up, our algorithm will instead
predict up with probability proportional to .
For this algorithm, we instead have the bound
which is a factor of 2 less (in the first term) than the above deterministic
algorithm. Thus, this algorithm will perform roughly on the same order as the
best expert.
Multiplicative weights
Now, we consider a more general setting. Here we will choose one decision in
each round out of possible decisions, and each decision will incur a
cost, which is revealed after making the decision. Above, we studied the
special case where each decision corresponds to a choice of an expert, and
the cost is 1 for a mistake, and 0 otherwise. Here we will instead
consister costs that can be in .
A naive strategy would be to pick a decision randomly; the expected penalty is
that of the average decision. But, if a few decisions are better, we can observe
this as the costs are revealed, and upweight those better decisions so that we
can pick them in the future. This motivates the multiplicative weight algorithm,
which has been discovered independently in many fields, e.g., machine learning, optimization, and game theory.
The goal is to design an algorithm that, in the long run, has total expected cost roughly on the
order of the best decision, i.e., .
Again, we will maintain a weighting of the decisions , where all weights initially are set to 1.
At each round , we have a distribution over a
set of decisions
\[
p^{(t)} = \left\{\frac{w_1^{(t)}}{\Phi^{(t)}}, \ldots, \frac{w_n^{(t)}}{\Phi^{(t)}} \right\},
\]
where is the normalization term.
For each round , we iterate the following:
Randomly select a decision from (thus, each decision
is chosen with probability proportional to its weight ).
The decision is made, and the cost vector is revealed, where
the th component correponds to the cost of decision (with cost ). The costs could be chosen arbitrarily by nature.
Update the weights of the costly decisions: for each decision , set
\[
w_i^{(t+1)} = w_i^{(t)} (1  \eta \, m_i^{(t)}),
\]
where is fixed in advance.
Here, the multiplicative term is less than 1 (thus, a
decay) if there is a larger cost, but if the cost is negative, this will increase the weights. A cost of 0 would not change the weight at all.
The expected cost of the algorithm from sampling decision is
\[
E_{i \in p^{(t)}}(m_i^{(t)}) = \langle m^{(t)}, p^{(t)} \rangle,
\]
i.e., the sum of the costs weighted by the probability of sampling each respective decision.
The total expected cost is the sum of the expected cost for each round:
\[
\sum_{t=1}^T \langle m^{(t)}, p^{(t)}\rangle.
\]
We can now consider a bound for this value.
Bound for the expected total cost
Assuming all costs lie in , and that ,
then we have the following bound after rounds: for any decision
,
For data that have a high signaltonoise ratio, a nonparametric, adaptive method
might be appropriate. In particular, we may want to fit the data to functions
that are spatially imhomogenous, i.e., the smoothness of the function varies a lot with .
In this post, we will discuss wavelets, which can be used an adaptive nonparametric estimation method.
First, we will introduce some background on function spaces and Fourier transforms, and then we will discuss Haar wavelets, a specific type of wavelet, and how to construct wavelets in general.
This presentation follows Wasserman [2], but I’ve included some additional code and images.
Preliminaries
Function spaces
Let \(L_2(a,b)\) denote the set of functions \(f : [a,b] \rightarrow \mathbb{R}\)
such that
\[ \int_a^b f^2(x) dx < \infty.\]
For our purposes, we will assume \(a=0,b=1\).
The inner product of \(f,g \in L_2(a,b)\) is defined as
\[
\langle f,g \rangle :=
\int_a^b f(x) g(x) dx,
\]
and the norm of \(f\) is defined as
\[
 f  = \left( \int_a^b f^2(x) dx \right)^{\frac{1}{2}}.
\]
A sequence of functions \(\phi_1,\phi_2,\ldots\) is orthonormal if
\(\phi_j = 1 \) for all \(j\) (i.e., has norm 1), and
\( \int_a^b \phi_i(x) \phi_j(x) dx = 0, \,\, i \neq j\) (i.e., orthogonal).
A complete (i.e., the only function orthogonal to each \(\phi_j\) is the 0
function) and orthonormal set of functions form a basis, i.e., if \(f \in L_2(a,b)\),
then \(f\) can be expanded in the basis in the following way:
\[
f(x) = \sum_{j=1}^\infty \theta_j \phi_j(x),
\]
where \(\theta_j = \int_a^b f(x) \phi_j(x) dx \).
Sparse functions and Fourier transforms
A function \(f = \sum_j \beta_j \phi_j \) is sparse in a basis
\(\phi_1,\phi_2,\ldots\) if most of the \(\beta_j\)’s are 0 or close to 0.
Sparseness can be seen as a generalization of smoothness, i.e., a smooth
function is sparse, but there are also nonsmooth functions that are sparse.
Let \(f^{*}\) denote the Fourier transform of a function \(f\):
\[ f^{*}(t) = \int_{\infty}^\infty \exp(ixt) \, f(x) \,dx.\]
We can recover \(f\) at almost all \(x\) using the inverse Fourier
transform:
\[
f(x) = \frac{1}{2\pi} \int_{\infty}^\infty \exp(ixt) \, f^*(t) \,dt,
\]
assuming that \(f^{*}\) is absolutely integrable.
Throughout our discussion of wavelets, we will use the following notation: given
any function \(f\) and \(j,k \in \mathbb{Z}\), define
\[
f_{jk}(x) = 2^{\frac{j}{2}} \, f(2^j x  k).
\]
Wavelets
We now turn our attention to wavelets, beginning with the simplest type of
wavelet, the Haar wavelet.
Haar wavelet
Haar wavelets are a simple type of wavelet given in terms of stepfunctions.
Specifically, these wavelets are expressed in terms of the the Father and Mother Haar wavelets.
Our goal is to define an orthonormal basis for —to do so,
we need to introduce the Father and Mother wavelets and their shifted and
rescaled sets.
The Father wavelet is defined as:
and looks like:
The Mother wavelet is defined as:
and looks like:
Now we define the wavelets as shifted and rescaled versions of the
Father and Mother wavelets, as above:
and
Below we plot some examples.
The shifted/rescaled father wavelet looks like:
The shifted/rescaled mother wavelet looks like:
Now we define
the set of rescaled and shifted mother wavelets at resolution $j$
is defined as:
We plot an example where :
The next theorem defines gives an orthonormal basis for in terms of
the introduced sets, which allows us to write any function in this space as a
linear combination of the basis elements.
Theorem: The set of functions
is an orthonormal basis for , i.e., the set of realvalued functions on where .
As a result, we can expand any function in this basis:
where is the scaling coefficient, and
are the detail coefficients.
So to approximate a function , we can take the finite sum
This is called the resolution approximation, and has terms.
We consider an example below. Suppose we are interested in approximating
the Doppler function:
We can approximate this function by considering several resolutions (i.e.,
finite truncations of the wavelet expansion sum).
Below, we plot the original function along with the resolution
approximations:
Here, the coefficients were computed using numerical quadrature.
Below, we plot the resolution approximation along with the coefficients.
The axis represents which resolution or level the coefficient comes from.
The height of the bars are proportional to the size of the coefficients, and the
direction of the bar corresponds to the sign of the coefficient.
For instance, for certain applications, the axis could be represent time, and the resolutions
could then be interpreted as subintervals of time.
Constructing smooth wavelets
Haar wavelets are simple to describe and are localized, i.e., the mass is
concentrated in one area. We can express these same ideas for more
general functions, which can give us approximations that are smooth and localized.
Intuitively, it is useful to consider these specific concepts in terms of Haar
wavelets, and to know that we can use these ideas for more general functions in the following way.
Given any function , we can define the subspaces of as follows:
Definition: We say that generates a multiresolution analysis (MRA) of if
and
i.e., for any function , there exists a sequence of functions
such that each and as .
In other words, (…)
Lemma: If is an MRA generated by , then
is an orthonormal basis for .
As an example, we consider the father Haar wavelet as the function .
The the MRA generated by is given by , where
each is the set of functions that are
piecewise constant on the interval
for .
Code
Code (Jupyter/iPython notebook) for generating these plots is available on Github.
References
W. Härdle, G. Kerkyacharian, D. Picard, A. Tsybakov. Wavelets, Approximation, and Statistical Applications.
L. Wasserman. All of Nonparametric Statistics. Chapter 9.
In this post, we’ll review linear systems and linear programming. We’ll then focus on how to use LP relaxations to provide approximate solutions to other (binary integer) problems
that are NPhard. Much of this post follows these randomized algorithms course notes [1].
Linear programming
One of the most basic ways of thinking of a problem is using linear equations.
It turns out that linear problems are also easy to solve, whereas nonlinear
problems are often hard to solve.
We can express a system of linear equations in matrix form as follows.
Let \(A \in \mathbb{R}^{m \times n}\) be the matrix of coefficients,
\(x \in \mathbb{R}^n \) a vector of variables (that we are interested in),
\(b \in \mathbb{R}^m \) a vector of real numbers.
We can express a system of equations and variables as the matrix
equation .
We can also consider a system of linear inequalities by replacing some of the
equalities with inequalities; i.e., , where is a componentwise inequality.
(Throughout this post, we will overload the symbol and other inequalities to denote both componentwise inequality and scalar inequalities).
The set of solutions is called the feasible region and is a convex polytope.
In the figure below, the edges of the convex polytope are the linear inequalities, and any point in this feasible region satisfies this set of inequalities.
We are often interested in optimizing (e.g., minimizing or maximizing) some
linear function subject to a set of linear inequalities (i.e., constraints).
This is called a linear program (LP) and in its most general form can be expressed
as:
\[ \text{minimize } c^\top x \]
\[ \text{subject to } Ax \geq b, x\geq 0. \]
Here are known vectors of coefficients, is a known matrix of
coeffcients, and is the unknown vector of the variables of interest.
(Fun history fact: linear programming was invented in 1939 by the Russian mathematician Leonid
Kantorovich to optimize the organization of industrial production and allocation
of a society’s resources.)
The optimal value of the objective function (that satisfies the constraints) \(x^*\)
will be some vertex/corner/extreme point of the convex feasible region. The simplex algorithm is a
method to enuerate the vertices one by one, decreasing the objective function at
every step.
LP relaxations and approximation algorithms
In a number of problems, the unknown variables are required to be integers,
which defines an integer program. Unlike linear programs, which can be solved
efficiently, integer programs, while being more practical, are NPhard.
However, what we can do is relax these integer programs so that they are
linear programs, by turning the integer constraints into linear inequalities.
Here we will look at three examples of LP relaxations. The first is an example
that even though we solve the relaxed LP, we still end up with the integer
solution. The other examples require some approximation achieved by rounding.
The assignment problem
We are interested in assigning jobs to factories, each job having
some cost (raw materials, etc.). Let dnote the cost of assigning job
to factory , and let denote the assignment of job to
factory . Thus is a binaryvalued variable and corresponds to a
binary integer program.
To write the relaxed LP program, we turn this binary constraint
into an inequality:
\[ x_{ij} \geq 0 \text{ and } x_{ij} \leq 1,\]
for all . But we also want each job to one
factory and each factory to contain one job.
So, we add the constraint for each job
to guarantee that each job is only assigned to one factory,
and the constraint for all to ensure that each factory only gets a single job.
This results in the relaxed LP program
\[\text{minimize } \sum_{ij \in [n]} c_{ij} x_{ij}\]
\[\text{subject to } 0 \leq x_{ij} \leq 1, \quad i,j \in [n] \]
\[\sum_{j=1}^n x_{ij}=1, \quad i \in [n]\]
\[\sum_{i=1}^n x_{ij}=1, \quad j \in [n]\]
This LP results in variables that are all binary and actually solves the
assignment problem.
However, in general, the solution of the LP produces a fractional solution, instead of an integer. Thus, to solve the original integer program, we
can round the fractional solution to get an integer solution.
We now discuss several examples where the LP relaxation gives us a fractional
solution and how we might round these solutions (e.g., deterministically with a
threshold or random rounding).
Approximate rounding solutions
Deterministic rounding
The simplest way to round a fractional solution to a binary one is by picking some threshold, e.g., 1/2, and rounding up or down depending on if variable
is greater than or less than the threshold. We now consider the weighed vertex
cover problem, which is NPhard, and show that this deterministic rounding
procedure gives us a ``good enough” approximate solution; i.e., with high
probability, we get something within of the true value.
LP relaxation for the weighted vertex cover problem
In the weighted vertex cover problem, we are given a graph and
some weight for each vertex . A vertex
cover is a subset of such that every edge has at least one vertex .
The optimization problem is: find a vertex cover with minimum total weight.
The relaxed LP program is:
\[\text{minimize } f(x) = w^\top x \]
\[\text{subject to } 0 \leq x \leq 1 \]
\[ \quad x_i + x_j \geq 1, \quad\forall \{i,j\} \in E \]
The first line is the objective, i.e., the total weight, where is a
variable that represents whether vertex is in the cover . The second line
gives the relaxed inequality, and the last line guarantees that we have a vertex
cover, i.e., that for every each edge \(\{i,j\} \in E\), we have at least one vertex that is in the cover (either and/or ).
How good is deterministic rounding?
Suppose is the optimal value of this program, and let be the minimum
weight of the vertex cover problem. We have that since every
binary integer solution is also a fractional solution.
Now apply deterministic rounding to get a new set , where every vertex
with is in the set, and otherwise not in .
This produces a vertex cover because for every edge \(\{i,j\} \in E\), we
know that we must satisfy the constraint which implies
that and\or .
Since the optimal value at the solution is
\[O = \sum_{i=1}^n w_i x_i,\]
when we apply the rounding, we only pick the vertices that are greater than
, and so the resulting weight of is at most twice the weight of the optimal value
of the LP, i.e., . Thus, we have a resulting vertex cover with cost within
a factor of 2 of the optimum cost.
Randomized rounding
Instead of the deterministic thresholding procedure, we could consider a
randomized rounding procedure. One simple randomized rounding procedure is round
the fractional variable up to 1 with probability (and down with
probability ). The rounded variable therefore has expectation ,
and by linearity, the expectation of a linear constraint (i.e., ) that the fractional
satisfies is, in expectation, satisfied by the rounded variable.
We will consider using this randomized rounding procedure for the MAX2SAT
problem. In this problem, we have boolean variables , and a set of clauses of two literals
, where each literal is either some or its negation .
The goal is to find an assignment that maximizes the number of satisfied
clauses (i.e., the clauses is true).
LP relaxation for MAX2SAT
For this problem, we will form the following LP relaxation. Let be a
set of clauses, and the literals in clause ;
and so is if the first literal in clause is , and
if it is the negation.
We will define the variable for each clause , where
Thus, taking the sum over all gives us the total number of satisfied clauses,
and thus defines our objective function.
So, we can express this as the relaxed LP
where .
Here and are both relaxed to be fractional, and the last line
ensures that the logic of the clauses holds.
How good is randomized rounding?
Now let be the optimal value of the this LP, and denote the number
of clauses satisfied by the best assignment in MAX2SAT. We have that
\( M \leq O\).
Now apply randomized rounding to the fractional solution to get the 0/1
assignment. It turns out that the expected number of clauses satisfied is at
least the optimal value of the LP .
We can have clauses that are either of size 1 or size 2. If the clause
is of size 1, , then the probability it is satisfied is . But we
must have from the constrainst, so the probability this clause
is satisfied is at least .
Now suppose we have a clause of size 2, e.g., . Note that the
probability of success after randomized rounding is 1 minus the product of if both are 0, i.e.,
\[1  (1x_r)(1x_s) = x_s + x_r  x_r x_s,\]
but since we know from the constraint that
\[z_j \leq x_r + x_s, \]
we can write that the probability of success is at least
So, since all clauses individually are satisfied with probability at least
, by linearity of expectation, the expected number of clauses
satisfied is at least .
Summary
Here we have discussed approximate solutions to integer programs via LP relaxations and two simple rounding schemes. Other problems likely
require more complicated rounding schemes. Here we discussed two problems where
simple deterministic and random rounding give easy to analyze approximate solutions are are
within a constant factor of the solution. Many more rounding schemes and algorithms
can be found in Williamson and Shmoys (2010) [2].
I recently switched my old Wordpress.com blog over to using Jekyll. The main reasons were
due to better math display and easier parsing but also because it was
easier to customize the layout (making the layout uses HTML, CSS, Liquid
templating code, and posts are written in Markdown).
I used Hyde, which is a Jekyll (a static website generator) theme based on Poole. To get this set up, fork
the repository and follow the instructions for the Hyde and
Poole READMEs
(basically, you install Ruby and Jekyll).
I changed the content to fit my website in _config.yml – essentially the
sidebar text. Then to build the site:
To run the project locally:
I then copied over my blog posts into the _posts directory, where each post was
prefaced by (the YAML frontmatter)

layout: post
title: blah
summary: blah blah blah

and then I made some changes to put things like headings in Markdown (see
this page for a great Markdown cheatsheet).
Unfortunately, the math rendering was a mess, so I had to make a few changes to
fix this.
MathJax
Then I added MathJax – basically this meant including MathJax on the
_includes/head.html page and a few other tweaks. In head.html, I included the block:
in the head of the file.
Previously, the syntax for entering LaTeX inline was “$latex 2+2$,” whereas
now I can type “\\( 2+2 \\)$” (or define my own macros) and “\\[ 2+2 \\]”
for block math rendering.
To get MathJax to play nice, I modified the _config.yml to use kramdown for
the markdown (see this post.)
For more details, see this blog post.
With a few find and replaces in vim, I changed the old syntax to the new syntax
pretty easily (kramdown recognizes $$ syntax).
Note: upon writing this blog post, I realized using kramdown in Jekyll has
issues with syntax highlighting. A fix is to put this repo
in a folder called _plugins and then edit the _config.yml file
to use KramdownPyments for the markdown.
Also note that code blocks
are also different than traditional markdown.
A nice alternative to using a plugin to do this
is to use Liquid highlighting, where you include code inside the tags:
Adding tags
My old blog had tabs corresponding to a few tags. I wanted to recreate that in
the sidebar. I mostly followed this blog post.
First I added tags I wanted in the frontmatter of each post:
tags:
 machine learning
 statistics
Now to add things to the sidebar, I created a few files in the top level
directory: archive.md, statml.md, compsci.md, and other.md.
Each file looked something like this:

layout: page
title: compsci

## computer science posts
{% for post in site.posts %}
{% for tag in post.tags %}
{% if tag == "computer science" %}
* {{ post.date  date_to_string }} » [ {{ post.title }} ]({{ site.baseurl }}{{ post.url }})
{{ post.summary }}...
{% break %}
{% endif %}
{% endfor %}
{% endfor %}
This is basically a modification of this blog post but looping over the tags as well.
Here site.baseurl is set in _config.yml.
I also modified _layouts/page.html to get rid of displaying post.title at the
top.
Other tweaks
I added Twitter widgets to each post by putting in _layouts/post.html the block:
To add font awesome icons, add this line in _includes/head.html.