The Extended Schwartz Theorem and strong posterior consistency
18 Feb 2021Schwartz’s theorem is a classical tool used to derive posterior consistency and is the foundation of many modern results for establishing frequentist consistency of Bayesian methods. As we discuss in our post on the original Schwartz result, the theorem itself had somewhat limited applicability; it was straightforward to use to establish weak consistency, provided the KL support condition on the prior holds, but was generally not enough to guarantee posterior consistency with respect to strong metrics, such as the L1 distance. This is because the uniformly consistent tests do not always exist for other metrics.
But an extension of Schwartz’s original theorem makes it much more broadly applicable. In this post, we review the extended Schwartz theorem, show how it can be used to recover the classical Schwartz theorem, and how it can be used to establish “strong” posterior consistency. For an introduction to posterior consistency and basic definitions, see our post on the original Schwartz result.
Preliminaries and notation
We consider the same i.i.d. modeling setting as in our discussion on classical Schwartz. Let P denote our model class, which is a space of densities with respect to a σ-finite measure μ, where p:X→R is a measurable function that is nonnegative and integrates to 1. We denote the distribution of a density p∈P as P, i.e., p=dPdμ. Denote the joint distribution of n∈N∪{∞} samples by P(n).
Let Π be a prior distribution on our space of models P, consider the following Bayesian model:
p∼ΠX1,…,Xn|pi.i.d.∼p,where each Xi∈X. In this post, we assume the model is well-specified and that there is some true density p0∈P from which the data are generated.
Under certain regularity conditions on P, (a version of) the posterior distribution, Π(⋅|X1,…,Xn), can be expressed via the Bayes’s formula: for all measurable subsets A⊆P,
Π(A|X1,…,Xn)=∫A∏ni=1p(Xi)dΠ(p)∫P∏ni=1p(Xi)dΠ(p).The Extended Schwartz Theorem
The statement of the theorem in Ghosal and van der Vaart has two parts; here we state the first part below as a theorem – which provides tools for more general results – and the second part as a corollary – which leads to insights directly about posterior consistency. Lastly, we’ll discuss how to prove the classical Schwartz’s theorem using the extended version, and how to get results on strong posterior consistency.
Theorem (Extended Schwartz). Suppose there exists a set P0⊆P and a number c with Π(P0)>0 and K(p0;P0)≤c. Let Pn⊆P be a set that satisfies either condition (a) or condition (b) for some constant C>c:
(a) There exist tests ϕn such that
ϕnn→∞⟶0, P(∞)0-a.s.,and ∫PnP(n)(1−ϕn)dΠ(p)≤e−Cn.(b) The prior satisfies Π(Pn)≤e−Cn.
Then
Π(Pn|X1:n)n→∞⟶0,P(∞)0-a.s.Remarks on theorem statement
The theorem above provides some conditions under which the posterior probability of the set Pn goes to 0 with probability 1. At the expense of its greater generality, this version of the theorem may as a result appear more mysterious in its application.
First recall that P is the space of densities of our model class. The statement describes behaviors for two general subsets of densities P0 and Pn:
-
What are the sets P0? These sets show up in conditions on the prior / KL divergence bound and serve as a prior support condition. E.g, it is applied to KL neighborhoods in the classical Schwartz theorem.
-
What are the sets Pn? These sets are present in the conditions (a) and (b) as well as in the asymptotic result. The theorem allows us to establish when the posterior probability of a general subset Pn ⊆P is going 0 a.s., rather than a specific subset needed to show consistency (e.g., a neighborhood of p0). As we’ll see below, we’ll apply this to show posterior consistency by decomposing the complement of a neighborhood Uc into a union of two other subsets, and apply this general theorem to each of the other subsets.
Regarding the conditions (a) and (b), note that either (a) or (b) needs to hold to imply the result. First note that condition (b) a special case of condition (a) when ϕn=0, but serves as a useful restatement of (a), as we’ll see in the proof of the corollary. Thus, in order to prove Π(Pn|X1:n)n→∞⟶0 (P(∞)0-a.s.), we only need to assume condition (a) holds.
Condition (a). These two conditions should look familiar and close to the classical Schwartz testing conditions. One condition that is commonly used to verify the classical Schwartz theorem is that the sequence of tests have exponentially small error probabilities (Ghosh and Ramamoorthi [2], Proposition 4.4.1): i.e., for some C>0,
P(n)0(ϕn)≤e−Cn,supp∈PnP(n)(1−ϕn)≤e−Cn,where in the classical Schwartz theorem, Pn=Uc.
Here P(n)0(ϕn)≤e−Cn implies, via the Borel-Cantelli lemma, the first part of the condition in the extended Schwartz theorem, i.e., that ϕn→0 a.s. And in the extended Schwartz theorem, the second part of the condition replaces supp∈UcP(n)(1−ϕn)≤e−Cn with an integral against the prior distribution Π over the set of densities in Pn.
Condition (b). This condition essentially says that if Pn is a set that the prior assigns a small amount of mass to, then asymptotically, the posterior will also assign a small amount of mass to that set.
Proof of theorem
The proof itself follows similar steps to that of the proof of the classical Schwartz theorem; see our post on the classical Schwartz for additional details. We include a sketch below for completeness.
(Expand for proof sketch)
Since the test functions ϕn∈[0,1], we can upper bound the posterior from above as
Π(Pn|X1,…,Xn)≤ϕn+(1−ϕn)∫Pn∏ni=1p(Xi)p0(Xi)dΠ(p)∫P∏ni=1p(Xi)p0(Xi)dΠ(p).The first term ϕn→0 in P(∞)0-a.s. by the first part of assumption (a).
Using very similar steps as the classical Schwartz proof, the denominator is bounded below as follows: for any c′>c, eventually P(∞)0-a.s.,
∫Pn∏i=1p(Xi)p0(Xi)dΠ(p)≥Π(P0)e−c′n.Applying Fubini’s theorem, we can bound the probability of the numerator as follows:
P(n)0((1−ϕn)∫Pnn∏i=1p(Xi)p0(Xi)dΠ(p))=∫(1−ϕn)∫Pnn∏i=1p(Xi)p0(Xi)dΠ(p)dP(n)0=∫Pn∫(1−ϕn)dP(n)dΠ(p)=∫PnP(n)(1−ϕn)dΠ(p)≤e−Cn,where in the last line, we applied the second part of assumption (a).
Finally, applying Markov’s inequality, ∑n≥1e−n(C−c′)<∞ for C>c′, and so by the Borel–Cantelli lemma, the numerator goes to 0 almost surely.
Corollary: posterior consistency
The theorem above implies the following result on posterior consistency, which is more readily compared to the statement of the classical Schwartz theorem.
Corollary (Extended Schwartz). Consider the model
p∼Π,X1:n|piid∼p.Suppose that p0 is in the KL support of the prior, i.e., p0∈KL(Π), and that for every neighborhood U of p0, there exists a constant C>0, measurable partitions P=Pn,1⊔Pn,2 and tests ϕn such that the following conditions hold:
(i) The test functions satisfy
P(n)0ϕn≤e−Cnand supp∈Pn,1∩UcP(n)(1−ϕn)≤e−Cn.(ii) The prior satisfies Π(Pn,2)≤e−Cn.
Then the posterior is consistent at p0: i.e., for every neighborhood U of p0,
Πn(Uc|X1:n)n→∞⟶0,a.s.-P∞0.Remarks on corollary statement
In particular, the conditions are essentially the same as the classical Schwartz theorem, but now we split the space of densities P into (1) one part Pn,1 that goes into the testing condition, and (2) another part Pn,2 that has small prior mass and therefore small posterior mass (and so does not need to be tested). In the Bayesian asymptotics literature, the sequence of sets {Pn,1} are often called sieves.
If we take Pn,1=P and Pn,2=∅, then the extended Schwartz conditions (i) and (ii) imply that the same testing conditions as classical Schwartz hold, i.e., that the test sequence is uniformly consistent.
Proof of the corollary
The result follows from applying parts (a) and (b) separately from the theorem above.
(Expand for proof)
Let U be a neighborhood with radius ϵ<C. Note that the KL condition implies that if P0=U, then 0>Π(Kϵ(p0))≥Π(U) and that K(p0;P0)≤ϵ.
Apply part (a) with P0=U and Pn=Uc∩Pn,1. That is, suppose there there exist tests ϕn such that
ϕnn→∞⟶0, P(∞)0-a.s.,and ∫Uc∩Pn,1Pn(1−ϕn)dΠ(p)≤e−Cn.Then, we have that Π(Uc∩Pn,1|X1:n)→0 a.s.
Now apply part (b) with P0=U and Pn=Pn,2. Then, we have that Π(Pn,2|X1:n)→0 a.s.
Putting these together, we have Π(Uc|X1:n)≤Π(Uc∩Pn,1|X1:n)+Π(Pn,2|X1:n)→0,P(∞)0-a.s.
Strong posterior consistency
In this section, we state a theorem that establishes strong posterior consistency, i.e., posterior consistency with the strong topology; this result is a consequence of applying the extended Schwartz theorem above. Note: the statement of this corollary in Ghosal and van der Vaart (2017) [1] uses the terminology “strongly consistent” to refer to P0-a.s. consistency, rather than referring to convergence in the strong topology.
Preliminaries: covering and metric entropy
In order to establish a more direct strong posterior consistency result, the (extended Schwartz) testing condition is replaced with a bound on the complexity of the space, which is measured by a function of the covering number N(ϵ,Pn,1,d) with respect to the set Pn,1⊆P. In order to define the covering number, we need to first introduce another definition.
A set S is called an ϵ-net for T if for every t∈T, there exists s∈S such that d(s,t)<ϵ. This means that the set T is covered by a collection of balls of radius ϵ around the points in the set S. The ϵ-covering number, denoted by N(ϵ,T,d), is the minimum cardinality of an ϵ-net.
The quantity logN(ϵ,T,d) is called metric entropy, and a bound on this quantity is a condition that appears in many posterior concentration results.
Strong posterior consistency via extended Schwartz
Again we consider the model X1:n|pi.i.d.∼p and p∼Π. The following theorem leads to posterior consistnecy with respect to any distance d that is bounded above by the Hellinger distance, e.g., the L1 distance.
Theorem (strong consistency). Let d be a distance that generates convex balls and satisfies d(p0,p)≤dH(p0,p) for every p. Suppose that for every ϵ>0, there exist partitions P=Pn,1⊔Pn,2 and a constant C>0 such that, for sufficiently large n,
-
logN(ϵ,Pn,1,d)≤nϵ2,
-
Π(Pn,2)≤e−Cn.
Suppose p0∈KLϵ(Π). Then the posterior distribution is consistent at p0.
References
-
Ghosal and van der Vaart (2017). Fundamentals of Nonparametric Bayesian Inference (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge: Cambridge University Press.
-
Kleijn, van der Vaart, van Zanten. Lectures on Nonparametric Bayesian Statistics