Finite mixture models are typically inconsistent for the number of components

Diana Cai,
Trevor Campbell,
Tamara Broderick

Abstract

Scientists and engineers are often interested in learning the number of subpopulations (or clusters) present in a data set. It is common to use a Dirichlet process mixture model (DPMM) for this purpose. But Miller and Harrison (2013) warn that the DPMM posterior is severely inconsistent for the number of clusters when the data are truly generated from a finite mixture; that is, the posterior probability of the true number of clusters goes to zero in the limit of infinite data. A potential alternative is to use a finite mixture model (FMM) with a prior on the number of clusters. Past work has shown the resulting posterior in this case is consistent. But these results crucially depend on the assumption that the cluster likelihoods are perfectly specified. In practice, this assumption is unrealistic, and empirical evidence (Miller and Dunson, 2018) suggests that the posterior on the number of clusters is sensitive to the likelihood choice. In this paper, we prove that under even the slightest model misspecification, the FMM posterior on the number of components is also severely inconsistent. We support our theory with empirical results on simulated and real data sets.

Preliminary version in NeurIPS 2019 Workshop on Machine Learning with Guarantees[pdf],
and also
the Symposium on Advances in Approximate Bayesian Inference 2017,
co-located with NeurIPS 2017.