By simplifying complex real-world phenomena, probabilistic methods have proven able to accelerate applications in discovery and design. However, classical theory often evaluates models under the assumption that they are perfect representations of the observed data. There remains the danger that these simplifications might sometimes lead to failure under real-world conditions. This disser- tation identifies popular data analyses that can yield unreliable conclusions—and in some cases ones that are arbitrarily unreliable—under such “misspecification.” But we also show how to practically navigate misspecification. To begin, we consider clustering, a mainstay of modern unsupervised data analysis, using Bayesian finite mixture models. Some scientists are interested not only in find- ing meaningful groups of data but also in learning the number of such clusters. We provide novel theoretical results and empirical studies showing that, no matter how small the misspecification, some common approaches, including Bayesian robustness procedures, for learning the number of clusters give increasingly wrong answers as one receives more data. But using imperfect models need not be hopeless. For instance, we consider a popular Bayesian modeling framework for graphs based on the assumption of vertex exchangeability. A consequence of this assumption is that the resulting graph models generate dense graphs with probability 1 and are therefore misspecified for sparse graphs, a common property of many real-world graphs. To address this undesirable scaling behavior, we introduce an alternative generative modeling framework and prove that it generates a range of sparse and dense scaling behaviors; we also show empirically that it can generate graphs with sparse power law scaling behavior. Finally, we consider the case where a researcher has access to a sequence of approximate models that become arbitrarily more complex at the cost of more computation, which is common in applications with simulators of physical dynamics or models re- quiring numerical approximations of some fidelity. In this case, we show how to obtain estimates as though one had access to the most complex model. In particular, we propose a framework for constructing Markov chain Monte Carlo algorithms that asymptotically simulates from the most complex model while only ever evaluating models from the sequence of approximate models.

Type

Publication

PhD Dissertation, Princeton University