Abstract: The Sketched Wasserstein Distance ($W^S$) is a new probability distance specifically tailored to finite mixture distributions. Given any metric d defined on a set A of probability distributions, W is defined to be the most discriminative convex extension of this metric to the space $\mathcal{S} = conv(A)$ of mixtures of elements of $A$. Our representation theorem shows that the space $(S, W^S)$ constructed in this way is isomorphic to a Wasserstein space over $X = (A, d)$. This result establishes a universality property for the Wasserstein distances, revealing them to be uniquely characterized by their discriminative power for finite mixtures. We exploit this representation theorem to propose an estimation methodology based on Kantorovich--Rubenstein duality, and prove a general theorem that shows that its estimation error can be bounded by the sum of the errors of estimating the mixture weights and the mixture components, for any estimators of these quantities. ($W^S$) can be used for both discrete and continuous mixtures, and its general properties are valid for either case.

We derive sharp statistical properties for the estimated $W^S$ in the case of $p$-dimensional discrete $K$-mixtures, which we show can be estimated at a rate proportional to $\sqrt{K/N}$, up to logarithmic factors. The quantity $K$ does not need to be known prior to estimation, and is allowed to grow with $N$. We complement these bounds with a minimax lower bound on the risk of estimating the Wasserstein distance between distributions on a $K$-point metric space, which matches our upper bound up to logarithmic factors. This result is the first nearly tight minimax lower bound for estimating the Wasserstein distance between discrete distributions. Furthermore, we construct $\sqrt{N}$ asymptotically normal estimators of the mixture weights, and derive a $\sqrt{N}$ distributional limit of our estimator of $W^S$ as a consequence. An extensive simulation study and a data analysis provide strong support on the applicability of the new Sketched Wasserstein Distance between mixtures.

Short Bio: “Florentina Bunea is a Professor of Statistics in the Department of Statistics and Data Science at Cornell. She is affiliated with the Cornell graduate fields of Applied Mathematics and Computer Science. She obtained her Ph.D. at the University of Washington and was elected Fellow of the Institute of Mathematical Statistics for her earlier work on foundations for model selection and aggregation in parametric and non-parametric settings. Her current research interests lie at the intersection of Statistics and Machine Learning Theory, with a focus on the development of methods and theory for estimating a wide range of high dimensional models. Some of her latest works are on latent space clustering, inference in and prediction with interpretable regression models with latent factors, topic models and sparse discrete mixture estimation, as well as the development of new distances and notions of optimal transport for high dimensional mixtures. Her work is partially funded by NSF-DMS, she is an active conference organizer nationally and internationally, and she is on the editorial board of premier journals in Statistics. She is fully engaged with the creation of an intellectually diverse community in the foundations of Data Science at Cornell and beyond.”