Kernel methods offer a mathematically elegant toolkit to tackle machine learning problems ranging from probabilistic inference to deep learning. Recently, a subfield of kernel methods known as Hilbert space embedding of distributions has grown in popularity [ ], thanks to foundational work done in our department during the last 10+ years. For a probability distribution $\mathbb{P}$ over a measurable space $\mathcal{X}$, the kernel mean embedding of $\mathbb{P}$ is defined as the mapping $\mu: \mathbb{P} \mapsto \int k(x,\cdot) \,\mathrm{d}\mathbb{P}(x)$ where $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is a positive definite kernel function. Its applications include, but are not limited to, comparing real-world distributions, independence testing, differentially private learning, and determining the goodness-of-fit of a model. We have a history of contributions to the state-of-the-art in this area. We summarize some recent contributions below.
Based on kernel mean embedding, we developed privacy-preserving algorithms that allow to release a dataset in a differentially private manner [ ]. In applications such as probabilistic programming, transforming a base random variable $X$ with a function $f$ forms a basic building block to manipulate a probabilistic model. It then becomes necessary to characterize the distribution of $f(X)$. In [ ], we show that for any continuous function $f$, consistent estimators of the mean embedding of $X$ lead to consistent estimators of the mean embedding of $f(X)$. For Mat\`ern kernels and sufficiently smooth functions, we provide rates of convergence.
In [ ], we generalized the two-variable Hilbert-Schmidt independence criterion (HSIC) to a $d$-variable counterpart (dHSIC) that allows for testing joint independence of $d$ multivariate random variables. In [ ], we addressed the problem of measuring the relative goodness-of-fit of two models using kernel mean embeddings. Given two candidate models, and a set of target observations, it produces a set of interpretable examples that indicate the regions in the data domain where one model fits better than the other. The proposed test has runtime complexity that is linear in the sample size. Inspired by the selective inference framework, we also proposed kernel-based two-sample tests that enable hyperparameter learning and testing on the full sample without data splitting [ ].
In [ ], we overcame long-standing theoretical limitations of the operator-based conditional mean embedding (CME) by viewing it as a Bochner conditional expectation. As by-products, we present the maximum conditional mean discrepancy (MCMD) and Hilbert-Schmidt conditional independence criterion (HSCIC) as natural extensions of the MMD and the HSIC to conditional distributions. In treatment effect analysis we were thus able to account for distributional aspects of treatment effects while taking into account heterogeneity across the population [ ]. In [ ], we introduced a conditional density estimator based on the CME to capture multivariate, multimodal output densities with competitive performance compared to recent neural conditional density models and Gaussian processes. We applied CME to model the Perron–Frobenius or Koopman operator, essential in the global analysis of complex dynamical systems [ ]. Lastly, we ventured into econometrics by providing a novel representation of conditional moment restriction called maximum moment restriction (MMR), enabling a new family of kernel-based specification tests [ ].
In recent years, connections between quantum computing and machine learning have attracted attention [ ]. Building upon our tools, we defined quantum mean embeddings as a natural extension of kernel mean embeddings to the quantum setting with appealing theoretical properties [ ]. Moreover, we investigated the limitations of learning with quantum kernels and showed that kernels defined on lower-dimensional subspaces may require exponentially many measurements to evaluate [ ], reminiscent of the Barren plateau phenomenon in quantum neural networks (QNN). For QNNs, we showed how to adaptively allocate measurements in order to improve the efficiency of gradient-based optimization [ ].