Research Overview

The problems studied in the department can be subsumed under the heading of empirical inference, i.e., inference performed on the basis of empirical data. Empirical inference includes statistical learning and the inference of causal structures from data, leading to models that provide insight into the underlying mechanisms and make predictions about the effect of interventions. Likewise, the type of empirical data can vary, ranging from biological measurements (e.g., in neuroscience) to astronomical observations. We are conducting theoretical, algorithmic, and experimental studies in empirical inference.

The department was started around statistical learning theory and kernel methods. It has since broadened its set of inference tools to include probabilistic methods and a strong focus on issues of causality. In terms of the inference tasks being studied, we have moved towards tasks that go beyond the well-studied problem of supervised learning, such as semi-supervised learning or transfer learning. We analyze challenging datasets from biology, astronomy, and other domains. No matter whether applications are done in the department or in collaboration with external partners, considering a whole range of applications helps us study principles and methods of inference, rather than inference applied to one specific problem domain.

The most competitive publication venues in empirical inference are NeurIPS (Neural Information Processing Systems), ICML (International Conference on Machine Learning), ICLR (International Conference on Learning Representations), UAI (Uncertainty in Artificial Intelligence), and for theoretical work, COLT (Conference on Learning Theory). Our work has earned us a number of awards, including best paper prizes at most major conferences in the field (NeurIPS, ICML, UAI, COLT, ALT, CVPR, ECCV, ISMB, IROS, KDD). Recent awards include IEEE SMC 2016, ECML-PKDD 2016, an honorable mention at ICML 2017, the test-of-time award for Olivier Bousquet at NeurIPS 2018, received for work he started while he was a member of our department, the best paper award at ICML 2019, and a best student paper award at R:SS 2021.

Theoretical studies, algorithms, and applications often go hand in hand. A specific application may lead to a customized algorithm that turns out to be of independent theoretical interest. Such serendipity is a desired side effect caused by interaction across groups and research areas, for instance, during our departmental talks (one long talk and two short talks per week). It concerns cross-fertilization of methodology (e.g., kernel independence measures used in causal inference), the transfer of algorithmic developments or theoretical insights to application domains (e.g., causal inference in neuroscience or astronomy), or the combination of different application areas (e.g., using methods of computational photography for magnetic resonance imaging). The linear organization cannot represent all these connections. Below, we have opted for a structure that devotes individual sections to some main application areas and comprises methodological sections on kernel methods, causal inference, probabilistic inference, and statistical learning theory. In addition, one-page descriptions of a selection of projects are provided below.

Statistical Learning Theory

The goal of statistical learning theory is to assess to what extent algorithms learning from data can be successful in principle. The approach is to assume that the training data have been generated by an unknown random source and to develop mathematical tools to analyze the performance of a learning algorithm in statistical terms: for example, by bounding prediction errors ("generalization bounds") or by analyzing large sample behavior and convergence of algorithms on random input ("consistency").

The department has made various contributions to this area, especially in areas of machine learning, where statistical learning theory is less well developed. These include settings like active [ ] and transfer learning, privacy preserving machine learning, as well as unsupervised generative modeling.

In recent years, significant progress has been made in the field of unsupervised (deep) generative modeling with generative adversarial networks (GANs), variational autoencoders (VAEs), and other deep neural network based architectures, significantly improving the state of the art in the quality of samples, especially in the domain of natural images. Traditionally the training objectives in VAEs and GANs have been based on f-divergences. Starting from Kantorovich's primal formulation of the optimal transport problem, we showed that it can be equivalently written in terms of probabilistic encoders, which are constrained to match the latent posterior and prior distributions [ ]. This leads to a new training procedure of latent variable models called Wasserstain auto-encoders (WAEs) [ ]. Another theoretical study of generative modeling led us to propose the AdaGAN, a boosting approach to greedily build mixtures of generative models (e.g., GANs or VAEs) by solving, at each step, an optimization problem that results in the best additional model to reduce the discrepancy between the current mixture model and the target [ ].

Pioneered by our lab, kernel mean embedding (KME) of distributions plays an important role in many tasks of machine learning and statistics, including independence testing, density estimation, and more [ ]. Inspired by the James-Stein estimator, we introduced a type of KME shrinkage estimator and showed that it can converge faster than the standard KME estimator [ ]. We have studied the optimality of KME estimators in the minimax sense in [ ] and shown that the convergence rate for the KME, and other methods published in the literature, is optimal and cannot be improved. In [ ] we study the minimax optimal estimation of the maximum mean discrepancy (MMD) and its properties which have been linked to three fundamental concepts: universal, characteristic, and strictly positive definite kernels. Building on the analyses in [ ], we propose a three-sample test for comparing the relative fit of two models, based solely on their samples [ ], and further extend our results to derive a nonparametric goodness-of-fit test for conditional density models, one of the few tests of its kind [ ].

Kernel Methods

Learning algorithms based on kernel methods have enjoyed considerable success in a wide range of supervised learning tasks such as regression and classification. One reason for the popularity of these approaches is that they solve difficult nonparametric problems by mapping data points into high dimensional spaces of features, specifically reproducing kernel Hilbert spaces (RKHS), in which

linear algorithms can be brought to bear, leading to solutions taking the form of kernel expansions [ ].

Building on this, we develop kernel methods that allow differentially private learning, as well as determining the goodness of fit of a model. We also address the problem of measuring the relative goodness of fit of two models, with run-time complexity linear in the sample size [ ]. Inspired by the selective inference framework, we propose an approach for kernel-based hypothesis testing that enables learning the hyperparameters and testing on the full sample without data splitting [ ].

Privacy-preserving machine learning algorithms aim to come up with database release mechanisms that allow third parties to construct consistent estimators of population statistics while ensuring that the privacy of each individual contributing to the database is protected. In [ ], we develop privacy-preserving algorithms based on kernel mean embeddings, allowing us to release a database while guaranteeing the privacy of the database records.

The expressiveness of the KME makes it suitable as a tool to evaluate the non-trivial effects of a policy intervention or a counterfactual change in certain background conditions. We introduce a Hilbert space representation of counterfactual distributions called a counterfactual mean embedding (CME) that can capture higher-order treatment effects beyond the mean [ ]. We extend this idea to conditional treatment effects and propose the conditional distributional treatment effect (CoDiTE), which is designed to encode a treatment's distributional aspects, and heterogeneity [ ]. Kernel methods are also applicable in econometrics, with estimation and inference methods based on a continuum of moment restrictions, as we demonstrated in [ ], with applications in instrumental variable (IV) regression [ ] and proximal causal learning [ ].

Optimization lies at the heart of most machine learning algorithms, and we have studied the convergence property of coordinate descent as well as Frank-Wolfe optimization algorithms [ ]. A connection between matching pursuit and Frank-Wolfe optimization is explored in [ ]. Going beyond first-order methods, in [ ], we prove the exact equivalence (strong duality) between the distributionally robust optimization (DRO) formulation of learning objectives and dual programs that can be elegantly solved by kernel-based learning algorithms. Relevant to our previous work in statistical learning theory, the resulting Kernel DRO algorithm can certify a large set of distribution shifts characterized using MMD.

In the emerging field of quantum machine learning, kernel methods play a central role. Our contributions to this field include the analysis of quantum mean embeddings of probability distributions [ ], and a sober look at where quantum kernels might be beneficial [ ].

Causal Inference

Statistical dependences underlie machine learning, and their detection in large-scale IID settings has led to impressive results in a range of domains. However, in many situations, we would prefer a causal model to a purely predictive one; e.g., a model that might tell us that a specific variable -- say, whether or not a person smokes -- is not just statistically associated with a disease, but causal for it.

Pearl's graphical approach to causal modeling generalizes Reichenbach's common cause principle and specifies observable statistical independences a causal structure should entail.

This "graphical models" approach to causal inference has certain weaknesses that we try to address in our work: it only can infer causal graphs up to Markov equivalence, it does not address the hardness of conditional independence testing, it does not worry about the complexity of the underlying functional regularities that generate statistical dependencies in the first place, and it ignores the problem of how to infer causal variables from raw data. Our work in this field is characterized by pursuing four aspects:

1. We often work in terms of structural causal models (SCMs), i.e., we do not take statistical dependences as primary, but rather study mechanistic models which give rise to such dependences.

SCMs do not only allow us to model observational distributions; one can also use them to model what happens under interventions (e.g., gene knockouts or randomized studies).

Under suitable model assumptions like additive independent noise, SCMs admit novel techniques of noise removal via so-called half-sibling regression [ ], reveal the arrow of time [ ], and permit the design of methods for causal recourse [ ].

2. The crucial assumption of the graphical approach to causality is the statistical independence of all SCM noise terms. Intuitively, as the noises propagate through a graph, they pick up dependences due to the graph structure; hence the assumption of initial independence of the noise terms allows us to tease out properties of that structure. We believe, however, that much can be gained by considering a more general independent mechanism assumption related to notions of invariance and autonomy of causal mechanisms. Here, the idea is that causal mechanisms are autonomous entities of the world that (in the generic case) do not depend on or influence each other, and changing (or intervening on) one of them often leaves the remaining ones invariant [ ].

3. This leads to the third characteristic aspect of our work on causality. Wherever possible, we attempt to establish connections to machine learning, and indeed we believe that some of the hardest problems of machine learning (such as those of domain adaptation and transfer learning) are best addressed using causal modeling. To this end, one may assume, for instance, that structural equations remain constant across data sets and only the noise distributions change [ ], that some of the causal conditionals in a causal Bayesian network change, while others remain constant [ ], or that they change independently [ ], which results in new approaches to domain adaptation [ ].

In the context of pattern recognition, learning causal models comprising independent mechanisms helps in transferring information across substantially different data sets [ ]. Theoretical work shows that the independence of causal mechanisms can be formalized via group symmetry [ ].

Our lab has firmly placed causal inference on the agenda of the machine learning community, contributing a recent award-winning textbook [ ], various implications for machine learning, e.g., in deep learning [ ], reinforcement learning [ ], semi-supervised learning [ ], and societal aspects of AI, including fairness [ ] and interpretability/accountability/recourse [ ].

Causality is also beginning to contribute to a range of applications, e.g., in neuroscience [ ], psychophysics [ ], epidemiology [ ], climate science [ ], or natural language processing [ ]. A particular highlight was astronomy, where our half-sibling regression confounder removal technique [ ] led to the discovery of 21 new exoplanets. For one of them, K2-18b, astronomers in 2019 found traces of water in the atmosphere --- the first such discovery for an exoplanet in the habitable zone, i.e., allowing for liquid water (see \url{http://people.tuebingen.mpg.de/bs/K2-18b.html}).

4. Traditional causal discovery assumes that the symbols or observables (modeled as random variables) are given a priori (much like in classical AI), and connected by a causal directed acyclic graph. In contrast, real-world observations (e.g., objects in images) are not necessarily structured into those symbols to begin with. The task of identifying latent variables that admit causal models is challenging, but it aligns with the general goal of modern machine learning to learn meaningful representations for data, where 'meaningful' can, for instance, mean interpretable, robust, or transferable [ ]. We have argued for the development of causal representation learning and are pursuing a number of approaches to help build this field. We have formulated key assumptions, including (1) independent mechanisms and (2) sparse mechanism shift in a position paper in the Proceedings of the IEEE [ ], and are assaying ideas to exploit these assumptions for the purpose of learning causal representations. This includes a causal analysis of self-supervised learning methods for automatically separating style and content [ ], a novel approach for nonlinear ICA using independent mechanisms [ ], as well as a range of other approaches [ ].

We have proposed a notion of causal disentanglement [ ], generalizing statistical notions [ ]. The latter inherit certain shortcomings from the non-identifiability of nonlinear ICA, as pointed out in our paper winning the best paper prize at ICML 2019 [ ], cf. also [ ].

Causality touches statistics, econometrics, and philosophy, and it constitutes one of the most exciting fields for conceptual basic research in machine learning today. Going forward, we expect that it will be instrumental in taking representation learning to the next level, moving beyond the mere representation of statistical dependences towards models that support intervention and planning, and thus Konrad Lorenz' notion of thinking as acting in an imagined space.

Probabilistic Inference

The probabilistic formulation of inference is one of the main research streams within machine learning. One of our main themes in this field has been nonparametric inference on function spaces using Gaussian process models [ ]. Our methods allow finding the best kernel bandwidth hyperparameter efficiently and are especially well-suited for online learning. Bayesian approaches are also increasingly studied in deep learning [ ].

A crucial bottleneck in Bayesian models is the marginalization of latent variables. This can be computationally demanding, so approximate inference routines reducing computational complexity are crucial. In [ ], we study the convergence properties of this approach, establishing connections to the classic Frank-Wolfe algorithm. The analyses yield novel theoretical insights regarding the sufficient conditions for convergence, explicit rates, and algorithmic simplifications. While probabilistic inference can indeed be computationally demanding, perhaps surprisingly, probabilistic considerations in numerical algorithms can also help speed up computations -- including those of high-level probabilistic inference itself [ ].

A theoretical study pertaining to probabilistic programming considers the problem of representing the distribution of f(X) for a random variable X [ ]. We use kernel mean embedding methods to construct consistent estimators of the mean embedding of f(X). The method is applicable to arbitrary data types on which suitable kernels can be defined. It thus allows us to generalize (to the probabilistic case) functions in computer programming, which are originally only defined on deterministic data types.

We have also studied techniques for representation learning based on differential geometry. We developed fundamental methods relying mainly on stochastic generative models to learn and utilize the underlying geometric structure of the data manifold [ ].

Moreover, we have developed probabilistic methods to solve inverse problems in science. In one such study [ ] we accurately infer black-hole parameters (such as masses, spins and sky position) from gravitational wave measurements more than 1000 times faster than standard methods, enabling real-time analyses which is crucial for timely electromagnetic follow-up observations.

Computational Imaging

Our work in computational imaging has gradually shifted focus from pure algorithmic approaches towards combinations with physical setups. In both cases, we aim to computationally enable or enhance imaging by data-driven computation.

To recover a high-resolution image from a single low-resolution input, we proposed a novel method for automated texture synthesis in combination with a perceptual loss focusing on creating realistic textures rather than optimizing for a pixel-accurate reproduction of ground truth images during training [ ]. By using feed-forward fully convolutional neural networks in an adversarial training setting, we achieve a significant boost in image quality even at high magnification ratios. We also developed an approach to estimate modulation transfer functions of optical lenses [ ].

Together with the department for High-Field Magnetic Resonance at the Max-Planck Institute for Biological Cybernetics, we enhanced the quality of MR images [ ]. Additionally, we have worked on automatic MRI sequence generation [ ], hardware optimization [ ], and acceleration of acquisition processes [ ].

We have also started to collaborate in the fields of acoustic holography and optical computation with Peer Fischer's lab at our Stuttgart site. By solving suitable inverse problems using optimization methods, we design devices to generate custom acoustic wave intensity patterns applicable in medicine and one-shot manufacturing. In optical computation, we carry out certain computations using light, with the ultimate goal of designing faster and more energy-efficient machine learning systems. The challenge here lies in the implementation of suitable nonlinearities and in the design of architectures that lend themselves well to what can be implemented using optical devices.

Robot Learning

Robots need to possess a variety of physical abilities and skills to function in complex environments. Programming such skills is a labor- and time-intensive task that often lead to brittle solutions and requiring considerable expert knowledge. We study learning-based approaches instead and use robot table tennis as a testbed. Tracking dynamic movements with inaccurate models is essential for such a task. To achieve accurate tracking, we have proposed a series of adaptive Iterative Learning Control (ILC) algorithms [ ]. In real robot experiments on a Barrett WAM, we have studied trajectory generation for table tennis strikes [ ], and how to learn such primitives from demonstrations [ ].

Moreover, we have built a musculoskeletal robot system, which presents interesting challenges making it suitable for learning. Muscular systems are hard to control using classical methods but allow safe exploration of highly-accelerated motions directly on the real robot, making them suitable candidates to try and learn human-like performance on complex tasks [ ]. We have leveraged this to learn precision-demanding table tennis with imprecise muscular robots [ ].

Dexterous object manipulation remains an open problem in robotics due to the large variety of possible changes and outcomes in this task. We address this problem by designing benchmarks and datasets as well as a low-cost robotic platform, called TriFinger [ ]. The open-source TriFinger platform enables researchers around the world to either easily rebuild real robot benchmarks in their own labs or to remotely submit code that is executed automatically on eight platforms hosted at MPI-IS [ ] (allowing robotics research to continue during pandemic lockdown periods). We organized a robotics challenge (https://real-robot-challenge.com/) and conducted research using these platforms, especially on transfer learning [ ].

Machine Learning in Neuroscience

The neurosciences present some of the steepest challenges to machine learning. Almost always, there is a high-dimensional input structure -- particularly relative to the number of examples, since data points are often gathered at a high cost. Regularities are often subtle, the rest being made up of noise that may be of much larger magnitude (often composed largely of the manifestations of other neurophysiological processes, besides the ones of interest). In finding generalizable solutions, one usually has to contend with a high degree of variability, both between individuals and across time, leading to problems of covariate shift and non-stationarity.

We have designed machine learning techniques to assist with the interpretation of experimental brain data. Supervised and unsupervised learning tools were used to automatically identify activity patterns among large populations of recorded neurons [ ]. Causal relationships between neural processes is also of particular interest to neuroscientists, and the tools we develop for observational data have been used to uncover the interactions between brain regions during sleep [ ]. Machine learning provided theoretical insight into biological learning by identifying key neural circuits underlying the reliable replay of memorized events [ ].

A neuroscientific application area in which we have a long-standing interest is that of brain-computer interfacing (BCI). BCIs hold promise in restoring communication for locked-in patients in late stages of amyotrophic lateral sclerosis (ALS), however, the transfer of findings has been limited due to unclear cognitive abilities in late-stage ALS, inaccessible BCI control strategies, and a focus on laboratory experiments. We characterized how ALS affects neural and cognitive abilities in studies on self-referential processing [ ] and changes in brain rhythms [ ].

Based on these insights, we developed more accessible BCI control strategies for late-stage ALS patients [ ]. To translate this system from the laboratory into home use, we have pioneered a transfer learning approach for BCIs [ ] and developed a novel brain-decoding feature for low-channel setups [ ]. To evaluate the unsupervised use of this system across multiple days, we developed an open-source framework that couples an easy-to-use application with consumer-grade recording hardware [ ]. These advances enable us to build high-performance, cognitive BCIs with off-the-shelf hardware for large-scale applications.

Accountability and Recourse

Astronomy

Brain-Computer Interfaces

Causal Inference

Causal Representation Learning

Computational Imaging

Differential Geometry for Representation Learning

Fairness

Kernel Methods

Learning for Control

Learning on Social Networks

Medical Applications