Much modern data is collected from devices enabling human-machine interactions. For instance, robot vacuum cleaners may have a built-in camera to navigate where to clean, but the camera can also capture scenes of people in the room. Similarly, user-facing artificial intelligence agents such as virtual assistants collect, store, and transmit potentially privacy-sensitive data. These data might be used to fit machine learning models to solve certain statistical tasks. Many recent papers indicate that fitted machine learning models can expose sensitive information from the dataset they were trained on.
For developing privacy-preserving machine learning methods, the first question we need to answer is how to define privacy and how to measure it. While there are various privacy notions developed in the field, differential privacy is currently the gold standard for privacy, due to its rigorous privacy guarantees. Intuitively, the definition states that the probability of any event does not change much when a single individual's data is modified, thereby limiting the amount of information that the algorithm reveals about any one individual.
Equipped with such a notion of privacy, the development of differentially private data analysis methods requires first to decide which quantity we want to guard. Depending on whether we want to share data, statistics, or model parameters, each will require developing different techniques. Below, we look into different modes of sharing.
Model sharing is probably the most popular way to achieve privacy in the current differential privacy literature. Model sharing often focuses on guarding model parameters by adding noise to them before releasing them. Generally, there are two ways of achieving private models. First, one could add noise to the objective function ("objective perturbation") such that the resulting estimates guarantee some level of privacy. Second, one could add noise to the output of an optimization routine ("output perturbation"). Often, objective perturbation techniques end up adding less noise than output perturbation techniques. However, to be able to analyze the relationship between the amount of noise that needs to be added to an objective function and the guaranteed privacy level of a corresponding estimate after optimizing the perturbed objective, it is unavoidable to make significant assumptions (e.g., strong convexity). Many learning problems in machine learning have non-convex objective functions, which limits the usefulness of the existing objective perturbation techniques. We worked on addressing the problem with non-convex objective functions in the context of deep learning [ ], and published two articles on privatizing approximate posterior distributions via variational inference [ ].
Data sharing requires adding noise to the dataset itself before releasing it. Most of the existing data sharing frameworks are designed for particular data types (e.g., count data, low-dimensional data) or particular purposes (e.g., decision tree algorithms) only. We thus need algorithms to add noise in a way that is truthful to the statistical properties of the raw data while being independent of downstream tasks, for a better utility in various statistical analyses. We published a differentially private kernel method for this task, relying upon kernel mean embedding of datasets in reproducing kernel Hilbert spaces, which (for suitable kernels) retains all information about a distribution [ ]. In further works we developed a privacy-preserving data generation framework using random Fourier features [ ], and a recent framework for medical image data sharing by generating synthetic images [ ].
Statistic sharing requires privatizing statistics computed on privacy-sensitive data. We established a statistic sharing framework in the context of kernel two-sample test [ ].
We also worked on a project that intercepts between differential privacy and interoperability of a classifier [ ], where we confirmed our hypothesis that model interpretability compromises privacy.