Privacy Preserving Machine Learning: Maintaining confidentiality and preserving trust

Published November 9, 2021

By Victor Ruehle , Principal Research Manager Robert Sim , Senior Principal Research Manager Sergey Yekhanin , Partner Research Manager Nishanth Chandran , Senior Principal Researcher Melissa Chase , Principal Researcher Daniel Jones , Principal AI Safety Researcher Kim Laine , Principal Researcher Boris Köpf , Principal Researcher Jaime Teevan , Chief Scientist & Technical Fellow Jim Kleewein , Technical Fellow Saravan Rajmohan , Vice President and Distinguished Engineer, AI and Applied Research

Share this page

Machine learning (ML) offers tremendous opportunities to increase productivity. However, ML systems are only as good as the quality of the data that informs the training of ML models. And training ML models requires a significant amount of data, more than a single individual or organization can contribute. By sharing data to collaboratively train ML models, we can unlock value and develop powerful language models that are applicable to a wide variety of scenarios, such as text prediction (opens in new tab) and email reply suggestions. At the same time, we recognize the need to preserve the confidentiality and privacy of individuals and earn and maintain the trust of the people who use our products. Protecting the confidentiality of our customers’ data is core to our mission. This is why we’re excited to share the work we’re doing as part of the Privacy Preserving Machine Learning (PPML) initiative.

The PPML initiative was started in partnership between Microsoft Research and Microsoft product teams with the objective of protecting the confidentiality and privacy of customer data when training large-capacity language models. The goal of the PPML initiative is to improve existing techniques and develop new ones for protecting sensitive information that work for both individuals and enterprises. This helps ensure that our use of data protects people’s privacy and the data is utilized in a safe fashion, avoiding leakage of confidential and private information.

This blog post discusses emerging research on combining techniques to ensure privacy and confidentiality when using sensitive data to train ML models. We illustrate how employing PPML can support our ML pipelines in meeting stringent privacy requirements and that our researchers and engineers have the tools they need to meet these requirements. We also discuss how applying best practices in PPML enables us to be transparent about how customer data is applied.

A holistic approach to PPML

Recent research has shown that deploying ML models can, in some cases, implicate privacy in unexpected ways. For example, pretrained public language models that are fine-tuned on private data can be misused to recover private information, and very large language models have been shown to memorize training examples (opens in new tab), potentially encoding personally identifying information (PII). Finally, inferring that a specific user was part of the training data can also impact privacy (opens in new tab). Therefore, we believe it’s critical to apply multiple techniques to achieve privacy and confidentiality; no single method can address all aspects alone. This is why we take a three-pronged approach to PPML: understanding the risks and requirements around privacy and confidentiality, measuring the risks, and mitigating the potential for breaches of privacy. We explain the details of this multi-faceted approach below.

Understand: We work to understand the risk of customer data leakage and potential privacy attacks in a way that helps determine confidentiality properties of ML pipelines. In addition, we believe it’s critical to proactively align with policy makers. We take into account local and international laws and guidance regulating data privacy, such as the General Data Protection Regulation (opens in new tab) (GDPR) and the EU’s policy on trustworthy AI (opens in new tab). We then map these legal principles, our contractual obligations, and responsible AI principles to our technical requirements and develop tools to communicate with policy makers how we meet these requirements.

Measure: Once we understand the risks to privacy and the requirements we must adhere to, we define metrics that can quantify the identified risks and track success towards mitigating them.

Mitigate: We then develop and apply mitigation strategies, such as differential privacy (DP), described in more detail later in this blog post. After we apply mitigation strategies, we measure their success and use our findings to refine our PPML approach.

Microsoft Research Blog

A holistic approach to PPML

The AI Revolution in Medicine, Revisited