Posted on Leave a comment

Research at Microsoft in 2022: A look back at accelerating progress in AI

2022 Microsoft Research - Year in review graphic

2022 has seen remarkable progress in foundational technologies that have helped to advance human knowledge and create new possibilities to address some of society’s most challenging problems. Significant advances in AI have also enabled Microsoft to bring new capabilities to customers through our products and services, including GitHub Copilot, an AI pair programmer capable of turning natural language prompts into code, and a preview of Microsoft Designer, a graphic design app that supports the creation of social media posts, invitations, posters, and one-of-a-kind images.

These offerings provide an early glimpse of how new AI capabilities, such as large language models, can enable people to interact with machines in increasingly powerful ways. They build on a significant, long-term commitment to fundamental research in computing and across the sciences, and the research community at Microsoft plays an integral role in advancing the state of the art in AI, while working closely with engineering teams and other partners to transform that progress into tangible benefits.

In 2022, Microsoft Research established AI4Science, a global organization applying the latest advances in AI and machine learning toward fundamentally transforming science; added to and expanded the capabilities of the company’s family of foundation models; worked to make these models and technologies more adaptable, collaborative, and efficient; further developed approaches to ensure that AI is used responsibly and in alignment with human needs; and pursued different approaches to AI, such as causal machine learning and reinforcement learning.

We shared our advances across AI and many other disciplines during our second annual Microsoft Research Summit, where members of our research community gathered virtually with their counterparts across industry and academia to discuss how emerging technologies are being explored and deployed to bring the greatest possible benefits to humanity.  

Plenary sessions at the event focused on the transformational impact of deep learning on the way we practice science, research that empowers medical practitioners and reduces inequities in healthcare, and emerging foundations for planet-scale computing. Further tracks and sessions over three days provided deeper dives into the future of the cloud; efficient large-scale AI; amplifying human productivity and creativity; delivering precision healthcare; building user trust through privacy, identity, and responsible AI; and enabling a resilient and sustainable world.

  • Blog Microsoft Climate Research Initiative (MCRI) 

    In June, the Microsoft Climate Research Initiative (MCRI) announced its first phase of collaborations among multidisciplinary researchers working together to accelerate cutting-edge research and transformative innovation in climate science and technology.

  • Publication New Future of Work Report 2022 

    In May, researchers across Microsoft published the New Future of Work Report 2022, which summarizes important recent research developments related to hybrid work. It highlights themes that have emerged in the findings of the past year and resurfaces older research that has become newly relevant.

In this blog post, we look back at some of the key achievements and notable work in AI and highlight other advances across our diverse, multidisciplinary, and global organization.

Advancing AI foundations and accelerating progress

Over the past year, the research community at Microsoft made significant contributions to the rapidly evolving landscape of powerful large-scale AI models. Microsoft Research and the Microsoft Turing team unveiled a new Turing Universal Language Representation model capable of performing both English and multilingual understanding tasks. In computer vision, advancements for the Project Florence-VL (Florence-Vision and Language) team spanned still imagery and video: its GIT model was the first to surpass human performance on the image captioning benchmark TextCaps; LAVENDER showed strong performance in video question answering, text-to-video retrieval, and video captioning; and GLIP and GLIPv2 combined localization and vision-language understanding. The group also introduced NUWA-Infinity, a model capable of converting text, images, and video into high-resolution images or long-duration video. Meanwhile, the Visual Computing Group scaled up its Transformer-based general-purpose computer vision architecture, Swin Transformer, achieving applicability across more vision tasks than ever before.

Researchers from Microsoft Research Asia and the Microsoft Turing team also introduced BEiT-3, a general-purpose multimodal foundation model that achieves state-of-the-art transfer performance on both vision and vision-language tasks. In BEiT-3, researchers introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, BEiT-3 performs masked “language” modeling on images (Imglish), texts (English), and image-text pairs (“parallel sentences”) in a unified manner. The code and pretrained models will be available at GitHub.

One of the most crucial accelerators of progress in AI is the ability to optimize training and inference for large-scale models. In 2022, the DeepSpeed team made a number of breakthroughs to improve mixture of experts (MoE) models, making them more efficient, faster, and less costly. Specifically, they were able to reduce training cost by 5x, reduce MoE parameter size by up to 3.7x, and reduce MoE inference latency by 7.3x while offering up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models.

Transforming scientific discovery and adding societal value

Our ability to comprehend and reason about the natural world has advanced over time, and the new AI4Science organization, announced in July, represents another turn in the evolution of scientific discovery. Machine learning is already being used in the natural sciences to model physical systems using observational data. AI4Science aims to dramatically accelerate our ability to model and predict natural phenomena by creating deep learning emulators that learn by using computational solutions to fundamental equations as training data.

This new paradigm can help scientists gain greater insight into natural phenomena, right down to their smallest components. Such molecular understanding and powerful computational tools can help accelerate the discovery of new materials to combat climate change, and new drugs to help support the prevention and treatment of disease.  

For instance, AI4Science’s Project Carbonix is working on globally accessible, at-scale solutions for decarbonizing the world economy, including reverse engineering materials that can pull carbon out of the environment and recycling carbon into materials. Collaborating on these efforts through the Microsoft Climate Research Initiative (MCRI) are domain experts from academia, industry, and government. Announced in June, MCRI is focused on areas such as carbon accounting, climate risk assessments, and decarbonization.

As part of the Generative Chemistry project, Microsoft researchers have been working with the global medicines company Novartis to develop and execute machine learning tools and human-in-the-loop approaches to enhance the entire drug discovery process. In April, they introduced MoLeR, a graph-based generative model for designing compounds that is more reflective of how chemists think about the process and is more efficient and practical than an earlier generative model the team developed. 

While AI4Science is focused on computational simulation, we have seen with projects like InnerEye that AI can have societal value in many other ways. In March, Microsoft acquired Nuance Communications Inc., further cementing the companies’ shared commitment to outcome-based AI across industries, particularly in healthcare. Tools like the integration of Microsoft Teams and Dragon Ambient eXperience (Nuance DAX) to help ease the administrative burden of physicians and support meaningful doctor-patient interactions are already making a difference.

Making AI more adaptable, collaborative, and efficient 

To help accelerate the capabilities of large-scale AI while building a landscape in which everyone can benefit from it, the research community at Microsoft aimed to drive progress in three areas: adaptability, collaboration, and efficiency.

To provide consistent value, AI systems must respond to changes in task and environment. Research in this area includes multi-task learning with task-aware routing of inputs, knowledge-infused decoding, model repurposing with data-centric ML, pruning and cognitive science or brain-inspired AI. A good example of our work toward adaptability is GODEL, or Grounded Open DialogueLanguage Model, which ushers in a new class of pretrained language models that enable chatbots to help with tasks and then engage in more general conversations.  

Microsoft’s research into more collaborative AI includes AdaTest, which leverages human expertise alongside the generative power of large language models to help people more efficiently find and correct bugs in natural language processing models. Researchers have also explored expanding the use of AI in creative processes, including a project in which science fiction writer Gabrielle Loisel used OpenAI’s GPT-3 to co-author a novella and other stories

To enable more people to make use of AI in an efficient and sustainable way, Microsoft researchers are pursuing several new architectures and training paradigms. This includes new modular architectures and novel techniques, such as DeepSpeed Compression, a composable library for extreme compression and zero-cost quantization, and Z-Code Mixture of Experts models, which boost translation efficiency and were deployed in Microsoft Translator in 2022.  

In December, researchers unveiled AutoDistil, a new technique that leverages knowledge distillation and neural architecture search to improve the balance between cost and performance when generating compressed models. They also introduced AdaMix, which improves the fine-tuning of large pretrained models for downstream tasks using mixture of adaptations modules for parameter-efficient model tuning. And vision-language model compression research on the lottery ticket hypothesis showed that pretrained language models can be significantly compressed without hurting their performance.

  • Blog Infusing AI into cloud computing systems 

    Cloud Intelligence/AIOps is a rapidly emerging technology trend and an interdisciplinary research direction across system, software engineering, and AI/ML communities. In this blog post from November, the researchers behind Microsoft’s AIOps work outline a research vision to make the cloud more autonomous, proactive, and manageable.

Building and deploying AI responsibly

Building AI that maximizes its benefit to humanity, and does so equitably, requires considering both the opportunities and risks that come with each new advancement in line with our guiding principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability.

Helping to put these principles into practice is Microsoft’s Responsible AI Standard, which the company made publicly available in June. The standard comprises tools and steps that AI practitioners can execute in their workflows today to help ensure that building AI responsibly is baked into every stage of development. These standards will evolve as the tools and resources to responsibly build AI evolve in response to the rapid pace of AI advancement, particularly pertaining to the growing size of AI models and the new challenges they bring

With FedKD and InclusiveFL, researchers tackled some of the obstacles in applying federated learning, an ML method for protecting privacy, to model training. Two separate teams explored solutions for the harmful language that large generative models can reproduce—one presenting a unified framework for both detoxifying and debiasing models and another introducing methods for making content moderation tools more robust. Meanwhile, researchers sought to strengthen human-AI collaboration by giving users more insight into how models arrive at their outputs via explanations provided by the models themselves.

The responsible development of AI also means deploying technologies that operate the way they were designed to—and the way people expect them to. In a pair of blog posts, researchers draw on their respective experiences developing a technology to support social agency in children who are born blind and another to support mental health practitioners in guiding patient treatment to stress the need for multiple measures of performance in determining the readiness of increasingly complex AI systems and the incorporation of domain experts and user research throughout the development process.

Advancing AI for decision making

Building the next generation of AI requires continuous research into fundamental new AI innovations. Two significant areas of study in 2022 were causal ML and reinforcement learning.

Causal ML

Identifying causal effects is an integral part of scientific inquiry. It helps us understand everything from educational outcomes to the effects of social policies to risk factors for diseases. Questions of cause and effect are also critical for the design and data-driven evaluation of many technological systems we build today.  

This year, Microsoft Research continued its work on causal ML, which combines traditional machine learning with causal inference methods. To help data scientists better understand and deploy causal inference, Microsoft researchers built the DoWhy library, an end-to-end causal inference tool, in 2018. To broaden access to this critical knowledge base, DoWhy has now migrated to an independent open-source governance model in a new PyWhy GitHub organization. As part of this new collaborative model, Amazon Web Services is contributing new technology based on structural causal models.

At this year’s Conference on Neural Information Processing Systems (NeurIPS), researchers presented a suite of open-source causal tools and libraries that aims to simultaneously provide core causal AI functionality to practitioners and create a platform for research advances to be rapidly deployed. This includes ShowWhy, a no-code user interface suite that empowers domain experts to become decision scientists. We hope that our work accelerates use-inspired basic research for improvement of causal AI.

Reinforcement learning (RL)

Reinforcement learning is a powerful tool for learning which behaviors are likely to produce the best outcomes in a given scenario, typically through trial and error. But this powerful tool faces some challenges. Trial and error can consume enormous resources when applied to large datasets. And for many real-time applications, there’s no room to learn from mistakes.   

To address RL’s computational bottleneck, Microsoft researchers developed Path Predictive Elimination, a reinforcement learning method that is robust enough to remove noise from continuously changing environments. Also in 2022, a Microsoft team released MoCapAct, a library of pretrained simulated models to enable advanced research on artificial humanoid control at a fraction of the compute resources currently required.  

Researchers also developed a new method for using offline RL to augment human-designed strategies for making critical decisions. This team deployed game theory to design algorithms that can use existing data to learn policies that improve on current strategies.

Readers’ choice: Notable blog posts for 2022 

Thank you for reading

2022 was an exciting year for research, and we look forward to the future breakthroughs our global research community will deliver. In the coming year, you can expect to hear more from us about our vision, and the impact we hope to achieve. We appreciate the opportunity to share our work with you, and we hope you will subscribe to the Microsoft Research Newsletter for the latest developments.

Writers and Editors
Elise Ballard
Kristina Dodge
Kate Forster
Chris Stetkiewicz
Larry West

Managing Editor
Amber Tingle

Project Manager
Amanda Melfi

Graphic Designer
Matt Sanderson

Editor in Chief
Matt Corwine

Posted on Leave a comment

How Cloud Intelligence/AIOps is making cloud systems more autonomous, proactive and manageable

The image has two circles side-by-side, each divided into three equal segments. An arrow between the two circles points from left to right to show the evolution from Microsoft’s previous Software Analytics research to today’s Cloud Intelligence/AIOps.

When legendary computer scientist Jim Gray accepted the Turing Award in 1999, he laid out a dozen long-range information technology research goals. One of those goals called for the creation of trouble-free server systems or, in Gray’s words, to “build a system used by millions of people each day and yet administered and managed by a single part-time person.”  

Gray envisioned a self-organizing “server in the sky” that would store massive amounts of data, and refresh or download data as needed. Today, with the emergence and rapid advancement of artificial intelligence (AI), machine learning (ML) and cloud computing, and Microsoft’s development of Cloud Intelligence/AIOps, we are closer than we have ever been to realizing that vision—and moving beyond it.  

Over the past fifteen years, the most significant paradigm shift in the computing industry has been the migration to cloud computing, which has created unprecedented digital transformation opportunities and benefits for business, society, and human life.  

The implication is profound: cloud computing platforms have become part of the world’s basic infrastructure. As a result, the non-functional properties of cloud computing platforms, including availability, reliability, performance, efficiency, security, and sustainability, have become immensely important. Yet the distributed nature, massive scale, and high complexity of cloud computing platforms—ranging from storage to networking, computing and beyond—present huge challenges to building and operating such systems.  

What is Cloud Intelligence/AIOps?

Cloud Intelligence/AIOps (“AIOps” for brevity) aims to innovate AI/ML technologies to help design, build, and operate complex cloud platforms and services at scale—effectively and efficiently.  

AIOps has three pillars, each with its own goal:  

  • AI for Systems to make intelligence a built-in capability to achieve high quality, high efficiency, self-control, and self-adaptation with less human intervention.  
  • AI for Customers to leverage AI/ML to create unparalleled user experiences and achieve exceptional user satisfaction using cloud services.  
  • AI for DevOps to infuse AI/ML into the entire software development lifecycle to achieve high productivity.  

Where did the research on AIOps begin?  

Gartner, a leading industry analyst firm, first coined the term AIOps (Artificial Intelligence for IT Operations) in 2017. According to Gartner, AIOps is the application of machine learning and data science to IT operation problems. While Gartner’s AIOps concept focuses only on DevOps, Microsoft’s Cloud Intelligence/AIOps research has a much broader scope, including AI for Systems and AI for Customers.  

The broader scope of Microsoft’s Cloud Intelligence/AIOps stems from the Software Analytics research we proposed in 2009, which seeks to enable software practitioners to explore and analyze data to obtain insightful and actionable information for data-driven tasks related to software and services. We started to focus our Software Analytics research on cloud computing in 2014 and named this new topic Cloud Intelligence (Figure 1). In retrospect, Software Analytics is about the digital transformation of the software industry itself, such as empowering practitioners to use data-driven approaches and technologies to develop software, operate software systems, and engage with customers.  

The image has two circles side-by-side, each divided into three equal segments. An arrow between the two circles points from left to right to show the evolution from Microsoft’s previous Software Analytics research to today’s Cloud Intelligence/AIOps.
Figure 1: From Software Analytics to Cloud Intelligence/AIOps

What is the AIOps problem space? 

There are many scenarios around each of the three pillars of AIOps. Some example scenarios include predictive capacity forecasting for efficient and sustainable services, monitoring service health status, and detecting health issues in a timely manner in AI for Systems; ensuring code quality and preventing defective build deployed into production in AI for DevOps; and providing effective customer support in AI for Customers. Across all these scenarios, there are four major problem categories that, taken together, constitute the AIOps problem space: detection, diagnosis, prediction, and optimization (Figure 2). Specifically, detection aims to identify unexpected system behaviors (or anomalies) in a timely manner. Given the symptom and associated artifacts, the goal of diagnosis is to localize the cause of service issues and find the root cause. Prediction attempts to forecast system behaviors, customer workload patterns, or DevOps activities, and so on. Lastly, optimization tries to identify the optimal strategies or decisions required to achieve certain performance targets related to system quality, customer experience and DevOps productivity. 

The image has three columns, each with a stack of four items, which show the problems and challenges of AIOps and the techniques used to address them.
Figure 2: Problems and challenges of AIOps

Each problem has its own challenges. Take detection as an example. To ensure service health at runtime, it is important for engineers to continuously monitor various metrics and detect anomalies in a timely manner. In the development process, to ensure the quality of the continuous integration/continuous delivery (CI/CD) practice, engineers need to create mechanisms to catch defective builds and prevent them from being deployed to other production sites.  

Both scenarios require timely detection, and in both there are common challenges for conducting effective detection. For example, time series data and log data are the most common data forms. Yet they are often multi-dimensional, there may be noise in the data, and they often have different detection requirements—all of which can pose significant challenges to reliable detection.  

Microsoft Research: Our AIOps vision

Microsoft is conducting continuous research in each of the AIOps problem categories. Our goal for this research is to empower cloud systems to be more autonomous, more proactive, more manageable, and more comprehensive across the entire cloud stack.  

Making cloud systems more autonomous

AIOps strives to make cloud systems more autonomous, to minimize human operations and rule-based decisions, which significantly helps reduce user impact caused by system issues, make better operation decisions, and reduce maintenance cost. This is achieved by automating DevOps as much as possible, including build, deployment, monitoring, and diagnosis. For example, the purpose of safe deployment is to catch a defective build early to prevent it from rolling out to production and resulting in significant customer impact. It can be extremely labor intensive and time consuming for engineers, because anomalous behaviors have a variety of patterns that may change over time, and not all anomalous behaviors are caused by a new build, which may introduce false positives.  

At Microsoft Research, we used transfer learning and active learning techniques to develop a safe deployment solution that overcomes these challenges. We’ve been running the solution in Microsoft Azure, and it has been highly effective at helping to catch defective builds – achieving more than 90% precision and near 100% recall in production over a period of 18 months.  

Root cause analysis is another way that AIOps is reducing human operations in cloud systems. To shorten the mitigation time, engineers in cloud systems must quickly identify the root causes of emerging incidents. Owing to the complex structure of cloud systems, however, incidents often contain only partial information and can be triggered by many services and components simultaneously, which forces engineers to spend extra time diagnosing the root causes before any effective actions can be taken.  By leveraging advanced contrast-mining algorithms, we have implemented autonomous incident-diagnosis systems, including HALO and Outage Scope, to reduce response time and increase accuracy in incident diagnosis tasks. These systems have been integrated in both Azure and Microsoft 365 (M365), which has considerably improved engineers’ ability to handle incidents in cloud systems. 

Making cloud systems more proactive 

AIOps makes cloud systems more proactive by introducing the concept of proactive design. In the design of a proactive system, an ML-based prediction component is added to the traditional system. The prediction system takes the input signals, does the necessary processing, and outputs the future status of the system. For example, what the capacity status of cluster A looks like next week, whether a disk will fail in a few days, or how many virtual machines (VMs) of a particular type will be needed in the next hour.​  

Knowing the future status makes it possible for the system to proactively avoid negative system impacts. For example, engineers can live migrate the services on an unhealthy computing node to a healthy one to reduce VM downtime, or pre-provision a certain number of VMs of a particular type for the next hour to reduce the latency of VM provisioning. In addition, AI/ML techniques can enable systems to learn over time which decision to make.  

As an example of proactive design, we built a system called Narya, which proactively mitigated potential hardware failures to reduce service interruption and minimize customer impact. Narya, which is in production in Microsoft Azure, performs prediction on hardware failures and uses a bandit algorithm to decide which mitigation action to take. 

Making cloud systems more manageable 

AIOps makes cloud systems more manageable by introducing the notion of tiered autonomy. Each tier represents a set of operations that require a certain level of human expertise and intervention. These tiers range from the top tier of autonomous routine operations to the bottom tier, which requires deep human expertise to respond to rare and complex problems.  

AI-driven automation often cannot handle such problems. By building AIOps solutions targeted at each tier, we can make cloud platforms easier to manage across the long tail of rare problems that inevitably arise in complex systems. Furthermore, the tiered design ensures that autonomous systems are developed from the start to evaluate certainty and risk, and that they have safe fallbacks when automation fails or the platform faces a previously unseen set of circumstances, such as the unforeseen increase in demand in 2020 due to the COVID-19 pandemic. 

As an example of tiered autonomy, we built Safe On-Node Learning (SOL), a framework for safe learning and actuation on server nodes for the top tier. As another example, we are exploring how to predict the commands that operators should perform to mitigate incidents, while considering the associated certainty and risks of those commands when the top-tier automation fails to prevent the incidents. 

Making AIOps more comprehensive across the cloud stack

AIOps can also be made more comprehensive by spanning the cloud stack—from the lowest infrastructure layers (such as network and storage) through the service layer (such as the scheduler and database) and on to the application layer. The benefit of applying AIOps more broadly would be a significant increase in the capability for holistic diagnosis, optimization, and management. 

Microsoft services built on top of Azure are called first-party (1P) services. A 1P setting, which is often used to optimize system resources, is particularly suited to a more comprehensive approach to AIOps. This is because with the 1P setting a single entity has visibility into, and control over, the layers of the cloud stack, which enables engineers to amplify the AIOps impact. Examples of 1P services at Microsoft include large and established services such as Office 365, relatively new but sizeable services such as Teams, and up and coming services such as Windows 365 Cloud PC. These 1P services typically account for a significant share of resource usage, such as wide-area network (WAN) traffic and compute cores. 

As an example of applying a more comprehensive AIOps approach to the 1P setting, the OneCOGS project, which is a joint effort of Azure, M365, and MSR, considers three broad opportunities for optimization:  

  1. Modeling users and their workload using signals cutting across the layers—such as using the user’s messaging activity versus fixed working hours to predict when a Cloud PC user will be active—thereby increasing accuracy to enable enabling appropriate allocation of system resources. 
  2. Jointly optimizing the application and the infrastructure to achieve cost savings and more.  
  3. Tame the complexity of data and configuration, thereby democratizing AIOps.  

The AIOps methodologies, technologies and practices used for cloud computing platforms and 1P services are also applicable to third-party (3P) services on the cloud stack. To achieve this, further research and development are needed to make AIOps methods and techniques more general and/or easily adaptable. For example, when operating cloud services, detecting anomalies in multi-dimensional space and the subsequent fault localization are common monitoring and diagnosis problems.  

Motivated by the real-world needs of Azure and M365, we proposed the technique AiDice, which automatically detects anomalies in multi-dimensional space, and HALO, a hierarchy-aware approach to locating fault-indicating combinations that uses telemetry data collected from cloud systems. In addition to deploying AiDice and HALO in Azure and M365, we’re also collaborating with product team partners to make AiDice and HALO AIOps services that can be leveraged by third-party services. 

Conclusion 

AIOps is a rapidly emerging technology trend and an interdisciplinary research direction across system, software engineering, and AI/ML communities. With years of research on Cloud Intelligence, Microsoft Research has built up rich technology assets in detection, diagnosis, prediction, and optimization. And through close collaboration with Azure and M365, we have deployed some of our technologies in production, which has created significant improvements in the reliability, performance, and efficiency of Azure and M365 while increasing the productivity of developers working on these products. In addition, we are collaborating with colleagues in academia and industry to promote the AIOps research and practices. For example, with the joint efforts we have organized 3 editions of AIOps Workshop at premium academic conferences AAAI 2020, ICSE 2021, and MLSys2022

Moving forward, we believe that as a new dimension of innovation, Cloud Intelligence/AIOps will play an increasingly important role in making cloud systems more autonomous, more proactive, more manageable, and more comprehensive across the entire cloud stack. Ultimately, Cloud Intelligence/AIOps will help us make our vision for the future of the cloud a reality. 

Posted on Leave a comment

Assessing AI system performance: Thinking beyond models to deployment contexts

A graphic overview of the way performance assessment methods change across the development lifecycle. It has four phases: getting started, connecting with users, tuning the user experience, and performance assessment in the deployment context. It visually shows how the balance of user experience and tech development change over these four phases.
Figure 1: Performance assessment methods change across the development lifecycle for complex AI systems in ways that differ from general purpose AI. The emphasis shifts from rapid technical innovation that requires easy-to-calculate aggregate performance metrics at the beginning of the development process to metrics that reflect the performance of critical AI system attributes needed to underpin the user experience at the end.

AI systems are becoming increasingly complex as we move from visionary research to deployable technologies such as self-driving cars, clinical predictive models, and novel accessibility devices. Unlike singular AI models, it is more difficult to assess whether these more complex AI systems are performing consistently and as intended to realize human benefit.

    1. Real-world contexts for which the data might be noisy or different from training data;
    2. Multiple AI components interact with each other, creating unanticipated dependencies and behaviors;
    3. Human-AI feedback loops that come from repeated engagements between people and AI system.
    4. Very large AI models (e.g., transformer models)
    5. AI models that interact with other parts of a system (e.g., user interface or heuristic algorithm)

How do we know when these more advanced systems are ‘good enough’ for their intended use? When assessing the performance of AI models, we often rely on aggregate performance metrics like percentage of accuracy. But this ignores the many, often human elements, that make up an AI system.

Our research on what it takes to build forward-looking, inclusive AI experiences has demonstrated that getting to ‘good enough’ requires multiple performance assessment approaches at different stages of the development lifecycle, based upon realistic data and key user needs (figure 1).

Shifting emphasis gradually from iterative adjustments in the AI models themselves toward approaches that improve the AI system as a whole has implications not only in terms of how performance is assessed, but who should be involved in the performance assessment process. Engaging (and training) non-technical domain experts earlier (i.e., for choosing test data or defining experience metrics) and in a larger capacity throughout the development lifecycle can enhance relevance, usability, and reliability of the AI system.

Performance assessment best practices from the PeopleLens

The PeopleLens (figure 2) is a new Microsoft technology designed to enable children who are born blind to experience social agency and build up the range of social attention skills needed to initiate and maintain social interactions. Running on smart glasses, it provides the wearer with continuous, real-time information about the people around them through spatial audio, helping them build up a dynamic map of the whereabouts of others. Its underlying technology is a complex AI system using several computer vision algorithms to calculate, pose, identify registered people, and track those entities over time.

The PeopleLens offers a useful illustration of the wide range of performance assessment methods and people necessary to comprehensively gauge its efficacy.

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard.
Figure 2: The PeopleLens is a new research technology designed to help people who are blind or have low vision better understand their immediate social environments by locating and identifying people in the space dynamically in real-time.

Getting started: AI model or AI system performance?

Calculating aggregate performance metrics on open-source benchmarked datasets may demonstrate the capability of an individual AI model, but may be insufficient when applied to an entire AI system. It can be tempting to believe a single aggregate performance metric (such as accuracy) can be sufficient to validate multiple AI models individually. But the performance of two AI models in a system cannot be comprehensively measured by simple summation of each model’s aggregate performance metric.

We used two AI models to test the accuracy of the PeopleLens to locate and identify people: the first was a benchmarked, state-of-the-art pose model used to indicate the location of people in an image. The second was a novel facial recognition algorithm previously demonstrated to have greater than 90% accuracy. Despite strong historical performance of these two models, when applied to the PeopleLens, the AI system recognized only 10% of people from a realistic dataset in which people were not always facing the camera.

This finding illustrates that multi-algorithm systems are more than a sum of their parts, requiring specific performance assessment approaches.

Connecting to the human experience: Metric scorecards and realistic data 

Metrics scorecards, calculated on a realistic reference dataset, offer one way to connect to the human experience while the AI system is still undergoing significant technical iteration. A metrics scorecard can combine several metrics to measure aspects of the system that are most important to users.

We used ten metrics in the development of PeopleLens. The most valuable two metrics were time-to-first-identification, which measured how long it took from the time a person was seen in a frame to the user hearing the name of that person, and number of repeat false positives, which measured how often a false positive occurred in three frames or more in a row within the reference dataset.

The first metric captured the core value proposition for the user: having the social agency to be the first to say hello when someone approaches. The second was important because the AI system would self-correct single misidentifications, while repeated mistakes would lead to a poor user experience. This measured the ramifications of that accuracy throughout the system, rather than just on a per-frame basis.

Beyond metrics: Using visualization tools to finetune the user experience

While metrics play a critical role in the development of AI systems, a wider range of tools is needed to finetune the intended user experience. It is essential for development teams to test on realistic datasets to understand how the AI system generates the actual user experience. This is especially important with complex systems, where multiple models, human-AI feedback loops, or unpredictable data (e.g., user-controlled data capture) can cause the AI system to respond unpredictably.

Visualization tools can enhance the top-down statistical tools of data scientists, helping domain experts contribute to system development. In the PeopleLens, we used custom-built visualization tools to compare side-by-side renditions of the experience with different model parameters (figure 3). We leveraged these visualizations to enable domain experts—in this case parents and teachers—to spot patterns of odd system behavior across the data.

Project Tokyo studio interface
Figure 3: Visualization tools helped the development team, including domain experts, in connecting the AI system to the user experience using realistic data. In this image, the top bar shows images taken from the wearable camera stream overlayed with the various model outcomes. The bottom bar shows the output of the world-state tracking algorithm on the left and the ground truth on the right. The panel in the middle shows model parameters that are being changed with the impact on the user experience being viewed in real time.

AI system performance in the context of the user experience

A user experience can only be as good as the underlying AI system. Testing the AI system in a realistic context, measuring things that matter to the users, is a critical stage before wide-spread deployment. We know, for example, that improving AI system performance does not necessarily correspond to improved performance of AI teams (reference).

We also know that human-to-AI feedback loops can make it difficult to measure an AI system’s performance. Essentially repeated interactions between AI system and user, these feedback loops can surface (and intensify) errors. They can also, through good intelligibility, be repaired by the user.

The PeopleLens system gave users feedback about the people’s locations and their faces. A missed identification (e.g., because the person is looking at a chest rather than a face) can be resolved once the user responds to feedback (e.g., by looking up). This example shows us that we do not need to focus on missed identification as they will be resolved by the human-AI feedback loop. However, users were very perplexed by the identification of people who were no longer present, and therefore performance assessments needed to focus on these false positive misidentifications.

    1. Multiple performance assessment methods should be used in AI system development. In contrast to developing individual AI models, general aggregate performance metrics are a small component, relevant primarily in the earliest stages of development.
    2. Documenting AI system performance should include a range of approaches, from metrics scorecards to system performance metrics for a deployed user experience, to visualization tools.
    3. Domain experts play an important role in performance assessment, beginning early in the development lifecycle. Domain experts are often not prepared or skilled for the in-depth participation optimal in AI system development.
    4. Visualization tools are as important as metrics in creating and documenting an AI system for a particular intended use. It is critical that domain experts have access to these tools as key decision-makers in AI system deployment.

Bringing it all together 

For complex AI systems, performance assessment methods change across the development lifecycle in ways that differ from individual AI models. Shifting performance assessment techniques from rapid technical innovation requiring easy-to-calculate aggregate metrics at the beginning of the development process, to the performance metrics that reflect critical AI system attributes that make up the user experience toward the end of development helps every type of stakeholder precisely and collectively define what is ‘good enough’ to achieve the intended use.  

It is useful for developers to remember performance assessment is not an end goal in itself; it is a process that defines how the system has reached its best state and whether that state is ready for deployment. The performance assessment process must include a broad range of stakeholders, including domain experts, who may need new tools to fulfill critical (sometimes unexpected) roles in the development and deployment of an AI system.

Posted on Leave a comment

Microsoft Research Summit 2022 Oct. 18-20: What’s next for technology and humanity?

Microsoft Research Summit setup 2022

Today, we are experiencing waves of breakthroughs in computing that are transforming just about every aspect of our lives. Artificial intelligence is changing the way we develop and create. Human language technologies are revolutionizing the workflows of healthcare professionals. Deep learning is accelerating our ability to understand and predict natural phenomena, from atomic to galactic scales. Meanwhile, the foundations of cloud computing are undergoing a reinvention from the atoms up. 

Realizing the benefits of these new breakthroughs demands that we come together in new ways across the global research community. The vibrancy of invention and innovation increasingly lies at the intersections among traditional research disciplines, from the highly theoretical and to the immediately applicable. Ensuring that the continuing advancement of technology is beneficial to all requires communication, collaboration and co-innovation across the communities that create new technologies and those that aim to use them to improve their lives. 

That’s why I’m excited to invite you to join us for this year’s Microsoft Research Summit, which will take place on October 18-20, 2022. This virtual event is where the global research community convenes to explore how emerging research might best address societal challenges and have significant impact on our lives in the coming years. This year’s event will feature over 120 speakers, including researchers and leaders from across the research community at Microsoft, alongside partners and collaborators from industry, academia and government who are advancing the frontiers of research in computing and across the sciences. 

Each of our three days will begin with a plenary session during which we’ll explore the potential impact of deep learning on scientific discovery, the opportunity to use technology to make healthcare more precise and accessible, and the re-invention of foundational technologies to enable the cloud of the future. These plenaries will lead into tracks that dive deeper into research that spans from more efficient and adaptable AI, to technologies that amplify human creativity and help foster a more sustainable society.

For further details – and to register to attend – check out the Microsoft Research Summit website

We hope you will join us. 

Posted on Leave a comment

MoCapAct: Training humanoid robots to ‘Move like Jagger’

A montage of four animated figures completing humanoid actions: standing up, walking, running, and jumping.

What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism provides unparalleled versatility in an environment designed for and by humans. By mixing and matching a wide range of basic motor skills, from walking to jumping to balancing on one foot, people routinely dance, play soccer, carry heavy objects, and perform other complex high-level motions. If robots are ever to reach their full potential as an assistive technology, mastery of diverse bipedal motion is a requirement, not a luxury. However, even the simplest of these skills can require a fine orchestration of dozens of joints. Sophisticated engineering can rein in some of this complexity, but endowing bipedal robots with the generality to cope with our messy, weakly structured world, or a metaverse that takes after it, requires learning. Training AI agents with humanoid morphology to match human performance across the entire diversity of human motion is one of the biggest challenges of artificial physical intelligence. Due to the vagaries of experimentation on physical robots, research in this direction is currently done mostly in simulation. 

Unfortunately, it involves computationally intensive methods, effectively restricting participation to research institutions with large compute budgets. In an effort to level the playing field and make this critical research area more inclusive, Microsoft Research’s Robot Learning group is releasing MoCapAct, a large library of pre-trained humanoid control models along with enriched data for training new ones. This will enable advanced research on artificial humanoid control at a fraction of the compute resources currently required. 

The reason why humanoid control research has been so computationally demanding is subtle and, at the first glance, paradoxical. The prominent avenue for learning locomotive skills is based on using motion capture (MoCap) data. MoCap is an animation technique widely used in the entertainment industry for decades. It involves recording the motion of several keypoints on a human actor’s body, such as their elbows, shoulders, and knees, while the actor is performing a task of interest, such as jogging. Thus, a MoCap clip can be thought of as a very concise and precise summary of an activity’s video clip. Thanks to this, useful information can be extracted from MoCap clips with much less computation than from the much more high-dimensional, ambiguous training data in other major areas of machine learning, which comes in the form of videos, images, and text. On top of this, MoCap data is widely available. Repositories such as CMU Motion Capture Dataset contain hours of clips for just about any common motion of a human body, with visualizations of several examples shown below. Why, then, is it so hard to make physical and simulated humanoid robots mimic a person’s movements? 

The caveat is that MoCap clips don’t contain all the information necessary to imitate the demonstrated motions on a physical robot or in a simulation that models physical forces. They only show us what a motion skill looks like, not the underlying muscular movements that caused the actor’s muscles to yield that motion. Even if MoCap systems recorded these signals, it wouldn’t be of much help: simulated humanoids and real robots typically use motors instead of muscles, which is a dramatically different form of articulation. Nonetheless, actuation in artificial humanoids is also driven by a type of control signal. MoCap clips are a valuable aid in computing these control signals, if combined with additional learning and optimization methods that use MoCap data as guidance. The computational bottleneck that our MoCapAct release aims to remove is created exactly by these methods, collectively known as reinforcement learning (RL). In simulation, where much of AI locomotion research is currently focused, RL can recover the sequence of control inputs that takes a humanoid agent through the sequence of poses from a given MoCap clip. What results is a locomotion behavior that is indistinguishable from the clip’s. The availability of control policies for individual basic behaviors learned from separate MoCap clips can open the doors for fascinating locomotion research, e.g., in methods for combining these behaviors into a single “multi-skilled” neural network and training higher-level locomotion capabilities by switching among them. However, with thousands of basic locomotion skills to learn, RL’s expensive trial-and-error approach creates a massive barrier to entry on this research path. It is this scalability issue that our dataset release aims to address. 

A flowchart showing motion capture clips producing clip-tracking agents via reinforcement learning. The agents then generate data using the simulated humanoid. The MoCapAct dataset consists of the agents and corresponding data.
Figure 1: The MoCapAct dataset consists of policies that track individual MoCap clips and data from these agents.

Our MoCapAct dataset, designed to be compatible with the highly popular dm_control humanoid simulation environment and the extensive CMU Motion Capture Dataset, serves the research community in two ways: 

  1. For each of over 2500 MoCap clip snippets from the CMU Motion Capture Dataset, it provides an RL-trained “expert” control policy (represented as a PyTorch model) that enables dm_control’s simulated humanoid to faithfully recreate the skill depicted in that clip snippet, as shown in these videos of the experts’ behaviors: 

Training this model zoo has taken the equivalent of 50 years over many GPU-equipped Azure NC6v2 virtual machines (excluding hyperparameter tuning and other required experiments) – a testament to the computational hurdle MoCapAct removes for other researchers. 

  1. For each of the trained skill policies above, MoCapAct supplies a set of recorded trajectories generated by executing that skill’s control policy on the dm_control’s humanoid agent. These trajectories can be thought of as MoCap clips of the trained experts but, in a crucial difference from the original MoCap data, they contain both low-level sensory measurements (e.g., touch measurements) and control signals for the humanoid agent. Unlike typical MoCap data, these trajectories are suitable for learning to match and improve on skill experts via direct imitation – a much more efficient class of techniques than RL. 

We give two examples of how we used the MoCapAct dataset. 

First, we train a hierarchical policy based on the neural probabilistic motor primitive. To achieve this, we combine the thousands of MoCapAct’s clip-specialized policies together into a single policy that is capable of executing many different skills. This agent has a high-level component that takes MoCap frames as input and outputs a learned skill. The low-level component takes the learned skill and sensory measurement from the humanoid as input and outputs the motor action. 

Two graphics of the hierarchical policy. The first graphic shows a MoCap clip of walking being fed into a high-level policy, which outputs a prediction of “walk forward.” This prediction and the humanoid observation are fed into the low-level policy, which then predicts the motor actions to execute the walking motion. The second graphic is similar to the first, with the only difference being that the MoCap clip shows a “run and jump” motion, and the predicted skill is “run and jump.”
Figure 2: The hierarchical policy consists of a high-level policy and low-level policy. The high-level policy maps the given MoCap frames to a learned skill. The low-level policy takes the skill and the humanoid observation and outputs an action that best realizes the skill. 

This hierarchical structure offers an appealing benefit. If we keep the low-level component, we can instead control the humanoid by inputting different skills to the low-level policy (e.g., “walk” instead of the corresponding motor actions). Therefore, we can re-use the low-level policy to efficiently learn new tasks. 

Graphic of a task policy feeding into a low-level policy. The task policy takes an observation from the humanoid as input, and outputs a “skill.” The skill and humanoid observation are fed into a low-level policy, which outputs the motor action.
Figure 3: We can replace the high-level policy with a task policy that is trained to output skills required to achieve some new task, such as running to a target. 

In light of that, we replace the high-level policy with a task policy that is then trained to steer the low-level policy towards achieving some task. As an example, we train a task policy to have the humanoid reach a target. Notice that the humanoid uses many low-level skills, like running, turning, and side-stepping. 

Graphic of the GPT policy. A sequence of humanoid observations is fed into the GPT module, which outputs the motor action.
Figure 4: Our GPT model takes in a sequence of observations from the humanoid (called the “context”) and outputs an action that it thinks best continues the observed motion. 

Our second example centers on motion completion, which is inspired by the task of sentence completion. Here, we use the GPT architecture, which accepts a sequence of sensory measurements (the “context”) and outputs a motor action. We train a control policy to take one second of sensory measurements from the dataset and output the corresponding motor actions from the specialized expert. Then, before executing the policy on our humanoid, we first generate a “prompt” (red humanoid in the videos) by executing a specialized expert for one second. Afterwards, we let the policy control the humanoid (bronze humanoid in the videos), at each time step, where it constantly takes the previous second of sensory measurements and predicts the motor actions. We find that this policy can reliably repeat the underlying motion of the clip, which is demonstrated in the first two videos. On other MoCap clips, we find that the policy can deviate from the underlying clip in a plausible way, such as in the third video, where the humanoid transitions from side-stepping to walking backwards.

On top of the dataset, we also release the code used to generate the policies and results. We hope the community can build off of our dataset and work to do incredible research in the control of humanoid robots. 

Our paper is available here. You can read more at our website

The data used in this project was obtained from mocap.cs.cmu.edu.
The database was created with funding from NSF EIA-0196217. 

Posted on Leave a comment

Confidential Containers: Verifiably secure computation in the cloud

White lock within a geometric circle over top a blue to orange color gradient background

For many organizations, trusting their data to the cloud requires having a complete understanding of and control over the environment in which that data resides and how it’s being processed. Microsoft understands this, and we are committed to building a trustworthy cloud—one in which security, privacy, and transparency are built into its core. A key part of this vision is confidential computing—a set of hardware and software capabilities that give data owners visibility into the data environment and verifiable security protection of their data in use. 

The Confidential Computing team at Microsoft Research is collaborating with hardware developers to create trusted execution environments (TEEs), where data stays encrypted not just when stored (encryption at rest) and in transit, but also during use. This work underpins the Azure confidential cloud platform, where users can upload encrypted code and data and get encrypted results back with strong privacy. 

At Microsoft Build 2022, the company announced serverless confidential containers with lift-and-shift support, the next step in the evolution of confidential computing. This service builds on the Confidential Containers work conducted at Microsoft Research. Confidential Containers offers a verifiably secure container environment in Azure where users can confirm that the software performing computations on their data is exactly the software they expect to be running, that it will do what they want it to do with their data, and that they can trust the results it returns. Confidential Containers enables users to take existing container workloads, and with a small amount of configuration, use them in a confidential environment.

Smaller trusted computing base 

Confidential Containers decreases the size of the trusted computing base (TCB)—the totality of elements in a computing environment that must be trusted not to violate the confidentiality of computation. The TCB can include software, hardware, and human administrators, among other things. By removing elements from the TCB, the components that can be compromised are reduced, decreasing the attack surface. Confidential Containers removes Microsoft administrators from the TCB, minimizing it as much as possible while still enabling customers to run existing workloads without modifying them.

This reduced TCB provides an option for organizations that currently run computations on their data on premises because they are concerned about the security of their data in the cloud. Even though setting up a computation environment in the cloud offers flexibility, data can be exposed to anyone who operates the servers on which the system runs. With Confidential Containers, the individuals who can access the data can be tightly controlled. This can be a single designated employee of the organization that owns the data or the business partner that is processing the data. It is never a Microsoft employee or another third party. 

Encrypted, policy-constrained computing environment 

A secure hardware environment enables data protection in use. Confidential Containers runs on AMD processors backed by AMD Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP), which provides a TEE. This hardware-enforced security boundary provides a shield so that nothing outside the encrypted memory space can read the data.

Users of Confidential Containers create a policy defining precisely what can run in the confidential container environment and how. The AMD SEV-SNP hardware produces an attestation report, which provides a succinct representation of everything in the confidential environment, including information about the code that will be enforcing the policy. Users can request this attestation report any time before providing the container with a key to unlock the encrypted dataset for processing. 

A cloud outline within a security shield over top a blue to orange color gradient background.

Sensitive data handling in the cloud 

Before the development of HTTPS, businesses could not securely run a storefront on the public web because communication over the internet was not secure. In the same way, today individuals and organizations cannot run containerized computation over sensitive data in the public cloud. Confidential Containers addresses this need. 

This is a game-changer for organizations that must comply with local and international regulations on how sensitive data is handled. For example, healthcare organizations that store encrypted patient information in the cloud are required by HIPAA regulations to download that data to perform computations on premises. This multistep process entails decrypting the data once it has been downloaded to an organization’s servers, performing the required computations, and then re-encrypting the data before re-uploading it to the cloud. It also requires ensuring that the on-premises environment contains the security architecture necessary to comply with HIPAA and other regulations. 

Because Confidential Containers provides advanced security safeguards for data in use in Azure, organizations no longer need to perform these time-consuming steps. This also means they no longer need to maintain servers on premises. Moreover, Azure users can define even stricter policies for their container environment in the cloud than they have in place in their on-premises environment.

Secure multiparty computations 

Another benefit of Confidential Containers is they enable secure multiparty computations. A single organization can securely process multiple datasets that contain sensitive information, or multiple organizations with datasets that must remain secure can share those datasets with the assurance that their data will not leak. Organizations can perform computations on multiple datasets, such as for training a machine learning model, and gain better results than they would if performing computations on a single dataset, all without knowing what is in those datasets. 

Easy deployment and lift-and-shift of Linux containers 

Creating a confidential container is straightforward for Azure users who are currently using or getting ready to use containers, requiring a small amount of configuration to move existing workloads. Linux users can easily lift-and-shift their Linux containers to Confidential Containers on Azure. 

Unlimited potential with Confidential Containers 

We believe that in the future, all computing in the cloud will be confidential, and we’re excited to share Confidential Containers—a technology that plays a role in making this happen. The capabilities it provides will have implications that we have yet to imagine. We’re particularly excited by the potential of multiparty computations. The ability to perform computations in a protected environment on multiple datasets brings limitless possibilities, unlocking great value to Azure users. 

Confidential Containers is currently available for limited preview and will be available for public preview later this year. Sign up for the Confidential Containers preview. 

Posted on Leave a comment

Microsoft Research AI4Science to empower the fifth paradigm of scientific discovery

Christopher Bishop, Distinguished Scientist, Managing Director, Microsoft Research Cambridge Lab

Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space and time. Could this capability represent the dawn of a new paradigm of scientific discovery?

Jim Gray, a Turing Award winner, and former Microsoft Technical Fellow, characterised the historical evolution of scientific discovery through four paradigms. With origins dating back thousands of years, the first paradigm was purely empirical and based on direct observation of natural phenomena. While many regularities were apparent in these observations, there was no systematic way to capture or express them. The second paradigm was characterised by theoretical models of nature, such as Newton’s laws of motion in the seventeenth century, or Maxwell’s equations of electrodynamics in the nineteenth century. Derived by induction from empirical observation, such equations allowed generalization to a much broader range of situations than those observed directly. While these equations could be solved analytically for simple scenarios, it was not until the development of digital computers in the twentieth century that they could be solved in more general cases, leading to a third paradigm based on numerical computation. By the dawn of the twenty-first century computation was again transforming science, this time through the ability to collect, store and process large volumes of data, leading to the fourth paradigm of data-intensive scientific discovery. Machine learning forms an increasingly important component of the fourth paradigm, allowing the modelling and analysis of large volumes of experimental scientific data. These four paradigms are complementary and coexist. 

The pioneering quantum physicist Paul Dirac commented in 1929 that “The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.” For example, Schrödinger’s equation describes the behaviour of molecules and materials at the subatomic level with exquisite precision, and yet numerical solution with high accuracy is only possible for very small systems consisting of a handful of atoms. Scaling to larger systems requires increasingly drastic approximations leading to a challenging trade-off between scale and accuracy. Even so, quantum chemistry calculations are already of such high practical value that they form one of the largest supercomputer workloads. 

However, over the last year or two, we have seen the emergence of a new way to exploit deep learning, as a powerful tool to address this speed-versus-accuracy trade-off for scientific discovery. This is a very different use of machine learning from the modelling of data that characterizes the fourth paradigm, because the data that is used to train the neural networks itself comes from numerical solution of the fundamental equations of science rather than from empirical observation. We can view the numerical solutions of scientific equations as simulators of the natural world that can be used, at high computational cost, to compute quantities of interest in applications such as forecasting the weather, modelling the collision of galaxies, optimizing the design of fusion reactors, or calculating the binding affinities of candidate drug molecules to a target protein. From a machine learning perspective, however, the intermediate details of the simulation can be viewed as training data which can be used to train deep learning emulators. Such data is perfectly labelled, and the quantity of data is limited only by computational budget. Once trained, the emulator can perform new calculations with high efficiency, achieving significant improvements in speed, sometimes by several orders of magnitude. 

This ‘fifth paradigm’ of scientific discovery represents one of the most exciting frontiers for machine learning as well as for the natural sciences. While there is a long way to go before these emulators are sufficiently fast, robust, and general-purpose to become mainstream, the potential for real-world impact is clear. For example, the number of small-molecule drug candidates alone is estimated at 1060, while the total number of stable materials is thought to be around 10180 (roughly the square of the number of atoms in the known universe). Finding more efficient ways to explore these vast spaces would transform our ability to discover new substances such as better drugs to treat disease, improved substrates for capturing atmospheric carbon dioxide, better materials for batteries, new electrodes for fuel cells to power the hydrogen economy, and myriad others.

AI4Science is an effort deeply rooted in Microsoft’s mission, applying the full breadth of our AI capabilities to develop new tools for scientific discovery so that we and others in the scientific community can confront some of humanity’s most important challenges. Microsoft Research has a 30+ year legacy of curiosity and discovery, and I believe that the AI4Science team – spanning geographies and scientific fields – has the potential to yield extraordinary contributions to that legacy.

Kevin Scott, Executive Vice President and Chief Technology Officer, Microsoft

I’m delighted to announce today that I will be leading a new global team in Microsoft Research, spanning the UK, China and the Netherlands, to focus on bringing this fifth paradigm to reality. Our AI4Science team encompasses world experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines who are working together to tackle some of the most pressing challenges in this field.

An example project is Graphormer, led by my colleague Tie-Yan Liu in our China team. This is a deep learning package that allows researchers and developers to train custom models for molecule modelling tasks, such as materials science, or drug discovery. Recently, Graphormer won the Open Catalyst Challenge, a molecular dynamics competition that aims to model the catalyst-absorbate reaction system by AI, and has more than 0.66 million catalyst-absorbate relaxation systems (144 million structure-energy frames) simulated by density functional theory (DFT) software. Another project, from our team in Cambridge, in collaboration with Novartis, is Generative Chemistry, where together we are empowering scientists with AI to speed up the discovery and development of break-through medicines.

As Iya Khalil, Global Head of the AI Innovation Lab at Novartis, recently noted, the work is no longer science fiction but science-in-action:

“Not only can AI learn from our past experiments, but, with each new iteration of designing and testing in the lab, the machine learning algorithms can identify new patterns and help guide the early drug discovery and development process. Hopefully in doing this we can augment our human scientists’ expertise so they can design better molecules faster.”

The team has since used the platform to generate several promising early-stage molecules which have been synthesised for further exploration.

Alongside our teams in China and the UK, we have been growing a team in the Netherlands, including hiring the world-renowned machine learning expert, Max Welling. I am also excited to be able to announce today that our brand-new Lab in Amsterdam will be housed in Matrix One, which is currently under construction on the Amsterdam Science Park. This purpose-built space is in close proximity to the University of Amsterdam and the Vrije Universiteit Amsterdam, and we will maintain strong affiliations with both institutions through the co-supervision of PhD students.

Image of Amsterdam office
Matrix One building in Amsterdam

It is with pride and excitement that we take this next step to come together as a cross-geographical team and follow in the footsteps of pioneers before us, to contribute to this next paradigm of scientific discovery, and in doing so impact many important societal challenges. If you share our excitement and ambition, and would like to join us, I encourage you to look at our open positions or get in touch to talk to anyone on the team.

Posted on Leave a comment

Microsoft Research’s GODEL: Combining goal-oriented dialog with real-word conversations

Diagram showing GODEL’s architecture. The environment of the dialog system consists of both structured and unstructured content, which it uses to retrieve information. This source content, which we term “grounding,” is updated and repeatedly used by GODEL to produce a new response after each user input.

They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation further? In addition to providing on-topic expertise, such as recommending a restaurant, it could engage in a conversation about the history of the neighborhood or a recent sports game, and then bring the conversation back on track. What if the agent’s responses continually reflect the latest world events? And what if it could do all of this without the need for any additional work by the designer?   

With GODEL, this may not be far off. GODEL stands for Grounded Open Dialogue Language Model, and it ushers in a new class of pretrained language models that enable both task-oriented and social conversation and are evaluated by the usefulness of their responses.  

Pretrained language models are among the engines that power conversational AI, the technology that underlies these dialog agents. They can either be task-oriented (“give me a job, and I’ll do it”) or engage in a conversation without a specified outcome, known as open-domain or chit-chat. GODEL combines both these capabilities, giving dialog agents the ability to generate responses based not just on the context of the conversation, but also on external information, content that was not part of the dataset when the model was trained. This includes both structured content, such as information stored in databases, and unstructured content, such as restaurant reviews, Wikipedia articles, and other publicly available material found on the web. This explains how a simple task-based query about restaurant recommendations can evolve into a dialog about ingredients, food, and even cooking techniques—the kind of winding path that real-world conversations take.  

In 2019, the Deep Learning and Natural Language Processing groups at Microsoft Research released DialoGPT, the first large-scale pretrained language model designed specifically for dialog. This helped make conversational AI more accessible and easier to work with, and it enabled the research community to make considerable progress in this area. With GODEL, our goal is to help further this progress by empowering researchers and developers to create dialog agents that are unrestricted in the types of queries they can respond to and the sources of information they can draw from. We also worked to ensure those responses are useful to the person making the query.    

In our paper, “GODEL: Large-Scale Pre-training for Goal-Directed Dialog,” we describe the technical details underlying GODEL, and we have made the code available on GitHub

A grounded model

One of GODEL’s key features is the flexibility it provides users in defining their model’s grounding—the sources from which their dialog agents retrieve information. This flexibility informs GODEL’s versatility in diverse conversational settings. If someone were to inquire about a local restaurant for example, GODEL would be able to provide specific and accurate responses even though that venue may not have been included in the data used to train it. Responses would vary depending on whether the grounding information is empty, a snippet of a document, a search result (unstructured text), or information drawn from a database about the restaurant (structured text). However, each response would be appropriate and useful. 

In addition to specificity, grounded generation helps keep models up to date, as the grounded text can incorporate information that may not have been available at the time the model was trained. For example, if a model were developed before the 2022 Winter Olympics, GODEL would be able to provide details on those games and a list of winners even though all the data available to train it predates that event.

Broad application of GODEL

Another main feature of GODEL is its wide range of dialog applications. While its predecessor, DialoGPT, and other prior pretrained models for dialog have mostly focused on social bots, GODEL can be applied to a variety of dialogs, including those that are task-oriented, question-answering, and grounded chit-chat. In the same conversation, GODEL can produce reasonable responses for a variety of query types, including general questions or requests for specific actions.  

In addition, GODEL’s responses have been evaluated for their helpfulness. In our paper, we show that evaluation is done more reliably on datasets that are goal-directed, and that people generally agree on which responses are better when asked to judge their utility towards achieving certain goals. Equipped with this robust evaluation setup, we compared our model against several strong baselines and state-of-the-art approaches and show that GODEL is superior in terms of both human and automatic evaluation, as indicated in Figure 1. The paper describes extensive experiments against other state-of-the-art pretrained language models and demonstrates that performance gains are even larger in these cases. 

Two bar graphs showing that GODEL outperforms the baseline, in terms of both human and automated dialog evaluation. For human evaluation, GODEL received much higher human ratings (47, 41, and 27), while the human ratings for the best baseline were low (30, 22, and 17). For automatic evaluation, differences are smaller yet still statistically significant.
Figure 1: These charts illustrate GODEL’s performance against T5, a pretrained model that performed best in our evaluation. They compare the aggregate performance of models fine-tuned from GODEL against that of models fine-tuned from T5. They show that GODEL performs much better in human evaluations and makes appreciable gains in the automatic evaluation. The test set for these experiments combines a variety of dialog genres, including task-oriented dialog, conversational question-answering, and grounded chit-chat.

The following examples illustrate different dialog scenarios where GODEL uses a variety of sources to respond to identical user queries. 

  • This example illustrates how GODEL responds in an open-ended scenario in which the user asks a question that is completely unrelated to the initial question. Despite the lack of relevance, GODEL responds appropriately while trying to bring the conversation back on track. 

    Figure showing how GODEL responds to a user who just changed the topic, demonstrating that it can bring the conversation back on track. While the initial query is about a restaurant, the user suddenly mentions a series of tornadoes that have recently affected the area. GODEL uses grounding from a recent news article to provide information about the tornadoes, as requested by the user. Finally, it asks the user if there is anything else it can help with.

  • This example illustrates how GODEL responds in a task-oriented setting in which the model is connected to the components of a traditional goal-oriented dialog systems, such as a database. In this case, the relevant environment contains structured information, a database returning two restaurants relevant to the current conversation.  

    Figure showing how GODEL responds appropriately to a user's request for a restaurant reservation. The user expresses a preference for a restaurant named Lucky Star, and GODEL extracts information from a database about that restaurant and retrieves relevant information, such as a reference number, to generate a response that flows naturally with the rest of the conversation.

  • This example illustrates how GODEL responds in a task-oriented setting in which traditional components of task-oriented dialog systems are not available. In this case, GODEL retrieves a restaurant review via a search engine. The response reflects both the context of the conversation and a snippet of the retrieved text, a restaurant review.  

    Figure showing how GODEL responds appropriately to a user's request for information about a specific restaurant. The user asks whether a given restaurant is good for groups, and GODEL uses text originating from restaurant reviews to infer that the restaurant is indeed good for groups. Also, GODEL provides additional information to address a concern with larger groups—that food is typically served quickly.

  •  This example illustrates how GODEL responds in a question-answering scenario, where the user asks a general question and the context provides the dialog agent with the words it needs to search for the relevant information on the web. 

    Figure showing how GODEL responds appropriately when asked to give an example of a popular Chinese dish. GODEL uses grounding originating from search results to respond to the question while focusing on the most relevant information of the retrieved document.

GODEL available as open source

To advance research, we believe it is crucial to make code and models publicly available, and we have released GODEL as fully open source. We have made three versions of GODEL available: base, large, and extra-large. We are also including the code needed to retrain all pretrained models and to fine-tune models for specific tasks: the CoQA dataset, intended for conversational question-answering; the Wizard of Wikipedia and Wizard of the Internet datasets, aimed at information-seeking chats; and MultiWOZ is for task-completion dialogs.

We hope GODEL helps numerous academic research teams advance the field of conversational AI with innovative dialog models while eliminating the need for significant GPU resources. We plan to continuously improve GODEL and make more models available to the research community. Please visit our project page to learn more about the GODEL project and new releases.

Acknowledgements

We would like to thank our fellow colleagues at Microsoft Research who contributed to this work and blog post: Bill Dolan, Pengcheng He, Elnaz Nouri, Clarisse Simoes Ribeiro. 

Posted on Leave a comment

(De)ToxiGen: Leveraging large language models to build more robust hate speech detection tools

An abstract image in pastel colors showing a vortex of vectors.

It’s a well-known challenge that large language models (LLMs)—growing in popularity thanks to their adaptability across a variety of applications—carry risks. Because they’re trained on large amounts of data from across the internet, they’re capable of generating inappropriate and harmful language based on similar language encountered during training.  

Content moderation tools can be deployed to flag or filter such language in some contexts, but unfortunately, datasets available to train these tools often fail to capture the complexities of potentially inappropriate and toxic language, especially hate speech. Specifically, the toxic examples in many existing hate speech datasets tend either to be too hard or too easy for tools to learn from—the too-easy examples contain slurs, profanity, and explicit mentions of minority identity groups; the too-hard examples involve obscure references or inside jokes within the hate speech community. Additionally, the neutral examples in these datasets tend not to contain group mentions. As a result, tools may flag any language that references a minority identity group as hate speech, even when that language is neutral. Alternatively, tools trained on this data fail to detect harmful language when it lacks known or explicit slurs, profanity, or explicit mentions of minority identity groups.  

Generating the kind of data needed to strengthen content moderation tools against the above failures and harms is challenging for numerous reasons. In particular, toxic text that is more implicit and that existing machine learning architectures can still learn from or neutral text with group mentions is difficult to collect at scale. Additionally, asking people to write such examples—particularly the toxic ones—can have a negative impact mentally on those assigned the task. 

Inspired by the ability of large language models to mimic the tone, style, and vocabulary of prompts they receive—whether toxic or neutral—we set out to create a dataset for training content moderation tools that can be used to better flag implicitly harmful language. In our paper “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection,” we collected initial examples of neutral statements with group mentions and examples of implicit hate speech across 13 minority identity groups and used a large-scale language model to scale up and guide the generation process. The outcome is the largest implicit hate speech dataset to date that is publicly available: 274,000 examples comprising both neutral and toxic statements. We conducted a human study on the generated dataset to better understand different aspects of harm beyond binary labels of toxic and neutral assigned by content moderation tools. To stress test existing content moderation tools across minority identity groups studied in this work, we also propose an adversarial classifier-in-the-loop decoding approach. The dataset, two content moderation tools trained on the dataset, prompts used as seed data, and the source codes for our proposed adversarial decoding approach are available in the ToxiGen GitHub repo (please see footnote).

We’re presenting this work at the 2022 Meeting of the Association for Computational Linguistics (ACL), where our colleagues will also be presenting work that leverages the generative power of large language models and human expertise

A horizontal chart comparing the proportion of minority identity group mentions in the prompts with the minority identity group mentions in the generated text for the 13 minority identity groups in this work: Black, Mexican, people with physical disabilities, LGBTQ+, people with cognitive disabilities, Chinese, Muslim, Jewish, Middle Eastern, Women, Asian, Native American, and Latino.
Figure 1: The ToxiGen dataset—an implicit hate speech dataset created by using a large-scale language model with both regular and adversarial decoding to scale up and guide the generation process—contains 274,000 examples comprising both neutral and toxic statements across 13 minority identity groups. As illustrated above, mentions of a specific minority identity group in the prompts and mentions of the same minority identity group in the corresponding generated text are proportional.

Demonstration-based prompting for building better datasets

Large Transformer-based language models don’t explicitly encode semantic information; nevertheless, these models can distinguish the statistical interactions of words in different contexts. Through experimentation with the generation of language via one of these large language models, we learned how to utilize careful prompt engineering strategies to create the ToxiGen implicit hate speech dataset. 

Our first experiments were to generate examples of hate speech and neutral speech related to the 13 minority identity groups in our work. We started by collecting implicit hate speech prompts from existing datasets and neutral prompts drawn from news articles, opinion pieces, podcast transcripts, and other similar public sources and feeding them into the LLM to create a broader, deeper set of prompts. What we found was that the LLM could generate examples that were qualitatively different depending on the source material. When prompted with bits from different writers on the above topics, in each case, the LLM produced linguistically diverse outputs that were nonetheless similar in style and tone. 

Furthermore, we found that through careful cultivation of prompt sets, we could generate a wide variety of text reflecting diverse opinions and thoughts on these topics that weren’t found in our original source materials. We could generate neutral statements about sensitive topics that mentioned the relevant minority identity groups, and we could consistently generate hate speech statements about these minority identity groups that didn’t contain slurs or profanity. And the more we experimented with the source material, the more interesting our dataset became. This is particularly exciting because we hope that other individuals and groups can use these tools to extend our dataset; different disciplinary experts could utilize the same strategies and collect even better prompt sets, resulting in even more subtle and rich examples of neutral speech and hate speech. 

We also found that the model often generated examples of speech that we ourselves had trouble labeling. In essence, we were using the LLM as a probe to explore the delicate boundaries between acceptable and offensive speech. As a result, our own understanding of the problem definition itself grew through our interactions with the model.  

The first 260,000 examples from our dataset were drawn from this experimental approach. 

Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa.
Figure 2: Examples of statements generated by (De)ToxiGen that fool Google’s Perspective API, HateBERT, OpenAI content filter, AI2 Delphi, and RoBERTa. Five statements are neutral but mention minority identity groups, so the content moderation tools find them hateful. Five are toxic sentences, but the tools find them neutral. The proposed decoding approach, (De)ToxiGen (referred to as ALICE in the paper), can challenge these content moderation tools, allowing developers to increase their coverage by creating adversarial examples. 

(De)ToxiGen: An adversarial decoding approach for strengthening content moderation tools

While demonstration-based prompting can facilitate large-scale data generation, it doesn’t generate data targeted specifically to challenge a given content moderation tool, or content classifier. This is important because every content moderation tool has unique vulnerabilities depending on the type of data it has been trained on. To address this, we developed (De)ToxiGen (referred to as ALICE in the paper), an algorithmic mechanism that creates an adversarial set-up between an LLM and a given content moderation tool in which the content classifier is in the loop during decoding.  

The proposed approach can increase or decrease the likelihood that a generated statement is classified as hate speech while maintaining the coherence of the generated language. It can generate both false negatives and false positives for a given content moderation tool. For false negatives, toxic prompts are used to elicit toxic responses, and then the tool’s probability of the neutral class is maximized during decoding. Similarly, to generate false positives, neutral prompts are used to generate neutral responses, and then the probability of the toxic class is maximized during decoding. With this approach, we’re essentially trying to reveal weaknesses in a specific content moderation tool by guiding the LLM to produce statements that we know the tool will misidentify. The generated data can then be used to improve the performance and coverage of the targeted content moderation tool. Our ToxiGen dataset includes data generated by both demonstration-based prompting and our proposed adversarial decoding approach. Through empirical study on three existing human-written datasets, we found that starting with an existing content moderation tool and fine-tuning it on ToxiGen can improve the tool’s performance significantly, demonstrating the quality of the machine-generated data in ToxiGen.  

Human evaluation: Better understanding the data

Human language is complex, particularly when it comes to harmful statements. To better understand different aspects of the data in ToxiGen—its perceived harmfulness and intent and whether it presents as fact or opinion, for example—we conducted human evaluations on the data generated by both regular decoding (top-k), used in the demonstration-based prompting, and the proposed adversarial decoding. The human evaluation also allowed us to test the quality of the output of these methods and gauge how effective these methods were in guiding the generation of the data we sought. 

For the human evaluation, three annotators were used for each statement from a pool of 156 prequalified annotators with prior experience annotating toxic language. About 4,500 samples were randomly selected for each of the decoding methods with coverage across all 13 minority identity groups for each split. We found the following: 

  1. For both decoding methods, minority identity group mentions included in the prompt also exist in the generated statements. This means that both data generation methods reliably produce the data they were designed to produce—hateful and neutral statements with explicit reference to the specified minority identity group.
  2. In the neutral case, the label of the prompt matches the generated text more often than in the toxic case, as shown in Figure 3a. 
  3. The proposed decoding approach generates a higher percentage of adversarial text compared to regular decoding—that is, it produces data that is more likely to fool a given content moderation tool—as illustrated in Figure 3b. 
Two bar charts side by side. The one on the left, titled “Prompt-Response Matching,” shows that top-k decoding produces non-toxic responses 95.2 percent of the time when given a non-toxic prompt compared with 92.1 percent for (De)ToxiGen and that top-k decoding produces toxic responses 67.7 percent of the time when given a toxic prompt compared with 40.3 percent for (De)ToxiGen. The bar chart on the right, titled “Adversarial Power,” shows that statements generated by (De)ToxiGen fool HateBERT 26.4 percent of the time compared with 16.8 percent for statements generated via top-k decoding.
Figure 3a (left) and 3b (right): Human evaluations on the data generated by regular decoding (top-k) and the proposed adversarial decoding showed that the toxicity labels for the prompt and the generated response match more often for non-toxic prompts compared to toxic ones (left). It was also observed that (De)ToxiGen generates a higher percentage of adversarial text compared to regular decoding (right). 
  1. 90.5 percent of machine-generated examples were thought to be human-written by the majority of annotators.
  2. Perceived harmfulness with respect to human- or AI-authored text is similar. 

Looking ahead: Societal implications and opportunities

As advances continue to be made in large language models, we remain vigilant in our pursuit of AI systems that align with our commitment to technology that benefits society as a whole and empowers everyone to achieve more. We’re beginning to ask better questions to more deeply understand the risks associated with LLMs and build processes and methods for addressing them. Existing content moderation tools tend to be only good at flagging overt inappropriate or harmful language. Our work aims to create data that can better target the challenge. While our work here specifically explores hate speech, our proposed methods could be applied to a variety of content moderation challenges, such as flagging potential misinformation content. By releasing the source codes and prompt seeds for this work, we hope to encourage the research community to contribute to it by, for example, adding prompt seeds and generating data for minority identity groups that aren’t covered in our dataset. 

As with many technologies, the solutions we develop to make them stronger, more secure, and less vulnerable also have the potential to be used in unintended ways. While the methods described here may be used to generate inappropriate or harmful language, we believe that they provide far greater value in helping to combat such language, resulting in content moderation tools that can be used alongside human guidance to support fairer, safer, more reliable, and more inclusive AI systems.  

Considerations for responsible use

There is still a lot that this dataset is not capturing about what constitutes problematic language, and before utilizing the dataset, its limitations should be acknowledged. Our annotations might not capture the full complexity of these issues, given problematic language is context-dependent, dynamic, and can manifest in different forms and different severities. Content moderation tools aren’t a silver bullet to address harmful online content. Problematic language is fundamentally a human-centric problem. It should be studied in conjunction with human experience, and tools to address this problem should be developed and deployed with human expertise and well-informed regulatory processes and policy. Multidisciplinary work is needed to better understand the aspects of this challenge.  

Also, this dataset only captures implicit toxicity (more precisely hate speech) for 13 minority identity groups and due to its large scale can naturally have imperfections. Our goal in this project is to provide the community with means to improve hate speech detection on implicit toxic language for the identified minority identity groups, and there exist limitations to this dataset and models trained on it that can potentially be the subject of future research, for example, including more minority identity groups, a combination of them, and so on that are not covered in our work. Stronger content moderation tools and systems can contribute to mitigating fairness-related harms in AI systems. For example, systems that don’t over-flag neutral statements with minority identity group mentions can help ensure better representation of diverse perspectives and experiences, while systems that can better flag implicit hate speech can support more inclusive technology.   

Acknowledgment 

This work was conducted by PhD students Thomas Hartvigsen and Saadia Gabriel during their internships at Microsoft Azure and Microsoft Research. Hamid Palangi, Dipankar Ray, Maarten Sap, and Ece Kamar served as advisors on the work. A special thanks to Misha Bilenko from Azure ML for making the compute resources available and to Microsoft Research for supporting our large-scale human study. 


Please note: This research, the GitHub repository, and examples from our work included in this blog contain and discuss content that is offensive or upsetting. All materials are intended to support research that improves hate speech detection methods. Included examples of hate speech don’t represent how the authors or sponsors feel about any minority identity groups. Hate speech applies to a range of minority identity groups; for the purposes of this research, we focus on 13 of them (as shown in Figure 1). Content moderation tools are part of larger content moderation systems. These systems also include human expertise and thoughtful policy and regulatory development. Even the most robust content moderation tools and datasets require systems with human supervision. 

Posted on Leave a comment

MoLeR: Creating a path to more efficient drug design

Drug discovery has come a long way from its roots in serendipity. It is now an increasingly rational process, in which one important phase, called lead optimization, is the stepwise search for promising drug candidate compounds in the lab. In this phase, expert medicinal chemists work to improve “hit” molecules—compounds that demonstrate some promising properties, as well as some undesirable ones, in early screening. In subsequent testing, chemists try to adapt the structure of hit molecules to improve their biological efficacy and reduce potential side effects. This process combines knowledge, creativity, experience, and intuition, and often lasts for years. Over many decades, computational modelling techniques have been developed to help predict how the molecules will fare in the lab, so that costly and time-consuming experiments can focus on the most promising compounds.

Diagram illustrating the process of drug discovery. It uses icons for the various stages, and arrows to show how drug discovery projects progress. The bottom section of the diagram shows the human-led approach, which includes
Figure 1: Classic human-led drug design (bottom) is an iterative process of proposing new compounds and testing them in vitro. As this process requires synthesis in the lab, it is very costly and time consuming. By using computational modelling (top), molecule design can be rapidly performed in silico, with only the most promising molecules promoted to be made in the lab and then eventually tested in vivo.

The Microsoft Generative Chemistry team is working with Novartis to improve these modelling techniques with a new model called MoLeR. 

“MoLeR illustrates how generative models based on deep learning can help transform the drug discovery process and enable our colleagues at Novartis to increase the efficiency in finding new compounds.”

Christopher Bishop, Technical Fellow and Laboratory Director, Microsoft Research Cambridge

We recently focused on predicting molecular properties using machine learning methods in the FS-Mol project. To further support the drug discovery process, we are also working on methods that can automatically design compounds that better fit project requirements than existing candidate compounds. This is an extremely difficult task, as only a few promising molecules exist in the vast and largely unexplored chemical space—estimated to contain up to 1060 drug-like molecules. Just how big is that number? It would be enough molecules to reproduce the Earth billions of times. Finding them requires creativity and intuition that cannot be captured by fixed rules or hand-designed algorithms. This is why learning is crucial not only for the predictive task, as done in FS-Mol, but also for the generative task of coming up with new structures. 

In our earlier work, published at the 2018 Conference on Neural Information Processing Systems (NeurIPS), we described a generative model of molecules called CGVAE. While that model performed well on simple, synthetic tasks, we noted then that further improvements required the expertise of drug discovery specialists. In collaboration with experts at Novartis, we identified two issues limiting the applicability of the CGVAE model in real drug discovery projects: it cannot be naturally constrained to explore only molecules containing a particular substructure (called the scaffold), and it struggles to reproduce key structures, such as complex ring systems, due to its low-level, atom-by-atom generative procedure. To remove these limitations, we built MoLeR, which we describe in our new paper, “Learning to Extend Molecular Scaffolds with Structural Motifs,” published at the 2022 International Conference on Learning Representations (ICLR)

The MoLeR model

In the MoLeR model, we represent molecules as graphs, in which atoms appear as vertices that are connected by edges corresponding to the bonds. Our model is trained in the auto-encoder paradigm, meaning that it consists of an encoder—a graph neural network (GNN) that aims to compress an input molecule into a so-called latent code—and a decoder, which tries to reconstruct the original molecule from this code. As the decoder needs to decompress a short encoding into a graph of arbitrary size, we design the reconstruction process to be sequential. In each step, we extend a partially generated graph by adding new atoms or bonds. A crucial feature of our model is that the decoder makes predictions at each step solely based on a partial graph and a latent code, rather than in dependence on earlier predictions. We also train MoLeR to construct the same molecule in a variety of different orders, as the construction order is an arbitrary choice. 

Animation showing a
Figure 2: Given a latent code, that may either come from encoding a molecule or sampling from the prior distribution, MoLeR learns to decode it step-by-step. In each step, it extends a given partial molecule by adding atoms, bonds, or entire structural motifs. These choices are guided by graph neural networks (GNNs) trained on construction sequences for molecules in the training dataset. 

As we alluded to earlier, drug molecules are not random combinations of atoms. They tend to be composed of larger structural motifs, much like sentences in a natural language are compositions of words, and not random sequences of letters. Thus, unlike CGVAE, MoLeR first discovers these common building blocks from data, and is then trained to extend a partial molecule using entire motifs (rather than single atoms). Consequently, MoLeR not only needs fewer steps to construct drug-like molecules, but its generation procedure also occurs in steps that are more akin to the way chemists think about the construction of molecules. 

Diagram with two parts (left and right), with an arrow pointing from left to right. The left part shows a molecule, while the right part shows the same molecule divided into chunks representing groups of atoms, which are formed by removing some of the bonds from the original molecule. Each chunk in the right part of the figure has a box around it.
Figure 3: Motif extraction strategy applied to Imatinib (a drug developed by Novartis, shown on the left) converts it into a collection of common building blocks and individual atoms (shown on the right, with motifs in red boxes and remaining atoms in blue ones). 

Drug-discovery projects often focus on a specific subset of the chemical space, by first defining a scaffold—a central part of the molecule that has already shown promising properties—and then exploring only those compounds that contain the scaffold as a subgraph. The design of MoLeR’s decoder allows us to seamlessly integrate an arbitrary scaffold by using it as an initial state in the decoding loop. As we randomize the generation order during training, MoLeR implicitly learns to complete arbitrary subgraphs, making it ideal for focused scaffold-based exploration. 

Diagram showing a 5x5 grid, with each cell depicting one molecule. The molecule in the middle has a box around it. All the molecules are different, but relatively similar, and all contain a particular substructure, which is marked in red.
Figure 4: Given a molecule (shown in the box in the center) containing a particular scaffold of interest (highlighted in red), MoLeR can traverse its scaffold-constrained latent space, and propose “neighbors” of the given molecule that have similar structure and properties. 

Optimization with MoLeR

Even after training our model as discussed above, MoLeR has no notion of “optimization” of molecules. However, like related approaches, we can perform optimization in the space of latent codes using an off-the-shelf black-box optimization algorithm. This was not possible with CGVAE, which used a much more complicated encoding of graphs. In our work, we opted for using Molecular Swarm Optimization (MSO), which shows state-of-the-art results for latent space optimization in other models, and indeed we found it to work very well for MoLeR. In particular, we evaluated optimization with MSO and MoLeR on new benchmark tasks that are similar to realistic drug discovery projects using large scaffolds and found this combination to outperform existing models. 

Outlook

We continue to work with Novartis to focus machine learning research on problems relevant to the real-world drug discovery process. The early results are substantially better than those of competing methods, including our earlier CGVAE model. With time, we hope MoLeR-generated compounds will reach the final stages of drug-discovery projects, eventually contributing to new useful drugs that benefit humanity.