Posted on Leave a comment

Powering the next generation of trustworthy AI in a confidential cloud using NVIDIA GPUs

Animation showing the process of how encrypted data is transferred between the GPU drive and the GPU through a secure channel. The GPU driver on the host CPU and the SEC2 microcontroller on the NVIDIA A100 Tensor Core GPU work together to achieve end-to-end encryption of data transfers

Cloud computing is powering a new age of data and AI by democratizing access to scalable compute, storage, and networking infrastructure and services. Thanks to the cloud, organizations can now collect data at an unprecedented scale and use it to train complex models and generate insights.  

While this increasing demand for data has unlocked new possibilities, it also raises concerns about privacy and security, especially in regulated industries such as government, finance, and healthcare. One area where data privacy is crucial is patient records, which are used to train models to aid clinicians in diagnosis. Another example is in banking, where models that evaluate borrower creditworthiness are built from increasingly rich datasets, such as bank statements, tax returns, and even social media profiles. This data contains very personal information, and to ensure that it’s kept private, governments and regulatory bodies are implementing strong privacy laws and regulations to govern the use and sharing of data for AI, such as the General Data Protection Regulation (GDPR) and the proposed EU AI Act. You can learn more about some of the industries where it’s imperative to protect sensitive data in this Microsoft Azure Blog post.

Commitment to a confidential cloud

Microsoft recognizes that trustworthy AI requires a trustworthy cloud—one in which security, privacy, and transparency are built into its core. A key component of this vision is confidential computing—a set of hardware and software capabilities that give data owners technical and verifiable control over how their data is shared and used. Confidential computing relies on a new hardware abstraction called trusted execution environments (TEEs). In TEEs, data remains encrypted not just at rest or during transit, but also during use. TEEs also support remote attestation, which enables data owners to remotely verify the configuration of the hardware and firmware supporting a TEE and grant specific algorithms access to their data.  

At Microsoft, we are committed to providing a confidential cloud, where confidential computing is the default for all cloud services. Today, Azure offers a rich confidential computing platform comprising different kinds of confidential computing hardware (Intel SGX, AMD SEV-SNP), core confidential computing services like Azure Attestation and Azure Key Vault managed HSM, and application-level services such as Azure SQL Always Encrypted, Azure confidential ledger, and confidential containers on Azure. However, these offerings are limited to using CPUs. This poses a challenge for AI workloads, which rely heavily on AI accelerators like GPUs to provide the performance needed to process large amounts of data and train complex models.  

The Confidential Computing group at Microsoft Research identified this problem and defined a vision for confidential AI powered by confidential GPUs, proposed in two papers, “Oblivious Multi-Party Machine Learning on Trusted Processors” and “Graviton: Trusted Execution Environments on GPUs.” In this post, we share this vision. We also take a deep dive into the NVIDIA GPU technology that’s helping us realize this vision, and we discuss the collaboration among NVIDIA, Microsoft Research, and Azure that enabled NVIDIA GPUs to become a part of the Azure confidential computing ecosystem.

Vision for confidential GPUs

Today, CPUs from companies like Intel and AMD allow the creation of TEEs, which can isolate a process or an entire guest virtual machine (VM), effectively eliminating the host operating system and the hypervisor from the trust boundary. Our vision is to extend this trust boundary to GPUs, allowing code running in the CPU TEE to securely offload computation and data to GPUs.  

Diagram showing the trust boundary extended from the host trusted execution environment of the CPU to the trusted execution environment of the GPU through a secure channel.
Figure 1: Vision for confidential computing with NVIDIA GPUs.

Unfortunately, extending the trust boundary is not straightforward. On the one hand, we must protect against a variety of attacks, such as man-in-the-middle attacks where the attacker can observe or tamper with traffic on the PCIe bus or on a NVIDIA NVLink connecting multiple GPUs, as well as impersonation attacks, where the host assigns an incorrectly configured GPU, a GPU running older versions or malicious firmware, or one without confidential computing support for the guest VM. At the same time, we must ensure that the Azure host operating system has enough control over the GPU to perform administrative tasks. Furthermore, the added protection must not introduce large performance overheads, increase thermal design power, or require significant changes to the GPU microarchitecture.  

Our research shows that this vision can be realized by extending the GPU with the following capabilities:

  • A new mode where all sensitive state on the GPU, including GPU memory, is isolated from the host
  • A hardware root-of-trust on the GPU chip that can generate verifiable attestations capturing all security sensitive state of the GPU, including all firmware and microcode 
  • Extensions to the GPU driver to verify GPU attestations, set up a secure communication channel with the GPU, and transparently encrypt all communications between the CPU and GPU 
  • Hardware support to transparently encrypt all GPU-GPU communications over NVLink  
  • Support in the guest operating system and hypervisor to securely attach GPUs to a CPU TEE, even if the contents of the CPU TEE are encrypted

Confidential computing with NVIDIA A100 Tensor Core GPUs

NVIDIA and Azure have taken a significant step toward realizing this vision with a new feature called Ampere Protected Memory (APM) in the NVIDIA A100 Tensor Core GPUs. In this section, we describe how APM supports confidential computing within the A100 GPU to achieve end-to-end data confidentiality.  

APM introduces a new confidential mode of execution in the A100 GPU. When the GPU is initialized in this mode, the GPU designates a region in high-bandwidth memory (HBM) as protected and helps prevent leaks through memory-mapped I/O (MMIO) access into this region from the host and peer GPUs. Only authenticated and encrypted traffic is permitted to and from the region.  

In confidential mode, the GPU can be paired with any external entity, such as a TEE on the host CPU. To enable this pairing, the GPU includes a hardware root-of-trust (HRoT). NVIDIA provisions the HRoT with a unique identity and a corresponding certificate created during manufacturing. The HRoT also implements authenticated and measured boot by measuring the firmware of the GPU as well as that of other microcontrollers on the GPU, including a security microcontroller called SEC2. SEC2, in turn, can generate attestation reports that include these measurements and that are signed by a fresh attestation key, which is endorsed by the unique device key. These reports can be used by any external entity to verify that the GPU is in confidential mode and running last known good firmware.  

When the NVIDIA GPU driver in the CPU TEE loads, it checks whether the GPU is in confidential mode. If so, the driver requests an attestation report and checks that the GPU is a genuine NVIDIA GPU running known good firmware. Once confirmed, the driver establishes a secure channel with the SEC2 microcontroller on the GPU using the Security Protocol and Data Model (SPDM)-backed Diffie-Hellman-based key exchange protocol to establish a fresh session key. When that exchange completes, both the GPU driver and SEC2 hold the same symmetric session key.  

The GPU driver uses the shared session key to encrypt all subsequent data transfers to and from the GPU. Because pages allocated to the CPU TEE are encrypted in memory and not readable by the GPU DMA engines, the GPU driver allocates pages outside the CPU TEE and writes encrypted data to those pages. On the GPU side, the SEC2 microcontroller is responsible for decrypting the encrypted data transferred from the CPU and copying it to the protected region. Once the data is in high bandwidth memory (HBM) in cleartext, the GPU kernels can freely use it for computation.

Diagram showing how the GPU driver on the host CPU and the SEC2 microcontroller on the NVIDIA Ampere GPU work together to achieve end-to-end encryption of data transfers.
Figure 2: The GPU driver on the host CPU and the SEC2 microcontroller on the NVIDIA A100 Tensor Core GPU work together to achieve end-to-end encryption of data transfers.

Accelerating innovation with confidential AI

The implementation of APM is an important milestone toward achieving broader adoption of confidential AI in the cloud and beyond. APM is the foundational building block of Azure Confidential GPU VMs, now in private preview. These VMs, designed in collaboration with NVIDIA, Azure, and Microsoft Research, feature up to four A100 GPUs with 80 GB of HBM and APM technology and enable users to host AI workloads on Azure with a new level of security.  

But this is just the beginning. We look forward to taking our collaboration with NVIDIA to the next level with NVIDIA’s Hopper architecture, which will enable customers to protect both the confidentiality and integrity of data and AI models in use. We believe that confidential GPUs can enable a confidential AI platform where multiple organizations can collaborate to train and deploy AI models by pooling together sensitive datasets while remaining in full control of their data and models. Such a platform can unlock the value of large amounts of data while preserving data privacy, giving organizations the opportunity to drive innovation.  

A real-world example involves Bosch Research, the research and advanced engineering division of Bosch, which is developing an AI pipeline to train models for autonomous driving. Much of the data it uses includes personal identifiable information (PII), such as license plate numbers and people’s faces. At the same time, it must comply with GDPR, which requires a legal basis for processing PII, namely, consent from data subjects or legitimate interest. The former is challenging because it is practically impossible to get consent from pedestrians and drivers recorded by test cars. Relying on legitimate interest is challenging too because, among other things, it requires showing that there is a no less privacy-intrusive way of achieving the same result. This is where confidential AI shines: Using confidential computing can help reduce risks for data subjects and data controllers by limiting exposure of data (for example, to specific algorithms), while enabling organizations to train more accurate models.   

At Microsoft Research, we are committed to working with the confidential computing ecosystem, including collaborators like NVIDIA and Bosch Research, to further strengthen security, enable seamless training and deployment of confidential AI models, and help power the next generation of technology.

About confidential computing at Microsoft Research  

The Confidential Computing team at Microsoft Research Cambridge conducts pioneering research in system design that aims to guarantee strong security and privacy properties to cloud users. We tackle problems around secure hardware design, cryptographic and security protocols, side channel resilience, and memory safety. We are also interested in new technologies and applications that security and privacy can uncover, such as blockchains and multiparty machine learning. Please visit our careers page to learn about opportunities for both researchers and engineers. We’re hiring.

Related GTC Conference sessions

Posted on Leave a comment

New Z-code Mixture of Experts models improve quality, efficiency in Translator and Azure AI

Microsoft is making upgrades to Translator and other Azure AI services powered by a new family of artificial intelligence models its researchers have developed called Z-code, which offer the kind of performance and quality benefits that other large-scale language models have but can be run much more efficiently.

“Our goal is to help everyone and every organization on the planet to communicate better, and to achieve that goal there are really two important dimensions — we want the quality of translations to be as good as possible and we want to support as many languages as possible,” said Xuedong Huang, Microsoft technical fellow and Azure AI chief technology officer.

Z-code takes advantage of shared linguistic elements across multiple languages via transfer learning —which applies knowledge from one task to another related task — to improve quality for machine translation and other language understanding tasks. It also helps extend those capabilities beyond the most common languages across the globe to underrepresented languages that have less available training data.

“With Z-code we are really making amazing progress because we are leveraging both transfer learning and multitask learning from monolingual and multilingual data to create a state-of-the-art language model that we believe has the best combination of quality, performance and efficiency that we can provide to our customers,” Huang said.

These models use a sparse “Mixture of Experts” approach that is more efficient to run because it only needs to engage a portion of the model to complete a task, as opposed to other architectures that have to activate an entire AI model to run every request. This architecture allows massive scale in the number of model parameters while keeping the amount of compute constant.

To put these models in production, Microsoft is using NVIDIA GPUs and Triton Inference Server to deploy and scale them efficiently for high-performance inference.

Microsoft has recently deployed Z-code models to improve common language understanding tasks such as name entity recognition, text summarization, custom text classification and key phrase extraction across its Azure AI services. But this is the first time a company has publicly demonstrated that it can use this new class of Mixture of Experts models to power machine translation products.

The new Z-code-based translation model is now available, by invitation initially, to customers using document translation in Translator, a Microsoft Azure Cognitive Service which is a part of Azure AI.

Microsoft’s Z-code models consistently improved translation quality over current production models, according to common industry metrics. In contrast with typical multilingual transfer learning approaches, which typically show AI quality gains in languages that have fewer direct translation examples available for training, the Z-code Mixture of Experts models show consistent gains even in the largest languages.

A chart shows percentage improvements in translation quality across 37 different language pairs from Translator’s old AI models to a new class of models called Z-code.
New Z-code Mixture of Experts AI models are powering improvements and efficiencies in Translator and other Azure AI services.

Human evaluators in a blind test commissioned by Microsoft found that the Z-code Mixture of Experts models improved translations across languages, with an average gain of 4%. For instance, the models improved English to French translations by 3.2 %, English to Turkish by 5.8 %, Japanese to English by 7.6%, English to Arabic by 9.3% and English to Slovenian by 15%.

Creating more powerful and integrative AI systems

Z-code is part of Microsoft’s larger XYZ-code initiative that seeks to combine models for text, vision, audio and multiple languages to create more powerful and integrative AI systems that can speak, hear, see and understand people better.

Over the past five years, Microsoft has developed models that have matched human performance in conversational speech recognition, machine translation, image captioning, SuperGLUE natural language understanding and commonsense question answering. These breakthroughs provide the foundation to realize more ambitious AI systems that can achieve multisensory and multilingual learning that is closer to how people learn and understand, Huang said.

“Those are the pieces, the building blocks that we are using to build a truly differentiated intelligence…and to form production systems that are cost efficient,” Huang said.

Z-code models were developed as part of Microsoft’s AI at Scale and Turing initiatives, which seek to develop large models that are pretrained on vast amounts of textual data to understand nuances of language — which can be integrated in multiple Microsoft products and also made available to customers for their own uses.

The same underlying model can be fine-tuned to perform different language understanding tasks such as translating between languages, summarizing a speech, offering ways to complete a sentence or generating suggested tweets, instead of having to develop separate models for each of those narrow purposes.

Posted on Leave a comment

Microsoft AI model surpasses human performance on benchmark test for natural language understanding

Natural language understanding (NLU) is one of the longest running goals in AI, and SuperGLUE is currently among the most challenging benchmarks for evaluating NLU models. The benchmark consists of a wide range of NLU tasks, including question answering, natural language inference, co-reference resolution, word sense disambiguation, and others. Take the causal reasoning task (COPA in Figure 1) as an example. Given the premise “the child became immune to the disease” and the question “what’s the cause for this?,” the model is asked to choose an answer from two plausible candidates: 1) “he avoided exposure to the disease” and 2) “he received the vaccine for the disease.” While it is easy for a human to choose the right answer, it is challenging for an AI model. To get the right answer, the model needs to understand the causal relationship between the premise and those plausible options.

Since its release in 2019, top research teams around the world have been developing large-scale pretrained language models (PLMs) that have driven striking performance improvement on the SuperGLUE benchmark. Microsoft recently updated the DeBERTa model by training a larger version that consists of 48 Transformer layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on SuperGLUE for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE benchmark rankings, outperforming the human baseline by a decent margin (90.3 versus 89.8). The model also sits at the top of the GLUE benchmark rankings with a macro-average score of 90.8.

Microsoft will release the 1.5-billion-parameter DeBERTa model and the source code to the public. In addition, DeBERTa is being integrated into the next version of the Microsoft Turing natural language representation model (Turing NLRv4). Our Turing models converge all language innovation across Microsoft, and they are then trained at large scale to support products like Bing, Office, Dynamics, and Azure Cognitive Services, powering a wide range of scenarios involving human-machine and human-human interactions via natural language (such as chatbot, recommendation, question answering, search, personal assist, customer support automation, content generation, and others) to benefit hundreds of millions of users through the Microsoft AI at Scale initiative.

Figure 1: The SuperGLUE leaderboard as of January 6th, 2021.

DeBERTa (Decoding-enhanced BERT with disentangled attention) is a Transformer-based neural language model pretrained on large amounts of raw text corpora using self-supervised learning. Like other PLMs, DeBERTa is intended to learn universal language representations that can be adapted to various downstream NLU tasks. DeBERTa improves previous state-of-the-art PLMs (for example, BERT, RoBERTa, UniLM) using three novel techniques (illustrated in Figure 2): a disentangled attention mechanism, an enhanced mask decoder, and a virtual adversarial training method for fine-tuning.

Figure 2: The architecture of DeBERTa. DeBERTa improves the BERT and RoBERTa models by 1) using a disentangled attention mechanism where each word is represented using two vectors that encode its content and relative position, respectively, and 2) an enhanced mask decoder.

Disentangled attention: a two-vector approach to content and position embedding

Unlike BERT, where each word in the input layer is represented using a vector that sums its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight (which measures the strength of word-word dependency) of a word pair depends on not only their contents but also their relative positions. For example, the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.

Enhanced mask decoder accounts for absolute word positions

Like BERT, DeBERTa is pretrained using masked language modeling (MLM). MLM is a fill-in-the-blank task, where a model is taught to use the words surrounding a mask token to predict what the masked word should be. DeBERTa uses the content and position information of the context words for MLM. The disentangled attention mechanism already considers the contents and relative positions of the context words, but not the absolute positions of these words, which in many cases are crucial for the prediction.

Consider the sentence “a new store opened beside the new mall” with the italicized words “store” and “mall” masked for prediction. Although the local contexts of the two words are similar, they play different syntactic roles in the sentence. (Here, the subject of the sentence is “store” not “mall,” for example.) These syntactical nuances depend, to a large degree, upon the words’ absolute positions in the sentence, and so it is important to account for a word’s absolute position in the language modeling process. DeBERTa incorporates absolute word position embeddings right before the softmax layer where the model decodes the masked words based on the aggregated contextual embeddings of word contents and positions.

Scale Invariant Fine-Tuning improves training stability

Virtual adversarial training is a regularization method for improving models’ generalization. It does so by improving a model’s robustness to adversarial examples, which are created by making small perturbations to the input. The model is regularized so that when given a task-specific example, the model produces the same output distribution as it produces on an adversarial perturbation of that example. For NLU tasks, the perturbation is applied to the word embedding instead of the original word sequence. However, the value ranges (norms) of the embedding vectors vary among different words and models. The variance gets larger for bigger models with billions of parameters, leading to some instability of adversarial training. Inspired by layer normalization, to improve the training stability, we developed a Scale-Invariant-Fine-Tuning (SiFT) method where the perturbations are applied to the normalized word embeddings.

Conclusion and looking forward

As shown in the SuperGLUE leaderboard (Figure 1), DeBERTa sets new state of the art on a wide range of NLU tasks by combining the three techniques detailed above. Compared to Google’s T5 model, which consists of 11 billion parameters, the 1.5-billion-parameter DeBERTa is much more energy efficient to train and maintain, and it is easier to compress and deploy to apps of various settings.

DeBERTa surpassing human performance on SuperGLUE marks an important milestone toward general AI. Despite its promising results on SuperGLUE, the model is by no means reaching the human-level intelligence of NLU. Humans are extremely good at leveraging the knowledge learned from different tasks to solve a new task with no or little task-specific demonstration. This is referred to as compositional generalization, the ability to generalize to novel compositions (new tasks) of familiar constituents (subtasks or basic problem-solving skills). Moving forward, it is worth exploring how to make DeBERTa incorporate compositional structures in a more explicit manner, which could allow combining neural and symbolic computation of natural language similar to what humans do.

Acknowledgments

This research was conducted by Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. We thank our collaborators from Bing, Dynamics 365 AI, and Microsoft Research for providing compute resources for large-scale modeling and insightful discussions.

Posted on Leave a comment

The perfect Cheeto: How PepsiCo is using Microsoft’s Project Bonsai to raise the (snack) bar

Once the developers had created that simulation framework, the AI algorithm learns through trial and error as well as feedback from operators – a process called reinforcement learning. In the simulation, the AI solution can simulate a day’s run in a mere 30 seconds.

That means the AI solution has easily gone through more simulated runs than an operator could see in many lifetimes. And its computing power means it can come up with the right option far faster. Plus, it learned from the company’s most skilled operators and Cheetos experts, so it’s monitoring the fluctuations in quality and productivity from the highest level of experience.

The AI solution “could encapsulate the knowledge and skill of the best operators, then apply that through other facilities,” says Jayson Stemmler, a technical project manager at Neal Analytics who worked on the PepsiCo pilot project. “This solution reveals interactions and relationships that might not be intuitive to operators but that exist in the data. Without the manual measurement process, PepsiCo’s engineers are able to be more efficient with their time and focus on breakthrough innovation.”

A cross section of a Cheetos puff with the words size, flavor, shape and air

A few bad Cheetos?

After the solution spent some time in its simulation proving ground, it was time to take it to a test plant in PepsiCo’s Plano facility to see how it did with the real thing, which means testing it with some imperfect Cheetos.

“To develop this technology, we need to be able to make product that’s not good, so the AI can learn to take the product back into spec,” says Sean Eichenlaub, a senior principal engineer at PepsiCo.

Personally, I don’t see how any Cheetos could be “not good,” but I understand PepsiCo is going for perfect.

With the computer vision system continually monitoring and sending data to the Project Bonsai solution, any variance from that ideal can be fixed ASAP.

“With faster corrections, we can avoid the potential issues of going out of spec, such as having to discard product, or problems with packaging and waste,” Eichenlaub says.

I, for one, am all for a bag full of perfect Cheetos. And while the company prepares to use this Project Bonsai solution at a production plant, it’s also looking into using it with other Frito-Lay products, including the even-more-complex tortilla chip.

Leah Culler edits Microsoft’s AI Blog for Business & Technology.

Related:

Posted on Leave a comment

Reinforcement learning helps bring a new class of AI solutions to customers

Someone looking to book a vacation online today might have very different preferences than they did before the COVID-19 pandemic.

Instead of flying to an exotic beach, they might feel more comfortable driving locally. With limited options for dining out, having a full kitchen might be essential. Motel rooms or cabins might be more appealing than hotels with shared lobbies.

Countless companies use online recommendation engines to show customers products and experiences that match their interests. And yet, traditional machine learning models that predict what people might prefer are often based on data from past experience. That means they aren’t necessarily able to pick up on quickly changing consumer preferences unless they are retrained with new data.

Personalizer, which is part of Azure Cognitive Services within the Azure AI platform, uses a more cutting-edge approach to machine learning called reinforcement learning, in which AI agents can interact and learn from their environment in real time.

The technique used to be primarily used in research labs. But now, it’s making its way into more Microsoft products and services — from Azure Cognitive Services that developers can plug into apps and websites to autonomous systems that engineers can use to refine manufacturing processes. Azure Machine Learning is also previewing cloud-based reinforcement learning offerings for data scientists and machine learning professionals.

“We’ve come a long way in the last two years when we had a lot of proof of concept projects within Microsoft and deployments with a couple of customers,” said Rafah Hosn, senior director at Microsoft Research’s New York lab. “Now we are really progressing nicely into things that can be packaged and shrink wrapped and pointed to a particular set of problems.”

Rafah Hosn standing outside
Rafah Hosn, senior director at Microsoft Research Lab – New York City. Photo courtesy of Microsoft.

Z-Tech, the technology hub of Anheuser-Busch InBev, is using Personalizer to deliver tailored recommendations in an online marketplace to better serve small grocery stores across Mexico. Other Microsoft customers and partners are employing reinforcement learning to detect production anomalies and develop robots that can adjust to unpredictable real-world conditions — with models that can learn from environmental cues, expert feedback or customer behavior in real time.

Once Microsoft began using Personalizer on its homepage to contextually personalize the products displayed to each visitor, the company saw a 19-fold increase in engagement with the products that Personalizer chose. The company has also used Personalizer internally to select the right offers, products and content across Windows, Edge browser and Xbox. These scenarios are giving up to a 60% lift in engagement across billions of personalizations each month.

Teams has also used reinforcement learning to find the optimal jitter buffer for a video meeting, which trades off millisecond-scale information delays to provide better connection continuity, while Azure is exploring reinforcement learning-based optimization to help determine when to reboot or remediate virtual machines.

Because reinforcement learning models learn from instantaneous feedback, they can quickly adapt to changing or unpredictable circumstances. Once the COVID-19 pandemic hit, some companies had no idea what to expect as people’s purchasing and travel behaviors changed overnight, said Jeff Mendenhall, a Microsoft principal program manager for Personalizer.

“All of their historic modeling and expert knowledge went out the window,” Mendenhall said. “But with reinforcement learning, Personalizer can update the model every minute if needed to learn and respond to what actual user behaviors are right now.”

In reinforcement learning, an AI agent learns largely by trial and error. It tests out different actions in either a real or simulated world and gets a reward when the actions achieve a desired result — whether that’s a customer hitting the button to book a vacation reservation or a robot successfully unloading an unwieldy bag of coins.

Training an AI agent through reinforcement learning is similar to teaching a puppy to do a trick, Hosn said. It gets a treat when it makes decisions that yield a desired result and learns to repeat the actions that get the most treats. But in complicated real-world scenarios, exploring the vast universe of potential actions and finding an optimal sequence of decisions can be far more complicated.

At the 34th Conference on Neural Information Processing Systems (NeurIPS 2020) this week, Microsoft researchers presented 17 research papers that mark significant progress in addressing some of the field’s biggest challenges. By investing in reinforcement learning teams across its network of Microsoft Research labs, the company says it is developing a portfolio of approaches to tackle different problems and exploring multiple paths to potential breakthroughs.

John Langford sits in an office
John Langford, partner research manager at Microsoft Research Lab – New York City. Photo by John Brecher.

Those teams have focused on developing a robust understanding of reinforcement learning’s foundational elements and creating practical solutions for customers — not just novelty demonstrations, researchers say.

They’ve spent a lot of time figuring out which scenarios reinforcement learning is well-suited to solve, as well as probing the technical underpinnings to understand why something works and how to repeat it, said John Langford, a partner research manager at Microsoft Research Lab – New York.

“Right now there’s a big gap between one-off applications where you can get PhDs to grind really hard and figure out a way to make it work as opposed to developing a routinely useful system that can be used over and over again,” Langford said.

“All of our reinforcement learning research at Microsoft really falls into two big buckets — how can we solve challenges that customers are bringing to us and what are the foundations we can use to build replicable, reliable solutions?” he said.

A different approach to machine learning

Reinforcement learning uses a fundamentally different approach than supervised learning, a more common machine learning technique in which models learn to make predictions from training examples they’ve been fed.

If a person is trying to learn French, exposing themselves to French text, grammar rules and vocabulary is closer to a supervised learning approach, said Raluca Georgescu, a research software engineer working on Project Paidia in the Microsoft Research Cambridge UK lab.

With a reinforcement learning approach, they would go to France and learn by talking to people. They’d be penalized with puzzled looks if they say the wrong thing and they’d get rewarded with a croissant if they order it correctly, she said.

A reinforcement learning agent learns from interacting with its environment, either in the real world or in a simulated environment that allows it to safely explore different options. It takes an action and waits to see if it results in a positive or negative outcome, based on a reward system that’s been established.  Once that feedback is received, the model learns whether that decision was good or bad and updates itself accordingly.

It’s a really simple form of learning that’s endemic in the natural world, said Langford.

“Even worms can do reinforcement learning — they can learn to go towards things and avoid things based on some feedback,” Langford said. “That ability to learn at a very basic level from your environment is something that is super natural for us but in machine learning it’s a bit more tricky and delicate and requires more thought than supervised learning.”

The new papers presented at NeurIPS this week offer significant contributions in three key research areas: batch reinforcement learning, strategic exploration given rich observations and representation learning. Taken together, researchers say, these breakthroughs aim to boost the efficiency of models and expand the scope of problems that reinforcement learning can solve.

Posted on Leave a comment

The human side of AI for chess

As artificial intelligence continues its rapid progress, equaling or surpassing human performance on benchmarks in an increasing range of tasks, researchers in the field are directing more effort to the interaction between humans and AI in domains where both are active. Chess stands as a model system for studying how people can collaborate with AI, or learn from AI, just as chess has served as a leading indicator of many central questions in AI throughout the field’s history.

AI-powered chess engines have consistently bested human players since 2005, and the chess world has undergone further shifts since then, such as the introduction of the heuristics-based Stockfish engine in 2008 and the deep reinforcement learning-based AlphaZero engine in 2017. The impact of this evolution has been monumental: chess is now seeing record numbers of people playing the game even as AI itself continues to get better at playing. These shifts have created a unique testbed for studying the interactions between humans and AI: formidable AI chess-playing ability combined with a large, growing human interest in the game has resulted in a wide variety of playing styles and player skill levels.

There’s a lot of work out there that attempts to match AI chess play to varying human skill levels, but the result is often AI that makes decisions and plays moves differently than human players at that skill level. The goal for our research is to better bridge the gap between AI and human chess-playing abilities. The question for AI and its ability to learn is: can AI make the same fine-grained decisions that humans do at a specific skill level? This is a good starting point for aligning AI with human behavior in chess.

Our team of researchers at the University of Toronto, Microsoft Research, and Cornell University has begun investigating how to better match AI to different human skill levels and, beyond that, personalize an AI model to a specific player’s playing style. Our work comprises two papers, “Aligning Superhuman AI with Human Behavior: Chess as a Model System” and “Learning Personalized Behaviors of Human Behavior in Chess,” as well as a novel chess engine, called Maia, which is trained on games played by humans to more closely match human play. Our results show that, in fact, human decisions at different levels of skill can be predicted by AI, even at the individual level. This represents a step forward in modeling human decisions in chess, opening new possibilities for collaboration and learning between humans and AI.

AlphaZero changed how AI played the game by practicing against itself with only knowledge of the rules (“self-play”), unlike previous models that relied heavily on libraries of moves and past games to inform training. Our model, Maia, is a customized version of Leela Chess Zero (an open-source implementation of AlphaZero). We trained Maia on human games with the goal of playing the most human-like moves, instead of being trained on self-play games with the goal of playing the optimal moves. In order to characterize human chess-playing at different skill levels, we developed a suite of nine Maias, one for each Elo rating between 1100 and 1900. (Elo ratings are a system for evaluating players’ relative skill in games like chess.) As you’ll see below, Maia matches human play more closely than any chess engine ever created.

  • CODE Maia Chess Explore our nine final maia models saved as Leela Chess neural networks, and the code to create more and reproduce our results.

If you’re curious, you can play against a few versions of Maia on Lichess, the popular open-source online chess platform. Our bots on Lichess are named maia1, maia5, and maia9, which we trained on human games at Elo rating 1100, 1500, and 1900, respectively. You can also download these bots and other resources from the GitHub repo.

Measuring human play

What does it mean for a chess engine to match human play? For our purposes, we settled on a simple metric: given a position that occurred in an actual human game, what is the probability that the engine plays the move that the human played in the game?

Making an engine that matches human play according to this definition is a difficult task. The vast majority of positions seen in real games only happen once, because the sheer number of possible positions is astronomical: after just four moves by each player, the number of potential positions enters the hundreds of billions. Moreover, people have a wide variety of styles, even at the same rough skill level. And even the same exact person might make a different move if they see the same position twice!

Creating a dataset

To rigorously compare engines in how well they match human play, we need a good test set to evaluate them with. We made a collection of nine test sets, one for each narrow rating range. Here’s how we made them:

  • First, we made rating bins for each range of 100 rating points (such as 1200-1299, 1300-1399, and so on).
  • In each bin, we put all games where both players are in the same rating range.
  • We drew 10,000 games from each bin, ignoring games played at Bullet and HyperBullet speeds. At those speeds (one minute or less per player), players tend to play lower quality moves to not lose by running out of time.
  • Within each game, we discarded the first 10 moves made by each player to ignore most memorized opening moves.
  • We also discarded any move where the player had less than 30 seconds to complete the rest of the game (to avoid situations where players are making random moves).

After these restrictions we had nine test sets, one for each rating range, which contained roughly 500,000 positions each.

Differentiating our work from prior attempts

People have been trying to create chess engines that accurately match human play for decades. For one thing, they would make great sparring partners. But getting crushed like a bug every single game isn’t that fun, so the most popular attempts at engines that match human play have been some kind of attenuated version of a strong chess engine. Attenuated versions of an engine are created by limiting the engine’s ability in some way, such as reducing the amount of data it’s trained on or limiting how deeply it searches to find a move. For example, the “play with the computer” feature on Lichess is a series of Stockfish models that are limited in the number of moves they are allowed to look ahead. Chess.com, ICC, FICS, and other platforms all have similar engines. How well do these engines match human play?

Stockfish: We created several attenuated versions of Stockfish, one for each depth limit (for example, the depth 3 Stockfish can only look 3 moves ahead), and then we tested them on our test sets. In the plot below, we break out the accuracies by rating level so you can see if the engine thinks more like players of a specific skill level.

Figure 1: Accuracy of Stockfish models with depth 1, 3, 5, 7, 9, 11, 13, and 15 shown form 1100 to 1900 Elo ratings. Depth 5 matching is the lowest accuracy, starting at under 35% at 1100 and rising to just above 35% for 1900 rating. The best move matching is at Depth 15, starting at roughly 36% at 1100 and rising to over 40% at 1900.
Figure 1: Move matching accuracy for Stockfish compared with the targeted player’s Elo rating

As you can see, it doesn’t work that well. Attenuated versions of Stockfish only match human moves about 35-40% of the time. And equally importantly, each curve is strictly increasing, meaning that even depth-1 Stockfish does a better job at matching 1900-rated human moves than it does at matching 1100-rated human moves. This means that attenuating Stockfish by restricting the depth it can search doesn’t capture human play at lower skill levels—instead, it looks like it’s playing regular Stockfish chess with a lot of noise mixed in.

Leela Chess Zero: Attenuating Stockfish doesn’t characterize human play at specific levels. What about Leela Chess Zero, an open-source implementation of AlphaZero, which learns chess through self-play games and deep reinforcement learning? Unlike Stockfish, Leela incorporates no human knowledge in its design. Despite this, however, the chess community was very excited by how Leela seemed to play more like human players.

Figure 2: Leela ratings from 800 to 3200 graphed for accuracy. Leela does better than Stockfish for move matching, but as Elo rating gets better, each version of Leela has better or worse accuracy. Accuracy ranges from under 20% (800-rated Leela predicting 1900-level play) to about 47% (3200-rated Leela predicting 1900-level play).
Figure 2: Move matching accuracy for Leela compared with the targeted player’s Elo rating

In the analysis above, we looked at a number of different Leela generations, with the ratings being their relative skill (commentators noted that early Leela generations played particularly similar to humans). People were right in that the best versions of Leela match human moves more often than Stockfish. But Leela still doesn’t capture human play at different skill levels: each version is always getting better or always getting worse as the human skill level increases. To characterize human play at a particular level, we need another approach.

Maia: A better solution for matching human skill levels

Maia is an engine designed to play like humans at a particular skill level. To achieve this, we adapted the AlphaZero/Leela Chess framework to learn from human games. We created nine different versions, one for each rating range from 1100-1199 to 1900-1999. We made nine training datasets in the same way that we made the test datasets (described above), with each training set containing 12 million games. We then trained a separate Maia model for each rating bin to create our nine Maias, from Maia 1100 to Maia 1900.

Figure 3: Maia trained models from 1100 to 1900 ratings. These are shown predicting player moves at 1100 to 1900 ratings. Maia’s worst accuracy is 46% when a 1900-rated Maia model predicts moves of a 1100-rated player. The highest is 52%, far greater than prior AI chess models.
Figure 3: Move matching accuracy for Maia compared with the targeted player’s Elo rating

As you can see, the Maia results are qualitatively different from Stockfish and Leela. First off, the move matching performance is much higher: Maia’s lowest accuracy, when it is trained on 1900-rated players but predicts moves made by 1100-rated players, is 46%—as high as the best performance achieved by any Stockfish or Leela model on any human skill level we tested. Maia’s highest accuracy is over 52%. Over half the time, Maia 1900 predicts the exact move a 1900-rated human played in an actual game.

Figure 4: Figures 1, 2, and 3 combined showing that Maia’s accuracy greatly surpasses prior models’ performance.
Figure 4: Move matching accuracy for all the models compared with the targeted player’s Elo rating

Importantly, every version of Maia uniquely captures a specific human skill level since every curve achieves its maximum accuracy at a different human rating. Even Maia 1100 achieves over 50% accuracy in predicting 1100-rated moves, and it’s much better at predicting 1100-rated players than 1900-rated players!

This means something deep about chess: there is such a thing as “1100-rated style.” And furthermore, it can be captured by a machine learning model. This was surprising to us: it would have been possible that human play is a mixture of good moves and random blunders, with 1100-rated players blundering more often and 1900-rated players blundering less often. Then it would have been impossible to capture 1100-rated style, because random blunders are impossible to predict. But since we can predict human play at different levels, there is a reliable, predictable, and maybe even algorithmically teachable difference between one human skill level and the next.

Maia’s predictions

You can find all of the juicy details in the paper, but one of the most exciting things about Maia is that it can predict mistakes. Even when a human makes an absolute howler—“hanging” a queen, in other words letting an opponent capture it for free, for example—Maia predicts the exact mistake made more than 25% of the time. This could be really valuable for average players trying to improve their game: Maia could look at your games and tell which blunders were predictable and which were random mistakes. If your mistakes are predictable, you know what to work on to hit the next level.

Figure 5: Matching accuracy (predicting move quality) of Maia versus Leela. Quality prediction is much more consistent and consistently higher across the full range of Maia models, at its height above 60%, when compared with Leela, which has a much broader range of accuracy when looking at the full range of models.
Figure 5: Move matching accuracy as a function of the quality of the move played in the game

Modeling individual players’ styles with Maia

In current work, we are pushing the modeling of human play to the next level: can we actually predict the moves a particular human player would make?

It turns out that personalizing Maia gives us our biggest performance gains. Whereas base Maia predicts human moves around 50% of the time, some personalized models can predict an individual’s moves with accuracies up to 75%!

We achieve these results by fine-tuning Maia. Starting with a base Maia, say Maia 1900, we update the model by continuing training on an individual player’s games. Below, you can see that for predicting individual players’ moves, the personalized models all show large improvements over the non-personalized models. The gains are so large that the personalized models are almost non-overlapping with the non-personalized ones: the personalized model for the hardest-to-predict player still gets almost 60% accuracy, whereas even the non-personalized models don’t achieve this accuracy on even the easiest-to-predict players.

Personalized Maia models show a greatly improved range of mean accuracy when compared to non-personalized Maia models: anywhere from just under 60% at the low end to just over 80% at the high end.

The personalized models are so accurate that given just a few games, we can tell which player played them! In this stylometry task—where the goal is to recognize an individual’s playing style—we train personalized models for 400 players of varying skill levels, and then have each model predict the moves from 4 games by each player. For 96% of the 4-game sets we tested, the personalized model that achieved the highest accuracy (that is, predicted the player’s actual moves most often) was the one that was trained on the player who played the games. With only 4 games of data, we can pick out who played the games from a set of 400 players. The personalized models are capturing individual chess-playing style in a highly accurate way.

Using AI to help improve human chess play

We designed Maia to be a chess engine that predicts human moves at a particular skill level, and it has progressed into a personalized engine that can identify the games of individual players. This is an exciting step forward in our understanding of human chess play, and it brings us closer to our goal of creating AI chess-teaching tools that help humans improve. Among the many capabilities of a good chess teacher, two of them are understanding how students at different skill levels play and recognizing the playing styles of their students. Maia has shown that these capabilities are realizable using AI.

The ability to create personalized chess engines from publicly available, individual player data opens an interesting discussion on the possible uses (and misuses) of this technology. We initiate this discussion in our papers, but there is a long road ahead in understanding the full potential and implications of this line of research. As in countless times before, Chess will be one model AI system that sets the stage for this discussion.

Acknowledgments

Many thanks to Lichess.org for providing the human games that we trained on, and hosting our Maia models that you can play against. Ashton Anderson was supported in part by an NSERC grant, a Microsoft Research gift, and a CFI grant. Jon Kleinberg was supported in part by a Simons Investigator Award, a Vannevar Bush Faculty Fellowship, a MURI grant, and a MacArthur Foundation grant.

Posted on Leave a comment

Microsoft, Code.org partner to teach AI + ethics from elementary to high school

Code.org

Microsoft and Code.org are excited to announce a partnership that gives every student from elementary school to high school the opportunity to learn about artificial intelligence (AI).

We’re excited to unveil our new video series on artificial intelligence and machine learning. Microsoft CEO Satya Nadella introduces the series.

At a time when AI and machine learning are changing the very fabric of society and transforming entire industries, it is more important than ever to give every student the opportunity to not only learn how these technologies work, but also to think critically about the ethical and societal impacts of AI.

AI is used everywhere, from voice assistants to self-driving cars, and it’s rapidly becoming the most important technological innovation of current times. AI has the potential to play a major role in addressing global problems, such as detecting and curing diseases, cleaning oceans, eliminating poverty, or harnessing clean energy.

At the same time, with great power comes great responsibility, and budding computer scientists must learn to consider technology’s ethical impacts. How does algorithmic bias impact social justice or deep fakes impact democracy? How does society cope with rapid job automation? By learning how to consider the ethical issues that AI raises, these future computer scientists will be better able to envision the appropriate safeguards that help to maximize the benefits of AI technologies and reduce their risks.

Made possible by Microsoft’s latest donation of $7.5 million, Code.org plans a comprehensive and age-appropriate approach to teaching how AI works along with the social and ethical considerations, from elementary school through high school.

Available on December 1:

  • A new video series on AI, featuring Microsoft CEO Satya Nadella along with leading technologists across industry and academia
See the playlist with all videos here.
AI for Oceans is available in 25+ languages and is optimized for mobile devices.

Within the coming year, AI and machine learning lessons will be integrated into Code.org’s CS Discoveries curriculum, which is one of the most widely-used computer science courses for students in grades 6–10, and in App Lab, Code.org’s popular app-creation platform used throughout middle school and high school.

In CS Discoveries, students will learn to work with datasets to create machine learning models that they can incorporate into their apps, and explore how advances in new technologies such as computer vision and neural networks require new ethical computer scientists to avoid bias and harm. Curated datasets will help students better understand the real-world impact that these technologies have.

Code.org will also help students and teachers find additional educational resources from a variety of partners and the broader community behind AI education.

A look at a new lesson in Minecraft: Education Edition. In these new lessons, students use AI in a range of exciting real-world scenarios: to preserve wildlife and ecosystems, help people in remote areas, and research climate change.

Additionally, last month the Microsoft AI for Earth team partnered with Minecraft: Education Edition to release five lessons challenging students to use the power of AI in a range of exciting real-world scenarios: to preserve wildlife and ecosystems, help people in remote areas, and research climate change.

What’s more, Microsoft’s Imagine Cup Junior 2021 challenge provides students aged 13 to 18 the opportunity to learn about technology and how it can be used to positively change the world.

The global challenge is focused on Artificial Intelligence (AI), introducing students to AI and Microsoft’s AI for Good initiatives so they can come up with ideas to solve social, cultural and environmental issues.

Microsoft’s Imagine Cup Junior challenge is geared towards students ages 13 to 18. Learn more and join the competition here.

On Code.org, 45% of students are young women, and in the US, 50% are students from underrepresented racial and ethnic groups and 45% are in high needs schools. Reaching the tens of millions of students in Code.org’s courses and on its platform, the partnership between Microsoft and Code.org works to democratize access to learning AI because all students deserve the opportunity to shape the world they live in — and because creating an equitable and socially just future will take all of us.

-Code.org CEO Hadi Partovi and Microsoft President Brad Smith

Posted on Leave a comment

‘Humans and AI’ Ask Me Anything: Nicolas Villar answers your questions about using AI as a force of good

In August, we introduced Humans and AI, a new series of stories that highlight the people who make innovation matter. The series features passionate people from all walks of life who are using AI to transform our society and our world for the better.

Today, we are thrilled to share our next episode of “Humans and AI” featuring Nicolas Villar, a principal hardware architect for Microsoft Premonition, an early warning system that monitors the environment for signs of epidemics. Villar is building robotic devices to capture and track disease-carrying mosquitoes – a threat he understands well after living in places where mosquito-borne illnesses are a daily concern.

Villar’s past projects include Code Jumper, a physical programming language designed to be inclusive of children with all ranges of vision. He considers himself a maker who loves using technology to bring ideas to life to help others and solve problems.

Want to know more about Villar or his work? On Nov. 18, he will be answering your questions live on Twitter in a chat hosted by Microsoft Research. Submit your questions by tagging @MSFTResearch and using #MicrosoftAIChat on Twitter to share your questions ahead of time.

Posted on Leave a comment

C3.ai, Microsoft, and Adobe combine forces to re-invent CRM with AI

C3 AI CRM enables a new category of customer-focused industry AI use cases and a new ecosystem

REDWOOD CITY, CA, REDMOND, WA, and SAN JOSE, CA – October 26, 2020 – C3.ai, Microsoft Corp. (NASDAQ:MSFT), and Adobe Inc. (NASDAQ:ADBE) today announced the launch of C3 AI® CRM powered by Microsoft Dynamics 365. The first enterprise-class, AI-first customer relationship management solution is purpose-built for industries, integrates with Adobe Experience Cloud, and drives customer-facing operations with predictive business insights.

The partners have agreed to:

  • Integrate Microsoft Dynamics 365, Adobe Experience Cloud (including Adobe Experience Platform), and C3.ai’s industry-specific data models, connectors, and AI models, in a joint go-to-market offering designed to provide an integrated suite of industry-specific AI-enabled CRM solutions including marketing, sales, and customer service.
  • Sell the industry-specific AI CRM offering through dedicated sales teams to target enterprise accounts across multiple industries globally, as well as through agents and industry partners.
  • Target industry vertical markets initially including financial services, oil and gas, utilities, manufacturing, telecommunications, public sector, healthcare, defense, intelligence, automotive, and aerospace
  • Market the jointly branded offering globally, supported by the companies’ commitment to customer success

C3 AI logo“Microsoft, Adobe, and C3.ai are reinventing a market that Siebel Systems invented more than 25 years ago,” said Thomas M. Siebel, CEO of C3.ai. “The dynamics of the market and the mandates of digital transformation have dramatically changed CRM market requirements.  A general-purpose CRM system of record is no longer sufficient.  Customers today demand industry-specific, fully AI-enabled solutions that provide AI-enabled revenue forecasting, product forecasting, customer churn, next-best product, next-best offer, and predisposition to buy.”

“This year has made clear that businesses fortified by digital technology are more resilient and more capable of transforming when faced with sweeping changes like those we are experiencing,” said Satya Nadella, CEO, Microsoft. “Together with C3.ai and Adobe, we are bringing to market a new class of industry-specific AI solutions, powered by Dynamics 365, to help organizations digitize their operations and unlock real-time insights across their business.”

“We’re proud to partner with C3.ai and Microsoft to advance the imperative for digital customer engagement,” said Shantanu Narayen, president and CEO of Adobe. “The unique combination of Adobe Experience Cloud, the industry-leading solution for customer experiences, together with the C3 AI Suite and Microsoft Dynamics 365, will enable brands to deliver rich experiences that drive business growth.”

Adobe logo“This is an exciting development in the advancement of Enterprise AI,” said Lorenzo Simonelli, chairman and CEO of Baker Hughes. “This partnership between C3.ai, Microsoft, and Adobe will bring a unique and powerful new CRM offering to the market. We are adopting AI in multiple applications internally and in new products and services for our customers through our C3.ai partnership. We look forward to offering C3 AI CRM to our customers and benefitting from the capabilities internally.”

Combining the market-leading Microsoft Dynamics 365 CRM software with Adobe’s leading suite of customer experience management solutions alongside C3.ai’s enterprise AI capabilities, C3 AI CRM is the world’s first AI-driven, industry-specific CRM built with a modern AI-first architecture. C3 AI CRM integrates and unifies vast amounts of structured and unstructured data from enterprise and extraprise sources into a unified, federated image to drive real-time predictive insights across the entire revenue supply chain, from contact to cash. With embedded AI-driven, industry-specific workflows, C3 AI CRM helps teams:

  • Accurately forecast revenue
  • Accurately predict product demand
  • Identify and reduce customer churn
  • Identify highly-qualified prospects
  • Next-best offer, next-best product
  • AI-driven segmentation, marketing, and targeting

C3 AI CRM enables brands to take advantage of their real-time customer profiles for cross-channel journey orchestration. The joint solution offers an integrated ecosystem that empowers customers to take advantage of leading CRM capabilities along with an integrated ecosystem with Azure, Microsoft 365, and the Microsoft Power Platform. C3 AI CRM is pre-built and configured for industries – financial services, healthcare, telecommunications, oil and gas, manufacturing, utilities, aerospace, automotive, public sector, defense, and intelligence – enabling customers to deploy and operate C3 AI CRM and its industry-specific machine learning models quickly. In addition, C3 AI CRM leverages the common data model of the Open Data Initiative (ODI), making it easier to bring together disparate customer data from across the enterprise.

C3 AI CRM is immediately available, with Adobe Experience Cloud sold separately. C3 AI CRM powered by Dynamics 365 will be available from C3.ai, Adobe, Microsoft and through the Microsoft Dynamics 365 Marketplace. Please contact [email protected] to learn more.

###
About C3.ai

C3.ai is a leading enterprise AI software provider for accelerating digital transformation. C3.ai delivers the C3 AI Suite for developing, deploying, and operating large-scale AI, predictive analytics, and IoT applications in addition to an increasingly broad portfolio of turn-key AI applications. The core of the C3.ai offering is a revolutionary, model-driven AI architecture that dramatically enhances data science and application development.

About Microsoft

Microsoft (Nasdaq “MSFT” @microsoft) enables digital transformation for the era of an intelligent cloud and an intelligent edge. Its mission is to empower every person and every organization on the planet to achieve more.

About Adobe

Adobe is changing the world through digital experiences. For more information, visit  www.adobe.com.

For more information, contact:

C3.ai Public Relations:
April Marks

(917) 574-5512
[email protected]

Microsoft Media Relations:

WE Communications for Microsoft

(425) 638-7777

[email protected]

Adobe Comms:

Ashley Levine

(408) 666-5888

[email protected]

 

Posted on Leave a comment

Latest AI breakthrough describes images as well as people do

Novel object captioning

Image captioning is a core challenge in the discipline of computer vision, one that requires an AI system to understand and describe the salient content, or action, in an image, explained Lijuan Wang, a principal research manager in Microsoft’s research lab in Redmond.

“You really need to understand what is going on, you need to know the relationship between objects and actions and you need to summarize and describe it in a natural language sentence,” she said.

Wang led the research team that achieved – and beat – human parity on the novel object captioning at scale, or nocaps, benchmark. The benchmark evaluates AI systems on how well they generate captions for objects in images that are not in the dataset used to train them.

Image captioning systems are typically trained with datasets that contain images paired with sentences that describe the images, essentially a dataset of captioned images.

“The nocaps challenge is really how are you able to describe those novel objects that you haven’t seen in your training data?” Wang said.

To meet the challenge, the Microsoft team pre-trained a large AI model with a rich dataset of images paired with word tags, with each tag mapped to a specific object in an image.

Datasets of images with word tags instead of full captions are more efficient to create, which allowed Wang’s team to feed lots of data into their model. The approach imbued the model with what the team calls a visual vocabulary.

The visual vocabulary pre-training approach, Huang explained, is similar to prepping children to read by first using a picture book that associates individual words with images, such as a picture of an apple with the word “apple” beneath it and a picture of a cat with the word “cat” beneath it.

“This visual vocabulary pre-training essentially is the education needed to train the system; we are trying to educate this motor memory,” Huang said.

The pre-trained model is then fine-tuned for captioning on the dataset of captioned images. In this stage of training, the model learns how to compose a sentence. When presented with an image containing novel objects, the AI system leverages the visual vocabulary to generate an accurate caption.

“It combines what is learned in both the pre-training and the fine-tuning to handle novel objects in the testing,” Wang said.

When evaluated on nocaps, the AI system created captions that were more descriptive and accurate than the captions for the same images that were written by people, according to results presented in a research paper.