WIRED: Undersea servers stay cool while processing oceans of data

Most electronics suffer a debilitating aquaphobia. At the ­littlest­ spillage—heaven forbid Dorothy’s bucket—of water, our wicked widgets shriek and melt.

Microsoft, it would seem, missed the memo. Last June, the company installed a smallish data center on a patch of seabed just off the coast of Scotland’s Orkney Islands; around it, approximately 933,333 bucketfuls of brine circulate every hour. As David Wolpert, who studies the thermodynamics of computing systems, wrote in a recent blog post for Scientific American, “Many people have impugned the rationality.”

Related Stories

The idea to submerge 864 servers in saltwater was, in fact, quite rational, the result of a five-year research project led by future-proofing engineers. Errant liquid might fritz your phone, but the slyer, far deadlier killer of technology is the opposing elemental force, fire. Nearly every system failure in the history of computers has been caused by overheating. As diodes and transistors work harder and get hotter, their susceptibility to degradation intensifies exponentially. Localized, it’s the warm iPhone on your cheek or a wheezing laptop giving you upper-leg sweats. At scale, it’s Outlook rendered inoperable by remote server meltdown for 16 excruciating hours—which happened in 2013.

Servers underlie the networked world, constantly refreshing the cloud with droplets of data, and they’re as valuable as they are vulnerable. Housed by the hundreds, and often the thousands, in millions of data centers across the United States, they cost billions every year to build and protect. The most significant number, however, might be a single-digit one: Running these machines, and therefore cooling them, blows through an estimated 5 percent of total energy use in the country. Without that power, the cloud burns up and you can’t even fact-check these stats on Google (an operation that costs some server, somewhere, a kilojoule of energy).

Alyssa Foote

Savings of even a few degrees Celsius can significantly extend the lifespan of electronic components; Microsoft reports that, on the ocean floor 117 feet down, its racks stay 10 degrees cooler than their land-based counterparts. Half a year after deployment, “the equipment is happy,” says Ben Cutler, the project’s manager. (The only exceptions are some of the facility’s ­outward-facing cameras, lately blinded by algal muck.)

Another Microsoft employee refers to the effort as “kind of a far-out idea.” But the truth is, most hyperscalers investing in superpowered cloud server farms, from Amazon to Alibaba, see in nature a reliable defense against ever more sophisticated, heat-spewing circuits. Google’s first data center, built in 2006, sits on the temperate banks of Oregon’s Columbia River. In 2013, Facebook opened a warehouse in northern Sweden, where winters average –20 degrees Celsius. The data company Green Mountain buried its massive DC1-­Stavenger center inside a Norwegian mountain; pristine, near-freezing water from a fjord, guided by gravity, flows through the cooling system. What Tim Cook has been calling the “data-­industrial complex” will rely, if it’s to sustainably expand to the farthest reaches, on a nonindustrial means of survival.

Alyssa Foote

Underwater centers may represent the next phase, a reverse evolution from land to sea. It’s never been hard, after all, to waterproof large equipment—think of submarines, which get more watertight as they dive deeper and pressure increases. That’s really all Microsoft is doing, swapping out the payloads of people for packets of data and hooking up the trucklong pod to umbilical wiring.

Nonetheless, Cutler says, the concept “catches people’s imagination.” He receives enthusiastic emails about his sunken center all the time, including one from a man who builds residential swimming pools. “He was like, you guys could provide the heating for the pools I install!” Cutler says. When pressed on the feasibility of the business model, Cutler adds: “We have not studied this.”

Alyssa Foote

Others have. IBM maintains a data center outside of Zurich that really does heat a public swimming pool in town, and the Dutch startup Nerdalize will erect a mini green data center in your home with promises of a warm shower and toasty living room. Hyperlocal servers, part of a move toward so-called edge computing, not only provide recyclable energy but also bring the network closer to you, making your connection speeds faster. Microsoft envisions sea-based facilities like the one in Scotland serving population-dense coastal cities all over the world.

“I’m not a philosopher, I’m an engineer,” Cutler says, declining to offer any quasipoetic contemplations on the imminent fusion of nature and machine. Still,
he does note the weather on the morning his team hauled the servers out to sea. It was foggy, after a week of clear skies and bright sun—as though the literal cloud, reifying the digital, were peering into the shimmering, unknown depths.

Jason Kehe (@jkehe) wrote about drone swarms in issue 26.08.

This article appears in the January issue. Subscribe now.

More Great WIRED Stories


Podcast: Soundscaping the world with Amos Miller

Product Strategist Amos Miller

Episode 54, December 12, 2018

Amos Miller is a product strategist on the Microsoft Research NeXT Enable team, and he’s played a pivotal role in bringing some of MSR’s most innovative research to users with disabilities. He also happens to be blind, so he can appreciate, perhaps in ways others can’t, the value of the technologies he works on, like Soundscape, an app which enhances mobility independence through audio and sound.

On today’s podcast, Amos Miller answers burning questions like how do you make a microwave accessible, what’s the cocktail party effect, and how do you hear a landmark? He also talks about how researchers are exploring the untapped potential of 3D audio in virtual and augmented reality applications, and explains how, in the end, his work is not so much about making technology more accessible, but using technology to make life more accessible.


Episode Transcript

Amos Miller: Until you are out there in the wind, in the rain, with the people, experiencing, or at least trying to get a sense for the kind of experience they’re going through, you’ll never understand the context in which your technology is going to be used. It’s not something you can imagine, or glean from secondary data, or even from video or anything. Until you are there, seeing how they grapple with issues that they are dealing with, it’s almost impossible to really understand that context.

(music plays)

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Amos Miller is a product strategist on the Microsoft Research NeXT Enable team, and he’s played a pivotal role in bringing some of MSR’s most innovative research to users with disabilities. He also happens to be blind, so he can appreciate, perhaps in ways others can’t, the value of the technologies he works on, like Soundscape, an app which enhances mobility independence through audio and sound.

On today’s podcast, Amos Miller answers burning questions like how do you make a microwave accessible, what’s the cocktail party effect, and how do you hear a landmark? He also talks about how researchers are exploring the untapped potential of 3D audio in virtual and augmented reality applications, and explains how, in the end, his work is not so much about making technology more accessible, but using technology to make life more accessible. That and much more on this episode of the Microsoft Research Podcast.

Host: Amos Miller, welcome to the podcast.

Amos Miller: Thank you. It’s great to be here.

Host: You are unique in the Microsoft Research ecosystem. Your work is mission-driven. Your personal life strongly informs your professional life and, we’ll get more specific in a bit. But for starters, in broad strokes, tell us what gets you up in the morning. Why do you do what you do?

Amos Miller: I’ve always been passionate about technology from a very young age. But, really, in the way that it impacts people’s lives. And it’s not a mission that I necessarily knew about when I went through my career and experiences with technology. But when I look back, I see that those are the areas where I could see that a person feels differently about themselves or about the environment as a result of their interaction with that technology. That’s where I thought okay, that is having meaning to this person. And I have this huge, wonderful opportunity to do what I do in Microsoft Research to actually have turned that passion into my day job, which is very… I feel extremely fortunate with that. And I sometimes have to pinch myself to see that it’s not a dream.

Host: Well, tell us a little bit about your background and how that plays into what you are doing here.

Amos Miller: I’m very much a person that grew up in the technology world. I also moved a number countries over my career, and my life. I grew up in Israel. I spent many years in the UK, in London. I spent a few other years in Asia, in Singapore, and now I’m here, so all of these aspects of my life have been very important to me. I also happen to be blind. I suffer from a genetic eye condition called retinitis pigmentosa. It was diagnosed when I was five and I gradually lost my sight. I started university with good enough sight to manage and finish university with a service dog and any kind of technology I could find to help me read the whiteboard, to help me read the text on the computer. And I’d say by the age of 30, I totally stopped using my sight. And that’s when I really started living life as a fully blind person.

Host: Let’s talk about your job for a second. You are a product strategist at Microsoft Research, so how would you describe what you do?

Amos Miller: So, I work in a part of the organization at Microsoft Research that looks at really transferring technology ideas into impact. Into a way that they impact business, impact people. A good idea will only have an impact when it’s applied in the right way, in the right environment, so that the social, the business, the technological context in which it operates is going to make it thrive. Otherwise it doesn’t matter how good it is, it’s not going to have an impact.

Host: Right. So, let’s circle over to this previous role you had which was in Microsoft’s Digital Advisory program. And I bring it up in as much as it speaks to how often our previous work can inform our current work, and you referred to that time as your “customer-facing life.” How does it inform your role as a strategist today?

Amos Miller: What always energizes me is when I see and observe the meaning and the impact that technology can really have for people. And I don’t say it lightly. Until you are out there in the wind, in the rain, with the people, experiencing, or at least trying to get a sense for the kind of experience they are going through, you’ll never understand the context in which your technology is going to be used. It’s not something you can imagine, or glean from secondary data, or even from video or anything. Until you are there, seeing how they grapple with the issues that they are dealing with, it’s almost impossible to really understand that context. And the work that I’ve done in, actually, my first nine years in Microsoft, I worked in a customer-facing part of the business, in the Strategic Advisory Services, today known as the Digital Advisory Services. It’s work that we do with our largest customers around the world to really help them figure out how they can transform their own businesses and leverage advancements in technology.

Host: Right. So now, as you are working in Microsoft Research, as a product strategist, how does that transfer to what you do today?

Amos Miller: First of all, I want to introduce, for a moment, the team that I work with, which is the Enable team in Microsoft Research. And the Enable team is looking at technological innovations, especially with disabilities in mind. In our case, our two primary groups are people with ALS and people who are blind. As a product strategist, my role is to work across the research, engineering, marketing and our customer segment and really figure out and understand how we can harness what we have from a technology perspective and, as an organization, to maximize and have that impact that we aspire to have with that community. And that takes a great deal of – again, going back to my earlier point – spending time with that community, going out there and spending time, in my case, with other people who are blind because I only know my own experience. I don’t have everybody else’s experience. The only way for me to learn about that is to be out there. And in our team, every developer goes out there to spend time with end users because that’s the only way you can really get under the covers and understand what’s going on.

Host: Right.

(music plays)

Host: So, the website says you drive a research program that “seeks to understand and invent accessibility in a world” – this is the fun part – “where AI agents and mixed reality are the primary forms of interaction.” It sounds kind of sci-fi to me…

Amos Miller: A little bit. Let me unpack that a little bit. When we traditionally think about accessibility, we think about, how do you make something accessible? So how do you make a microwave more accessible? Well, there isn’t anything inherently inaccessible in putting a piece of pizza and warming it up in the microwave. The only reason it’s inaccessible is because the microwave was designed in an inaccessible way. It could have been accessible from the beginning.

Host: Sure.

Amos Miller: But the world we are moving to is, it’s not about me operating the microwave, it’s not about the accessibility of the microwave, it’s about me preparing dinner for my family. That’s the experience that I’m in. And there’s a bunch of technologies that support that experience. And that experience is what I am seeking to make an accessible and inclusive experience.

Host: Okay.

Amos Miller: That means that we are no longer talking about the microwave, we are talking about a set of interactions that involve people, that involve technology, that involves physical things in the environment. It’s not about making the technology accessible, it’s about using technology to make life more accessible, whether you are going for a walk with a friend, whether you are going to see a movie with a friend, whether it’s sitting in a meeting and brainstorming a storyboard for video. All of these are experiences, and the goal is, how do you make those experiences accessible experiences? That kind of gets you thinking about accessibility in a very different way, where your interaction is with the person that you are sitting in front of. The technology is just there in support of that interaction.

Host: Right. As I’m researching the interview, I’m find myself thinking of the various solutions – maybe the “technical guide dog” mentality – like let’s replace all these things, with technology, that people have traditionally used for independence. And the technology as it enters that ecosystem, some people might think the aim is to replace those things, but I don’t think that’s the point of what’s going on here. Am I right?

Amos Miller: That’s right. There is a tendency, when you come at a problem with a technology solution, to look at what you are currently doing and replace that with something that’s automatic. Right? Oh, you are using a guide dog? How can I replace that guide dog and give you a robot? So, I work on technology that enhances mobility independence through audio and sound, which we’ll talk about in a minute.

Host: Right.

Amos Miller: But often people ask me, how would that work for people who can’t hear? And the natural inclination to them is to say, oh, okay, well you’ll have to deliver the information in a different way. The thing is that people get a sense of their space and their surroundings using the senses that they have. To me, the question is not, how do we shortcut that? It’s how do they sense their space today? They do. They don’t sit there feeling completely disconnected. And if you are going to intervene in that, you better be consistent with how they’re experiencing it today.

Host: Yeah, and that leads me right into the next question because you and I talked earlier about the fundamental role that design plays in the success of human computer interaction. And I’m really eager to have you weigh in on the topic. Let’s frame this broadly in terms of assumptions. And that’s kind of what you were just referring to.

Amos Miller: Yeah.

Host: You know, if I’m looking at you and I think, well my solution to how you interact with the world with technology would be Braille, that’s an assumption. So, I’m just going to give you free reign here. Tell us what you want us to know about this from your perspective.

Amos Miller: We all make assumptions about other people’s other people’s experience of life. You are referring to Bill Buxton who was on your podcast a few weeks ago.

Host: Right.

Amos Miller: And he’s actually been a very close friend and mentor throughout the work that we are doing on Soundscape, which we’ll talk about in a minute. And he’s really brought to our attention that what we’ve done, of going out there and experiencing the real situation that people are experiencing, is about empathy and it’s about trying to understand and probe ideas that challenge your assumptions about what effect they will have. But, really seeing, observing and understanding their experience in that particular situation, and then maybe applying, from your learning, some form of intervention into that experience and observing how that affects that experience. It doesn’t have to be a complete piece of software or technology, it’s just an intervention. It can be completely low-fi. That helps you to start expanding your understanding. And you don’t have to do it with 100 people. Do it with two… three people. You will discover a whole new world you didn’t know about. I’m sorry, but you don’t need 200 data points to support that experience, you’ve just seen it. And you can build on that. So, can you enhance that, in any way, to give them an even richer awareness of their surrounding? And those are the kind of questions that taking design through that very experiential lens has led us to the work that we are actually doing our work on Soundscape, which is the technology that we’ve been developing over the last few years, to really see how far we can take this notion of how people perceive the world and how you can enhance that so their perception is enhanced.

(music plays)

Host: Well, let’s talk about 3D sound and an exciting launch earlier this year in the form of Microsoft’s Soundscape. This is such a cool technology with so many angles to talk about. First, just give our listeners an overview of Soundscape. What is it, who is it for, how does it work, how do people experience it?

Amos Miller: Soundscape is a technology that we developed in collaboration with Guide Dogs, certainly in the early stages, and still do. And the idea is very much using audio that’s played in 3D. Using a stereo headset, you can hear the landmarks that are around you and you can, thereby, really enrich your awareness of your surroundings, of what’s where in a very natural, easy way. And that really helps you feel more independent, more confident, to explore the world beyond what you know.

Host: How do you hear a landmark?

Amos Miller: How do you hear a landmark? So, for example, if you are standing and Starbucks is in front of you and to the right, we will say the word Starbucks, but we won’t say it’s in front of you and to the right, it will sound like it is over there where Starbucks is.

Host: Oh.

Amos Miller: OK? And that’s generated using, the technical term is head rotation transfer of synthetic binaural audio. So, it’s work that actually was developed in Microsoft Research, over a number of years, by Ivan Tashev and his team. And effectively, you can generate sound to make it sound like it’s not in between your ears. You can hear it as though it’s out in the space around you. It’s really quite amazing. And we also use non-audio cues. For example, one of the ideas that we built into Soundscape is this notion of a virtual audio beacon. Not to be confused with Bluetooth beacons! It’s completely virtual. But let’s suppose that you are standing on a street corner and you are heading to a restaurant that’s a block and a half away. What you can do with Soundscape is play some audio beacon that will sound like it’s coming from that restaurant, so no matter which way you’re standing, which way you’re heading, you can always hear that “click-click” sound so you know exactly where that restaurant is. You can see it with your ears.

Host: How do you do that? How do you place a beacon someplace, technically?

Amos Miller: Binaural audio is when you have a slightly different sound in each ear which tricks the brain into having a sense of, that sound is three dimensional. It’s exactly the same way that 3D images work. Audio works almost the same. If Ivan was here, he’ll say it’s not exactly the same, but by generating a slightly different soundwave in each ear, you’re able to make sound, sound like it’s coming from a specific direction. But by playing it in each ear slightly differently, it will actually sound like it’s coming from in front of you and to the right. OK? Now how do we know where to place that beacon?

Host: Right.

Amos Miller: At present, we – it’s largely designed to be used outdoors – so, we use GPS, so we know where you are standing. We know where that restaurant is, so we have two coordinates to work with. We also estimate which way you are facing. So, if you were facing the restaurant, we would want to play that beacon right in front of you. If you were standing at 90 degrees to the restaurant, we’d want to make that beacon sound like it’s coming not only from your right ear, but 100 meters away to your right.

Host: Unbelievable…

Amos Miller: Yeah? And so, taking all of those sensory inputs and taking the information from the map, the GPS location, the direction, we reproduce the sound image in your stereo headset so that you can hear the direction of the sound and where the thing is. And the most amazing thing is, this is all done in real time, completely dynamic. So, as you walk down the street, that restaurant may sound in front of you at 45 degrees to your right, and as you progress, you’ll hear it getting closer and closer and further and further to your right and further and further to your right. And if you overshoot it, it’ll start to sound behind you a little bit, yeah? Now, why is this so important? Because I’m not going to the restaurant on my own. I’m there with my kid or with my wife, or with my friend. And, if I were to hold a phone with the GPS instructions and all of that, I can’t hold a conversation with that person at the same time because I’m so engaged with the technology. And we talked earlier about, how do you get technology to be in the background? That beacon sound is totally in the background. You don’t have to think about it, you don’t have to attend to it mentally, it’s just there. So, you know where the restaurant is, and you continue to have a conversation with the person you are with, or you can daydream, or you can read your emails, listen to a podcast, and all of that happens at the same time. Because it’s played in 3D space, because it’s non-intrusive. You minimize the use of language. And all of these subtle aspects are absolutely crucial for this kind of technology to be relevant to this situation. You’re not sitting in front of the computer and it’s the only thing you are doing. You are outdoors. There’s a ton of things happening all the time that you have to deal with. You can’t expect the person to disassociate themselves from all of that. You know, Soundscape is one way of addressing this very, very interesting and important question. Throughout history, technology has always changed the way that we do things. But I think that we’re starting to see that, as technology developers, we really have to be much more mindful about just from the subtleties of how we design something on, what is the relationship between the technology and the person in that situation? How can a technology do exactly the same as it has done, but do so in a way that makes the person feel empowered and develop a new skill. Great runners learn to feel their heartbeat. But if they have a heart monitor, they’ll stop feeling that heartbeat because the device on their wrist tells them what it is. Well, that’s only because that’s how it was designed. If the heart monitor, instead of telling you, you are at, I don’t know, 150, it’d say, what do you think you’re at? And you’d say, oh, I’m at 140, and it’ll say, oh, you are actually at 150. You will have learned something new from that. It’s exactly the same function, but you have developed yourself as a result of that interaction. And I think that that’s the kind of opportunity that we need to start looking for.

Host: I want to circle back to this 3D audio and the technology behind it, and something that you referred to as “the cocktail party effect.” Can you explain that a little bit and how Microsoft Research is sort of leading the way here?

Amos Miller: The cocktail party effect is an effect, in the world of psycho-acoustics, that is very simple. If you imagine you’re sitting around a table in a cocktail party having a very exciting conversation with somebody, and there are lots of other similar conversations happening around you at the same time, because all of those conversations are happening in 3D space, you are actually able to hear all of those conversations even though you are attending just to yours. You are listening and you can understand and engage in your conversation, but if your name came up in any of those other conversations, you’ll immediately turn your head and say, hey guys, what are you talking about there? And that’s an incredible capability of the brain to manage a very rich set of inputs in the auditory space that is very much under-utilized today in the technology space. We always feel that if we need to convert something into audio, it’s got to be sequenced, because we can only hear one thing at a time. When it’s in 3D, that’s no longer the case. And that’s a huge opportunity. We play a lot of that in VR and augmented reality and we spend a lot of time on the visual aspect of virtual reality and really pushing the envelope on how far we can take the use of immersive experiences in objects in all directions. But the same is available with audio. Even more with audio because your eyes are no longer engaged. Audio is in 360. If we block our ears for a moment, all of a sudden, our awareness level drops. But we are so unaware of the power of audio because vision just takes over everything. And I think the work that we have done, both in the acoustic work on 3D audio, and the application, especially in the disability space where we placed the constraints on the team – there is no vision, now let’s figure it out – and that leads to new frontiers of discovery and innovation in this space that I think could be applicable and would be applicable in many other spaces. And that, you know, that heads-up experience when you are out and about in the streets, not focused on the screen, but engaged in your surroundings. And that’s a perfect situation where audio has huge advantages that we can look at.

(music plays)

Host: I ask each of my guests some version of the question, what keeps you up at night? Because I’m interested in how researchers are addressing unintended consequences of the work they’re doing. Is there anything that concerns you, Amos? Anything that keeps you up at night?

Amos Miller: I think things keeps me up at night because they are so interesting and yet unsolved. You know, we talked a bit about, how do you really express and portray the physical space around you in ways that utilize your other senses and really maximize the ability of the brain to make sense of places without vision? And I really think that, with Soundscape, we’ve only started to scratch the surface of that question. Over half of the brain is devoted to perception. And I think that, when we find ways to really engage, even further engage that incredible human capability, we will discover a whole new frontier of machine and human interaction in ways that we don’t understand today.

Host: You said you arrived at Microsoft Research from “left field.” What’s your story on how you came to be working on research in accessibility at Microsoft Research?

Amos Miller: I started life as a developer, and I did a business degree and joined the Strategic Advisory Services in Microsoft Consulting in the UK. And I think it was a very special moment in Microsoft, over the last few years, when we really started to understand the meaning of impacting every person on the planet with technology and seeing that as our mission. And that led to a series of conversations that opened an opportunity for us to actually get behind that statement and we basically joined Microsoft Research through that mission, through the work that we’re doing in Soundscape. And because we already had very strong relationships, thanks to some wonderful people in the company, and strong relationships here in Microsoft Research and in other parts of the company.

Host: Before we close, Bill Buxton asked me to ask you about the kayak regatta that you organized.

Amos Miller: Uh huh. Oh, we didn’t talk about that.

Host: Just tell that story quickly because I do have one question I want to wrap up with before we go.

Amos Miller: Okay. Well we talked about Soundscape as a technology that really enables you to hear signals in 3D around you. And that was largely designed to be used in the street, right? And then we thought, what would happen if we placed that audio beacon on a lake? So, we got a bunch of people during the summer hackathon and said, okay, well let’s try it out. So, we organized an event on Lake Sammamish. We hacked Soundscape to work on the lake and placed some virtual audio beacons around the lake and invited a group of people who are blind to come and kayak with us and see how they enjoy it. And they absolutely loved it. And I think that was a real eye-opener for us. You have to understand the difference here, you know? Could they kayak before? Sure, no problem, because a sighted person would be with them and tell them, okay, now you go straight, now you row left… But I’m sorry, that’s a very boring experience. You are not in control, you are not independent, you are just doing the work. And by being able to hear where those beacons are, you are truly in the driving seat. And that is a sense of independence that we’ve not really seen to that extent before we did this event.

Host: I like how you called it an eye-opening event!

Amos Miller: It was!

Host: There are so many metaphors about vision that we just sort of take for granted, right?

Amos Miller: Maybe it’s because I have prior sight, maybe not, but I, first of all, I use those metaphors all the time, and I also feel, you know, I could close my eyes and feel that my eyes are closed and open them and feel that they’re open. And I definitely take everything in in a very different way, even though the eyes don’t actually do the scientific aspect of what they’re designed to do.

Host: As we close, I always ask my guests to offer some parting advice to our listeners whether that be in the form of inspiration for future research or challenges that remain to be solved or personal advice on next steps along the career path, whether you have a guide dog with you or Soundscape… What would you say to your 25-year old self if you were just getting started in this arena?

Amos Miller: I honestly would say, get real life experience. Especially in the areas that you are passionate about. Be passionate about them with even more energy and see the work that you do in the context of what you are passionate about. Because you can only really apply your personal experiences to what you do. It’s so great here, in Microsoft Research, to see the interns coming here in the summer. And the creativity and passion, and new perspectives that they bring to our work here. And there’s a little bit of a side of me that worried they’ll jump into the job before they went out and explored the world. And I think it’s important that they find a way to do something that gives them that meaningful context to the work that they’ll be doing here.

(music plays)

Host: Amos Miller, thank you for joining us today. It’s been – can I say it? – an eye-opening experience!

Amos Miller: Sure. My pleasure. Thanks so much for having me.

To learn more about Amos Miller and the latest innovations in audio, sound and accessibility technology, visit


First TextWorld Problems: Microsoft Research Montreal’s latest AI competition is really cooking

textworld at neurips 2018

This week, Microsoft Research threw down the gauntlet with the launch of a competition challenging researchers around the world to develop AI agents that can solve text-based games. Conceived by the Machine Reading Comprehension team at Microsoft Research Montreal, the competition—First TextWorld Problems: A Reinforcement and Language Learning Challenge—runs from December 8, 2018 through May 31, 2019.

First TextWorld Problems is built on the TextWorld framework. TextWorld was released to the public in July 2018 at TextWorld is an extensible, sandbox learning environment for reinforcement learning in text-based games. Beyond game simulation, it has the capacity to generate games stochastically from a user-specified distribution. Such a distribution of games opens new possibilities for the study of generalization and continual or meta-learning in a reinforcement learning setting, by enabling researchers to train and test agents on distinct but related games. TextWorld’s generator gives fine control over game parameters like the size of the game world, the branching factor and length of quests, the density of rewards, and the stochasticity of transitions. Game vocabulary can also be controlled; this directly affects the action and observation spaces. Researchers can also use TextWorld to handcraft games that test for specific knowledge and skills.

The theme for First TextWorld Problems is gathering ingredients to cook a recipe. Agents must determine the necessary ingredients from a recipe book, explore the house to gather ingredients, and return to the kitchen to cook up a delicious meal. Additionally, agents will need to use tools like knives and frying pans. Locked doors and other obstacles along the way must be overcome. The necessary ingredients and their locations change from game to game, as does the layout of the house itself; agents cannot simply memorize a procedure in order to succeed.

Hang on … did someone change the floorplan in this house? Example house layouts generated by TextWorld.

Hang on … did someone change the floorplan in this house? Example house layouts generated by TextWorld.

While a simple cooking task may seem quotidian by human standards, it is still very difficult for AI. Observations and actions are all text-based (see the example below), so a successful agent must learn to understand and manipulate its environment through language, as well as to ground its language in the environmental dynamics. It must also deal with classic, open reinforcement learning problems like partial observability and sparse rewards.

An example of a text-based cooking game whipped up in the TextWorld framework kitchen.

We hope this competition fosters research into generalization across tasks, meta-learning, zero-shot language understanding, common-sense reasoning, efficient exploration, and effective handling of combinatorial action spaces. The winning team will be awarded a prize of $2000 USD, plus an exclusive one-hour discussion session with a Microsoft Research researcher, as well as being featured in a Microsoft Research blog post and in an accompanying article in the Microsoft Research Newsletter (some restrictions apply, please check competition rules and regulations for details.)

Did we pique your interest? We encourage everyone to put their reinforcement learning prowess—and culinary talents—to the test in First TextWorld Problems. Go to and sign up today!


Getting into the groove: New approach encourages risk-taking in data-driven neural modeling

Microsoft Research’s Natural Language Processing group has set an ambitious goal for itself: to create a neural model that can engage in the full scope of conversational capabilities, providing answers to requests while also bringing the value of additional information relevant to the exchange and—in doing so—sustaining and encouraging further conversation.

Take the act of renting a car at the airport, for example. Across from you at the counter is the company representative, entering your information into the system, checking your driver’s license, and the like. If you’re lucky, the interaction isn’t merely a robotic back-and-forth; there is a social element that makes the mundane experience more enjoyable.

“They might ask you where you’re going, and, you say the Grand Canyon. As they’re typing, they’re saying, ‘The weather’s beautiful out there today; it looks gorgeous,’” explained Microsoft Principal Researcher and Research Manager Bill Dolan. “We’re aiming for that kind of interaction, where pleasantries that are linked to the context, even if it’s a very task-oriented context, are not just appropriate, but in many situations, making the conversation feel fluid and human.”

As is the case with many goals worth pursuing, there are obstacles. Existing end-to-end data-driven neural networks have proven highly effective in generating conversational responses that are coherent and relevant, and Microsoft has been at the forefront of the rapid progress that has been made, the first to publish in the space of data-driven approaches to modeling conversational responses back in 2010. But these neural models present two particularly large challenges: They tend to produce very bland, vague outputs—hallmarks of stale conversation and nonstarters if the goal is user engagement beyond the completion of singular tasks—and they take a top-level either-or approach, classifying inputs as either task-oriented or conversational and assigning to each a specific path in the code base that fails to account for the nuances of the other. The result? Responses to more sophisticated conversation that can often be uninformative if varied—for example, “I haven’t a clue” and “I couldn’t tell you”—or they may be informative but not specific enough—such as “I like music” versus “I like jazz”—a result of traditional generation strategies that try to maximize the likelihood of the response.

The paper the team is presenting at the 2018 Conference on Neural Information Processing Systems (NeurIPS)—“Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization”—tackles the former challenge, introducing a new approach to producing more engaging responses that was inspired by the success of adversarial training techniques in such areas as image generation.

“Ideally, we would like to have the systems generate informative responses that are relevant and fully address the input query,” said leading author Yizhe Zhang. “By the same token, we also would like to promote responses that are more varied and less conventionally predictable, something that would help make conversations seem more natural and humanlike.”

“This work is focused on trying to force these modeling techniques to innovate more and not be so boring, to not be the person you’re desperately trying to avoid at the party,” added Dolan.

The force of two major algorithmic components

To accomplish this, the team determined it needed to generate responses that reduce the uncertainty of the query. In other words, the system needed to be better able to guess from the response what the original query might have been, reducing the chance that the system would produce bland outputs such as “I don’t know.”

In the paper, Zhang, Dolan, and their collaborators introduce adversarial information maximization (AIM). Designed to train end-to-end neural response generation models that produce conversational responses that are both informative and diverse, this new approach combines two major algorithmic components: generative adversarial networks (GANs) to encourage diversity and variational information maximization objective (VIMO) to produce informative responses.

“This adversarial training technique has received great success in generating very diverse and realistic-looking synthetic data when it comes to image creation,” said Zhang, who began this work as a Microsoft Research intern while at Duke University and is now a researcher with the company. “It’s been less explored in the text domain because of the discrete nature of text, and we were inspired to see how it could help with natural language processing, especially in dialogue generation.”

GANs themselves are increasingly deployed in neural response and commonly use synthetic data during training. Equilibrium for the GAN objective is achieved when the synthetic data distribution matches the real data distribution. This has the effect of discouraging the generation of responses that demonstrate less variation than human responses. While this may help reduce the level of blandness, however, the GAN technique was not developed for the purpose of explicitly improving either informativeness or diversity. That is where VIMO comes in.

Going backward to move forward

The team trained a backward model that generates the query, or source, from the response, or target. The backward model is then used to guide the forward model—from query to response—to generate relevant responses during training, providing a principled approach to mutual information maximization. This work is the first application of a variational mutual information objective in text generation.

The authors also employed a dual adversarial objective that composes both source-to-target and target-to-source objectives. The dual objective requires the forward and backward model to work synergistically, and each improves the other.

To mitigate the well-known instability in training GAN models, the authors—inspired by the deep structured similarity model—applied an embedding-based discriminator rather than the binary classifier that is conventionally used in GAN training. To reduce the variance of gradient estimation, they used a deterministic policy gradient algorithm with a discrete approximation strategy.

The paper advances the team’s focus on improving ranking candidate hypotheses to push the system to take more risks and produce more interesting outputs.

“In ranking the candidate hypotheses, you might have hundreds and thousands of hypotheses that it’s trying to weigh, and the very top-ranked ones might be these really bland-type ones,” explained Dolan. “If you look down at candidate No. 2,043, it might have a lot of content words, but be wrong and completely odd in context even though it’s aggressively contentful. Go down a little farther, and maybe you find a candidate that’s contentful and appropriate in context.”

Persona non grata

Solving the fundamental problem of uninteresting and potentially uninformative outputs in today’s modeling techniques is an important pursuit, as it’s a significant obstacle in creating conversational agents that individuals will want to engage with regularly in their everyday lives. Without interesting and useful outputs, conversations, task-oriented or not, will quickly spiral into the trivial unless the user is continuously voicing keywords. In that way, current neural models are very reactive, requiring a lot of work from the user, and that can be frustrating and exhausting.

“It’s not that tempting to engage with these agents even though they sound, superficially, fluent as if they understand you, because they tend not to innovate in the conversation,” said Dolan.

Conversation generation stands to gain a lot from this work, but so do other tasks involving language and neural models, such as video and photo captioning or text summarization, let’s say of a spreadsheet you’re working in.

“You don’t want a generated spreadsheet caption that is just, ‘Lines are going up. Numbers are all over the place,’” said Dolan. “You actually need it to be contentful and tie to the context in interesting ways, and that’s at odds with the tendency of current neural modeling techniques.”

The team can envision a future in which exchanges with conversational agents are comparable to those with friends, an exploratory process in which you’re asking for an opinion, unsure of where the conversation will lead.

“You can use our system to improve that, to produce more engaging and interesting dialogue; that’s what this is all about,” said Zhang.


Microsoft Simple Encrypted Arithmetic Library goes open source

The Microsoft Simple Encrypted Arithmetic Library goes open source

Today we are extremely excited to announce that our Microsoft Simple Encrypted Arithmetic Library (Microsoft SEAL), an easy-to-use homomorphic encryption library developed by researchers in the Cryptography Research group at Microsoft, is open source on GitHub under an MIT License for free use. The library has already been adopted by Intel to implement the underlying cryptography functions in HE-Transformer, the homomorphic encryption back end to its neural network compiler nGraph.

As we increasingly move our data to the cloud, there is a clear concern that arises: How can we balance convenience and privacy? We all love to get practical guidance on how to, for example, maximize our investments, improve our workouts, or reach our destinations as efficiently as possible. In exchange, we share personal information with service providers because we have few other options. With traditional encryption schemes, it is impossible to run any computation on encrypted data. So either we store our data encrypted in the cloud and download it to perform any useful operations, which can be logistically inconvenient, or we provide the decryption key to service providers, risking our privacy. Until now. Homomorphic encryption, which allows processing of encrypted data, gives us the ability to use these services without exposing our private information.

In 2015, Microsoft Research released the first version of Microsoft SEAL with the specific goal of providing a well-engineered and documented homomorphic encryption library, free of external dependencies, that would be easy for both cryptography experts and novice practitioners to use. In 2016, we demonstrated CryptoNets, showing that deep learning on homomorphically encrypted data is indeed feasible, revolutionizing our approach to responsible AI.

Now, homomorphic encryption is ready to be standardized, and Microsoft, other industry leaders, academic institutions, and government agencies are actively working toward this goal. This is the right moment to put our library in the hands of every developer, so we can work together for more secure, private, and trustworthy computing.

In addition to having no external dependencies, Microsoft SEAL is written in standard C++, making it easy to compile in many different environments. We are looking forward to engaging with the open-source community in continuing to develop our library. If you are interested, we warmly invite you to join us on GitHub or to participate in discussions on StackOverflow tag-SEAL.


Minimizing trial and error in the drug discovery process

molecules, stock image

In 1928, Alexander Fleming accidentally let his petri dishes go moldy, a mistake that would lead to the breakthrough discovery of penicillin and save the lives of countless people. From these haphazard beginnings, the pharmaceutical industry has grown into one of the most technically advanced and valuable sectors, driven by incredible progress in chemistry and molecular biology. Nevertheless, a great deal of trial and error still exists in the drug discovery process. With an estimated space of 1060 small organic molecules that could be tried and tested, it is no surprise that finding useful compounds is difficult and that the process is full of costly dead ends and surprises.

The challenge of molecule design also lies at the heart of many applications outside pharmacology, including in the optimization of energy production, electronic displays, and plastics. Each of these fields has developed computational methods to search through molecular space and pinpoint useful leads that are followed up in the lab or in more detailed physical simulations. As a result, there are now vast libraries of molecules tagged with useful properties. The abundance of data has encouraged researchers to turn to data-driven approaches to reduce the degree of trial and error in chemical development, and the aim of our paper being presented at the 2018 Conference on Neural Information Processing Systems (NeurIPS) is to investigate how recent advances, specifically in deep learning techniques, could help harness these libraries for new molecular design tasks.

Deep learning with molecular data

Figure 1: The chemical structure of naturally occurring penicillin (penicillin G) and its representation as a graph in a GGNN. The messages passed in the environment of a single node are shown as curved arrows, and the neural networks that transform the messages are shown as small squares. Repeated rounds of message passing allow each node to learn about its surroundings (gray circles).

Deep learning methods have revolutionized a range of applications requiring understanding or generation of unstructured data such as pictures, audio, and text from large datasets. Applying similar methods to organic molecules poses an interesting challenge because molecules contain a lot of structure that is not easy to concisely capture with flat text strings or images (although some schemes do exist). Instead, organic chemists typically represent molecules as a graph where nodes represent atoms and edges represent covalent bonds between atoms. Recently, a class of methods that have collectively become known as neural message passing has been developed precisely to handle the task of deep learning on graph-structured data. The idea of these methods is to encode the local information, such as which element of the periodic table a node represents, into a low-dimensional vector at each node and then pass these vectors along the edges of the graph to inform each node about its neighbors (see Figure 1). Each message is channeled through small neural networks that are trained to extract and combine information to update the destination node’s vector representation to be informative for the downstream task. The message passing can be iterated to allow each node to learn about its more distant neighbors in the graph. Microsoft Research developed one of the earliest variants of this class of deep learning models—the gated graph neural network (GGNN). Microsoft’s primary application focus for GGNNs is in the Deep Program Understanding project, where they are used to analyze program source code (which can also be represented using graphs). Exactly the same underlying techniques are applicable to molecular graphs.

Generating molecules

Figure 2: Example molecules generated by our system after being trained on organic solar cell molecules (CEP database).

Broadly speaking, there are two types of questions that a machine learning system could try to solve in molecule design tasks. First, there are discriminative questions of the following form: What is the property Y of molecule X? A system trained to answer such questions can be used to compare given molecules by predicting their properties from their graph structure. Second, there are generative questions—what is the structure of molecule X that has the optimum property Y?—that aim to invent structures that are similar to molecules seen during training but that optimize for some property. The new paper concentrates on the latter, generative question; GGNNs have already shown great promise in the discriminative setting (for example, see the code available here).

The basic idea of the generative model is to start with an unconnected set of atoms and some latent “specification” vector for the desired molecule and gradually build up molecules by asking a GGNN to inspect the partial graph at each construction step and decide where to add new bonds to grow a molecule satisfying the specification. The two key challenges in this process are ensuring the output of chemically stable molecules and designing useful low-dimensional specification vectors that can be decoded into molecules by the generative GGNN and are amenable to continuous optimization techniques for finding locally optimal molecules.

For the first challenge, there are many chemical rules that dictate whether a molecular structure is stable. The simplest are the valence rules, which dictate how many bonds an element can make in a molecule. For example, carbon atoms have a valency of four and oxygen a valency of two. Inferring these known rigid rules from data and learning to never violate them in the generative process is a waste of the neural network’s capacity. Instead, in the new work, we simply incorporate known rules into the model, leaving the network free to discover the softer trends and patterns in the data. This approach allows injection of domain expertise and is particularly important in applications where there is not enough data to spend on relearning existing knowledge. We believe that combining this domain knowledge and machine learning will produce the best methods in the future.

Figure 3: Example molecule optimization trajectory when optimizing the quantitative estimate of drug-likeness (QED) of a molecule after training on the ZINC database. The initial molecule has a QED of 0.4, and the final molecule has a QED of 0.9

Figure 3: Example molecule optimization trajectory when optimizing the quantitative estimate of drug-likeness (QED) of a molecule after training on the ZINC database. The initial molecule has a QED of 0.4, and the final molecule has a QED of 0.9

For the second challenge, we used an architecture known as a variational autoencoder to discover a space of meaningful specification vectors. In this architecture, a discriminative GGNN is used to predict some property Y of a molecule X, and the internal vector representations in this discriminative GGNN are used as the specification vector for a generative GGNN. Since these internal representations contain information about both the structure of molecule X and the property Y, continuous optimization methods can be used to find the representation that optimizes property Y; the representation is then decoded to find useful molecules. Example molecules generated by the new system are shown in Figures 2 and 3.

Collaborating with experts

The results in the paper are very promising on simple molecule design tasks. However, deep learning methods for molecule generation are still in their infancy, and real-world molecule design is a very complicated process with many different objectives to consider, such as molecule efficacy, specificity, side effects, and production costs. To make significant further progress will require collaboration of machine learning experts and expert chemists. One of the main aims of this paper is to showcase the basic capabilities of deep learning in this space and thereby act as a starting point for dialogue with chemistry experts to see how these methods could enhance their productivity and have the most impact.


Princeton and Microsoft collaborate to tackle fundamental challenges in microbiology

Princeton University has teamed up with Microsoft to collaborate on the leading edge of microbiology and computational modelling research.   

In this project, Microsoft is helping Princeton to better understand the mechanisms of biofilm formation by providing advanced technology that will greatly extend the type of research analysis capable today. Biofilms — surface-associated communities of bacteria — are the leading cause of microbial infection worldwide and kill as many people as cancer does. They are also a leading cause of antibiotic resistance, a problem highlighted by the World Health Organization as “a global crisis that we cannot ignore.” Understanding how biofilms form could enable new strategies to disrupt them. 

Ned Wingreen

Ned Wingreen, the Howard A. Prior Professor in the Life Sciences and professor of molecular biology and the Lewis-Sigler Institute for Integrative Genomics.

To support Princeton, a Microsoft team led by Dr. Andrew Phillips, head of the Biological Computation group at Microsoft Research, will be working closely with Bonnie Bassler, a global pioneer in microbiology who is the Squibb Professor in Molecular Biology and chair of the Department of Molecular Biology at Princeton and a Howard Hughes Medical Institute Investigator, and with Ned Wingreen, the Howard A. Prior Professor in the Life Sciences and professor of molecular biology and the Lewis-Sigler Institute for Integrative Genomics.

Using the power of Microsoft’s cloud and advanced machine learning, Princeton will be able to study different strains of biofilms in new ways to better understand how they work. Microsoft is contributing a cloud-based prototype that can be used for biological modelling and experimentation that will be deployed at Princeton. This work combines programming languages and compilers, which generate biological protocols that can be executed using lab automation technology. It allows experimental data to be uploaded to the cloud where it can be analyzed at scale using advanced machine learning and data analysis methods, to generate biological knowledge. This in turn informs the design of subsequent experiments, to provide insight into the mechanisms of biofilm formation. Princeton is contributing world-leading expertise in experiments and modelling of microbial biofilms.  

“This collaboration enables us to bring together advances in computing and microbiology in powerful new ways,” said Brad Smith, president of Microsoft. “This partnership can help us unlock answers that we hope someday may help save millions of people around the world.”

“By combining our distinctive strengths, Princeton and Microsoft will increase our ability to make the discoveries needed to improve lives and serve society,” said Christopher L. Eisgruber, president of Princeton University. “Technology is creating new possibilities for collaboration, and we hope this venture will inspire other innovative partnerships in the years ahead.”

Pablo Debenedetti, Princeton’s dean for research, said: “We are delighted to be collaborating with Microsoft to advance scientific innovation with this new project, investigating the fundamentals that underlie urgent biomedical problems. Doing cutting-edge research that helps define the boundaries of knowledge and that could ultimately benefit society at large is what we strive for at Princeton.”

Princeton’s relationship with Microsoft is one of the University’s most extensive with industry, spanning collaborations in computer science, cybersecurity and now biomedical research.

As a global research university and leader in innovation, Princeton University cultivates mutually beneficial relationships with companies to support the University’s educational, scientific and scholarly mission. The University is guided by the principle that initiatives to fortify and connect with the innovation ecosystem will advance Princeton’s role as an internationally renowned institution of higher education and accelerate its ability to have greater impact in the world. 


MARLÖ competition challenges researchers to build AI agents that collaborate — and compete to win


With the latest Project Malmo competition, we’re calling on researchers and engineers to test the limits of their thinking as it pertains to artificial intelligence, particularly multi-task, multi-agent reinforcement learning. Last week, a group of attendees at the 14th Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE’18) participated in a one-day workshop featuring the competition, exchanging ideas on the unique challenges of the research area with some of the field’s leading minds.

Learning to Play: The Multi-Agent Reinforcement Learning in MalmÖ (MARLÖ) Competition requires participants to design learning agents capable of collaborating with or competing against other agents to complete tasks of varying difficulty across 3D games. It is the second competition affiliated with Project Malmo, which is an open-ended platform designed for the experimentation of artificial intelligence. Last year’s Malmo Collaborative AI Challenge yielded a diversity and creativity in approach that exceeded expectations, and we look forward to the same this time around.

The competition, co-hosted by Microsoft, Queen Mary University of London, and CrowdAI, is open to participants worldwide through December 31 (submit your entries here).

Sam Devlin, a game AI researcher from the Machine Intelligence and Perception group at Microsoft Research Cambridge, organized the MARLÖ AIIDE 2018 Workshop in collaboration with our academic partners Diego Perez-Liebana of Queen Mary and Sharada Mohanty of École Polytechnique Fédérale de Lausanne, Switzerland.

The workshop included a short tutorial of MARLÖ that allowed attendees to experiment with competition agents and keynote addresses from two distinguished speakers. There were also a series of short contributed talks and a panel session to encourage attendees to share ideas around the application of reinforcement learning in modern commercial video games.

Jesse Cluff, Principal Engineering Lead, The Coalition

Jesse Cluff, Principal Engineering Lead, The Coalition

The first keynote speaker was Jesse Cluff, Principal Engineering Lead with The Coalition. Jesse has more than 20 years of experience in the industry, working on many exciting game titles, including Jackie Chan Stuntmaster, The Simpsons: Hit & Run, Bully, and Gears of War 4. During the workshop, he explored two aspects of game AI—the hardware side in discussing how we run programs in real time with limited resources and the emotional side in discussing how we maximize the enjoyment of players while controlling difficulty. He also talked about how AI techniques are actually used in commercial game products and the challenges he’s facing that still need further research.

Martin Schmid, Research Scientist, DeepMind

Martin Schmid, Research Scientist, DeepMind

Martin Schmid, a research scientist with DeepMind, was the second keynote speaker. He is the lead author of DeepStack, the first computer program to outplay human professionals at heads-up no-limit Texas Hold’em poker, and he spoke about the program as an example of how successful AI methods used in complex games of perfect information like Go can advance AI application in imperfect-information games like poker. The work has huge practical significance since we regularly have to deal with imperfect information in the real world. These two keynotes were inspiring for faculty, researchers, and graduate students in attendance.

From left: Mobchase, Buildbattle, and Treasurehunt

From left: Mobchase, Buildbattle, and Treasurehunt

The workshop also featured the MARLÖ competition’s kickoff tournament. Agents of the participating teams competed in a round robin to achieve the highest scores across three different games—Mobchase, Buildbattle, and Treasurehunt. At the end of the day, we announced the rankings of the enrolled teams. The top three eligible teams will each be presented with the Progress Award, a travel grant worth up to $2,500 for use toward a relevant conference at which they can publish their competition results. The MARLÖ competition is open until December 31, after which the final tournament will be held offline. We hope to see more participants join.


Podcast: Hearing in 3D with Dr. Ivan Tashev

Ivan Tashev podcast

Partner Software Architect, Dr. Ivan Tashev

Episode 50, November 14, 2018

After decades of research in processing audio signals, we’ve reached the point of so-called performance saturation. But recent advances in machine learning and signal processing algorithms have paved the way for a revolution in speech recognition technology and audio signal processing. Dr. Ivan Tashev, a Partner Software Architect in the Audio and Acoustics Group at Microsoft Research, is no small part of the revolution, having both published papers and shipped products at the forefront of the science of sound.

On today’s podcast, Dr. Tashev gives us an overview of the quest for better sound processing and speech enhancement, tells us about the latest innovations in 3D audio, and explains why the research behind audio processing technology is, thanks to variations in human perception, equal parts science, art and craft.


Episode Transcript

Ivan Tashev: You know, humans, they don’t care about mean square error solution or maximum likelihood solution, they just want the sound to sound better. For them. And it’s about human perception. That’s one of the very tricky parts in audio signal processing.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: After decades of research in processing audio signals, we’ve reached the point of so-called performance saturation. But recent advances in machine learning and signal processing algorithms have paved the way for a revolution in speech recognition technology and audio signal processing. Dr. Ivan Tashev, a Partner Software Architect in the Audio and Acoustics Group at Microsoft Research, is no small part of the revolution, having both published papers and shipped products at the forefront of the science of sound.

On today’s podcast, Dr. Tashev gives us an overview of the quest for better sound processing and speech enhancement, tells us about the latest innovations in 3D audio, and explains why the research behind audio processing technology is, thanks to variations in human perception, equal parts science, art and craft. That and much more on this episode of the Microsoft Research Podcast.

Host: Ivan Tashev, welcome to the podcast.

Ivan Tashev: Thank you.

Host: Great to have you here. You’re a Partner Software Architect in the Audio and Acoustics groups at Microsoft Research, so, in broad strokes, tell us about your work. What gets you up in the morning, what questions are you asking, what big problems are you trying to solve?

Ivan Tashev: So, in general, in Audio and Acoustics Research Group, we do audio signal processing. That includes enhancing of a captured sound by our microphones, better sound reproduction using binaural audio, so-called spatial audio. We do a lot of work in audio analytics, recognition of audio objects, recognition of the audio background. We design a lot of interesting audio devices. Our research ranges from applied research related to Microsoft products to a blue-sky research far from what is Microsoft business today.

Host: So, what’s the ultimate goal? Perfect sound?

Ivan Tashev: Hhhh… Perfect sound is a very tricky thing, because it is about human perception. And this is very difficult to be modeled using mathematical equations. So, the classic statistical signal processing was established in 1947 with a paper published by Norbert Wiener defining what we call, today, the Wiener Filtering. The approach is simple: you have a process, you make a statistical model, you define optimality criterion, make the first derivative, make it zero, voila! You have the analytical solution of the problem. The problem is that, you either have an approximate model, and find the solution analytically, or you have precise model which you cannot solve analytically. The other thing is the optimality criterion. You know, humans, they don’t care about mean square error solution or maximum likelihood solution, they just want the sound to sound better. For them. And it’s about human perception. That’s one of the very tricky parts in audio signal processing.

Host: So, where are we heading in audio signal processing, in the era of machine learning and neural networks?

Ivan Tashev: The machine learning and neural networks are capable to find the solution from the data without us making an approximate model. And this is the beauty of this whole application of machine learning in signal processing, and the reason why we achieve significantly better results than using statistical signal processing. Even more, we train the neural network using certain cost function and we can make the cost function to be even another neural network, trained on human perception for better audio which allows us to achieve better perception of a higher quality of the speech enhancement we do using neural network. I’m not saying that we should go in every single audio processing block using machine learning and neural networks. We have processing blocks which have a nice and clean analytical solution, and this runs fast and efficient, and they will remain the same. But in many cases, we operate with approximate models with not very natural optimality criteria. And then, this is where the machine learning shines. This is where we can achieve much better results and provide a higher quality of our output signal.

Host: One interesting area of research that you are doing is noise robust speech recognition. And this is where researchers are working to improve automatic speech recognition systems. So, what’s the science behind this and how are algorithms helping to clean up the signal?

Ivan Tashev: We are witnessing a revolution in speech recognition. The classic speech recognizer was based on so-called Hidden Markov Models or HMM’s. And they served us quite well, but the revolution came when neural networks were implemented and trained to do speech recognition. My colleagues in the speech research group were the first to design a neural network-based speech recognition algorithm which instantly showed better results than the existing production HMM-based speech recognizer. The speech recognition engine has one channel input, while in audio processing, we can deal with multiple channels, so-called microphone arrays, and they give us a sense of spatiality. We can detect the direction where the sounds came from. We can enhance that sound. We can suppress sounds coming from other directions. And then provide this cleaner sound to the speech recognition engine. The microphone reprocessing technologies combined together with techniques like sound source localization and tracking and sound source separation allow us to even separate two simultaneously speaking humans in the conference room and feed two separate instances of the speech recognizer for meeting transcription.

Host: Are you serious?

Ivan Tashev: Yes, we can do that. Even more, the audio processing engine has more prior information. For example, the signal we send to the loudspeakers. And the goal of this engine is to remove the sound which is interfering for our sound. And this is also one of the oldest signal processing algorithms and every single speaker phone has it. But, in all instances, it has been implemented as a mono acoustic echo cancellation. In Microsoft, we were the first to design a stereo and surround sound echo canceller despite a paper written by the inventor of the acoustic echo cancellation himself, stating that stereo acoustic cancellation is not possible. And it’s relatively simple to understand: you have two channels between the left and the right speaker coming to one microphone, so you have one equation and two unknowns. And Microsoft released, as part of Kinect for Xbox, a surround sound echo cancellation engine. Not that we solved five unknowns from one equation, but we just found a workaround which was good enough for any practical purposes and allowed us to clean the surround sound coming from the Xbox to provide a cleaner sound to the speech recognition engine.

Host: So, did you write a paper and say, “Yes, it is possible, thank you very much!”?

Ivan Tashev: I did write a paper.

Host: Oh, you did!

Ivan Tashev: And it was rejected with the most crucial feedback from the reviewers I have ever seen in my career. It is the same to go to the French Academy of Sciences and to propose eternal engine. They have decided, since 18th century, not to discuss papers about that. When I received the rejection notice, I went downstairs in my lab, started the demo, listened to the output. Okay, it works! So, we should be fine!

(music plays)

Host: One thing that’s fascinated me about your work is the infamous anechoic chamber – or chambers, as I came to find out – at Microsoft, and one’s right here in Building 99, but there are others. And so, phrases like “the quietest place on earth” and “where sound goes to die” are kind of sensational, but these are really interesting structures and have really specific purposes which I was interested to find out about. So, tell us about these anechoic, or echo-free, chambers. How many are there here, how are they different from one another and what are they used for?

Ivan Tashev: So, the anechoic chamber is just a room insulated from the sounds outside. In our case, it’s a concrete cube which does not touch the building and sits on around half a meter of rubber to prevent vibrations from the street to come into the room. And internally, the walls, the ceiling and the floor are covered with sound absorption panels. This is pretty much it. What happens is that the sound from the source reaches the microphone, or the human ear, only using the direct path. There is no reflection from the walls and there is no other noise in the chamber. Pretty much that anechoic chamber simulates absence of a room. And it’s just an instrument for making acoustical measurements. What we do in the chamber is we measure the directivity patterns of microphones or radiation patterns of loudspeakers as they are installed in the devices we design. Initially, the anechoic chamber here, in Microsoft Building 99, the headquarters of Microsoft Research, was the only one in Microsoft. But with our engagement with product teams, it became overcrowded, and our business partners decided to build their own anechoic chambers. And there are, today, five in Microsoft Corporation. They all can perform the standard set of measurements, but all of them are a little bit different from each other. For example, the “Quietest Place in the Earth,” as recorded in the Guinness Book of Records, is the anechoic chamber in Building 88. And the largest anechoic chamber is in Studio B which allows making measurements with lower frequencies than in the rest of the chambers. In our chamber, in Building 99, it’s the only one in Microsoft which can allow human beings to stay prolonged amount of time in the chamber because we have air-conditioning connected to the chamber. It’s a different story how much effort it cost us to make the rumbling noise from the air conditioner not to enter the anechoic chamber. But this allowed us to do a lot of research on human spatial hearing in that chamber.

Host: So, drill in on that a little bit because, coming from a video production background, the air conditioner in a building is always the annoying part for the sound people. But you’ve got that figured out in the way that you situated the air conditioning unit and so on?

Ivan Tashev: To remove this rumbling sound from the air conditioner, we installed a gigantic filter which is under the floor of the entire equipment room. So, think about six by four meters floor and this is how we were able to reduce the sound from the air conditioning. Still, if you do a very precise acoustical measurement, we have the ability to switch it off.

Host: Okay. So, back to what you had said about having humans in this room for prolonged periods of time. I’ve heard that your brain starts to play tricks on you when you are in that quiet of a place for a prolonged period of time. What’s the deal there?

Ivan Tashev: OK. This is the human perception of the anechoic chamber. Humans, in general, are, I would say two and a half dimensional creatures. When we walk on the ground, we don’t have very good spatial hearing, vertically. We do much better horizontally. But also, we count on the first reflection from the ground to use it as a distance cue. When you enter the anechoic chamber, you subconsciously swallow, and this is a reaction because your brain thinks that there is a difference in the pressure between your inner ear and the atmosphere which presses the ear drums and you cannot hear anything.

Host: So that swallowing reaction is what you do when you’re in an airplane and the pressure actually changes. And you get the same perception in this room, but the pressure didn’t change.

Ivan Tashev: Exactly. But the problem in the room is that you cannot hear anything just because there is no sound in the chamber. And the other thing what happens is you cannot hear that reflection from the floor which is basically very hard-wired in our brains. We can distinguish two separate sounds when the distance between them is a couple of milliseconds. And when the sound source is far away, this difference between the direct path and the reflection from the ground is less than that. We hear this as one sound. We start to perceive those two as separate sounds when the sound source is closer than a couple of meters away… means two jumps. Then subconsciously alarm bells start to ring in our brain that, hey, there is a sound source less than two jumps away, watch out not to become the dinner! Or maybe this is the dinner!

Host: So, the progress, though, of what your brain does and what your hearing does inside the chamber for one minute, for ten minutes, what happens?

Ivan Tashev: So, there is no sound. And, the brain tries to acquire as much information as possible. And the situation when you don’t get information is called information deprival. You, first after a minute or so, start to hear a shhhhhh, which is actually the blood in the vessels of your ear. Then, after a couple of minutes, you start to hear your body sounds, your heartbeat, your breathing. And, under no other senses, eyes closed, no sound coming, literally you reach, after ten, fifteen minutes the stage of audio hallucinations. Our brains are pattern-matching machines, so sooner or later, the brain will start to recognize sounds you have heard somewhere in different places. We – people from my team – we have not reached that stage, simply because when you work there, the door is open, the tools are clanking, we have conversations, etcetera, etcetera. But maybe someday I will have to lay there and close my eyes and see, can I reach the hallucination stage?

(music plays)

Host: Well, let’s talk about the research behind Microsoft Kinect. And that’s been a huge driver of innovations in this field. Tell us how the legacy of research and hardware for Kinect led to progress in other areas of Microsoft.

Ivan Tashev: Kinect introduced us to new modalities in human-machine interfaces: voice and gesture. And it was a wildly successful product. Kinect entered the Guinness Book of Records for the fastest-selling electronic device in the history of mankind. Microsoft sold eight million devices in the first three months of the beginning of the production. Since then, most of the technologies in Kinect have been further developed. But even during the first year of Kinect, Microsoft released Kinect for Windows which allowed researchers from all over the globe to do things we even didn’t thought of. This is so-called Kinect Effect. We had more than fifty start-ups building their products using technologies from Microsoft Kinect. Today, most of them are further developed, enhanced, and are part of our products. I’ll give just two examples. The first is HoloLens. The device does not have a mouse or keyboard and the human-machine interface is built on three input modalities: gaze, gesture and voice. In HoloLens, we have a depth camera, quite similar to the one in Kinect, and we do gesture recognition using super-refined and improved algorithms, but they originate from the ones we had in Kinect. The second example is also HoloLens. HoloLens has four microphones, the same number as Kinect, and I would say that the audio enhancement pipeline for getting the voice of the person wearing the device is the granddaughter of the audio pipeline released in Kinect in 2010.

Host: Now let’s talk about one of the coolest projects you are working on. It’s the spatial audio or 3D audio. What’s your team doing to make the 3D audio experience a reality?

Ivan Tashev: In general, spatial audio or 3D audio is a technology that allows us to project audio sources in any desired position to be perceived by the human being wearing headphones. This technology is not something new. Actually, we have instances of it in mid-19th century, when two microphones and two rented telephone lines were used for stereo broadcasting of a theatrical play. Later, in the 20th century, there have been vinyl records marked to be listened with headphones because they were stereo recorded using a dummy head with two microphones in the ears. This technology did not fly because of two major deficiencies. The first is, you move your head left and right and the entire audio scene rotates with you. The second is that your brain may not exactly like the spatial cues coming from the microphones in the ear of the dummy head. And this is where we reach the topic of head-related transfer functions. Literally, if you have a sound source somewhere in the space, the sound from it reaches your left and right ear in a slightly different way. It can be modeled as two filters. And if you filter it through those two filters and play it through headphones, your brain will perceive the sound coming from that direction. If we know those pairs of filters for all directions around you, this is called head-related transfer functions. The problem is that they are highly individual. Head-related transfer functions are formed by the size and the dimensions of the head, the position of the ears, the fine structure of the pinna, the reflections from the shoulders. And we did a lot of research to find the way to quickly generate personalized head-related transfer functions. We put, in our anechoic chamber, more than four hundred subjects. We measured their HRTFs. We did a submillimeter precision scan of their head and torso, and we did measurement of certain anthropometric dimensions of those subjects. Today, we can just measure several dimensions of your head and generate your personalized head-related transfer function. We can do this even from a depth picture. Literally, you can tell how you hear from the way you look. And we polished this technology to extend that in HoloLens, you have your spatial audio personalized without even knowing it. You put the device on and you hear through your own personalized spatial hearing.

Host: How does that do that automatically?

Ivan Tashev: Silently, we measure certain anthropometrics of your head. Our engineering teams, our partners, decided that there should not be anything visible for generation of those personalized spatial hearing.

Host: So, if I put this on, say the HoloLens headset, it’s going to measure me on the fly?

Ivan Tashev: Mmm hmmm.

Host: And then the 3D audio will happen for me. Both of us could have the headset on and hear a noise in one of our ears that supposedly is coming from behind us, but really isn’t. It’s virtual.

Ivan Tashev: That’s absolutely correct. With the two loudspeakers in HoloLens or in your headphones, we can make you perceive the sound coming from above, from below, from behind. And this is actually the main difference between surround sound and 3D audio for headphones. Surround sound has five or seven loudspeakers, but they are all in one plane. So, surround audio world is actually flat. While with this spatial audio engine, we can actually render audio above and below which opens pretty much a new frontier in expressiveness of the audio, what we can do.

Host: Listen, as you talk, I have a vision of a bat in my head sending out signals and getting signals and echolocations and…

Ivan Tashev: We did that.

Host: What?

Ivan Tashev: We did that!

Host: Okay, tell.

Ivan Tashev: So, one of our projects – this is one of those more blue-sky research projects – is exactly about that. What we wanted to explore is using audio as echolocation in the same way the bats see in complete darkness. And we built a spherical loudspeaker array of eight transducers which sent ultrasound pulses towards given direction, and near it, an eight-element microphone array which, through the technology called beam forming, listens towards the same direction. With this, we utilized the energy of the loudspeakers well, and reduced the amount of sounds coming from other directions and this allows us to measure the energy reflected by the object in that direction. When you do the scanning of the space, you can create an image which is exactly the same as created from a depth camera using infrared light but with a fraction of the energy. The ultimate goal, eventually, will be to get the same gesture recognition with one tenth or one hundredth of the power necessary. This is important for all portable battery-operated devices.

Host: Yeah. Speaking of that, accessibility is a huge area of interest for Microsoft right now, especially here in Microsoft Research with the AI for Accessibility initiative. And it’s really revolutionizing access to technology for people with disabilities. Tell us how the research you’re doing is finding its way into the projects and products in the arena of accessibility.

Ivan Tashev: You know, accessibility finds a resonance among Microsoft employees. The first application of our spatial audio technology was actually not HoloLens. It was a project which was a kind of a grass roots project when Microsoft employees worked with a charity organization called Guide Dogs in United Kingdom. And from the name you can basically guess that they train guiding dogs for people with blindness. The idea was to use the spatial audio to help the visually impaired. Multiple teams in Microsoft Research, actually, have been involved to overcome a lot of problems, including my team, and this whole story ended up with releasing a product called Soundscape, which is a phone application which allows people with blindness to navigate easier where the spatial audio acts like a finger-pointer. When the system says, “And on the left is the department store,” actually that voice-prompt came from the direction where the department store is, and this is additional spatial cue which helps the orientation of the visually impaired people. Another interesting project we have been involved, also is a grass roots project. It was driven by a girl which was hearing-impaired. She initiated a project during one of the yearly hackathons. And the project was triggered by the fact that she was told by her neighbor that your CO2 alarm is beeping already a week. You have to replace the battery. So, we created a phone application which was able to recognize numerous sounds like CO2 alarm, fire alarm, door knock, phone ring, baby crying, etcetera, etcetera, and to signal the hearing-impaired person using vibration, or the display. And this is to help to navigate and to live a better life in our environment.

(music plays)

Host: You have an interesting personal story. Tell us a bit about your background. Where did you grow up, what got you interested in the work you are doing and how did you end up at Microsoft Research?

Ivan Tashev: I’m born in a small country in Southeast Europe called Bulgaria. I took my diploma in electronic engineering, and PhD in computer science from the Technical University of Sofia, and immediately after my graduation, started to work as a researcher there. In 1998, I was Assistant Professor in the Department of Electronic Engineering when Microsoft hired me, and I moved to Washington State. Spent to two full shipping cycles in Microsoft engineering teams before, in 2001, to move in Microsoft Research. And what I have learned during those two shipping cycles actually helped me a lot to talk better with the engineers during the technology transfers I have done with Microsoft engineering teams.

Host: Yeah, and there’s quite a bit of tech transfer that’s coming out of your group. What are some examples of the things that have been “blue sky research” at the beginning that are now finding their way into millions of users’ desks and homes?

Ivan Tashev: I have been lucky enough to be a part of very strong research groups and to learn from masters like Anoop Gupta or Rico Malvar. My first project in Microsoft Research was called Distributed Meetings and we used that device to record meetings, to store them and to process them. Later, this device became a roundtable device which is part of many conference rooms worldwide. Then, I decided to generalize the microphone array support I designed for round table device and this became the microphone array support in Windows Vista. Next challenge was to make this speech enhancement pipeline to work even in more harsh conditions like the noisy car. And, I designed the algorithms and transferred them to the first speech-driven entertainment system in a mass-production car. And then, the story continues with Kinect, with HoloLens, many other products, and this is another difference between industrial research and academia. The satisfaction from your work is measurable. You know to how many homes your technology has been released, to how many people you changed the way they live, entertain or work.

Host: As we close, Ivan, perhaps you can give some parting advice to those of our listeners that might be interested in the science of sound, so to speak. What are the exciting challenges out there in audio and acoustics research, and what guidance would you offer would-be researchers in this area?

Ivan Tashev: So, audio processing is a very interesting area of research because it is a mixture between art, craft and science. It is science because we work with mathematical models and we have repetitive results. But it is an art because it’s about human perception. Humans have their own preferences and tastes, and this makes it very difficult to model with mathematical models. And it’s also a craft. There are always some small tricks and secret sauce which are not mathematical models but make the algorithms from one lab work much better than the algorithms from another lab. Into the mixture, we have to add the powerful innovation of machine learning technologies, neural networks and artificial intelligence which allow us to solve problems we thought were unsolvable and to produce algorithms which work much better than the classic ones. So, the advice is, learn signal processing and machine learning. This combination is very powerful!

Host: Ivan Tashev, thank you for joining us today.

Ivan Tashev: Thank you.

To learn more about Dr. Ivan Tashev and how Microsoft Research is working to make sound sound better, visit


Microsoft’s code-mixing project could help computers handle Spanglish


Communication is a large part of who we are as human beings, and today, technology has allowed us to communicate in new ways and to audiences much larger and wider than ever before. That technology has assumed single-language speech, which — quite often — does not reflect the way people naturally speak. India, like many other parts of the world, is multilingual on a societal level with most people speaking two or more languages. I speak Bengali, English, and Hindi, as do a lot of my friends and colleagues. When we talk, we move fluidly between these languages without much thought.

This mixing of words and phrases is referred to as code-mixing or code-switching, and from it, we’ve gained such combinations as Hinglish and Spanglish. More than half of the world’s population speaks two or more languages, so with as many people potentially code-switching, creating technology that can process it is important in not only creating useful translation and speech recognition tools, but also in building engaging user interface. Microsoft is progressing on that front in exciting ways.

In Project Mélange, we at Microsoft Research India have been building technologies for processing code-mixed speech and text. Through large-scale computational studies, we are also exploring some fascinating linguistic and behavioral questions around code-mixing, such as why and when people code-mix, that are helping us build technology people can relate to. At the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), my colleagues and I have the opportunity to share some of our recent research with our paper “Word Embeddings for Code-Mixed Language Processing.

A data shortage in code-mixed language

Word embeddings — multidimensional vector representation where words similar in meaning or used in similar context are closer to each other — are learnt using deep learning from large language corpora and are valuable in solving a variety of natural language processing tasks using neural techniques. For processing code-mixed language — say, Hinglish — one would ideally need an embedding of words from both Hindi and English in the same space. There are standard methods for obtaining multilingual word embeddings; however, these techniques typically try to map translation equivalents from the two languages (e.g., school and vidyalay) close to each other. This helps in cross-lingual transfer of models. For instance, a sentiment analysis system trained for English can be appropriately transferred to work for Hindi using multilingual embeddings. But it’s not ideal for code-mixed language processing. While school and vidyalay are translation equivalents, in Hinglish, school is far more commonly used than vidyalay; also, these words are used in slightly different contexts. Further, there are grammatical constraints on code-mixing that disallow certain types of direct word substitutions, most notably for verbs in Hinglish. For processing code-mixed language, the word embeddings should ideally be learnt from a corpus of code-mixed text.

It is difficult to estimate the amount of code-mixing that happens in the world. One good proxy is the code-mixing patterns on social media. Approximately 3.5 percent of the tweets on Twitter are code-mixed. The above pie charts show the distribution of monolingual and code-mixed, or code-switched (cs), tweets in seven major European languages: Dutch (nl), English (en), French (fr), German (de), Portuguese (pt), Spanish (es), and Turkish (tr).

The chart above shows the distributions of monolingual and code-mixed tweets for 12 major cities in Europe and the Americas that were found to have very large or very small fractions of code-mixed tweets, represented in the larger pies by the missing white wedge. The smaller pies show the top two code-mixed language pairs, the size being proportionate to their usage. The Microsoft Research India team found that code-mixing is more prevalent in cities where English is not the major language used to tweet.

Even though code-mixing is extremely common in multilingual societies, it happens in casual speech and rarely in text, so we’re limited in the amount of text data available for code-mixed language. What little we do have is from informal speech conversations, such as interactions on social media, where people write almost exactly how they speak. To address this challenge, we developed a technique to generate natural-looking code-mixed data from monolingual text data. Our method is based on a linguistic model known as the equivalence constraint theory of code-mixing, which imposes several syntactic constraints on code-mixing. In building the Spanglish corpus, for example, we used Bing Microsoft Translator to first translate an English sentence into Spanish. Then we aligned the words, identifying which English word corresponded to the Spanish word, and in a process called parsing identified in the sentences the phrases and how they’re related. Then using the equivalence constraint theory, we systematically generated all possible valid Spanglish versions of the input English sentence. A small number of the generated sentences were randomly sampled based on certain criteria that indicated how close they were to natural Spanglish data, and these sentences comprise our artificial Spanglish corpus. Since there is no dearth of monolingual English and Spanish sentences, using this fully automated technique, we can generate as large a Spanglish corpus as we want.

Solving NLP tasks with an artificially generated corpus

Through experiments on parts-of-speech tagging and sentiment classification, we showed that word embeddings learnt from the artificially generated Spanglish corpus were more effective in solving these NLP tasks for code-mixed language than the standard cross-lingual embedding techniques.

The linguistic theory–based generation of code-mixed text has applications beyond word embeddings. For instance, in one of our previous studies published earlier this year, we showed that this technique helps us in learning better language models that can help us build better speech recognition systems for code-mixed speech. We are exploring its application in machine translation to improve the accuracy of mixed-language requests. And imagine a multilingual chatbot that can code-mix depending on who you are, the context of the conversation, and what topic is being discussed, and switch in a natural and appropriate way. That would be true engagement.