Project Triton and the physics of sound with Microsoft Research’s Dr. Nikunj Raghuvanshi

Episode 68, March 20, 2019

If you’ve ever played video games, you know that for the most part, they look a lot better than they sound. That’s largely due to the fact that audible sound waves are much longer – and a lot more crafty – than visual light waves, and therefore, much more difficult to replicate in simulated environments. But Dr. Nikunj Raghuvanshi, a Senior Researcher in the Interactive Media Group at Microsoft Research, is working to change that by bringing the quality of game audio up to speed with the quality of game video. He wants you to hear how sound really travels – in rooms, around corners, behind walls, out doors – and he’s using computational physics to do it.

Today, Dr. Raghuvanshi talks about the unique challenges of simulating realistic sound on a budget (both money and CPU), explains how classic ideas in concert hall acoustics need a fresh take for complex games like Gears of War, reveals the computational secret sauce you need to deliver the right sound at the right time, and tells us about Project Triton, an acoustic system that models how real sound waves behave in 3-D game environments to makes us believe with our ears as well as our eyes.


Final Transcript

Nikunj Raghuvanshi: In a game scene, you will have multiple rooms, you’ll have caves, you’ll have courtyards, you’ll have all sorts of complex geometry and then people love to blow off roofs and poke holes into geometry all over the place. And within that, now sound is streaming all around the space and it’s making its way around geometry. And the question becomes how do you compute even the direct sound? Even the initial sound’s loudness and direction, which are important? How do you find those? Quickly? Because you are on the clock and you have like 60, 100 sources moving around, and you have to compute all of that very quickly.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: If you’ve ever played video games, you know that for the most part, they look a lot better than they sound. That’s largely due to the fact that audible sound waves are much longer – and a lot more crafty – than visual light waves, and therefore, much more difficult to replicate in simulated environments. But Dr. Nikunj Raghuvanshi, a Senior Researcher in the Interactive Media Group at Microsoft Research, is working to change that by bringing the quality of game audio up to speed with the quality of game video. He wants you to hear how sound really travels – in rooms, around corners, behind walls, out doors – and he’s using computational physics to do it.

Today, Dr. Raghuvanshi talks about the unique challenges of simulating realistic sound on a budget (both money and CPU), explains how classic ideas in concert hall acoustics need a fresh take for complex games like Gears of War, reveals the computational secret sauce you need to deliver the right sound at the right time, and tells us about Project Triton, an acoustic system that models how real sound waves behave in 3-D game environments to makes us believe with our ears as well as our eyes. That and much more on this episode of the Microsoft Research Podcast.

Host: Nikunj Raghuvanshi, welcome to the podcast.

Nikunj Raghuvanshi: I’m glad to be here!

Host: You are a senior researcher in MSR’s Interactive Media Group, and you situate your research at the intersection of computational acoustics and graphics. Specifically, you call it “fast computational physics for interactive audio/visual applications.”

Nikunj Raghuvanshi: Yep, that’s a mouthful, right?

Host: It is a mouthful. So, unpack that! How would you describe what you do and why you do it? What gets you up in the morning?

Nikunj Raghuvanshi: Yeah, so my passion is physics. I really like the mixture of computers and physics. So, the way I got into this was, many, many years ago, I picked up this book on C++ and it was describing graphics and stuff. And I didn’t understand half of it, and there was a color plate in there. It took me two days to realize that those are not photographs, they were generated by a machine, and I was like, somebody took a photo of a world that doesn’t exist. So, that is what excites me. I was like, this is amazing. This is as close to magic as you can get. And then the idea was I used to build these little simulations and I was like the exciting thing is you just code up these laws of physics into a machine and you see all this behavior emerge out of it. And you didn’t tell the world to do this or that. It’s just basic Newtonian physics. So, that is computational physics. And when you try to do this for games, the challenge is you have to be super-fast. You have 1/60th of a second to render the next frame to produce the next buffer of audio. Right? So, that’s the fast portion. How do you take all these laws and compute the results fast enough that it can happen at 1/60th of a second, repeatedly? So, that’s where the computer science enters the physics part of it. So, that’s the sort of mixture of things where I like to work in.

Host: You’ve said that light and sound, or video and audio, work together to make gaming, augmented reality, virtual reality, believable. Why are the visual components so much more advanced than the audio? Is it because the audio is the poor relation in this equation, or is it that much harder to do?

Nikunj Raghuvanshi: It is kind of both. Humans are visual dominant creatures, right? Because visuals are what is on our conscious mind and when you describe the world, our language is so visual, right? Even for sound, sometimes we use visual metaphors to describe things. So, that is part of it. And part of it is also that for sound, the physics is in many ways tougher because you have much longer wavelengths and you need to model wave diffraction, wave scattering and all these things to produce a believable simulation. And so, that is the physical aspect of it. And also, there’s a perceptual aspect. Our brain has evolved in a world where both audio/visual cues exist, and our brain is very clever. It goes for the physical aspects of both that give us separate information, unique information. So, visuals give you line-of-sight, high resolution, right? But audio is lower resolution directionally, but it goes around corners. It goes around rooms. That’s why if you put on your headphones and just listen to music at the loud volume, you are a danger to everybody on the street because you have no awareness.

Host: Right.

Nikunj Raghuvanshi: So, audio is the awareness part of it.

Host: That is fascinating because you’re right. What you can see is what is in front of you, but you could hear things that aren’t in front of you.

Nikunj Raghuvanshi: Yeah.

Host: You can’t see behind you, but you can hear behind you.

Nikunj Raghuvanshi: Absolutely, you can hear behind yourself and you can hear around stuff, around corners. You can hear stuff you don’t see, and that’s important for anticipating stuff.

Host: Right.

Nikunj Raghuvanshi: People coming towards you and things like that.

Host: So, there’s all kinds of people here that are working on 3D sound and head-related transfer functions and all that.

Nikunj Raghuvanshi: Yeah, Ivan’s group.

Host: Yeah! How is your work interacting with that?

Nikunj Raghuvanshi: So, that work is about, if I tell you the spatial sound field around your head, how does it translate into a personal experience in your two ears? So, the HRTF modeling is about that aspect. My work with John Snyder is about, how does the sound propagate in the world, right?

Host: Interesting.

Nikunj Raghuvanshi: So, if there is a sound down a hallway, what happens during the time it gets from there up to your head? That’s our work.

Host: I want you to give us a snapshot of the current state-of-the-art in computational acoustics and there’s apparently two main approaches in the field. What are they, and what’s the case for each and where do you land in this spectrum?

Nikunj Raghuvanshi: So, there’s a lot of work in room acoustics where people are thinking about, okay, what makes a concert hall sound great? Can you simulate a concert hall before you build it, so you know how it’s going to sound? And, based on the constraints on those areas, people have used a lot of ray tracing approaches which borrow on a lot of literature in graphics. And for graphics, ray tracing is the main technique, and it works really well, because the idea is you’re using a short wavelength approximation. So, light wavelengths are submicron and if they hit something, they get blocked. But the analogy I like to use is sound is very different, the wavelengths are much bigger. So, you can hold your thumb out in front of you and blot out the sun, but you are going to have a hard time blocking out the sound of thunder with a thumb held out in front of your ear because the waves will just wrap around. And, that’s what motivates our approach which is to actually go back to the physical laws and say, instead of doing the short wave length approximation for sound, we revisit and say, maybe sounds needs the more fundamental wave equation to be solved, to actually model these diffraction effects for us. The usual thinking is that, you know, in games, you are thinking about we want a certain set of perceptual cues. We want walls to occlude sound, we want a small room to reverberate less. We want a large hall to reverberate more. And the thought is, why are we solving this expensive partial differential equation again? Can’t we just find some shortcut to jump straight to the answer instead of going through this long-winded route of physics? And our answer has been that you really have to do all the hard work because there’s a ton of information that’s folded in and what seems easy to us as humans isn’t quite so easy for a computer and and there’s no neat trick to get you straight to the perceptual answer you care about.

(music plays)

Host: Much of the work in audio and acoustic research is focused on indoor sound where the sound source is within the line of sight and the audience and the listener can see what they were listening to…

Nikunj Raghuvanshi: Um-hum.

Host: …and you mentioned that the concert hall has a rich literature in this field. So, what’s the gap in the literature when we move from the concert hall to the computer, specifically in virtual environments?

Nikunj Raghuvanshi: Yeah, so games and virtual reality, the key demand they have is the scene is not one room, and with time it has become much more difficult. So, a concert hall is terrible if you can’t see the people who are playing the sound, right? So, it allows for a certain set of assumptions that work extremely nicely. The direct sound, which is the initial sound, which is perceptually very critical, just goes in a straight line from source to listener. You know the distance so you can just use a simple formula and you know exactly how loud the initial sound is at the person. But in a game scene, you will have multiple rooms, you’ll have caves, you’ll have courtyards, you’ll have all sorts of complex geometry and then people love to blow off roofs and poke holes into geometry all over the place. And within that, now sound is streaming all around the space and it’s making its way around geometry. And the question becomes, how do you compute even the direct sound? Even the initial sound’s loudness and direction, which are important? How do you find those? Quickly? Because you are on the clock and you have like 60, 100 sources moving around, and you have to compute all of that very quickly. So, that’s the challenge.

Host: All right. So, let’s talk about how you’re addressing it. A recent paper that you’ve published made some waves, sound waves probably. No pun intended… It’s called Parametric Directional Coding for Pre-computed Sound Propagation. Another mouthful. But it’s a great paper and the technology is so cool. Talk about this… research this that you’re doing.

Nikunj Raghuvanshi: Yeah. So, our main idea is, actually, to look at the literature in lighting again and see the kind of path they’d followed to kind of deliver this computational challenge of how you do these extensive simulations and still hit that stringent CPU budget in real time. And one of the key ideas is you precompute. You cheat. You just look at the scene and just compute everything you need to compute beforehand, right? Instead of trying to do it on the fly during the game. So, it does introduce the limitation that the scene has to be static. But then you can do these very nice physical computations and you can ensure that the whole thing is robust, it is accurate, it doesn’t suffer from all the sort of corner cases that approximations tend to suffer from, and you have your result. You basically have a giant look-up table. If somebody tells you that the source is over there and the listener is over here, tell me what the loudness of the sound would be. We just say okay, we this a giant table, we’ll just go look it up for you. And that is the main way we bring the CPU usage into control. But it generates a knock-off challenge that now we have this huge table, there’s this huge amount of data that we’ve stored and it’s 6-dimensional. The source can move in 3-dimensions and the listener can move in 3-dimensions. So, we have the giant table which is terabytes or even more on data.

Host: Yeah.

Nikunj Raghuvanshi: And the game’s typical budget is like 100 megabytes. So, the key challenge we’re facing is, how do we fit everything in that? How do we take this data and extract out something salient that people listen to and use that? So, you start with full computation, you start as close to nature as possible and then we’re saying okay, now what would a person hear out of this? Right? Now, let’s do that activity of, instead of doing a shortcut, now let’s think about okay, a person hears the directional sound comes from. If there is a doorway, the sound should come from the doorway. So, we pick out these perceptual parameters that are salient for human perception and then we store those. That’s the crucial way you kind of bring down this enormous data set and do a sort of memory budget that’s feasible.

Host: So, that’s the paper.

Nikunj Raghuvanshi: Um-hum.

Host: And how has it played out in practice, or in project, as it were?

Nikunj Raghuvanshi: So, a little bit of history on this is, we had a paper SIGGRAPH 2010, me and John Snyder and some academic collaborators, and at that point, we were trying to think of just physical accuracy. So, we took the physical data and we were trying to stay as close to physical reality as possible and we were rendering that. And around 2012, we got to talking with Gears of War, the studio, and we were going through what the budgets will be, how things would be. And we were like we need… this needs to… this is gigabytes, it needs to go to megabytes…

Host: Really?

Nikunj Raghuvanshi: …very quickly. And that’s when we were like, okay, let’s simplify. Like, what’s the four like most basic things that you really want from an acoustic system? Like walls should occlude sound and thing like that. So, we kind of re-winded and came to it from this perceptual viewpoint that I was just describing. Let’s keep only what’s necessary. And that’s how we were able to ship this in 2016 in Gears of War 4 by just re-winding and doing this process.

Host: How is that playing in to, you know… Project Triton is the big project that we’re talking about. How would you describe what that’s about and where it’s going? Is it everything you’ve just described or is there… other aspects to it?

Nikunj Raghuvanshi: Yeah. Project Triton is this idea that you should precompute the wave physics, instead of starting with approximations. Approximate later. That’s one idea of Project Triton. And the second is, if you want to make it feasible for real games and real virtual reality and augmented reality, switch to perceptual parameters. Extract that out of this physical simulation and then you have something feasible. And the path we are on now, which brings me back to the recent paper you mentioned…

Host: Right.

Nikunj Raghuvanshi: …is, in Gears of War, we shipped some set of parameters. We were like, these make a big difference. But one thing we lacked was if the sound is, say, in a different room and you are separated by a doorway, you would hear the right loudness of the sound, but its direction would be wrong. Its direction would be straight through the wall, going from source to listener.

Host: Interesting.

Nikunj Raghuvanshi: And that’s an important spatial cue. It helps you orient yourself when sounds funnel through doorways.

Host: Right.

Nikunj Raghuvanshi: Right? And it’s a cue that sound designers really look for and try to hand-tune to get good ambiances going. So, in the recent 2018 paper, that’s what we fixed. We call this portaling. It’s a made-up word for this effect of sounds going around doorways, but that’s what we’re modeling now.

Host: Is this new stuff? I mean, people have tackled these problems for a long time.

Nikunj Raghuvanshi: Yeah.

Host: Are you people the first ones to come up with this, the portaling and…?

Nikunj Raghuvanshi: I mean, the basic ideas have been around. People know that, perceptually, this is important, and there are approaches to try to tackle this, but I’d say, because we’re using wave physics, this problem becomes much easier because you just have the waves diffract around the edge. With ray tracing you face the difficult problem that you have to trace out the rays “intelligently” somehow to hit an edge, which is like hitting a bullseye, right?

Host: Right.

Nikunj Raghuvanshi: So, the ray can wrap around the edge. So, it becomes really difficult. Most practical ray tracing systems don’t try to deal with this edge diffraction effect because of that. Although there are academic approaches to it, in practice it becomes difficult. But as I worked on this over the years, I’ve kind of realized, these are the real advantages of this. Disadvantages are pretty clear: it’s slow, right? So, you have to precompute. But we’re realizing, over time, that going to physics has these advantages.

Host: Well, but the precompute part is innovative in terms of a thought process on how you would accomplish the speed-up…

Nikunj Raghuvanshi: There have been papers on precomputed acoustics, academically before, but this realization that mixing precomputation and extracting these perceptual parameters? That is a recipe that makes a lot of practical sense. Because a third thing that I haven’t mentioned yet is going to the perceptual domain, now the sound designer can make sense of the numbers coming out of this whole system. Because it’s loudness. It’s reverberation time, how long the sound is reverberating. And these numbers that are super-intuitive to sound designers, they already deal with them. So, now what you are telling them is, hey, you used to start with a blank world, which had nothing, right? Like the world before the act of creation, there’s nothing. It’s just empty space and you are trying to make things reverberate this way or that, now you don’t need to do that. Now physics will execute first ,on the actual scene with the actual materials, and then you can say I don’t like what physics did here or there, let me tweak it, let me modify what the real result is and make it meet the artistic goals I have for my game.

(music plays)

Host: We’ve talked about indoor audio modeling, but let’s talk about the outdoors for now and the computational challenges to making natural outdoor sounds, sound convincing.

Nikunj Raghuvanshi: Yeah.

Host: How have people hacked it in the past and how does your work in ambient sound propagation move us forward here?

Nikunj Raghuvanshi: Yeah, we’ve hacked it in the past! Okay. This is something we realized on Gears of War because the parameters we use were borrowed, again, from the concert hall literature and, because they’re parameters informed by concert halls, things sound like halls and rooms. Back in the days of Doom, this tech would have been great because it was all indoors and rooms, but in Gears of War, we have these open spaces and it doesn’t sound quite right. Outdoors sounds like a huge hall and you know, how do we do wind ambiances and rain that’s outdoors? And so, we came up with a solution for them at that time which we called “outdoorness.” It’s, again, an invented word.

Host: Outdoorness.

Nikunj Raghuvanshi: Outdoorness.

Host: I’m going to use that. I like it.

Nikunj Raghuvanshi: Because the idea it’s trying to convey is, it’s not a binary indoor/outdoor. When you are crossing a doorway or a threshold, you expect a smooth transition. You expect that, I’m not hearing rain inside, I’m feeling nice and dry and comfortable and now I’m walking into the rain…

Host: Yeah.

Nikunj Raghuvanshi: …and you want the smooth transition on it. So, we built a sort of custom tech to do that outdoor transition. But it got us thinking about, what’s the right way to do this? How do you produce the right sort of spatial impression of, there’s rain outside, it’s coming through a doorway, the doorway is to my left, and as you walk, it spreads all around you. You are standing in the middle of rain now and it’s all around you. So, we wanted to create that experience. So, the ambient sound propagation work was an intern project and now we finished it up with our collaborators in Cornell. And that was about, how do you model extended sound sources? So, again, going back to concert halls, usually people have dealt with point-like sources which might have a directivity pattern. But rain is like a million little drops. If you try to model each and every drop, that’s not going to get you anywhere. So, that’s what the paper is about, how to treat it as one aggregate that somebody gave us? And we produce an aggregate sort of energy distribution of that thing along with this directional characteristics and just encode that.

Host: And just encode it.

Nikunj Raghuvanshi: And just encode it.

Host: How is it working?

Nikunj Raghuvanshi: It works nice. It sounds good. To my ears it sounds great.

Host: Well you know, and you’re the picky one, I would imagine.

Nikunj Raghuvanshi: Yeah. I’m the picky one and also when you are doing iterations for a paper, you also completely lose objectivity at some point. So, you’re always looking for others to get some feedback.

Host: Here, listen to this.

Nikunj Raghuvanshi: Well, reviewers give their feedback, so, yeah.

Host: Sure. Okay. Well, kind of riffing on that, there’s another project going on that I’d love for you to talk as much as you can about called Project Acoustics and kind of the future of where we’re going with this. Talk about that.

Nikunj Raghuvanshi: That’s really exciting. So, up to now, Project Triton was an internal tech which we managed to propagate from research into actual Microsoft product, internally.

Host: Um-hum.

Nikunj Raghuvanshi: Project Acoustics is being led by Noel Cross’s team in Azure Cognition. And what they’re doing is turning it into a product that’s externally usable. So, trying to democratize this technology so it can be used by any game audio team anywhere backed by Azure compute to do the precomputation.

Host: Which is key, the Azure compute.

Nikunj Raghuvanshi: Yeah, because you know, it took us a long time, with Gears of War to figure out, okay, where is all this precompute going to happen?

Host: Right.

Nikunj Raghuvanshi: We had to figure out the whole cluster story for themselves, how to get the machines, how to procure them, and there’s a big headache of arranging compute for yourself. And so that’s, logistically, a key problem that people face when they try to think of precomputed acoustics. The run-time side, Project Acoustics, we are going to have plug-ins for all the standard game audio engines and everything. So, that makes things simpler on that side. But a key blocker in my view was always this question of, where are you going to precompute? So, now the answer is simple. You get your Azure badge account and you just send your stuff up there and it just computes.

Host: Send it to the cloud and the cloud will rain it back down on you.

Nikunj Raghuvanshi: Yes. It will send down data.

Host: Who is your audience for Project Acoustics?

Nikunj Raghuvanshi: Project Acoustics, the audience is the whole game audio industry. And our real hope is that we’ll see some uptake on it when we announce it at GDC in March, and we want people to use it, as many teams, small, big, medium, everybody, to start using this because we feel there’s a positive feedback loop that can be set up where you have these new tools available, designers realize that they have these new tools available that have shipped in Triple A games, so they do work. And for them to give us feedback. If they use these tools, we hope that they can produce new audio experiences that are distinctly different so that then they can say to their tech director, or somebody, for the next game, we need more CPU budget. Because we’ve shown you value. So, a big exercise was how to fit this within current budgets so people can produce these examples of novel possible experiences so they can argue for more. So, to increase the budget for audio and kind of bring it on par with graphics over time as you alluded to earlier.

Host: You know, if we get nothing across in this podcast, it’s like, people, pay attention to good audio. Give it its props. Because it needs it. Let’s talk briefly about some of the other applications for computational acoustics. Where else might it be awesome to have a layer of realism with audio computing?

Nikunj Raghuvanshi: One of the applications that I find very exciting is for audio rendering for people who are blind. I had the opportunity to actually show the demo of our latest system to Daniel Kish, who, if you don’t know, he’s the human echo-locator. And he uses clicks from his mouth to actually locate geometry around him and he’s always oriented. He’s an amazing person. So that was a collaboration, actually, we had with a team in the Garage. They released a game called Ear Hockey and it was a nice collaboration, like there was a good exchange of ideas over there. That’s nice because I feel that’s a whole different application where it can have a potential social positive impact. The other one that’s very interesting to me is that we lived in 2-D desktop screens for a while and now computing is moving into the physical world. That’s the sort of exciting thing about mixed reality, is moving compute out into this world. And then the acoustics of the real world being folded into the sounds of virtual objects becomes extremely important. If something virtual is right behind the wall from you, you don’t want to listen to it with full loudness. That would completely break the realism of something being situated in the real world. So, from that viewpoint, good light transport and good sound propagation are both required things for the future compute platform in the physical world. So that’s a very exciting future direction to me.

(music plays)

Host: It’s about this time in the podcast I ask all my guests the infamous “what keeps you up at night?” question. And when you and I talked before, we went down kind of two tracks here, and I felt like we could do a whole podcast on it, but sadly we can’t… But let’s talk about what keeps you up at night. Ironically to tee it up here, it deals with both getting people to use your technology…

Nikunj Raghuvanshi: Um-hum.

Host: And keeping people from using your technology.

Nikunj Raghuvanshi: No! I wanted everybody to use the technology. But I’d say like five years ago, what used to keep me up at night is like, how are we going to ship this thing in Gears of War? Now what’s keeping me up at night is how do we make Project Acoustics succeed and how do we you know expand the adoption of it and, in a small way, try to improve, move the game audio industry forward a bit and help artists do the artistic expression they need to do in games? So, that’s what I’m thinking right now, how can we move things forward in that direction? I frankly look at video games as an art form. And I’ve gamed a lot in my time. To be honest, all of it wasn’t art, I was enjoying myself a lot and I wasted some time playing games. But we all have our ways to unwind and waste time. But good games can be amazing. They can be much better than a Hollywood movie in terms of what you leave them with. And I just want to contribute in my small way to that. Giving artists the tools to maybe make the next great story, you know.

Host: All right. So, let’s do talk a little bit, though, about this idea of you make a really good game…

Nikunj Raghuvanshi: Um-hum.

Host: Suddenly, you’ve got a lot of people spending a lot of time. I won’t say wasting. But we have to address the nature of gaming, and the fact that there are you know… you’re upstream of it. You are an artist, you are a technologist, you are a scientist…

Nikunj Raghuvanshi: Um-hum.

Host: And it’s like I just want to make this cool stuff.

Nikunj Raghuvanshi: Yeah.

Host: Downstream, it’s people want people to use it a lot. So, how do you think about that and the responsibilities of a researcher in this arena?

Nikunj Raghuvanshi: Yeah. You know, this reminds me of Kurt Vonnegut’s book, Cat’s Cradle? He kind of makes – what there’s scientist who makes Ice 9 and it freezes the whole planet or something. So, you see things about video games in the news and stuff. But I frankly feel that the kind of games I’ve participated in making, these games are very social experiences. People meet on the games a lot. Like Sea of Thieves is all about, you get a bunch of friends together, you’re sitting on the couch together, and you’re just going crazy like on these pirate ships and trying to just have fun. So, they are not the sort of games where a person is being separated from society by the act of gaming and just is immersed in the screen and is just not participating in the world. They are kind of the opposite. So, games have all these aspects. And so, I personally feel pretty good about the games I’ve contributed to. I can at least say that.

Host: So, I like to hear personal stories of the researchers that come on the podcast. So, tell us a little bit about yourself. When did you know you wanted to do science for a living and how did you go about making that happen?

Nikunj Raghuvanshi: Science for a living? I was the guy in 6th grade who’d get up and say I want to be a scientist. So, that was then, but what got me really hooked was graphics, initially. Like I told you, I found the book which had these color plates and I was like, wow, that’s awesome! So, I was at UNC Chapel Hill, graphics group, and I studied graphics for my graduate studies. And then, in my second or third year, my advisor, Ming Lin, she does a lot of research in physical simulations. How do we make water look nice in physical simulations? Lots of it is CGI. How do you model that? How do you model cloth? How do you model hair? So, there’s all this physics for that. And so, I took a course with her and I was like, you know what? I want to do audio because you get a different sense, right? It’s simulation, not for visuals, but you get to hear stuff. I’m like okay, this is cool. This is different. So, I did a project with her and I published a paper on sound synthesis. So, like how rigid bodies, like objects rolling and bouncing around and sliding make sound, just from physical equations. And I found a cool technique and I was like okay, let me do acoustics with this. It’s going to be fun. And I’m going to publish another paper in a year. And here I am, still trying to crack that problem of how to do acoustics in spaces!

Host: Yeah, but what a place to be. And speaking of that, you have a really interesting story about how you ended up at Microsoft Research and brought your entire PhD code base with you.

Nikunj Raghuvanshi: Yeah. It was an interesting time. So, when I was graduating, MSR was my number one choice because I was always thinking of this technology as, it would be great if games used this one day. This is the sort of thing that would have a good application in games. And then, around that time, I got hired to MSR and it was a multicore incubation back then, my group was looking at how do these multicore systems enable all sorts of cool new things? And one of the things my hiring manager was looking at was how can we do physically based sound synthesis and propagation. So, that’s what my PhD was, so they licensed the whole code base and I built on that.

Host: You don’t see that very often.

Nikunj Raghuvanshi: Yeah, it was nice.

Host: That’s awesome. Well, Nikunj, as we close, I always like to ask guests to give some words of wisdom or advice or encouragement, however it looks to you. What would you say to the next generation of researchers who might want to make sound sound better?

Nikunj Raghuvanshi: Yeah, it’s an exciting area. It’s super-exciting right now. Because even like just to start from more technical stuff, there are so many problems to solve with acoustic propagation. I’d say we’ve taken just the first step of feasibility, maybe a second one with Project Acoustics, but we’re right at the beginning of this. And we’re thinking there are so many missing things, like outdoors is one thing that we’ve kind of fixed up a bit, but we’re going towards what sorts of effects can you model in the future? Like directional sources is one we’re looking at, but there are so many problems. I kind of think of it as the 1980s of graphics when people first figured out that you can make this work. You can make light propagation work. What are the things that you need to do to make it ever closer to reality? And we’re still at it. So, I think we’re at that phase with acoustics. We’ve just figured out this is one way that you can actually ship in practical applications and we know there are deficiencies in its realism in many, many places. So, I think of it as a very rich area that students can jump in and start contributing.

Host: Nowhere to go but up.

Nikunj Raghuvanshi: Yes. Absolutely!

Host: Nikunj Raghuvanshi, thank you for coming in and talking us today.

Nikunj Raghuvanshi: Thanks for having me.

(music plays)

To learn more about Dr. Nikunj Raghuvanshi and the science of sound simulation, visit


Email overload: Using machine learning to manage messages, commitments

As email continues to be not only an important means of communication but also an official record of information and a tool for managing tasks, schedules, and collaborations, making sense of everything moving in and out of our inboxes will only get more difficult. The good news is there’s a method to the madness of staying on top of your email, and Microsoft researchers are drawing on this behavior to create tools to support users. Two teams working in the space will be presenting papers at this year’s ACM International Conference on Web Search and Data Mining February 11–15 in Melbourne, Australia.

“Identifying the emails you need to pay attention to is a challenging task,” says Partner Researcher and Research Manager Ryen White of Microsoft Research, who manages a team of about a dozen scientists and engineers and typically receives 100 to 200 emails a day. “Right now, we end up doing a lot of that on our own.”

According to the McKinsey Global Institute, professionals spend 28 percent of their time on email, so thoughtful support tools have the potential to make a tangible difference.

“We’re trying to bring in machine learning to make sense of a huge amount of data to make you more productive and efficient in your work,” says Senior Researcher and Research Manager Ahmed Hassan Awadallah. “Efficiency could come from a better ability to handle email, getting back to people faster, not missing things you would have missed otherwise. If we’re able to save some of that time so you could use it for your actual work function, that would be great.”

Email deferral: Deciding now or later

Awadallah has been studying the relationship between individuals and their email for years, exploring how machine learning can better support users in their email responses and help make information in inboxes more accessible. During these studies, he and fellow researchers began noticing varying behavior among users. Some tackled email-related tasks immediately, while others returned to messages multiple times before acting. The observations led them to wonder: How do users manage their messages, and how can we help them make the process more efficient?

“There’s this term called ‘email overload,’ where you have a lot of information flowing into your inbox and you are struggling to keep up with all the incoming messages,” explains Awadallah, “and different people come up with different strategies to cope.”

In “Characterizing and Predicting Email Deferral Behavior,” Awadallah and his coauthors reveal the inner workings of one such common strategy: email deferral, which they define as seeing an email but waiting until a later time to address it.

The team’s goal was twofold: to gain a deep understanding of deferral behavior and to build a predictive model that could help users in their deferral decisions and follow-up responses. The team—a collaboration between Microsoft Research’s Awadallah, Susan Dumais, and Bahareh Sarrafzadeh, lead author on the paper and an intern at the time, and Christopher Lin, Chia-Jung Lee, and Milad Shokouhi of the Microsoft Search, Assistant and Intelligence group—dedicated a significant amount of resources to the former.

“AI and machine learning should be inspired by the behavior people are doing right now,” says Awadallah.

The probability of deferring an email based on the workload of the user as measured by the number of unhandled emails. The number of unhandled emails is one of many features Awadallah and his coauthors used in training their deferral prediction model.

The team interviewed 15 subjects and analyzed the email logs of 40,000 anonymous users, finding that people defer for several reasons: They need more time and resources to respond than they have in that moment, or they’re juggling more immediate tasks. They also factor in who the sender is and how many others have been copied. They found some of the more interesting reasons revolved around perception and boundaries, delaying or not to set expectations on how quickly they respond to messages.

The researchers used this information to create a dataset of features—such as the message length, the number of unanswered emails in an inbox, and whether a message was human- or machine-generated—to train a model to predict whether a message is deferred. The model has the potential to significantly improve the email experience, says Awadallah. For example, email clients could use such a model to remind users about emails they’ve deferred or even forgotten about, saving them the effort they would have spent searching for those emails and reducing the likelihood of missing important ones.

“If you have decided to leave an email for later, in many cases, you either just rely on memory or more primitive controls that your mail client provides like flagging your message or marking the message unread, and while these are useful strategies, we found that they do not provide enough support for users,” says Awadallah.

Commitment detection: A promise is a promise

Among the deluge of incoming emails are outgoing messages containing promises we make—promises to provide information, set up meetings, or follow up with coworkers—and losing track of them has ramifications.

“Meeting your commitments is incredibly important in collaborative settings and helps build your reputation and establish trust,” says Ryen White.

Current commitment detection tools, such as those available in Cortana, are pretty effective, but there’s room for further advancement. White, lead author Hosein Azarbonyad, who was interning with Microsoft at the time of the work, and coauthor Microsoft Research Principal Applied Scientist Robert Sim seek to tackle one particular obstacle in their paper “Domain Adaptation for Commitment Detection in Email”: bias in the datasets available to train commitment detection models.

Researcher access is generally limited to public corpora, which tend to be specific to the industry they’re from. In this case, the team used public datasets of email from the energy company Enron and an unspecified tech startup referred to as “Avocado.” They found a significant disparity between models trained and evaluated on the same collection of emails and models trained on one collection and applied to another; the latter model failed to perform as well.

“We want to learn transferable models,” explains White. “That’s the goal—to learn algorithms that can be applied to problems, scenarios, and corpora that are related but different to those used during training.”

To accomplish this, the group turned to transfer learning, which has been effective in other scenarios where datasets aren’t representative of the environments in which they’ll ultimately be deployed. In their paper, the researchers train their models to remove bias by identifying and devaluing certain information using three approaches: feature-level adaptation, sample-level adaptation, and an adversarial deep learning approach that uses an autoencoder.

Emails contain a variety and number of words and phrases, some more likely to be related to a commitment—“I will,” “I shall,” “let you know”—than others. In the Enron corpus, domain-specific words like “Enron,” “gas,” and “energy” may be overweighted in any model trained from it. Feature-level adaptation attempts to replace or transform these domain-specific terms, or features, with similar domain-specific features in the target domain, explains Sim. For instance, “Enron” might be replaced with “Avocado,” and “energy forecast” might be replaced with a relevant tech industry term. The sample level, meanwhile, aims to elevate emails in the training dataset that resemble emails in the target domain, downgrading those that aren’t very similar. So if an Enron email is “Avocado-like,” the researchers will give it more weight while training.

General schema of the proposed neural autoencoder model used for commitment detection.

The most novel—and successful—of the three techniques is the adversarial deep learning approach, which in addition to training the model to recognize commitments also trains the model to perform poorly at distinguishing between the emails it’s being trained on and the emails it will evaluate; this is the adversarial aspect. Essentially, the network receives negative feedback when it indicates an email source, training it to be bad at recognizing which domain a particular email comes from. This has the effect of minimizing or removing domain-specific features from the model.

“There’s something counterintuitive to trying to train the network to be really bad at a classification problem, but it’s actually the nudge that helps steer the network to do the right thing for our main classification task, which is, is this a commitment or not,” says Sim.

Empowering users to do more

The two papers are aligned with the greater Microsoft goal of empowering individuals to do more, tapping into an ability to be more productive in a space full of opportunity for increased efficiency.

Reflecting on his own email usage, which finds him interacting with his email frequently throughout the day, White questions the cost-benefit of some of the behavior.

“If you think about it rationally, it’s like, ‘Wow, this is a thing that occupies a lot of our time and attention. Do we really get the return on that investment?’” he says.

He and other Microsoft researchers are confident they can help users feel better about the answer with the continued exploration of the tools needed to support them.


Podcast: Putting the ‘human’ in human computer interaction with Haiyan Zhang

haiyan zhang standing in front of a wall

Haiyan Zhang, Innovation Director

Episode 62, February 6, 2019

Haiyan Zhang is a designer, technologist and maker of things (really cool technical things) who currently holds the unusual title of Innovation Director at the Microsoft Research lab in Cambridge, England. There, she applies her unusual skillset to a wide range of unusual solutions to real-life problems, many of which draw on novel applications of gaming technology in serious areas like healthcare.

On today’s podcast, Haiyan talks about her unique “brain hack” approach to the human-centered design process, and discusses a wide range of projects, from the connected play experience of Zanzibar, to Fizzyo, which turns laborious breathing exercises for children with cystic fibrosis into a video game, to Project Emma, an application of haptic vibration technology that, somewhat curiously, offsets the effects of tremors caused by Parkinson’s disease.


Episode Transcript

Haiyan Zhang: We started out going very broad, and looking at lots of different solutions out there, not necessarily just for tremor, but across the spectrum to address different symptoms of Parkinson’s disease. And this is actually really part of this whole design thinking methodology which is to look at analogous experiences. So, taking your core problem and then looking at adjacent spaces where there might be solutions in a completely different area that can inform upon the challenge that you are tackling.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Haiyan Zhang is a designer, technologist and maker of things (really cool technical things) who currently holds the unusual title of Innovation Director at the Microsoft Research lab in Cambridge, England. There, she applies her unusual skillset to a wide range of unusual solutions to real-life problems, many of which draw on novel applications of gaming technology in serious areas like healthcare.

On today’s podcast, Haiyan talks about her unique “brain hack” approach to the human-centered design process, and discusses a wide range of projects, from the connected play experience of Zanzibar, to Fizzyo, which turns laborious breathing exercises for children with cystic fibrosis into a video game, to Project Emma, an application of haptic vibration technology that, somewhat curiously, offsets the effects of tremors caused by Parkinson’s disease. That and much more on this episode of the Microsoft Research Podcast.

(music plays)

Host: Haiyan Zhang, welcome to the podcast.

Haiyan Zhang: Hi, thanks Gretchen. Great to be here.

Host: You are the Innovation Director at MSR Cambridge in England, which is a super interesting title. What is an Innovation Director? What does an Innovation Director do? What gets an Innovation Director up in the morning?

Haiyan Zhang: I guess it is quite an unusual title. It’s a kind of a bespoke role, I would say, because of my quite unusual background, I guess. Part of what I do is look at how technology can be applied in real use cases in the world to create business impact, within Microsoft and outside of Microsoft, and to make those connections between our deeply technical research with applied product groups across the company.

Host: So, is this a job that existed at MSR in Cambridge or did you arrive with this unique set of talents and skills and background and ability, and bring the job with you?

Haiyan Zhang: I would say it’s something I brought with me and it’s evolving over time. (laughs)

Host: Well, unpack that a little bit. How has it evolved since you began? When did you begin?

Haiyan Zhang: So, I actually joined Microsoft about five and a half years ago and I actually initially joined as part of the Xbox organization, running an innovation team in Xbox in London and looking at new play experiences for kids, for teens, that were completely outside of the box. And then from that, I transitioned into Microsoft Research. And part of my team also continued on that research in terms of creating completely new technology experiences around entertainment. And more recently, I’m working across the lab with various projects to see how we can connect our sort of fundamental computer science work better with products across Microsoft in terms of Azure cloud infrastructure, in terms of Xbox and gaming, in terms of Office and productivity.

Host: You’ve been in high-tech for nearly twenty years and you’ve worked in engineering and user experience and research… R&D, hardware, service design, etc., and even out in the “blue-sky envisioning space.” So, that brings a lot to the party in the form of one person. (laughter) Quite frankly, I’m impressed. How has your experience in each, or all, of these areas informed how you approached the research you do today?

Haiyan Zhang: Well thanks, Gretchen. I’m really… I’m quite honored to be on the podcast actually because I’m so impressed with all the researchers that you’ve been interviewing across all the MSR labs. So, I would say that, in the research work that I do, I bring a very human-centered lens to looking at technology. So I undertake a full, human-centered design process starting from talking to people, getting empathy with people, trying to extract insight from what people really need and then going deeply into the technical research to develop prototypes, technology ideas to support those needs, and then deploying those prototypes in the field to understand how that can be improved and how we can evolve our technology thinking.

Host: Let’s talk about design thinking, then, for a minute. I don’t know if you’d call it discrete from computational thinking or any other kind of thinking, but it seems to be a buzz phrase right now. So, as a self-described designer, technologist and maker of things, how would you define design thinking?

Haiyan Zhang: So, I would say that design thinking is not separate from computational thinking, it’s a layer above. It’s just an approach to problem-solving, and it’s basically a tool kit that allows you to utilize different methods to really gain an understanding of people’s needs, to gain an understanding of insight into how people’s lives can be improved through technology, and then tools around prototyping and evaluating those prototypes. So, I would say that it is not, in itself, a scientific method, but it can be used to improve and augment your existing practice.

Host: Let’s get specific now and talk about some of those projects that you’ve been working on, starting with Project Zanzibar. What was the inspiration behind this project? How did you bring it to life and how does it embody your idea of connected play experiences that you’ve talked about?

Haiyan Zhang: I think there is a rich history in computer science of tangible user interfaces. You know, some of the early work at Xerox Park even or at the MIT Media Lab around how we can create these seamless interactions between people, between their physical environment and between a digital universe. And I think the approach we had to Zanzibar was that the most fruitful area for exploration in tangible user interfaces would be to enable kids to play and learn though physicality. Through interacting with physical objects that were augmented with virtual information, because we’re really trying to tap into this idea of multi-modal learning and learning through play. So, just coming from this initial approach, we dive very deeply into how would we invent a completely new technology platform to enable people to very seamlessly manipulate objects in a natural way using gestures, and then bring about some new digital experiences layered on top of that, that were games or education scenarios and then sort of bringing those together in terms of really fundamental technology invention, but also applications that could demonstrate what that technology could do.

Host: Right. Well, and it’s too bad that this is an audio-only experience here on the podcast because there’s a really cool overview of this project on the Microsoft Research website and it’s a very visual, artifact-based approach to playing with computers.

Haiyan Zhang: Yeah, yeah. And I encourage everyone to visit the project page and take a look at some of the videos and our prototypes that we have published.

Host: Right. So, what was the thinking behind tying in the artifact and the digital?

Haiyan Zhang: You know, there’s this rich history of research with physical objects and we’ve proven out that physical/digital interaction is a great way forward in terms of novel interactions between people and computing. But the pragmatics of these systems have not been ideal. You know, if you have to be sat at your desk and there has to be an overhead camera, usually a lot of research projects involve this or there’s occlusion in terms of where your hand can be and where the physical objects can be because the cameras won’t be able to track it. So, what we set out to do was think about well, how would you design a technology platform that overcomes a lot of these barriers to these platforms so that we can then be freed up to think about those scenarios, but we can also empower other researchers who are doing research in this space to think about those scenarios. So, our research group, we had to this idea of leveraging NFC, but leveraging it in terms of an NFC antenna array so that we could track objects in a 2-D space. And then the additional novelty was also layering that with a capacitive multi-touch layer so that we could track both the objects in terms of the physical IDs of the objects on top of this surface area. The capacitor’s multi-touch would enhance that tracking that the NFC provided, but also, we could track hand gestures, both in terms of multi-touch gestures on top of the surface and also some hover gestures just above the surface as well.

(music plays)

Host: Let’s talk a bit about another really cool project that you’re working on. I know Cambridge, your lab, is deeply, and maybe even uniquely, invested in researching and developing technologies to improve healthcare, and you have a couple projects going on in this area. One of them, Project Fizzyo. I’ll spell it. It’s F (as in Frank)-i-z-z-y-o. Tell us about this project. How did it come about? What’s the technology behind it and how does it work?

Haiyan Zhang: So, Fizzyo really started as a collaboration with the BBC and we were inspired by one family. The mom, Vicky, she has four kids and two of her boys have cystic fibrosis, they have a genetic condition where their internal organs are constantly secreting mucous. And so, every day, twice a day, the boys have to do this laborious breathing exercise to expel the mucous from their lungs, and it involves breathing into these plastic apparatus. And they basically apply pressure to your breath so that when you breathe, it creates and oscillating effect in your lungs and escalates the mucous and then it culminates in you coughing and trying to cough out the mucous from your lungs. They’re usually plastic devices, where as you blow, the air kind of enters a chamber and there might be some sort of mechanism that oscillates the air like a ball-bearing that bounces up and down and so they are very low-fi, so there’s no digital aspect to these devices. And you can imagine, these kids, they are having to do these exercises from a very early age, from as early as they can remember, twice a day for 30 minutes, for an hour at a time. It’s really intensive and it can be, you know, if not painful, at least really uncomfortable to do. And I actually tried to do this once and I felt really light-headed. I actually couldn’t do one session of it. And also, the kids, they want to be outside playing with their friends. You know, they don’t want to be stuck indoors doing this all the time. And there is no thread from doing the exercise and feeling an improvement because the activity is about maintenance, so you are trying to maintain your health because if you don’t clear the mucous from your lungs, infection can set in and that means going to the hospital, that means getting antibiotics. And so, it’s a very challenging thing for Vicky, their mom, to be jostling them, be harassing them to do this all the time. And she said that her role has really changed with her kids and that she’s no longer a mom, she’s sort of nagging them all the time. And so, we visited with the family to really understand their plight. And she asked, you know, can we create a piece of technology that can help us in getting the kids to do this kind of physio, the treatment is a type of physio. And so, we actually came up with this idea together where she said, you know, the boys really love to play video games so, what if we could create a way for the boys to be playing a video game as they are undertaking this exercise. So, we started this process of prototyping and developing a digital attachment, a sensor, that attaches to all these various different physio devices. And as the patient is expelling, is breathing out, the sensor actually senses the breath and transmits that digital signal to a tablet and we can translate that signal into controls for a video game. And we’re also able to upload that to the cloud, to do further analysis on that medical treatment.

Host: Wow. How is it working?

Haiyan Zhang: We started this project about two and a half years ago. It’s been a long process, but a really fruitful and rewarding one. So, we started out with just some early prototypes, just using off-the-shelf electronics to get the breath sensor working just right. We added a single button, because we realized if you were just using the breath to play video games, it’s actually really challenging. And then, within the team, our industrial designer, Greg Saul, designed the physical attachment. We developed our own sensor board and we had it manufactured along with the product design. And we partnered with University College London, their physiotherapy department, and the Great Ormond Street Hospital in London where they’ve deployed over a hundred of these units with kids across the country to do a long-term trial. So actually, when we first met with the University College London physiotherapy department, I mean, this is a department that they’ve spent their entire careers working with kids in this domain. And they had never had any contact with the computer science department. This was not a digital research area. When they first met us, and they saw, on the computer screen, someone breathing out, and a graph showing that breath, the peak of that breath, one of the heads of the department that we were working with, she started to cry because she said that in her entire career, she had never seen physio data visualized in this way. It was just incredible for her.

Host: Wow.

Haiyan Zhang: And so, we decided to partner, and they’ve been amazing because, through this journey, they’ve gone to meet people in the computer science department, they initiated masters’ degrees incorporating data science and digital understanding. They just hired their first data scientist in order to leverage the platform that we’ve built to do further analysis to improve the health of these kids. And they said that even though this kind of exercise has been around for decades, no one has actually done a definitive, long-term study to track the efficacy of this kind of exercise to health, to outcomes. You know, because I think past studies have really relied on keeping paper diaries, answering questionnaires, but no one has done that digital study, which is what the power of Internet of Things can really bring you, which is tracking in the background in a very precise way.

Host: Talk about the role of machine learning. How do any of the new methodologies in computer science like machine learning methods and techniques play into this?

Haiyan Zhang: You know, what’s really interesting with machine learning is the availability of data. And, you know, we understand that what has driven this AI revolution is now the availability of large data sets to actually be able to develop new ML algorithms and models. And in many cases, especially in healthcare, there is the lack of data. So, I think throughout different areas of computer science research, there’s a real need to kind of connect the dots and actually develop IoT solutions that can start at the beginning and capture the data, because it’s only through cleverly capturing valid data, that we can then do the machine learning in the back end once we’ve done the data collection. And so, I think the Fizzyo project is a really good proof point of that in that we started out with IoT in order to gather the information that track the health exercises. And we just sort of deployed in the UK, so as we’re collected this data, we’re now able to look at that and start to do some predictions around long-term health. So, you know, some of the questions that physiotherapy researchers are trying to answer, if kids are very adherent to this kind of exercise, if they are doing what they are being told, they are doing this this twice a day for the duration that they are supposed to be doing it, does that mean, in six months’ time or a year’s time, their number of days in hospital is going to be reduced? Does it actually impact how much time they are spending being ill? If we see a trailing-off of this exercise, does that mean that we’ll see an increase in infection rates? So, with the data that we’re collecting, we’re now working with a different part of Microsoft, they’re called the Microsoft Commercial Software Engineering team, who are actively delving into projects around AI for good and they are going to be working with UCL to do some of this clustering and developing models around health prediction. So, clustering the patients into different cohorts to understand if there is prediction factors around how they are doing the exercises and how much time they are going to be spending in hospital in the years to come.

Host: Well, it almost would be hard for me to get more excited about something than what you just described in Project Fizzyo, but there is another project to talk about which is Project Emma. This is so cool it’s even been featured on a documentary series there in the UK called The Big Life Fix. And it didn’t just start with a specific idea, but with a specific person. Tell us the story of Emma.

Haiyan Zhang: Yes! So, again, Project Emma started with a single person, with Emma Lawton, who, when she was 28 years old, she was diagnosed with early onset Parkinson’s disease. And, it had been five years since her diagnosis and some of her symptoms had progressed quite quickly and one of them was an active tremor. So, her tremor would get worse as she started to write or draw. And this really affected how she went about her day-to-day work because she was a creative director, a graphic designer and day-to-day she would be in client meetings, talking with people and trying to sketch out what they meant in terms of the ideas that they had. And she would not be able to do that. And when I first met with her, she would sit with a colleague and her colleague would actually draw on her behalf. So, she really was looking for some kind of technology intervention to help her. And, we started out going very broad, and looking at lots of different solutions out there, not necessarily just for tremor, but across the spectrum to address different symptoms of Parkinson’s disease. And this is actually really part of this whole design thinking methodology which is to look at analogous experiences. So, taking your core problem and then looking at adjacent spaces where there might be solutions in a completely different area that can inform upon the challenge that you are tackling. So, we looked at lots of different solutions for other kinds of symptoms and of course, there was a lot of desk research. It was reading research papers that had been published over the decades that looked at tremors specifically. So, I think the two aspects that really influenced our thinking, one was around going to visit with a local charity called Parkinson’s UK and we were asking them to show us their catalogue of widgets and devices that they sold to Parkinson’s patients that helped them in their every day. And on the table, there was a digital metronome. So, you know, when you’re playing the piano you see musicians, they have this ticking metronome. And I asked, you know, so why is there a metronome on the table? And the lady said, well, for some Parkinson’s patients, they have a symptom called freezing of gait and this is where when you are walking along, your legs suddenly freeze, and you lose control of your legs. And so, sometimes people find that if they take out this metronome and they turn it on and it makes this rhythmic ticking sound, it somehow distracts their brain into being able to walk again, which is really kind of odd. There’s been a little bit of literature around this. In the literature it’s called queuing, it’s a queuing effect, but it doesn’t apply to tremor. But, for me, it sort of signaled an interesting brain hack, and signaled kind of underlying what might be going on in your brain when you have Parkinson’s disease. At the same time, there had been a number of papers around using vibration on the muscles to try to ameliorate tremor, to try to address it, to various effect. And not specifically looking at Parkinson’s but looking at other kinds of tremor diseases like central tremor, dystonia. And so, we developed a hypothesis and in order to test out the hypothesis, we developed a prototype which was a wearable device for the wrist that had a number of vibrating motors on it. So, it would apply vibration to the wrist in a rhythmic fashion in order to somehow circumvent the mechanism that was causing the tremor. And of course, we had a number of other hypotheses, too. This was not the only hypothesis. We had other devices that worked in a completely different way that was more about mechanically stopping the tremor, mechanically countering the tremor. And this device actually worked really well. So, we were surprised, but very, very happy, and so this is the direction that we took in order to further develop this product.

Host: Right. So, drilling in, I do want to mention that there is a video on this, on the website as well. It’s a video that made me cry. I think it made you cry, and it made Emma cry. We’re all just puddles of tears, because it’s so fantastic. And so, this kind of circles back to research writ large, and experimenting with ideas that may not necessarily be super, what we would call high-tech, maybe they are kind of low-fi, you know, a vibration tool that can keep you from shaking. So, how did it play out? How did you prototype this? Give us a little overview of your process.

Haiyan Zhang: For us, it was a very simple prototyping exercise. We took some off-the-shelf coin cell motors and developed, basically, a haptic type bracelet that we then had an app that you could program the haptics on the bracelet. And that’s what we sort of experimented with. So, just research from the haptics area of computer science research which is really about a mechanism for sort of using in VR or sensing something about the digital world, now applied to this medical domain.

(music plays)

Host: You have a diverse slate of projects going on at any given time and your teams are really diverse. So, I want you to talk, specifically, about the composition of skills and expertise that are required to bring some of these really fascinating research projects to life, and ultimately to market. Who is on your team and what do they bring to the party?

Haiyan Zhang: Well, I think there’s just something really unique about Microsoft Research and Microsoft Research Cambridge, in particular, we have such a broad portfolio of projects, but also expertise in the different computer science fields, that we can sort of pull together these multidisciplinary teams to go after a single topic. So, within our lab we have social scientists doing user research, gaining real insight into how people behave, how people think about various technologies. We have designers that are exploring user interfaces, exploring products to bring these ideas to life. We have, you know, computer vision specialists. We have machine learning specialists. We have natural language processing people, systems researchers, and securities researchers and, obviously, healthcare researchers. So, it’s that broad outlook that I think can really push forward in terms of technology innovation and really emphasizing the applications for people, for improving society as a whole.

Host: I ask all my guests some form of the question is there anything that keeps you up at night. And I know that many people, mainly parents, are worried that their kids are too engaged with screens or not spending enough time in real life and so on. What would you say to them, and is there anything that keeps you up at night about sort of the broader swath of what you are working on?

Haiyan Zhang: You know, on the topic of screen time, obviously it’s something that we really wrestled with Zanzibar research specifically which is thinking about how you could interact with physical objects instead of a digital screen, and also bringing that kind of bigger interaction surface between family and between friends so they could interact together. You know, at the same time, I would say that culture is constantly changing and how we live our lives is constantly changing. We’ve only seen the internet be really embedded in our lives in the last, I’d say, twenty years, fifteen years, twenty years. When I think we were younger, we had television and there were no computers and so, I say culture is constantly evolving. How we’re growing, how we’re living is constantly evolving. It’s important for parents to evaluate this changing landscape of technology and to figure out what is the best thing to do with their kids. And maybe you don’t have to rely on how you grew up, but to kind of evaluate that our kids are getting the right kind of social interaction, getting the right amount of parental support and quality time with their family. I think that’s what is important, but to accept that how we’re growing is changing.

Host: What about the idea of the internet of things and privacy when we’re talking about toys and kids?

Haiyan Zhang: Mmm, yeah, it is something we really have to watch out for, and um you know, we’ve seen some bad examples of the toy industry jumping ahead too far and enabling toys to be connected 24/7 and conversing with kids and what does that really mean? I’ve seen some really great research out of the MIT Media Lab where there was a researcher really looking at how kids are conversing with AI, with different AI agents and their mental model of these AI agents. So, I think that’s a really great piece of research to look at, but also maybe to expand upon. As a research community, if we’re thinking about kids, to understand that how kids are interacting with AI is going to be more commonplace, and rather than trying to avoid it, to really tackle it head-on and see how we can improve the principles around designing AI, how we can inform companies in the market out there of what is the ethical approach to doing this so that kids really understand what AI is as they are growing up with it.

Host: We’re coming up on an event at Microsoft Research called I Chose STEM and it’s all about encouraging women to… well, choose STEM! As an area of study or a career.

Haiyan Zhang: Yeah.

Host: So, tell us the story of how you chose it? What got you interested in a career in high-tech in general, and maybe even high-tech research specifically? Who were your influences?

Haiyan Zhang: I have a I guess slightly unique background in that I was born in China and at the time it was very kind of Communist education that I had when I was growing up. And my family moved to Australia when I was 8 years old. And I was always very technical and very nerdy. But I never thought about technology as a career. I actually wanted to study law when I was in high school. And computing was just something where I was sort of, you know, it was kind of fun, but I never thought about it as a career. And I’d say in the last sort of year of high school, I decided to switch and do computer science and I realized that I was actually really good at computer science. I guess what led me to choose STEM is just the – I think the fun and creativity you can have with programming. You know, I would always come up with my own little creative exercises to write on the computer. It wasn’t the rote exercises, it was the ability to kind of be creative with this technical tool that really got me excited. I think at the same time, I love this huge effort within our industry to really focus on getting more women, more girls into technology, into STEM education, and we really want to increase representation, increase sort of equal representation. At the same time, I think I found it, at times, to be, you know, challenging to be the only woman in the room. You know, when I was in computer science, sometimes I’d be, you know, one of three women in the lecture theater or something. I think we need to adopt this kind of pioneer mindset so that we can go into these new areas, go into a room where you’re the only person, where you’re unique in that room and you have something to contribute and don’t be afraid to speak up. I think that’s a really important mindset and skill for anybody to have.

Host: No interview would be complete if I didn’t ask my guest to predict the future. No pressure, Haiyan. Seriously though, you are living on the cutting edge of technology research which is what this podcast is all about. And so what advice or encouragement – you’ve just kind of given some – would you give to any of our listeners across the board who might be interested or inspired by what you are doing? Who is a good fit for the research you do?

Haiyan Zhang: My advice would be, especially in the research domain, to develop that deep research expertise, but to keep a holistic outlook. I think the research landscape is changing in that we are going to be working in more multidisciplinary teams, working across departments. You know, sometimes it’s the healthcare department, the physiotherapy department, with the computer science department. It’s through the connection of these disparate fields that I think we’re going to see dramatic impact from technology. And I think for researchers to have that holistic outlook, to visit other departments, to understand what are the challenges beyond their own group, I think is really, really important. And develop collaboration skills and techniques.

Host: Haiyan Zhang, it’s been a delight. Thanks for joining us today.

Haiyan Zhang: Thanks so much, Gretchen. It’s been a real pleasure, thank you.


Podcast with Dr. Rico Malvar, manager of Microsoft Research’s NExT Enable group

Rico Malvar, Chief Scientist and Distinguished Engineer

Episode 61, January 30, 2019

From his deep technical roots as a principal researcher and founder of the Communications, Collaboration and Signal Processing group at MSR, through his tenure as Managing Director of the lab in Redmond, to his current role as Distinguished Engineer, Chief Scientist for Microsoft Research and manager of the MSR NExT Enable group, Dr. Rico Malvar has seen – and pretty well done – it all.

Today, Dr. Malvar recalls his early years at a fledgling Microsoft Research, talks about the exciting work he oversees now, explains why designing with the user is as important as designing for the user, and tells us how a challenge from an ex-football player with ALS led to a prize winning hackathon project and produced the core technology that allows you to type on a keyboard without your hands and drive a wheelchair with your eyes.


Episode Transcript

Rico Malvar: At some point, the leader of the team, Alex Kipman, came to us and says, oh, we want to do a new controller. What if you just spoke to the machine, made gestures and we could recognize everything? You say, that sounds like sci-fi. And then we said, no, wait a second, but to detect gestures, we need specialized computer vision. We’ve been doing computer vision for 15 years. To identify your voice, we need speech recognition. We’ve also been doing speech recognition for 15 years. Oh, but now there maybe be other sounds and multiple people… oh, but just a little over 10 years ago, we started these microphone arrays. They are acoustic antennas. And I said, wait a second, we actually have all the core elements, we could actually do this thing!

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: From his deep technical roots as a principal researcher and founder of the Communications, Collaboration and Signal Processing group at MSR, through his tenure as Managing Director of the lab in Redmond, to his current role as Distinguished Engineer, Chief Scientist for Microsoft Research and manager of the MSR NExT Enable group, Dr. Rico Malvar has seen – and pretty well done – it all.

Today, Dr. Malvar recalls his early years at a fledgling Microsoft Research, talks about the exciting work he oversees now, explains why designing with the user is as important as designing for the user, and tells us how a challenge from an ex-football player with ALS led to a prize winning hackathon project and produced the core technology that allows you to type on a keyboard without your hands and drive a wheelchair with your eyes. That and much more on this episode of the Microsoft Research Podcast.

Host: Rico Malvar, welcome to the podcast.

Rico Malvar: It’s a pleasure to be with you, Gretchen.

Host: You’re a Distinguished Engineer and Chief Scientist at Microsoft Research. How would you define your current role? What gets you up in the morning?

Rico Malvar: Ha ha! Uh, yeah, by chief scientist, it means I tell everybody what to do, very simple. (laughing) Yeah… Not really, but Chief Scientist is basically a way for me to have my fingers and eyes, in particular, on everything going on at Microsoft Research. So, I have an opportunity to interact with, essentially, all the labs, many of the groups, and find opportunities to do collaborative projects. And that is really super-exciting. And it’s really hard to be on top of what everybody is doing. It’s quite the opposite of telling people what to do, it’s like trying follow-up what they are doing.

Host: It’s um – on some level herding cats?

Rico Malvar: It’s not even herding. It’s where are they??

Host: You got to find the cats.

Rico Malvar: Find the cats, yeah.

Host: Well, talk a little bit about your role as Distinguished Engineer. What does that entail, what does that mean?

Rico Malvar: That’s basically… there’s a whole set of us. We have Distinguished Engineers and Technical Fellows which are at the top of our technical ladder. And the idea is a little bit recognition of some of the contributions we’ve done in the technical area, but it’s mostly our responsibility to go after big technical problems and don’t think just about the group you’re in, but think about the company, what the company needs, what the technology in that particular area should be evolving. My area, in particular, on the technical side, is signal processing, data compression, media compression. And these days, with audio and video entering the internet, that matters a lot. But also a few other areas, but that’s the idea. The idea is that what are the big problems in technology, how can we drive new things, how can we watch out for new things coming up at the company level?

Host: You know, those two things that you mentioned, drive things and anticipate things, are two kind of different gears and two different, I won’t say skillsets, but maybe it’s having your brain in two places.

Rico Malvar: You are right. It’s not completely different skillsets but driving and following are both important and one helps the other. And it’s very important for us to do both.

Host: Let’s go back to your roots a little bit. When you started here at Microsoft Research, you were a principle researcher and the founder and manager of what was called the Communications, Collaboration and Signal Processing group at MSR. So, tell us a little bit about the work you used to do and give us a short “where are they now?” snapshot of that group.

Rico Malvar: Yeah, that name is funny. That name was a bad example when you get too democratic about choosing names, and then we got everybody in the team to give ideas and then it got all complicated and we end up with a little bit of everything and came up with a boring name instead of a cool one. But it was a very descriptive name which was good. It was just called Signal Processing when we started, and then it evolved to Communication, Collaboration and Signal Processing because of the new things we were doing. For example, we had a big project on the collaboration area which is the prototype of a system which later evolved to become the RoundTable product. And that’s just not signal processing, it’s collaboration. Well, we have put collaboration. But people use it to communicate so it’s also communication, saying okay, put it all in the name. So, it’s just like that. And on your question of where people are, a cool thing is that we had a combination of expertise in the team to be able to do things like RoundTable. So, we had computer vision experts, we had distributed systems experts, we had streaming media experts and we had audio experts, on the last one for example, in audio. Then later, we actually evolved a new group doing specifically audio signal processing which is now led by Ivan Tashev who was a member of my team and now has his own team. He already participated in your podcast, so it’s nice to see the interesting challenges in those areas continue. And we keep evolving, as you know. The groups are always changing, modifying, renewing.

Host: In fact, that leads into my next question. Microsoft Research, as an entity, has evolved quite a bit since it was formed in 1991. And you were Managing Director in the mid-2000’s from like 2007 to 2010?

Rico Malvar: ‘10. Of the lab here in Redmond, yeah.

Host: Yeah. So, tell us a little bit about the history of the organization in the time you’ve been here.

Rico Malvar: Yeah. It’s great. One thing I really like about Microsoft Research is first, is that it started early with the top leaders in the company always believing in the concept. So, Bill Gates started Microsoft Research, driven by Nathan Myhrvold who was the CTO at the time, and it was a no-brainer for them to start Microsoft Research. They found Rick Rashid, who was our first leader of MSR. And I had the pleasure of reporting to Rick for many years. And the vision he put in, it is still to this day, is let’s really push the limits of technology. We don’t start by thinking how this is going to help Microsoft, we start by thinking how we push the technology, how it helps people. Later, we will figure out how it’s going to help Microsoft. And to this date, that’s how we operate. With the difference being, maybe, is that in the old days, the lab was more of a classical research lab. Almost everything was pivoted on research projects.

Host: Sure.

Rico Malvar: Which is great, and many, many of them generated good technology or even new products to the company. I was just talking about RoundTable as one example, and we have several. Of course, the vast majority fail because research is a business of failure and we all know that! We submit ten papers for publication, two or three get accepted. That is totally fine, and we keep playing the game. And we do the papers as a validation and also as a way to interact with the community. And both are extremely of value to us so we can have a better understanding we are pushing the state-of-the-art. And today, the new Microsoft Research puts even a little more emphasis on the impact side. We still want to push the state-of-the-art, we still do innovative things, but we want to spend a little more effort on making those things real.

Host: Yeah.

Rico Malvar: On helping the company. And even the company, itself, evolved to a point where that has even a higher value from Satya, our CEO, down. It is the mission of the company to empower people to do more. But empowering is not just developing the technology, it’s packaging it, shipping it in the right way, making products that actually leverage that. So, I would say the new MSR gets even more into, okay, what it takes to make this real.

Host: Well, let’s talk a little bit about Microsoft Research NExT. Give our listeners what I would call your elevator pitch of Microsoft NExT. What does it stand for, how does it fit in the portfolio of Microsoft Research? I kind of liken it to pick-up basketball, only with scientists and more money, but you do it more justice than I do!

Rico Malvar: That’s funny. Yeah, NExT is actually a great idea. As I said, we’re always evolving. And then, when Peter Lee came in, and also Harry Shum is our new leader, they thought hard about diversifying the approaches in which we do research. So, we still have the Microsoft Research labs, the part that is a bit more traditional in the sense that the research is mostly pivoted by areas. We have a graphics team, natural language processing group, human computer interaction, systems, and so forth. Many, many of them. When you go to NExT, the idea is different. One way to achieve potentially even more impact is pivot some of those activities, not by area, but by project, by impact goal. Oh, because of this technology and that technology, maybe we have an opportunity to do X, where X is this new project. Oh, but we’re going to have the first technology is computer vision, the other one is hardware architecture. Oops, we’re going to have to need people in all those areas together in a project team and then Peter Lee has been driving that, always trying to find disruptive, high impact things so that we can take new challenges. And lots of things are coming up from this new model which we call NExT, which is New Experiences in Technology.

Host: I actually didn’t know that, what the acronym stood for. I just thought it was, what’s NExT, right?

Rico Malvar: Of course, that is a cool acronym. Peter did a much better job than we did on the CCSB thing.

Host: I love it.

(music plays)

Host: Well, let’s talk about Enable, the group. There’s a fascinating story of how this all got started and it involves a former football player and what’s now called the yearly hackathon. Tell us the story.

Rico Malvar: That is exactly right. It all started when that famous football player, ex-football player, Steve Gleason, still a good partner of ours, is still a consultant to my team… Steve is a totally impressive person. He got diagnosed with ALS, and ALS is a very difficult disease because you basically lose mobility. And at some point in life, your organs may lose their ability to function, so, most people actually don’t survive ALS. But with some mitigations you can prolong, a little bit, and technology can help. Steve, actually, we quote him saying, “Until there is a cure for ALS, technology is the cure.” This is very inspiring. And he created a foundation, Team Gleason, that really does a wonderful job of securing resources and distributing resources to people with ALS. They really, really make a difference in the community. And he came to us almost five years ago, and we were toying with the idea of creating this hackathon, which is a company-wide effort to create hack-projects. And then in one of those, which actually the first time we did, which is in 2014, Steve told us, “You know what guys, I want to be able to do more. In particular, I want to be able to argue with my wife and play with my son. So, I need to communicate, and I need to move. My eyes still work, this eye tracking thing might be the way to go. Do you want to do something with that?” The hackathon team really got inspired by the challenge and within a very short period of time, they created an eye tracking system where you look at the computer and then there’s a keyboard and you can look at the keys and type at the keys by looking. And there is a play button so you can compose sentences and then speak out with your eyes.

Host: That’s amazing.

Rico Malvar: And they also created an interface where they put buttons, similar to a joy stick, on the screen. You look at those, and the wheelchair moves in the direction of where you are selecting. They did a nice overlay between the buttons and the video, so it’s almost like they put the computer, mount it on the wheelchair, you look through the computer, the camera shows what’s in front of you, and then the wheelchair goes. With lots of safety things like a stop button. And it was very successful, that project. In fact, it won the first prize.

Host: The hackathon prize?

Rico Malvar: On the hackathon prize. And then, a little bit later, Peter and I were thinking about where to go on new projects. And then Peter really suggested, Rico, what about that hackathon thing? That seems to be quite impactful, so maybe we want to develop that technology further. What do you think? I said, well if I had a team… (laughs) we could do that…

Host: (sings) If I only had a team…

Rico Malvar: (sings) If I only had had a team… And then Peter said, ehh, how many people you need? I don’t know, six, seven to start. I said, okay, let’s go do it. It was as easy as that.

Host: Well, let’s talk a little bit more about the hackathon. Like you said, it’s about in its fifth year. And, as I understand it, it’s kind of a ground-up approach. Satya replaced the annual “executive-inspirational-talk-top-down” kind of summer event with, hey, let’s get the whole company involved in invention. I would imagine it’s had a huge impact on the company at large. But how would you describe the role of the hackathon for people in Microsoft Research now? It seems like a lot of really interesting things have come out of that summer event.

Rico Malvar: You know, for us, it was a clear thing, because Microsoft Research was always bottom-up. I mean, we literally don’t tell researchers what to do. People, researchers, engineers, designers, managers, they all have great ideas, right? And they come up with those great ideas. When they click enough, they start developing something and we look from the top and say, that sounds good, keep going, right? So, we try to foster the most promising ones. But the idea of bottom-up was already there.

Host: Yeah.

Rico Malvar: When we look at the hackathon, we say, hey, thanks to Satya and the new leadership of Microsoft, the company’s embracing this concept of moving bottom-up. There’s The Garage. The Garage has been involved with many of those hackathons. Garage has been a driver and supporter of the hackathon. So, to us, it was like, hey, great, that’s how we work! And now we’re going to do more collaboration with the rest of the company.

Host: You have a fantastic and diverse group of researchers working with you, many of whom have been on the podcast already and been delightful. Who and what does it take to tackle big issues, huge ideas like hands-free keyboards and eye tracking and 3-D sound?

Rico Malvar: Right. One important concept, and it’s particularly important for Enable, is that we really need to pay attention to the user. Terms such as “user-centric” – yeah, they sound like cliché – but especially in accessibility, this is super important. For example, in our Enable team, the area working with eye tracking, our main intended user were people with ALS since the motivation from Steve Gleason. And then, in our team, Ann Paradiso, who is our user experience manager, she created what we call the PALS program. PALS means Person with ALS. And we actually brought people with ALS in their wheelchairs and everything to our lab and discussed ideas with them. So, they were not just testers, they were brainstorming with us on the design and technologies…

Host: Collaborators.

Rico Malvar: Collaborators. They loved doing it. They really felt, wow, I’m in this condition but I can contribute to something meaningful and we will make it better for the next generation…

Host: Sure.

Rico Malvar: …of people with this. So, this concept of strong user understanding through user design and user research, particularly on accessibility, makes a big difference.

Host: Mmm hmm. Talk a little bit about the technical side of things. What kinds of technical lines of inquiry are you really focusing on right now? I think our listeners are really curious about what they’re studying and how that might translate over here if they wanted to…

Rico Malvar: That’s a great question. Many of the advancements today are associated with artificial intelligence, AI, because of all the applications of AI, including in our projects. AI is typically a bunch of algorithms and data manipulation in finding patterns in data and so forth. But AI, itself, doesn’t talk to the user. You still need the last mile of the interfaces, the new interface. Is the AI going to appear to the user as a voice? Or as something on the screen? How is the user going to interact with the AI? So, we need new interfaces. And then, with the evolution of technology, we can develop novel interfaces. Eye tracking being an example. If I tell you that you’re going to control your computer with your eyes, you’re going to say, what? What does that mean? If I tell you, you’re going to control the computer with your voice, you say, oh yeah, I’ve been doing that for a while. With the eye tracking for a person with a disability, they immediately get it and say, a-ha! I know what it means, and I want to use that. For everybody, suppose, for example, that you are having your lunch break and you want to browse the news on the internet, get up to date on a topic of interest. But you’re eating a sandwich. Your hands are busy, your mouth is busy, but your eyes are free. You could actually flip around pages, do a lot of things, just with your eyes and you don’t need to worry about cleaning your hands and touching the computer because you don’t need to touch the computer. And you can think, in the future, where you may not even need your eyes. I may read your thoughts directly. And, at some point, it’s just a matter of time. It’s not that far away. We are going to read your thoughts directly.

Host: That’s both exciting and scary. Ummmm…

Rico Malvar: Yes.

Host: What does it take to say, all right, we’re going to make a machine be able to look at your eyes and tell you back what you are doing?

Rico Malvar: Yeah, you see, it’s a specialized version of computer vision. It’s basically cameras that look at your eyes. In fact, the sensor works by first illuminating your eyes with bright IR lights, infrared, so it doesn’t bother you because you can’t see. But now you have this bright image that the camera is looking at, IR can see, and then models in a little bit of AI and a little bit of just graphics and computer vision and signal modeling, that then make an estimate of the position of your eyes and associate that with elements on the screen. So, it’s almost as if you have a cursor on the screen.

Host: Okay.

Rico Malvar: That is controlled with your eyes, very similar to a mouse, with the difference that the eye control works better if we don’t display the cursor. With the mouse, you actually should display the cursor…

Host: Ooohhh, interesting….

Rico Malvar: …with eye control, the cursor works better if it is invisible. But you see the idea there is that you do need specialists, you need folks who understand that. And sometimes you do a combination of some of that understanding being in the group, so we need to be the top leaders in that technology, or we partner with partners that have a piece of the technology. For example, for the eye tracking, we put much more emphasis on designing the proper user interfaces and user experiences, because there are companies that do a good job introducing eye tracking devices. So, we leverage the eye tracking devices that these companies produce.

Host: And behind that, you are building on machine learning technologies, on computer vision technologies and… um… so…

Rico Malvar: Correct. For example, a typical one is that the keyboard driven by your eyes. You still want to have a predictive keyboard.

Host: Sure.

Rico Malvar: So, as you are typing the letters, it guesses. But how you interface on the guess, it’s very interesting, because when you are typically using a keyboard, your eye is looking at the letters, your fingers are typing on the keys. When you’re doing an eye control keyboard, your eye has to do everything. So, how you design the interface should be different.

Host: Yeah.

Rico Malvar: And we’ve learned and designed good ways to make that different.

Host: If I’m looking at the screen and I’m moving my eyes, how does it know when I’m done, you know, like that’s the letter I want? Do I just laser beam the…??

Rico Malvar: You said you would be asking deep technical questions and you are. That one, we use the concept that we call “dwelling.” As you look around the keyboard, remember that I told you we don’t display the cursor?

Host: Right.

Rico Malvar: So, but as you – the position where you look in your eyes, the focus of your eye, is in a particular letter, we highlight that letter. It can be a different color, it can be a lighter shade of grey…

Host: Gotcha.

Rico Malvar: So, as you move around, you see the letters moving around. If you want to type a particular letter, once you get to that letter, you stop moving for a little bit, let’s say half a second. That’s a dwell. You dwell on that letter a little bit and we measure the dwell. And there’s a little bit of AI to learn what is the proper dwell time based on the user.

(music plays)

Host: One thing I’m fascinated by, not just here, but in scientific ventures everywhere, is the research “success story.” The one that chronicles the path of a blue-sky research thing to instantiation in a product. And, I know, over and over, researchers have told me, research is generally a slow business, so it’s not like, oh, the overnight success story, but there’s a lot of hard-won success stories or stories that sort of blossomed over multiple years of serendipitous discovery. Do you have any stories that you could share about things that you’ve seen that started out like a hair-brained idea and now millions of people are using?

Rico Malvar: You know, there’s so many examples. I particularly like the story of Kinect, which was actually not a product developed by Microsoft Research, but in close collaboration with Microsoft Research. It was the Kinect team, at the time, in Windows. Because at some point, the leader of the team, Alex Kipman, came to us and says, oh, we want to do a new controller. What if you just spoke to the machine, made gestures and we could recognize everything? You say, that sounds like sci-fi. So, naahhh, that doesn’t work. But then Alex was very insistent. And then we said, no, wait a second, but to detect gestures, we need specialized computer vision. We’ve been doing computer vision for 15 years. To identify your voice, we need speech recognition. We’ve also been doing speech recognition for 15 years. Oh, but now there maybe be other sounds and there are maybe multiple people… oh, but just a little over 10 years ago, we started these microphone arrays. They are acoustic antennas. They can tune to the sound of whoever is speaking all of that.

Host: Directional.

Rico Malvar: The directional sound input. And I said, wait a second, we actually have all the core elements, we could actually do this thing. So, after the third or fourth meeting, I said, okay Alex, I think we can do that. And he said, great, you have two years to do it. What??? Yeah, because we need to ship at this particular date. And it all worked. I doubt there’s some other institution or company that could have produced that because we’ve been doing what was, apparently, “blue-sky” for many years, but then we created all those technologies and when then need arose, I say, a-ha, we can put them altogether.

Host: Where is Kinect today?

Rico Malvar: Kinect used to be a peripheral device for Xbox. We changed it into an IoT device. So, there’s a new Kinect kit, connects to Azure so people can do Kinect-like things, not just for games but for everything. And all the technology that supports that is now in Azure.

Host: So, Rico, you have a reputation for being an optimist. You’ve actually said as much yourself.

Rico Malvar: (laughs) Yes, I am!

Host: Plus, you work with teams on projects that are actually making the lives of people with disabilities, and others, profoundly better. But I know some of the projects that you worked on fall somewhere in the bounds of medical interventions.

Rico Malvar: Mmm-hmm.

Host: So, is there anything about what you do that keeps you up at night, anything we should be concerned about?

Rico Malvar: Yeah, you know, when you are helping a person with disability, sometimes what you are doing can be seen as, is that a treatment, is that a medical device? In most cases, they are not. But the answer to those questions can be complicated and there can be regulations. And of course, Microsoft is a super-responsible company, and if anything is regulated, of course, we are going to pay attention to the regulations. But some of those are complex. So, doing it right by the regulations can take significant amount of work. So, we have to do this extra work. So, my team has to spend time, sometimes in collaboration with our legal team, to make sure we do the right things. And I hope also that we will help evolve those regulations, potentially by working with the regulatory bodies, educating them on the evolution of the technology. Because in all areas, not just this area, but almost all areas of technology, regulations tend to be behind. It’s hard to move, and understandably so. So, the fact that we have to spend significant effort dealing with that does keep me up at night a little bit. But we do our best.

Host: You know, there’s a bit of a Wild West mentality where you have to, like you say, educate. And so, in a sense what I hear you saying is that, as you take responsibility for what you are doing, you are helping to shape and inform the way the culture onboards these things.

Rico Malvar: Exactly right, yes. Exactly right.

Host: So, how would you sort of frame that for people out there? How do we, you, help move the culture into a space that more understands what’s going on and can onboard it with responsibility themselves?

Rico Malvar: That is a great question. And you see for example, in areas such as AI, artificial intelligence, people are naturally afraid of how far can AI go? What are the kinds of things it could do?

Host: Yeah.

Rico Malvar: Can we regulate so that there will be some control in how it’s developed? And Microsoft has taken the stance that we have to be very serious about AI. We have to be ethical, we have to preserve privacy and all of those things. So, instead of waiting for regulation and regulatory aspects to develop, let’s help them. So, we were founders of – not just me, but the company and especially the Microsoft Research AI team – founders of the Partnership for AI, in partnership with other companies to actually say no, let’s be proactive about that.

(music plays)

Host: Tell us a bit about Rico Malvar. Let’s go further back than your time here at MSR and tell us how you got interested in technology, technology research. How did you end up here at Microsoft Research?

Rico Malvar: Okay, on the first question, how I got interested in technology? It took me a long time. I think I was 8 years old when my dad gave me an electronics kit and I start playing with that thing and I said, a-ha! That’s what I want to do when I grow up. So, then I went through high school taking courses in electronics and then I went to college to become an electrical engineer and I loved the academic environment, I loved doing research. So, I knew I wanted to do grad school. I got lucky enough to be accepted at MIT and when I arrived there, I was like, boy, this place is tough! And it was tough! But then when I finished and I went back to my home country, I created the signal processing group at the school there, which was… I was lucky to get fair amounts of funding, so we did lots of cool things. And then, one day, some colleagues in a company here in the US called me back in Brazil and they say, hey, our director of research decided to do something else. Do you want to apply for the position? And then I told my wife, hey, there’s a job opening in the US, what about that? I said, well go talk to them. And I came, talked to them. They make me an offer. And then it took us about a whole month discussing, are we going to move our whole family to another country? Hey, we lived there before, it’s not so bad, because I studied here. And maybe it’s going to be good for the kids. Let’s go. If something doesn’t work, we move back. I say, okay. So, and… here we are. But that was not Microsoft. That was for another company at the time, a company called PictureTel which was actually the leading company in professional video conferencing systems.

Host: Oh, okay.

Rico Malvar: So, we were pushing the state-of-the-art on how do you compress video and audio and these other things? And I was working happily there for about four years and then one day I see Microsoft and I say, wow, Microsoft Research is growing fast. Then one afternoon, I said, ah, okay, I think about it and I send an email to the CTO of Microsoft saying, you guys are great, you are developing all these groups. You don’t have yet a group on signal processing. And signal processing is important because one day we’re going to be watching video on your computers via the internet and all of that, so you should be investing more on that. And I see you already have Windows Media Player. Anyways, if you want to do research in signal processing, here’s my CV. I could build and lead a group for you doing that. And then I tell my wife and she goes, you did what?? You sent an email to the CTO of Microsoft??

Host: Who was it at the time?

Rico Malvar: It was Nathan Myhrvold.

Host: Nathan.

Rico Malvar: And she said, nah. I say, what do I have to lose? The worst case, they don’t respond, and life is good. I have a good job here. It’s all good. And that was on a Sunday afternoon. Monday morning, I get an email from Microsoft. Hey, my name is Suzanne. I work on recruiting. I’m coordinating your interview trip. I said, alright! And then I show the email to my wife and she was like, what? It worked? Whoa! And then it actually was a great time. The environment here, from day one, since the interviews, the openness of everybody, of management, the possibilities and the desire of Microsoft to, yeah, let’s explore this area, this area. One big word here is diversity. Diversity of people, diversity of areas. It is so broad. And that’s super exciting. So, I was almost saying, whatever offer they make me, I’ll take it! Fortunately, they made a reasonable one, so it wasn’t too hard to make that decision.

Host: Well, two things I take away from what you’ve just told me. You keep using the word lucky and I think that has less to do with it than you are making it out to be. Um, because there’s a lot of really smart people here that say, I was so lucky that they offered me this. It’s like, no, they’re lucky to have you, actually. But also, the idea that if you don’t ask, you are never going to know whether you could have or not. I think that’s a wonderful story of boldness and saying why not?

Rico Malvar: Yeah. And in fact, boldness is very characteristic of Microsoft Research. We’re not afraid. We have an idea, we just go and execute. And we’re fortunate, and I’m not going to say lucky, I’m going to say fortunate, that we’re in a company that sees that and gives us the resources to do so.

Host: Rico, I like to ask all my guests, as we come to the end of our conversation, to offer some parting thoughts to our listeners. I think what you just said is a fantastic parting thought. But maybe there’s more. So, what advice or wisdom would you pass on to what we might call the next generation of technical researchers? What’s important for them to know? What qualities should they be cultivating in their lives and work in order to be successful in this arena?

Rico Malvar: I would go back on boldness and diversity. Boldness, you’ve already highlighted Gretchen, that, you know, if you have an idea but it’s not just too rough an idea, you know a thing or two why that actually could work, go after it! Give it a try. Especially if you are young. Don’t worry if you fail many things. I failed many things in my life. But what matters is not the failures. You learn from the failures and you do it again. And the other one is diversity. Always think diversity in all the dimensions. All kind of people, everywhere in the world. It doesn’t matter gender, race, ethnicity, upbringing, rich, poor, whatever they come from, everybody can have cool ideas. The person whom you least expect to invent something might be the one inventing. So, listen to everybody because that diversity is great. And remember, the diversity of users. Don’t assume that all users are the same. Go learn what users really think. If you are not sure if Idea A or Idea B is the better, go talk to them. Try them out, test, get their opinion, test things with them. So, push diversity on both sides, diversity on the creation and diversity on who is going to use your technology. And don’t assume you know. In fact, Satya has been pushing the whole company towards that. Put us in a growth mindset which basically means keep learning, right? Because then if you do that, that diversity will expand and then we’ll be able to do more.

Host: Rico Malvar, I’m so glad that I finally got you on the podcast. It’s been delightful. Thanks for joining us today.

Rico Malvar: It has been a pleasure. Thanks for inviting me.

(music plays)

To learn more about Dr. Rico Malvar and how research for people with disabilities is enabling people of all abilities, visit


Podcast with Microsoft Research Cambridge’s Dr. Cecily Morrison: Empowering people with AI

Cecily Morrison

Researcher Cecily Morrison from Microsoft Research Cambridge

Episode 60, January 23, 2019

You never know how an incident in your own life might inspire a breakthrough in science, but Dr. Cecily Morrison, a researcher in the Human Computer Interaction group at Microsoft Research Cambridge, can attest to how even unexpected events can cause us to see things through a different – more inclusive – lens and, ultimately, give rise to innovations in research that impact everyone.

On today’s podcast, Dr. Morrison gives us an overview of what she calls the “pillars” of inclusive design, shares how her research is positively impacting people with health issues and disabilities, and tells us how having a child born with blindness put her in touch with a community of people she would otherwise never have met, and on the path to developing Project Torino, an inclusive physical programming language for children with visual impairments.


Episode Transcript

Cecily Morrison: Working in the health and disability space has been a really interesting space to work with these technologies because you can see, on the one hand, that they can have a profound impact on the lives of the people that you’re working with. And when I say profound, I don’t mean, you know, they had a nicer day. I mean, they can have lives and careers that they couldn’t consider otherwise.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: You never know how an incident in your own life might inspire a breakthrough in science, but Dr. Cecily Morrison, a researcher in the Human Computer Interaction group at Microsoft Research Cambridge, can attest to how even unexpected events can cause us to see things through a different – more inclusive – lens and, ultimately, give rise to innovations in research that impact everyone.

On today’s podcast, Dr. Morrison gives us an overview of what she calls the “pillars” of inclusive design, shares how her research is positively impacting people with health issues and disabilities, and tells us how having a child born with blindness put her in touch with a community of people she would otherwise never have met, and on the path to developing Project Torino, an inclusive physical programming language for children with visual impairments. That and much more on this episode of the Microsoft Research Podcast.

(music plays)

Host: Cecily Morrison, welcome to the podcast.

Cecily Morrison: Thank you.

Host: You’re a researcher under the big umbrella of Human Computer Interaction in the Cambridge, England, lab of Microsoft Research and you are working on technologies that enable human health and well-being in the broadest sense. So, tell us, in the broadest sense, about your research. What gets you up in the morning?

Cecily Morrison: I like technology that helps people live the lives that they want to live, whether that’s because they have a health issue or a disability, or they’re just trying to live better. I want to be part of making those technologies. We have a quite an exciting group structure that we work in here. So, at the moment, we sit on a floor of multidisciplinary researchers that mix human computer interaction, design, engineering, software engineering, hardware engineering. We sort of sit together as a community, and then we work across three strands: the future of work, the future of the cloud, and the empowering people with AI. And through those themes of work across our lab, we get to work with people in many different kinds of groups. I specifically work with people in the machine learning team and looking how the kinds of machine learning opportunities that we have now can underpin experiences that really enable people to do things they couldn’t do before.

Host: I want to drill in on this idea of inclusive design for a second. It speaks to a mindset and assumptions that researchers make even as they approach working on a new technology. How would you define research that incorporates inclusion from the outset, and how might we change the paradigm so that inclusivity would be the default mode for everyone?

Cecily Morrison: So, inclusive design, as it’s been put through the inclusive design handbook done by Microsoft, has three important pillars. The first one is to recognize exclusion. So, it used to be that disability was a thing that if you had a different physical makeup, you were missing an arm, you couldn’t see, you were considered to have a disability. And the World Health Organization changed that definition some years back now to say that actually, what disability is, is a mismatch between a person’s physical capabilities and the environment which they’re in. So, if you’re a wheelchair user and you don’t have curb cuts, then you immediately feel disabled because it’s really hard for you to get around. You know what? If you’re a buggy user, you feel the same. You know somehow, you have to get that massive buggy across the pavement. And thank goodness we have curb cuts that were pioneered for people who were using wheelchairs.

Host: Right.

Cecily Morrison: I think, in that regard, as we think about as technologists, we are people who can recognize and address that exclusion by creating technologies that ensure that there isn’t a mismatch between the environment that I and the technology people are using and their particular physical makeup and needs. So, I start from that perspective, that we as technology designers, have an important role to make the world a more inclusive place. Because it’s not about how people are born, or how they – what happens to their bodies over their lives. It’s about the environments that we create, and technology is an important part of the environments that we create. So, the second part of inclusive design is really about saying that when we design things, we need to design for a set of people. And often, we implicitly do this by designing for ourselves. We just don’t recognize that we’re designing for ourselves. And if we don’t have very inclusive teams, that means we get the same ideas over and over again, and they’re a little bit different, and a little bit this way, a little bit that way. But they’re really the same idea. When we start to design for people who have a very different experience of the world, which people with disabilities do, we can start to pull ourselves into a different way of thinking and really start to generate ideas that we wouldn’t have considered before. So, I think people with disabilities can really inspire us to innovate in ways that we hadn’t expected. And the third thing is, then, to extend to many people. So, if we design for a particular group, people say, oh, well there aren’t very many of them, and, you know, where’s my technology? But actually, the exciting thing is that, by designing for a particular group who’s different, we get new ideas that we can potentially extend to many people. So, if you think about designing for somebody with only one arm, and that means, for example, using a computer, a phone, any technology with a single hand. You can think, well, there aren’t that many people who only have one arm. But then you start to think, well, how many people have a broken arm at some time in their lives? Well, that’s a much larger number. So that person has a, what we might think of as a, temporary disability. And then what about those people who have what’s called a situational disability? So, in a particular situation, they only have access to one arm. So, I know this quite well, as the mother of a small baby. If you have to hold a baby and do something on your phone, you need to do it with one hand. I can guarantee you. So, this inclusive design is a way of helping us really generate new ideas by thinking about and working with people with disabilities and then extending them to help all of us. So, we create more innovative technologies that include more people in our world and help us break down those barriers that create disabilities.

Host: Let’s talk about this idea of human health and well-being being central to the focus of your work. Even Christopher Bishop at your lab has said healthcare is a fruitful field for AI and machine learning and technology research in general, but it’s challenging because that particular area is woefully behind other industries simply in embracing current technologies, let alone emerging ones. So how do you see that landscape given the work you’re doing, and what can we do about it?

Cecily Morrison: Well, I remember when I arrived at Microsoft Research, I was really excited to come here because I had just spent four years working in our National Health Service in the UK, really trying to help them put into practice some of the technologies that already existed. And man, was it hard work! It was incredibly important work, but it was really, really hard work. And I don’t think it’s because people are afraid of technologies or they don’t want to use technologies, but you’re dealing with an incredibly complex organization, and you can’t get it wrong. You can’t get it wrong, because the impact you could have on someone’s life is beyond what I think we would ethically allow ourselves. So, I was excited to come to Microsoft Research, and I said you know I really want to work on technologies that impact people, but at the same time, we need a little bit more space to be able to experiment and think about new ideas without being so constrained by having to deliver a service every day. One challenge with healthcare is the easiest way to think about what a technology might do is to imagine what people do now and think, well how would a technology do that? But actually, that’s not really where we see innovation. We see innovation usually coming in at making something different, making something new, or making something easier, not doing something the same.

(music plays)

Host: Let’s talk about some of your specific research. I want to begin with a really cool project called Assess MS. Tell us how this came about. What was the specific problem you were addressing, and how does this illustrate the goal of collaboration between humans and machines?

Cecily Morrison: Right, so Assess MS was a project to track disease progression in multiple sclerosis using computer vision technology. It was a collaboration between Microsoft Research and Novartis Pharmaceutical, with a branch based in Basel, Switzerland. And it really came about as healthcare is moving into the technology space and technology’s moving into the healthcare space with these two large companies thinking about, what could we do together? How can we bring our expertise together? We were approached by our partner, Novartis, and they said, we would like to have a “neurologist in a box.” And it took a lot of time and working with them, negotiating with them, doing design work with them to understand that a neurologist in a box is not really what technology is good at, but we could do something even more powerful. And what that something was, was that we were looking at how do we track disease progression in multiple sclerosis? Now, patients with multiple sclerosis might have very, very different paths of that particular disease. It could progress very quickly, and within two years they lose their lives. They could have it for sixty years and really have minor symptoms such as very numb feet or some cognitive difficulties. These are very, very different experiences, and it can be very difficult for patients to know when or how or which treatments to start if you don’t know any sense of how your disease might progress. And one step in helping patients and clinicians make those decisions is being able to very consistently track when the disease is progressing. Now that was really difficult when we started, because they were using a range of paper and pencil tools where a neurologist would look at a patient, ask them to do a movement such as extending their arm out to the side and then touching their nose, and then checking for a tremor in the hand. Now, in one year with one neurologist, they might say, oh, well that’s a tremor of one. And the next year or the next neurologist, they might say, oh, that’s a tremor of two. Then there’s the question of, has the patient changed, or is it just that the neurologist is at a different time and a different neurologist? Because there’s no absolute criteria for what is a one and what is a two. And again, if you’re lucky enough to have the same doctor, you might be slightly better, but again, it’s been a year’s time between the two experiences. But what a machine does really well – they’re not very good at helping a patient make decisions about their care – but they are very good at doing things consistently. So, tracking disease progression was something that we said, well, we can do very consistently with a machine. And we can then supply those results to the patient and the neurologist to really think through what are the best options for that patient that particular year?

Host: So, how is the machine learning technology playing into this? What specific technical aspects to this Assess MS have you seen developing over the course of this project?

Cecily Morrison: There are quite a range of things, actually. In the first instance, we were using machine learning to do this categorization. So, at the moment, neurological symptoms in MS are already categorized with a particular tool called the Expanded Disability Status Scale, the EDSS. And we were attempting to replicate those measures as being measures that the clinical field was already comfortable with. And so, in that regards, we were using a set of training data of 500 plus patients that we had collected and labeled and using that to train algorithms and test out and research, really, we were more testing out different kinds of algorithms that might be able to discriminate between those patient levels. But actually, what we did on the human-computer interaction side of things was actually making a lot of that machine learning work. So, the first thing that we needed to do was design a device that helped people capture the data in a form that was standardized enough for the machine learning to work well. The first thing that we saw when we just did a little bit of pilots, that the cameras were tilted, people were out of the frame, you couldn’t see half their legs because they had sparkly pants on. All kinds of things that you just don’t imagine until you go into a real-world context that we had to design. And what’s, I think for me, quite interesting is that people are really willing to work with a machine so that the machine can see well, as long as they understand how the machine is seeing. And it’s not seeing like a person. So, we built a physical interface, as in a physical prototype, which allowed people to position and see and adjust the way the vision was seeing so it could capture really good quality data for machine learning.

Host: Right.

Cecily Morrison: That was step one. And then step two was like, oh, we need labeled data to train against, and we discovered, very quickly, that the clinicians – if we’re trying to increase our consistency above clinicians – if we use the current way clinicians label data at the moment, we’re going to get the same level of consistency as clinicians. So, we won’t really have achieved our goal. So, we had to come up with a new way to get more precise and consistent labels from clinicians. And again, we did something pretty interesting there. Partially, we used interaction design features, so we went with the idea that clinicians, and people generally, they’re much better at giving relative labels. So, this person is better than that person, rather than saying, this person is a one and that person’s a two, which we call a discreet label. So, what we did is, we did a pairwise comparison. We said, okay, tell us which person is more disabled. This worked really well in terms of consistency, although we nearly had all of our clinicians quit because they figured, you know, this is incredibly tedious work. And again, that’s where machine learning and good design can come in. Because we said, well actually, we have this great algorithm called TrueSkill. This is an algorithm that was originally used for matching players in Xbox games. But actually, what it does, is give us a probabilistic distribution of how likely someone is better than someone else. So, it takes a problem, which is pairwise comparison, which is an unsquared problem, and makes it a linear problem. And to interpret that for people who don’t really work in this space, that basically means if you have 100 films to label, that takes you 100 times however long it takes, which in this case is about a second, rather than taking 100 times 100.

Host: Right.

Cecily Morrison: Which is a much longer time. By using sort of thoughtful ways and other kinds of machine learning, we could actually make that process much faster. So, we managed to show that we could get much more consistent and finer-grained labels much faster than the original approach. So, we went to build the big system, but in the end, actually, we spent a lot of our times on these challenges that just make computer vision systems work in the real world.

Host: Is this working in the real world, or is it still very much at the prototype research stage?

Cecily Morrison: Well, I think it was a very large project, a lot of data was collected, the data sets are still there. But what we found was that really the machine learning isn’t really up to discriminating the fine level of detail that we need yet. But we have a data set, because we expect, in the next couple of years, it will be. So, it’s on pause.

Host: Let’s talk about one of the most exciting projects you’re working on and it’s launching as we speak, called Project Torino. And you said this was sort of a serendipitous, if not accidental, project for you. Tell us all about Project Torino. This is so cool.

Cecily Morrison: So, Project Torino is a physical programming language for teaching basic programming concepts and computational learning skills to children ages seven to eleven regardless of their level of visions, whether they’re blind, low vision, partially sighted or sighted. It’s a tool that children can use. And it was, indeed, a serendipitous project. We were exploring technology that blind and low-vision children used, because we have a blind child. And at the time, he was quite young. He was about 18 months. And we really wondered how many blind and low-vision people were involved in the design of this technology. And we thought, what would it look like if these kids, these blind and low-vision kids that were in our community that we now knew through our son – what would it look like if they were designing the technologies of tomorrow, their own technologies, other technologies? So, we decided to run an outreach workshop teaching the children in our community how to do a design process and how to come up with their own ideas. So, we brought them together. We had a number of different design process activities that we did. And, you know, they came up with amazing things. We gave them a base technology based on Arduino that turns light into sound. And we just walked them through a process to create something new with that. And they came up with incredible things that you’d never think of. So, one young girl came up with an idea of this hat – very fashionable hat, I have to say – which adjusted the light so that she could always see, because she had a condition where, if the light was perfect, she could see almost perfectly, and if the light was just a little bit wrong, she was almost totally blind. So, it was quite difficult for her in school. We had another child who created this, um, you might call it a robot, which was running around his 100-room castle which was imaginary, I learned, in the end, to find out which rooms had windows, and which rooms didn’t have windows because, at the age of seven, he had told me very confidently that his mom had told him that sighted people like windows, and he should put them in the rooms with windows. So, we were really excited about how engaged the children were, the ideas they came up with were great. But it was an outreach workshop, so when we were finished with the day, we thought we were finished. And that week, a number of the parents phoned me back or emailed me and said, great, you know, my child has come up with several new ideas. They really want to build them, so, how can they code? And I thought, gosh, I have no idea! Most of the, you know, languages that we would use with children of that age group, between seven and eleven, are not very accessible. They’re block-based languages. So, I asked around, did anybody know? We tried a few things out. We tried putting assistive technologies on existing languages, and we discovered that this was a big failure. The first time I made a child cry, I was a little bit sad, a little bit depressed about that. So that was definitely not the right direction, but I was having lunch one day with a colleague of mine who works in my group as a hardware research engineer. And I said, you know, is there anything out there that we could hack together, just to enable these kids to learn to code, give them the basics before they’re ready to code with a text-based language with an assistive technology when they’re a bit older? And the answer was, well, not really, but actually, I think we can build that. I think we’ve got a bunch of the base tech there already. So, we got a bunch of interns together and off we went.

Host: And… where is it now?

Cecily Morrison: It’s been a very exciting journey from that first prototype, which was really a good prototype, tested with ten children, to a second and a third prototype which was then manufactured to test with a hundred children. And after an incredibly successful beta trial, we are partnering with American Printing House for the Blind who will take this technology to market as a product.

Host: Wow. How does it work?

Cecily Morrison: How does it work? It’s a set of physical pods that you connect together with wires. And each of these pods is a statement in your program, and you can connect a number of pods to create a multi-statement program which creates music, stories or poetry. And in the process, with different types of pods, we take children through the different types of control flows that you can have in a programming language.

Host: And so, this is not just, you know, the basics of programing languages. It’s computation thinking and, sort of, preparing them, as you say, for what they might want to do when they get older?

Cecily Morrison: Yeah, so I think whether children become, you know, software engineers or computer scientists in some way or not, a lot of the skills that they can learn through coding and through the computational learning aspect of what we were doing, are key to many, many careers. So those are things like breaking a problem down. You’re stuck; you can’t solve it. How are you going to break it down to a problem that you can solve? Or, you’ve got a bug; it’s not working. How are you going to figure out where it is? How are you going to fix it? Perhaps my favorite one, and perhaps this is just a beautiful memory I had of a child with one of those a-ha moments, is, how do you make something more efficient? A physical programming language can’t have very many pods. And I think, in our current version, we have about twenty-one pods. So, you have to use those really efficiently. That means, you have to use loops if you want to do things again, because you don’t have enough pods to do it out in a serial fashion. And I remember a child trying to create the program with Jingle Bells. It was just before Christmas. We were all ready to go off on holiday, and she was determined to solve this before any of us could go home. She’d mapped it all out, and she said, “But I don’t have enough pods for the last two words!” I said, well, you know, we have solved this, so it must be solvable. So, she’s sitting there and thinking, and her mom looks at her and goes, “Jingle Bells, Jingle Bells…” And all of a sudden, she goes, “Oh, I get it! I get it!” And she reaches for the loop and puts it in a loop. But I think those are the kinds of moments, both as a researcher, which are just beautiful to see when your technologies really help someone move forward. But also, the kind of thing that we’re trying to get children to get at, which is to really understand that they can do things in multiple ways.

Host: Who would ever have thought that Jingle Bells would give someone an a-ha moment in technology research?!

(music plays)

Host: So, let’s talk a bit about some rather cutting-edge, ongoing inclusive design research you’re involved in, where the goal is to create a deeply personal visual agent. What can you tell us about the direction of this research and what it might bode for the future?

Cecily Morrison: I think, across all of the major industrial research labs and industrial partners in technology, there’s a lot of focus on agents, and agents as being a way to augment your world with useful information in the moment. We’ve been working on visual agents, so visual agents are ones that incorporate computer vision. And I think one of the interesting challenges that come from working in this space is that there are many, many things that we can perceive in the world. You know, our computer vision is getting better by the month. Not even by the year, by the month. From when we started to now, the things that we can do are dramatically different. But that’s kind of a problem from a human experience point of view, because, what’s my agent going to tell me, now that I can recognize everything and recognize relationships between things, and I can recognize people? Now we have this relevance problem, is what am I going to surface and actually tell the person which is relevant to them in their particular context? So, I think one of the exciting things that we’re thinking about is how do we make things personalized to people without using either a lot of their data, or asking them to do things that require a deeper understanding of computer science? So, that’s a real challenge of how we build new kinds of algorithms and new kinds of interfaces to work hand-in-hand with agents to get the experience that people want without having to put too much effort in.

Host: So, I want to talk about a topic I’ve discussed with several guests on the podcast. It’s this trend towards cross- or multi-disciplinary research, and I know that’s important to you. Tell us how you view this trend – even the need – to work across disciplines in the research you’re doing today.

Cecily Morrison: Well, I can’t think of a project I’ve ever worked on in technology that hasn’t required working across disciplines. I think if you really want to impact people, that requires people with lots of different kinds of expertise. When I first started doing research as a PhD, I started right away working with clinicians, with social scientists, with computer scientists. That was a small team at the time. The Torino Project that I’ve just discussed, we were quite a large team. We had hardware engineers, software engineers, UX designers, user researchers, social scientists involved. Industrial designers as well. Everyone needed to bring together their particular perspective to enable that system to be built. And I feel, in some ways, incredibly privileged to work at Microsoft Research where I sit on a floor with all those people. So, it’s just a lunch conversation away to get the expertise you need to really think about, how can I get this aspect of what I’m trying to solve?

Host: Hmm. You know, there’s some interesting, and even serious, challenges that arise in the area of safety and privacy when we talk about technologies that impact human health. You’ve alluded to that earlier. So, as we extend our reach, we also extend our risk. Is there anything that keeps you up at night about what you’re doing, and how are you addressing those challenges?

Cecily Morrison: No doubt any technology that uses computer vision, sets many people into a worried expression. What are you capturing? What are you doing with it? So, I’ve certainly thought quite a lot, and quite deeply, about what we do and why we do it. And I think working in the health and disability space has been a really interesting space to work with these technologies because you can see, on the one hand, that they can have a profound impact on the lives of the people that you’re working with. And when I say profound, I don’t mean, you know, they had a nicer day. I mean, they can have lives and careers that they couldn’t consider otherwise. That said, we are, no doubt, with vision technology, capturing other people. But for me, that’s one of the most exciting design spaces that we can work in. We can start to think about, how do we build systems in which users and bystanders have enough feedback that they can make choices in the use of that system? So, it used to be that users of the systems were the ones that controlled the system. But I think we’re moving into an era where we allow people to participate in systems even when they’re not the direct user of those systems. And I think Assess MS was a good example, because there we were also capturing clinical data of people, and we had to be very careful about balancing the need to, for example, look at that data to figure out where our algorithms were going wrong, and respecting the privacy of the individuals as there’s no way to anonymize the data. So, I can assure you, we thought very hard about how we do that within our team. But it was also a very interesting discussion with some of our colleagues who are working in cloud computing to say, you know, there’s a real open challenge here which hopefully won’t be open too much longer, about how we deal with clinical data, how we allow machine learning algorithms to work on data so not everyone can see all of the same data. So, it’s certainly top of mind in how we do that ethically and respectfully, and of course, legally, now that we have many legal structures in place.

Host: Cecily, tell us a bit about yourself. Your undergrad is in anthropology, and then you got a diploma and a PhD in computer science. How and why did that happen, and how did you end up working in Microsoft Research?

Cecily Morrison: Well, I suppose life never takes the direction you quite expect. It certainly hasn’t for me. I did a lot of maths and science as a high school student. But I was getting a little bit frustrated, because I really liked understanding people. And what I really liked about anthropology was it was a very systematic way of looking at human behavior and how different behaviors could adjust the system in different ways. And that, to me, was a little bit like some of the maths that I was doing, but just with people. Sort of solving the same kind of problems but using people and systems rather than equations. So, I found that very interesting. I went off to do a Fulbright Scholarship in Hungary. I was studying the role of traditional music, in particular bagpipe music, in the changes and political regimes in Hungary. And, as part of that, I spent a couple of years there, I found some really interesting things with children. I started teaching kids. I started working with them on robotics, just because, well, it was fun. And having done that, I was then seeing that, actually, there could be a lot of better ways to build technology that supports interaction between children in the classroom. So off I set myself to find a way to build better technologies. I figured I needed to know something about computing first. So, I thought I’d do a diploma in computer science. But that, again, distracted me when I was given this opportunity to work in the healthcare space and I realized that really what I wanted to do was create technology that enabled people in ways they wanted to be enabled, whether that be education or health or disability. So, I ended up doing a PhD in computing and then, very quickly, moving into working in technology in the NHS. And soon after that I came to Microsoft to work on the Assess MS project.

Host: So, you have two boys, currently 11 months and 6 years. Do you feel like kids, in general, and your specific boys are informing your work, and how has that impacted things, as you see them, from a research perspective?

Cecily Morrison: Again, one of the serendipities of life, you can get frustrated with them, or you can take them and run with them. So, I have an older child who was born just before I started at Microsoft, who is blind, and I have another 11-month-old baby who… we call him a classic. We have the new age and the classic version. And it very much has impacted my work. Seeing the world in a different perspective, taking part in communities that I wouldn’t otherwise have seen or taken part of have definitely driven what we’ve done. So, Torino is certainly an example of that. But a lot of the work I’ve done around inclusive design is driven very much by that. And I think, interestingly enough, in the agent space, we have done some work with people who are blind and low vision because, at the time we started working with agents, typical people were not heavy users of agents. In fact, most people thought they were toys. Whereas for people who are blind and low vision, they were early adopters and heavy users of agent technologies and really could work with us to help push the boundaries of what these technologies can do. If you’re not using technology regularly, you can’t really imagine what the next steps were. So, it’s a great example of inclusive design where we can work with this cohort of young, very able, blind people to help us think about what agents of the future are going to look like for all of us.

Host: So, while we’re on the topic of you, you’re a successful young woman doing high-tech research. What was your path to getting interested? Was it just natural, or did you have role models or inspirations? Who were your influences?

Cecily Morrison: (laughs) Well, I think, as maybe some of the stories I’ve said so far, you could see serendipity has played a substantial role in my life, and I guess I’m grateful to my parents for being very proactive in helping me accept serendipity and running with it wherever it has taken me. I think I’ve been very lucky to have a boss and mentor, Abby Sellen, maybe people may know from the HCI community, who’s been amazingly adept at navigating, building great technology and navigating the needs we all have as people in our own personal lives. I’m sure there have been many other people. I take inspiration wherever it’s offered.

Host: As we close, Cecily, I’d like you to share some personal advice or words of wisdom. What you’re doing is really inspirational and really interesting. How could academically minded people in any discipline get involved in building technologies that matter to people, like you?

Cecily Morrison: I think knowing about the world helps you build technologies that matter. And to take an example from the blind space, I’ve seen a lot of technology out there where people build technology because they want to do good, but they don’t know how to do good, because they don’t know the people they’re designing for and building. We have lots of techniques for getting to know people. But I think in some ways, the best is to just go out and have a life outside of your academic world that you can draw inspiration from. Go find people. Go talk to people. Go volunteer with people. To me, if we want to build technologies that matter to people, we need to spend a good part of our life with people understanding what matters to them, and that’s something that drives me as a person. And I think it then comes into the way I think about technology. Another thing to say is, be open to serendipity. Be open to the things that cross your path. And I know, as academic researchers, sometimes we feel that we need to define ourselves. And perhaps that’s important, although it’s never been the way that I’ve worked. But I think there’s also something about, you can be incredibly genuine if you go with things that are really meaningful to you. And being genuine in what you do gives you insights that nobody else will have. I never expected to have a blind child, but I think it’s been incredibly impactful in the way I approach my life and the way I approach the technology I build. And I don’t think I would have innovated in the same way if I had not had that sort of deep experience of living life in a different way.

Host: Cecily Morrison, thanks for joining us today.

Cecily Morrison: Thanks very much.

(music plays)

To learn more about Dr. Cecily Morrison and how researchers are using innovative approaches to empower people to do things they couldn’t do before, visit


Scientists discover how bacteria use noise to survive stress

January 22, 2019 | By Microsoft blog editor

Noisy expression of stress response in microcolony of E. coli.

Mutations in the genome of an organism give rise to variations in its form and function—its phenotype. However, phenotypic variations can also arise in other ways. The random collisions of molecules constituting an organism—including its DNA and the proteins that transcribe the DNA to RNA—result in noisy gene expression that can lead to variations in behavior even in the absence of mutations. In a research paper published in Nature Communications, researchers at Microsoft Research and the University of Cambridge have discovered how bacteria can couple noisy gene expression with noisy growth to survive rapidly changing environments.

“We have taken advantage of advances in microfluidics technology, time-lapse microscopy, and the availability of libraries of genetically modified bacteria that have happened in the past decade or so to provide unprecedented detail of how single cells survive stress,” says Microsoft PhD Scholar Om Patange. “We hope this will help fellow researchers see that studies of bacteria at the single-cell level can reveal important aspects of how these organisms live and contend with their environment.”

Cells stochastically turn on their stress response and slow down growth to survive future stressful times. A montage of E. coli grown in a microfluidics device illustrates this phenomenon.

Using a microfluidic device, Patange—together with colleagues and cosupervisors Andrew Phillips, head of Microsoft Research’s Biological Computation group, and James Locke, research group leader at Cambridge’s Sainsbury Laboratory—observed single Escherichia coli cells grow and divide over many generations. They found that a key regulator of stress response called RpoS pulsed on and off. When these happily growing cells were exposed to a sudden chemical stress, the few cells ready for the stress survived. This is a striking example of a microbial population partitioning into two populations despite being of the same genetic makeup. The researchers further discovered that the surviving population was paying a cost to survive: They grew slower than their neighbors.

To uncover the mechanism causing the cells to grow slowly and turn on their stress response, the researchers developed a stochastic simulation of biological reactions inside single cells. They found that a simple mutual inhibitory coupling of noisy stress response and noisy growth caused the pulses observed and also captured more subtle observations.

This study, for which single-cell datasets are available on GitLab, has both pure and applied implications. The stress response phenomenon may be related to persistence, a strategy used by bacteria to evade antibiotics without mutations. Understanding the connection between persistence and stress response may lead to more nuanced approaches to antibiotic treatments. The idea that bacteria have evolved a population-level phenotype governed by single-cell actions is also intriguing. Understanding the benefit gained by the population at the expense of single bacteria may yield insights into the evolution of cooperative strategies.

“The bacteria might teach us about cooperative strategies we haven’t already come up with,” says Patange. “We might also learn how to use and defend against bacteria better if we can see the world from their perspective.”


WIRED: Undersea servers stay cool while processing oceans of data

Most electronics suffer a debilitating aquaphobia. At the ­littlest­ spillage—heaven forbid Dorothy’s bucket—of water, our wicked widgets shriek and melt.

Microsoft, it would seem, missed the memo. Last June, the company installed a smallish data center on a patch of seabed just off the coast of Scotland’s Orkney Islands; around it, approximately 933,333 bucketfuls of brine circulate every hour. As David Wolpert, who studies the thermodynamics of computing systems, wrote in a recent blog post for Scientific American, “Many people have impugned the rationality.”

Related Stories

The idea to submerge 864 servers in saltwater was, in fact, quite rational, the result of a five-year research project led by future-proofing engineers. Errant liquid might fritz your phone, but the slyer, far deadlier killer of technology is the opposing elemental force, fire. Nearly every system failure in the history of computers has been caused by overheating. As diodes and transistors work harder and get hotter, their susceptibility to degradation intensifies exponentially. Localized, it’s the warm iPhone on your cheek or a wheezing laptop giving you upper-leg sweats. At scale, it’s Outlook rendered inoperable by remote server meltdown for 16 excruciating hours—which happened in 2013.

Servers underlie the networked world, constantly refreshing the cloud with droplets of data, and they’re as valuable as they are vulnerable. Housed by the hundreds, and often the thousands, in millions of data centers across the United States, they cost billions every year to build and protect. The most significant number, however, might be a single-digit one: Running these machines, and therefore cooling them, blows through an estimated 5 percent of total energy use in the country. Without that power, the cloud burns up and you can’t even fact-check these stats on Google (an operation that costs some server, somewhere, a kilojoule of energy).

Alyssa Foote

Savings of even a few degrees Celsius can significantly extend the lifespan of electronic components; Microsoft reports that, on the ocean floor 117 feet down, its racks stay 10 degrees cooler than their land-based counterparts. Half a year after deployment, “the equipment is happy,” says Ben Cutler, the project’s manager. (The only exceptions are some of the facility’s ­outward-facing cameras, lately blinded by algal muck.)

Another Microsoft employee refers to the effort as “kind of a far-out idea.” But the truth is, most hyperscalers investing in superpowered cloud server farms, from Amazon to Alibaba, see in nature a reliable defense against ever more sophisticated, heat-spewing circuits. Google’s first data center, built in 2006, sits on the temperate banks of Oregon’s Columbia River. In 2013, Facebook opened a warehouse in northern Sweden, where winters average –20 degrees Celsius. The data company Green Mountain buried its massive DC1-­Stavenger center inside a Norwegian mountain; pristine, near-freezing water from a fjord, guided by gravity, flows through the cooling system. What Tim Cook has been calling the “data-­industrial complex” will rely, if it’s to sustainably expand to the farthest reaches, on a nonindustrial means of survival.

Alyssa Foote

Underwater centers may represent the next phase, a reverse evolution from land to sea. It’s never been hard, after all, to waterproof large equipment—think of submarines, which get more watertight as they dive deeper and pressure increases. That’s really all Microsoft is doing, swapping out the payloads of people for packets of data and hooking up the trucklong pod to umbilical wiring.

Nonetheless, Cutler says, the concept “catches people’s imagination.” He receives enthusiastic emails about his sunken center all the time, including one from a man who builds residential swimming pools. “He was like, you guys could provide the heating for the pools I install!” Cutler says. When pressed on the feasibility of the business model, Cutler adds: “We have not studied this.”

Alyssa Foote

Others have. IBM maintains a data center outside of Zurich that really does heat a public swimming pool in town, and the Dutch startup Nerdalize will erect a mini green data center in your home with promises of a warm shower and toasty living room. Hyperlocal servers, part of a move toward so-called edge computing, not only provide recyclable energy but also bring the network closer to you, making your connection speeds faster. Microsoft envisions sea-based facilities like the one in Scotland serving population-dense coastal cities all over the world.

“I’m not a philosopher, I’m an engineer,” Cutler says, declining to offer any quasipoetic contemplations on the imminent fusion of nature and machine. Still,
he does note the weather on the morning his team hauled the servers out to sea. It was foggy, after a week of clear skies and bright sun—as though the literal cloud, reifying the digital, were peering into the shimmering, unknown depths.

Jason Kehe (@jkehe) wrote about drone swarms in issue 26.08.

This article appears in the January issue. Subscribe now.

More Great WIRED Stories


Podcast: Soundscaping the world with Amos Miller

Product Strategist Amos Miller

Episode 54, December 12, 2018

Amos Miller is a product strategist on the Microsoft Research NeXT Enable team, and he’s played a pivotal role in bringing some of MSR’s most innovative research to users with disabilities. He also happens to be blind, so he can appreciate, perhaps in ways others can’t, the value of the technologies he works on, like Soundscape, an app which enhances mobility independence through audio and sound.

On today’s podcast, Amos Miller answers burning questions like how do you make a microwave accessible, what’s the cocktail party effect, and how do you hear a landmark? He also talks about how researchers are exploring the untapped potential of 3D audio in virtual and augmented reality applications, and explains how, in the end, his work is not so much about making technology more accessible, but using technology to make life more accessible.


Episode Transcript

Amos Miller: Until you are out there in the wind, in the rain, with the people, experiencing, or at least trying to get a sense for the kind of experience they’re going through, you’ll never understand the context in which your technology is going to be used. It’s not something you can imagine, or glean from secondary data, or even from video or anything. Until you are there, seeing how they grapple with issues that they are dealing with, it’s almost impossible to really understand that context.

(music plays)

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Amos Miller is a product strategist on the Microsoft Research NeXT Enable team, and he’s played a pivotal role in bringing some of MSR’s most innovative research to users with disabilities. He also happens to be blind, so he can appreciate, perhaps in ways others can’t, the value of the technologies he works on, like Soundscape, an app which enhances mobility independence through audio and sound.

On today’s podcast, Amos Miller answers burning questions like how do you make a microwave accessible, what’s the cocktail party effect, and how do you hear a landmark? He also talks about how researchers are exploring the untapped potential of 3D audio in virtual and augmented reality applications, and explains how, in the end, his work is not so much about making technology more accessible, but using technology to make life more accessible. That and much more on this episode of the Microsoft Research Podcast.

Host: Amos Miller, welcome to the podcast.

Amos Miller: Thank you. It’s great to be here.

Host: You are unique in the Microsoft Research ecosystem. Your work is mission-driven. Your personal life strongly informs your professional life and, we’ll get more specific in a bit. But for starters, in broad strokes, tell us what gets you up in the morning. Why do you do what you do?

Amos Miller: I’ve always been passionate about technology from a very young age. But, really, in the way that it impacts people’s lives. And it’s not a mission that I necessarily knew about when I went through my career and experiences with technology. But when I look back, I see that those are the areas where I could see that a person feels differently about themselves or about the environment as a result of their interaction with that technology. That’s where I thought okay, that is having meaning to this person. And I have this huge, wonderful opportunity to do what I do in Microsoft Research to actually have turned that passion into my day job, which is very… I feel extremely fortunate with that. And I sometimes have to pinch myself to see that it’s not a dream.

Host: Well, tell us a little bit about your background and how that plays into what you are doing here.

Amos Miller: I’m very much a person that grew up in the technology world. I also moved a number countries over my career, and my life. I grew up in Israel. I spent many years in the UK, in London. I spent a few other years in Asia, in Singapore, and now I’m here, so all of these aspects of my life have been very important to me. I also happen to be blind. I suffer from a genetic eye condition called retinitis pigmentosa. It was diagnosed when I was five and I gradually lost my sight. I started university with good enough sight to manage and finish university with a service dog and any kind of technology I could find to help me read the whiteboard, to help me read the text on the computer. And I’d say by the age of 30, I totally stopped using my sight. And that’s when I really started living life as a fully blind person.

Host: Let’s talk about your job for a second. You are a product strategist at Microsoft Research, so how would you describe what you do?

Amos Miller: So, I work in a part of the organization at Microsoft Research that looks at really transferring technology ideas into impact. Into a way that they impact business, impact people. A good idea will only have an impact when it’s applied in the right way, in the right environment, so that the social, the business, the technological context in which it operates is going to make it thrive. Otherwise it doesn’t matter how good it is, it’s not going to have an impact.

Host: Right. So, let’s circle over to this previous role you had which was in Microsoft’s Digital Advisory program. And I bring it up in as much as it speaks to how often our previous work can inform our current work, and you referred to that time as your “customer-facing life.” How does it inform your role as a strategist today?

Amos Miller: What always energizes me is when I see and observe the meaning and the impact that technology can really have for people. And I don’t say it lightly. Until you are out there in the wind, in the rain, with the people, experiencing, or at least trying to get a sense for the kind of experience they are going through, you’ll never understand the context in which your technology is going to be used. It’s not something you can imagine, or glean from secondary data, or even from video or anything. Until you are there, seeing how they grapple with the issues that they are dealing with, it’s almost impossible to really understand that context. And the work that I’ve done in, actually, my first nine years in Microsoft, I worked in a customer-facing part of the business, in the Strategic Advisory Services, today known as the Digital Advisory Services. It’s work that we do with our largest customers around the world to really help them figure out how they can transform their own businesses and leverage advancements in technology.

Host: Right. So now, as you are working in Microsoft Research, as a product strategist, how does that transfer to what you do today?

Amos Miller: First of all, I want to introduce, for a moment, the team that I work with, which is the Enable team in Microsoft Research. And the Enable team is looking at technological innovations, especially with disabilities in mind. In our case, our two primary groups are people with ALS and people who are blind. As a product strategist, my role is to work across the research, engineering, marketing and our customer segment and really figure out and understand how we can harness what we have from a technology perspective and, as an organization, to maximize and have that impact that we aspire to have with that community. And that takes a great deal of – again, going back to my earlier point – spending time with that community, going out there and spending time, in my case, with other people who are blind because I only know my own experience. I don’t have everybody else’s experience. The only way for me to learn about that is to be out there. And in our team, every developer goes out there to spend time with end users because that’s the only way you can really get under the covers and understand what’s going on.

Host: Right.

(music plays)

Host: So, the website says you drive a research program that “seeks to understand and invent accessibility in a world” – this is the fun part – “where AI agents and mixed reality are the primary forms of interaction.” It sounds kind of sci-fi to me…

Amos Miller: A little bit. Let me unpack that a little bit. When we traditionally think about accessibility, we think about, how do you make something accessible? So how do you make a microwave more accessible? Well, there isn’t anything inherently inaccessible in putting a piece of pizza and warming it up in the microwave. The only reason it’s inaccessible is because the microwave was designed in an inaccessible way. It could have been accessible from the beginning.

Host: Sure.

Amos Miller: But the world we are moving to is, it’s not about me operating the microwave, it’s not about the accessibility of the microwave, it’s about me preparing dinner for my family. That’s the experience that I’m in. And there’s a bunch of technologies that support that experience. And that experience is what I am seeking to make an accessible and inclusive experience.

Host: Okay.

Amos Miller: That means that we are no longer talking about the microwave, we are talking about a set of interactions that involve people, that involve technology, that involves physical things in the environment. It’s not about making the technology accessible, it’s about using technology to make life more accessible, whether you are going for a walk with a friend, whether you are going to see a movie with a friend, whether it’s sitting in a meeting and brainstorming a storyboard for video. All of these are experiences, and the goal is, how do you make those experiences accessible experiences? That kind of gets you thinking about accessibility in a very different way, where your interaction is with the person that you are sitting in front of. The technology is just there in support of that interaction.

Host: Right. As I’m researching the interview, I’m find myself thinking of the various solutions – maybe the “technical guide dog” mentality – like let’s replace all these things, with technology, that people have traditionally used for independence. And the technology as it enters that ecosystem, some people might think the aim is to replace those things, but I don’t think that’s the point of what’s going on here. Am I right?

Amos Miller: That’s right. There is a tendency, when you come at a problem with a technology solution, to look at what you are currently doing and replace that with something that’s automatic. Right? Oh, you are using a guide dog? How can I replace that guide dog and give you a robot? So, I work on technology that enhances mobility independence through audio and sound, which we’ll talk about in a minute.

Host: Right.

Amos Miller: But often people ask me, how would that work for people who can’t hear? And the natural inclination to them is to say, oh, okay, well you’ll have to deliver the information in a different way. The thing is that people get a sense of their space and their surroundings using the senses that they have. To me, the question is not, how do we shortcut that? It’s how do they sense their space today? They do. They don’t sit there feeling completely disconnected. And if you are going to intervene in that, you better be consistent with how they’re experiencing it today.

Host: Yeah, and that leads me right into the next question because you and I talked earlier about the fundamental role that design plays in the success of human computer interaction. And I’m really eager to have you weigh in on the topic. Let’s frame this broadly in terms of assumptions. And that’s kind of what you were just referring to.

Amos Miller: Yeah.

Host: You know, if I’m looking at you and I think, well my solution to how you interact with the world with technology would be Braille, that’s an assumption. So, I’m just going to give you free reign here. Tell us what you want us to know about this from your perspective.

Amos Miller: We all make assumptions about other people’s other people’s experience of life. You are referring to Bill Buxton who was on your podcast a few weeks ago.

Host: Right.

Amos Miller: And he’s actually been a very close friend and mentor throughout the work that we are doing on Soundscape, which we’ll talk about in a minute. And he’s really brought to our attention that what we’ve done, of going out there and experiencing the real situation that people are experiencing, is about empathy and it’s about trying to understand and probe ideas that challenge your assumptions about what effect they will have. But, really seeing, observing and understanding their experience in that particular situation, and then maybe applying, from your learning, some form of intervention into that experience and observing how that affects that experience. It doesn’t have to be a complete piece of software or technology, it’s just an intervention. It can be completely low-fi. That helps you to start expanding your understanding. And you don’t have to do it with 100 people. Do it with two… three people. You will discover a whole new world you didn’t know about. I’m sorry, but you don’t need 200 data points to support that experience, you’ve just seen it. And you can build on that. So, can you enhance that, in any way, to give them an even richer awareness of their surrounding? And those are the kind of questions that taking design through that very experiential lens has led us to the work that we are actually doing our work on Soundscape, which is the technology that we’ve been developing over the last few years, to really see how far we can take this notion of how people perceive the world and how you can enhance that so their perception is enhanced.

(music plays)

Host: Well, let’s talk about 3D sound and an exciting launch earlier this year in the form of Microsoft’s Soundscape. This is such a cool technology with so many angles to talk about. First, just give our listeners an overview of Soundscape. What is it, who is it for, how does it work, how do people experience it?

Amos Miller: Soundscape is a technology that we developed in collaboration with Guide Dogs, certainly in the early stages, and still do. And the idea is very much using audio that’s played in 3D. Using a stereo headset, you can hear the landmarks that are around you and you can, thereby, really enrich your awareness of your surroundings, of what’s where in a very natural, easy way. And that really helps you feel more independent, more confident, to explore the world beyond what you know.

Host: How do you hear a landmark?

Amos Miller: How do you hear a landmark? So, for example, if you are standing and Starbucks is in front of you and to the right, we will say the word Starbucks, but we won’t say it’s in front of you and to the right, it will sound like it is over there where Starbucks is.

Host: Oh.

Amos Miller: OK? And that’s generated using, the technical term is head rotation transfer of synthetic binaural audio. So, it’s work that actually was developed in Microsoft Research, over a number of years, by Ivan Tashev and his team. And effectively, you can generate sound to make it sound like it’s not in between your ears. You can hear it as though it’s out in the space around you. It’s really quite amazing. And we also use non-audio cues. For example, one of the ideas that we built into Soundscape is this notion of a virtual audio beacon. Not to be confused with Bluetooth beacons! It’s completely virtual. But let’s suppose that you are standing on a street corner and you are heading to a restaurant that’s a block and a half away. What you can do with Soundscape is play some audio beacon that will sound like it’s coming from that restaurant, so no matter which way you’re standing, which way you’re heading, you can always hear that “click-click” sound so you know exactly where that restaurant is. You can see it with your ears.

Host: How do you do that? How do you place a beacon someplace, technically?

Amos Miller: Binaural audio is when you have a slightly different sound in each ear which tricks the brain into having a sense of, that sound is three dimensional. It’s exactly the same way that 3D images work. Audio works almost the same. If Ivan was here, he’ll say it’s not exactly the same, but by generating a slightly different soundwave in each ear, you’re able to make sound, sound like it’s coming from a specific direction. But by playing it in each ear slightly differently, it will actually sound like it’s coming from in front of you and to the right. OK? Now how do we know where to place that beacon?

Host: Right.

Amos Miller: At present, we – it’s largely designed to be used outdoors – so, we use GPS, so we know where you are standing. We know where that restaurant is, so we have two coordinates to work with. We also estimate which way you are facing. So, if you were facing the restaurant, we would want to play that beacon right in front of you. If you were standing at 90 degrees to the restaurant, we’d want to make that beacon sound like it’s coming not only from your right ear, but 100 meters away to your right.

Host: Unbelievable…

Amos Miller: Yeah? And so, taking all of those sensory inputs and taking the information from the map, the GPS location, the direction, we reproduce the sound image in your stereo headset so that you can hear the direction of the sound and where the thing is. And the most amazing thing is, this is all done in real time, completely dynamic. So, as you walk down the street, that restaurant may sound in front of you at 45 degrees to your right, and as you progress, you’ll hear it getting closer and closer and further and further to your right and further and further to your right. And if you overshoot it, it’ll start to sound behind you a little bit, yeah? Now, why is this so important? Because I’m not going to the restaurant on my own. I’m there with my kid or with my wife, or with my friend. And, if I were to hold a phone with the GPS instructions and all of that, I can’t hold a conversation with that person at the same time because I’m so engaged with the technology. And we talked earlier about, how do you get technology to be in the background? That beacon sound is totally in the background. You don’t have to think about it, you don’t have to attend to it mentally, it’s just there. So, you know where the restaurant is, and you continue to have a conversation with the person you are with, or you can daydream, or you can read your emails, listen to a podcast, and all of that happens at the same time. Because it’s played in 3D space, because it’s non-intrusive. You minimize the use of language. And all of these subtle aspects are absolutely crucial for this kind of technology to be relevant to this situation. You’re not sitting in front of the computer and it’s the only thing you are doing. You are outdoors. There’s a ton of things happening all the time that you have to deal with. You can’t expect the person to disassociate themselves from all of that. You know, Soundscape is one way of addressing this very, very interesting and important question. Throughout history, technology has always changed the way that we do things. But I think that we’re starting to see that, as technology developers, we really have to be much more mindful about just from the subtleties of how we design something on, what is the relationship between the technology and the person in that situation? How can a technology do exactly the same as it has done, but do so in a way that makes the person feel empowered and develop a new skill. Great runners learn to feel their heartbeat. But if they have a heart monitor, they’ll stop feeling that heartbeat because the device on their wrist tells them what it is. Well, that’s only because that’s how it was designed. If the heart monitor, instead of telling you, you are at, I don’t know, 150, it’d say, what do you think you’re at? And you’d say, oh, I’m at 140, and it’ll say, oh, you are actually at 150. You will have learned something new from that. It’s exactly the same function, but you have developed yourself as a result of that interaction. And I think that that’s the kind of opportunity that we need to start looking for.

Host: I want to circle back to this 3D audio and the technology behind it, and something that you referred to as “the cocktail party effect.” Can you explain that a little bit and how Microsoft Research is sort of leading the way here?

Amos Miller: The cocktail party effect is an effect, in the world of psycho-acoustics, that is very simple. If you imagine you’re sitting around a table in a cocktail party having a very exciting conversation with somebody, and there are lots of other similar conversations happening around you at the same time, because all of those conversations are happening in 3D space, you are actually able to hear all of those conversations even though you are attending just to yours. You are listening and you can understand and engage in your conversation, but if your name came up in any of those other conversations, you’ll immediately turn your head and say, hey guys, what are you talking about there? And that’s an incredible capability of the brain to manage a very rich set of inputs in the auditory space that is very much under-utilized today in the technology space. We always feel that if we need to convert something into audio, it’s got to be sequenced, because we can only hear one thing at a time. When it’s in 3D, that’s no longer the case. And that’s a huge opportunity. We play a lot of that in VR and augmented reality and we spend a lot of time on the visual aspect of virtual reality and really pushing the envelope on how far we can take the use of immersive experiences in objects in all directions. But the same is available with audio. Even more with audio because your eyes are no longer engaged. Audio is in 360. If we block our ears for a moment, all of a sudden, our awareness level drops. But we are so unaware of the power of audio because vision just takes over everything. And I think the work that we have done, both in the acoustic work on 3D audio, and the application, especially in the disability space where we placed the constraints on the team – there is no vision, now let’s figure it out – and that leads to new frontiers of discovery and innovation in this space that I think could be applicable and would be applicable in many other spaces. And that, you know, that heads-up experience when you are out and about in the streets, not focused on the screen, but engaged in your surroundings. And that’s a perfect situation where audio has huge advantages that we can look at.

(music plays)

Host: I ask each of my guests some version of the question, what keeps you up at night? Because I’m interested in how researchers are addressing unintended consequences of the work they’re doing. Is there anything that concerns you, Amos? Anything that keeps you up at night?

Amos Miller: I think things keeps me up at night because they are so interesting and yet unsolved. You know, we talked a bit about, how do you really express and portray the physical space around you in ways that utilize your other senses and really maximize the ability of the brain to make sense of places without vision? And I really think that, with Soundscape, we’ve only started to scratch the surface of that question. Over half of the brain is devoted to perception. And I think that, when we find ways to really engage, even further engage that incredible human capability, we will discover a whole new frontier of machine and human interaction in ways that we don’t understand today.

Host: You said you arrived at Microsoft Research from “left field.” What’s your story on how you came to be working on research in accessibility at Microsoft Research?

Amos Miller: I started life as a developer, and I did a business degree and joined the Strategic Advisory Services in Microsoft Consulting in the UK. And I think it was a very special moment in Microsoft, over the last few years, when we really started to understand the meaning of impacting every person on the planet with technology and seeing that as our mission. And that led to a series of conversations that opened an opportunity for us to actually get behind that statement and we basically joined Microsoft Research through that mission, through the work that we’re doing in Soundscape. And because we already had very strong relationships, thanks to some wonderful people in the company, and strong relationships here in Microsoft Research and in other parts of the company.

Host: Before we close, Bill Buxton asked me to ask you about the kayak regatta that you organized.

Amos Miller: Uh huh. Oh, we didn’t talk about that.

Host: Just tell that story quickly because I do have one question I want to wrap up with before we go.

Amos Miller: Okay. Well we talked about Soundscape as a technology that really enables you to hear signals in 3D around you. And that was largely designed to be used in the street, right? And then we thought, what would happen if we placed that audio beacon on a lake? So, we got a bunch of people during the summer hackathon and said, okay, well let’s try it out. So, we organized an event on Lake Sammamish. We hacked Soundscape to work on the lake and placed some virtual audio beacons around the lake and invited a group of people who are blind to come and kayak with us and see how they enjoy it. And they absolutely loved it. And I think that was a real eye-opener for us. You have to understand the difference here, you know? Could they kayak before? Sure, no problem, because a sighted person would be with them and tell them, okay, now you go straight, now you row left… But I’m sorry, that’s a very boring experience. You are not in control, you are not independent, you are just doing the work. And by being able to hear where those beacons are, you are truly in the driving seat. And that is a sense of independence that we’ve not really seen to that extent before we did this event.

Host: I like how you called it an eye-opening event!

Amos Miller: It was!

Host: There are so many metaphors about vision that we just sort of take for granted, right?

Amos Miller: Maybe it’s because I have prior sight, maybe not, but I, first of all, I use those metaphors all the time, and I also feel, you know, I could close my eyes and feel that my eyes are closed and open them and feel that they’re open. And I definitely take everything in in a very different way, even though the eyes don’t actually do the scientific aspect of what they’re designed to do.

Host: As we close, I always ask my guests to offer some parting advice to our listeners whether that be in the form of inspiration for future research or challenges that remain to be solved or personal advice on next steps along the career path, whether you have a guide dog with you or Soundscape… What would you say to your 25-year old self if you were just getting started in this arena?

Amos Miller: I honestly would say, get real life experience. Especially in the areas that you are passionate about. Be passionate about them with even more energy and see the work that you do in the context of what you are passionate about. Because you can only really apply your personal experiences to what you do. It’s so great here, in Microsoft Research, to see the interns coming here in the summer. And the creativity and passion, and new perspectives that they bring to our work here. And there’s a little bit of a side of me that worried they’ll jump into the job before they went out and explored the world. And I think it’s important that they find a way to do something that gives them that meaningful context to the work that they’ll be doing here.

(music plays)

Host: Amos Miller, thank you for joining us today. It’s been – can I say it? – an eye-opening experience!

Amos Miller: Sure. My pleasure. Thanks so much for having me.

To learn more about Amos Miller and the latest innovations in audio, sound and accessibility technology, visit


First TextWorld Problems: Microsoft Research Montreal’s latest AI competition is really cooking

textworld at neurips 2018

This week, Microsoft Research threw down the gauntlet with the launch of a competition challenging researchers around the world to develop AI agents that can solve text-based games. Conceived by the Machine Reading Comprehension team at Microsoft Research Montreal, the competition—First TextWorld Problems: A Reinforcement and Language Learning Challenge—runs from December 8, 2018 through May 31, 2019.

First TextWorld Problems is built on the TextWorld framework. TextWorld was released to the public in July 2018 at TextWorld is an extensible, sandbox learning environment for reinforcement learning in text-based games. Beyond game simulation, it has the capacity to generate games stochastically from a user-specified distribution. Such a distribution of games opens new possibilities for the study of generalization and continual or meta-learning in a reinforcement learning setting, by enabling researchers to train and test agents on distinct but related games. TextWorld’s generator gives fine control over game parameters like the size of the game world, the branching factor and length of quests, the density of rewards, and the stochasticity of transitions. Game vocabulary can also be controlled; this directly affects the action and observation spaces. Researchers can also use TextWorld to handcraft games that test for specific knowledge and skills.

The theme for First TextWorld Problems is gathering ingredients to cook a recipe. Agents must determine the necessary ingredients from a recipe book, explore the house to gather ingredients, and return to the kitchen to cook up a delicious meal. Additionally, agents will need to use tools like knives and frying pans. Locked doors and other obstacles along the way must be overcome. The necessary ingredients and their locations change from game to game, as does the layout of the house itself; agents cannot simply memorize a procedure in order to succeed.

Hang on … did someone change the floorplan in this house? Example house layouts generated by TextWorld.

Hang on … did someone change the floorplan in this house? Example house layouts generated by TextWorld.

While a simple cooking task may seem quotidian by human standards, it is still very difficult for AI. Observations and actions are all text-based (see the example below), so a successful agent must learn to understand and manipulate its environment through language, as well as to ground its language in the environmental dynamics. It must also deal with classic, open reinforcement learning problems like partial observability and sparse rewards.

An example of a text-based cooking game whipped up in the TextWorld framework kitchen.

We hope this competition fosters research into generalization across tasks, meta-learning, zero-shot language understanding, common-sense reasoning, efficient exploration, and effective handling of combinatorial action spaces. The winning team will be awarded a prize of $2000 USD, plus an exclusive one-hour discussion session with a Microsoft Research researcher, as well as being featured in a Microsoft Research blog post and in an accompanying article in the Microsoft Research Newsletter (some restrictions apply, please check competition rules and regulations for details.)

Did we pique your interest? We encourage everyone to put their reinforcement learning prowess—and culinary talents—to the test in First TextWorld Problems. Go to and sign up today!


Getting into the groove: New approach encourages risk-taking in data-driven neural modeling

Microsoft Research’s Natural Language Processing group has set an ambitious goal for itself: to create a neural model that can engage in the full scope of conversational capabilities, providing answers to requests while also bringing the value of additional information relevant to the exchange and—in doing so—sustaining and encouraging further conversation.

Take the act of renting a car at the airport, for example. Across from you at the counter is the company representative, entering your information into the system, checking your driver’s license, and the like. If you’re lucky, the interaction isn’t merely a robotic back-and-forth; there is a social element that makes the mundane experience more enjoyable.

“They might ask you where you’re going, and, you say the Grand Canyon. As they’re typing, they’re saying, ‘The weather’s beautiful out there today; it looks gorgeous,’” explained Microsoft Principal Researcher and Research Manager Bill Dolan. “We’re aiming for that kind of interaction, where pleasantries that are linked to the context, even if it’s a very task-oriented context, are not just appropriate, but in many situations, making the conversation feel fluid and human.”

As is the case with many goals worth pursuing, there are obstacles. Existing end-to-end data-driven neural networks have proven highly effective in generating conversational responses that are coherent and relevant, and Microsoft has been at the forefront of the rapid progress that has been made, the first to publish in the space of data-driven approaches to modeling conversational responses back in 2010. But these neural models present two particularly large challenges: They tend to produce very bland, vague outputs—hallmarks of stale conversation and nonstarters if the goal is user engagement beyond the completion of singular tasks—and they take a top-level either-or approach, classifying inputs as either task-oriented or conversational and assigning to each a specific path in the code base that fails to account for the nuances of the other. The result? Responses to more sophisticated conversation that can often be uninformative if varied—for example, “I haven’t a clue” and “I couldn’t tell you”—or they may be informative but not specific enough—such as “I like music” versus “I like jazz”—a result of traditional generation strategies that try to maximize the likelihood of the response.

The paper the team is presenting at the 2018 Conference on Neural Information Processing Systems (NeurIPS)—“Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization”—tackles the former challenge, introducing a new approach to producing more engaging responses that was inspired by the success of adversarial training techniques in such areas as image generation.

“Ideally, we would like to have the systems generate informative responses that are relevant and fully address the input query,” said leading author Yizhe Zhang. “By the same token, we also would like to promote responses that are more varied and less conventionally predictable, something that would help make conversations seem more natural and humanlike.”

“This work is focused on trying to force these modeling techniques to innovate more and not be so boring, to not be the person you’re desperately trying to avoid at the party,” added Dolan.

The force of two major algorithmic components

To accomplish this, the team determined it needed to generate responses that reduce the uncertainty of the query. In other words, the system needed to be better able to guess from the response what the original query might have been, reducing the chance that the system would produce bland outputs such as “I don’t know.”

In the paper, Zhang, Dolan, and their collaborators introduce adversarial information maximization (AIM). Designed to train end-to-end neural response generation models that produce conversational responses that are both informative and diverse, this new approach combines two major algorithmic components: generative adversarial networks (GANs) to encourage diversity and variational information maximization objective (VIMO) to produce informative responses.

“This adversarial training technique has received great success in generating very diverse and realistic-looking synthetic data when it comes to image creation,” said Zhang, who began this work as a Microsoft Research intern while at Duke University and is now a researcher with the company. “It’s been less explored in the text domain because of the discrete nature of text, and we were inspired to see how it could help with natural language processing, especially in dialogue generation.”

GANs themselves are increasingly deployed in neural response and commonly use synthetic data during training. Equilibrium for the GAN objective is achieved when the synthetic data distribution matches the real data distribution. This has the effect of discouraging the generation of responses that demonstrate less variation than human responses. While this may help reduce the level of blandness, however, the GAN technique was not developed for the purpose of explicitly improving either informativeness or diversity. That is where VIMO comes in.

Going backward to move forward

The team trained a backward model that generates the query, or source, from the response, or target. The backward model is then used to guide the forward model—from query to response—to generate relevant responses during training, providing a principled approach to mutual information maximization. This work is the first application of a variational mutual information objective in text generation.

The authors also employed a dual adversarial objective that composes both source-to-target and target-to-source objectives. The dual objective requires the forward and backward model to work synergistically, and each improves the other.

To mitigate the well-known instability in training GAN models, the authors—inspired by the deep structured similarity model—applied an embedding-based discriminator rather than the binary classifier that is conventionally used in GAN training. To reduce the variance of gradient estimation, they used a deterministic policy gradient algorithm with a discrete approximation strategy.

The paper advances the team’s focus on improving ranking candidate hypotheses to push the system to take more risks and produce more interesting outputs.

“In ranking the candidate hypotheses, you might have hundreds and thousands of hypotheses that it’s trying to weigh, and the very top-ranked ones might be these really bland-type ones,” explained Dolan. “If you look down at candidate No. 2,043, it might have a lot of content words, but be wrong and completely odd in context even though it’s aggressively contentful. Go down a little farther, and maybe you find a candidate that’s contentful and appropriate in context.”

Persona non grata

Solving the fundamental problem of uninteresting and potentially uninformative outputs in today’s modeling techniques is an important pursuit, as it’s a significant obstacle in creating conversational agents that individuals will want to engage with regularly in their everyday lives. Without interesting and useful outputs, conversations, task-oriented or not, will quickly spiral into the trivial unless the user is continuously voicing keywords. In that way, current neural models are very reactive, requiring a lot of work from the user, and that can be frustrating and exhausting.

“It’s not that tempting to engage with these agents even though they sound, superficially, fluent as if they understand you, because they tend not to innovate in the conversation,” said Dolan.

Conversation generation stands to gain a lot from this work, but so do other tasks involving language and neural models, such as video and photo captioning or text summarization, let’s say of a spreadsheet you’re working in.

“You don’t want a generated spreadsheet caption that is just, ‘Lines are going up. Numbers are all over the place,’” said Dolan. “You actually need it to be contentful and tie to the context in interesting ways, and that’s at odds with the tendency of current neural modeling techniques.”

The team can envision a future in which exchanges with conversational agents are comparable to those with friends, an exploratory process in which you’re asking for an opinion, unsure of where the conversation will lead.

“You can use our system to improve that, to produce more engaging and interesting dialogue; that’s what this is all about,” said Zhang.