dan klein headshot
Faculty Fellow

Dan Klein

Electrical Engineering and Computer Sciences

Dr. Dan Klein is a professor in the Department of Electrical Engineering and Computer Sciences. As a undergraduate, he studied math, computer science, and linguistics at Cornell University. He then earned a master’s degree from Oxford University in linguistics, and he earned his PhD in computer science from Stanford.

Spark Award Project

Dr. Klein aims to build devices that allow synthesis and/or recognition of speech based on silent speech inputs from silently mouthing words. He imagines a headset that allows the user above to carry on a private conversation free of noise and without being overheard, speaking silently but otherwise naturally. The headset tracks the user’s articulatory muscle movements through surface electrical signals (electromyography, or EMG) and translates them into synthetic sound that is heard on the other end of the call as the user’s own reconstructed voice. Unlike audio microphones, EMG signals have no interference from other conversations, so noise is not transmitted to the remote participant. There is also improved privacy – because the speaker is speaking silently, there is nothing for others to overhear. Such a device would be useful in a wide range of scenarios, from workers on construction sites to commuters on buses.

Dan Klein’s Story

Translating the Sound of Silence

September 21, 2022

By: Wallace Ravven

In a private conversation, you want to be heard but not overheard. Talking on a cell phone as the subway rumbles by makes it hard to communicate and just as hard to keep things private. Mental telepathy is not an option. But an emerging strategy may accomplish the same goals. Dan Klein, professor of computer science, is developing a technology to translate silently mouthed speech into audible speech. He describes the strategy and its potential applications.

Q: It sounds a little like science fiction. How can a silent speaker be heard?

A: The basic idea here is that we can track the electrical signals that are generated when different muscles in the face and throat move to make speech sounds. We make use of electromyography or EMG, a procedure that was developed to directly monitor muscle movements and that can work through the skin. The muscle movements can be tracked just as well whether the words are spoken or just mouthed. The electrical impulses from mouth movements are picked up by electrodes on the surface of the face, and we use those signals to reconstruct what your voice would have sounded like if you had actually spoken.


 Dr. Dan Klein, Professor in the Department of Electrical Engineering and Computer Sciences (PC: Peg Skorpinski)

So it’s a speech synthesizing system, but without speech. You must need to track the movements of a lot of different muscles.

Yes, there are a lot of articulatory muscles involved in speech — the muscles of the jaw, tongue, lips, vocal folds larynx, and throat, and more. Our prototype uses 8 electrodes placed around the face to measure their movements.

It seems like accuracy of the reconstructed silent speech would be quite a hurdle.

The more training data the system has, the more accurate it is. So data is one of the big challenges. In our prototype, we recorded my student speaking for around twenty hours, partly voiced speech and partly silent speech, tracked with EMG. When you listen to the results of successful synthesis, it sounds just like him! It’s very accurate when we limit the range of words. When you can say absolutely anything, the error rate goes up of course.

How should I picture a silent speaker using the prototype you’ve developed?

At the outset we needed to make sure EMG signals could actually be used to reconstruct speech at all, so there are currently a bunch of electrodes that get taped to your head. We already know we don’t need to target all of those muscles. Ideally, we would develop a headset with fewer electrodes, so it would be more ergonomic. The Bakar Fellows Program Spark Award will help us to do that.

We also need to pre-train the system with hundreds – ideally tens of thousands – of hours of different people speaking in many different contexts. There are so many variations in the way people speak, both in terms of their muscle movements and also the resulting sound of their voice. The more data we provide as pre-training, the better the system will be at adapting quickly to new users.

So a lot depends on subtle differences between one speaker’s voice and another’s.

I think it’s important that we reproduce the real voice of the speaker. You don’t want to have concerns about impersonation. Your voice is different from mine. Those differences come from a lot of different factors, partly how we move and position our articulators, and partly how the acoustics of our vocal tracts vary.
It also doesn’t have to be perfect all the time. The way we envision the technology being used, the speaker and the person on the other end of the conversation would both hear the reconstructed speech in their respective headsets. So if the speaker catches an error, they can mouth the correct word or phrase again.

Where do you see this speechless speech technology being used?

One of the things that has been evolving rapidly is the work environment. It’s hard to control noise and privacy in workplaces outside of the office — at home or in a coffee shop where you have other voices to contend with and also the need for privacy. Even in a more conventional work setting, a lot of offices weren’t built for as many private video conferences as we have today. With this technology, some people could mute themselves, so to speak, so as to allow different people at the same table to have separate, mutually private conversations.

It’s hard to envision that extraneous noise just doesn’t matter.

It’s really a huge change from the world of microphones, where your voice gets combined with all the ambient sound around you. With EMG, absolutely all ambient input noise concerns go away. The electrodes don’t pick up any surrounding sound. And because you’re mouthing words, you’re not adding to the noise for others. Workers could have a conversation even on a construction site.

It’s not obvious to me which applications are going to dominate the others, but collectively there are a lot of possible scenarios. All of these possibilities together form the market. When you start on this kind of project, you don’t know exactly where it is going to go. We’ve got some cool questions and we’re starting to get cool answers. We’ll find out what works.