Yes, there are a lot of articulatory muscles involved in speech — the muscles of the jaw, tongue, lips, vocal folds larynx, and throat, and more. Our prototype uses 8 electrodes placed around the face to measure their movements.
It seems like accuracy of the reconstructed silent speech would be quite a hurdle.
The more training data the system has, the more accurate it is. So data is one of the big challenges. In our prototype, we recorded my student speaking for around twenty hours, partly voiced speech and partly silent speech, tracked with EMG. When you listen to the results of successful synthesis, it sounds just like him! It’s very accurate when we limit the range of words. When you can say absolutely anything, the error rate goes up of course.
How should I picture a silent speaker using the prototype you’ve developed?
At the outset we needed to make sure EMG signals could actually be used to reconstruct speech at all, so there are currently a bunch of electrodes that get taped to your head. We already know we don’t need to target all of those muscles. Ideally, we would develop a headset with fewer electrodes, so it would be more ergonomic. The Bakar Fellows Program Spark Award will help us to do that.
We also need to pre-train the system with hundreds – ideally tens of thousands – of hours of different people speaking in many different contexts. There are so many variations in the way people speak, both in terms of their muscle movements and also the resulting sound of their voice. The more data we provide as pre-training, the better the system will be at adapting quickly to new users.
So a lot depends on subtle differences between one speaker’s voice and another’s.
I think it’s important that we reproduce the real voice of the speaker. You don’t want to have concerns about impersonation. Your voice is different from mine. Those differences come from a lot of different factors, partly how we move and position our articulators, and partly how the acoustics of our vocal tracts vary.
It also doesn’t have to be perfect all the time. The way we envision the technology being used, the speaker and the person on the other end of the conversation would both hear the reconstructed speech in their respective headsets. So if the speaker catches an error, they can mouth the correct word or phrase again.
Where do you see this speechless speech technology being used?
One of the things that has been evolving rapidly is the work environment. It’s hard to control noise and privacy in workplaces outside of the office — at home or in a coffee shop where you have other voices to contend with and also the need for privacy. Even in a more conventional work setting, a lot of offices weren’t built for as many private video conferences as we have today. With this technology, some people could mute themselves, so to speak, so as to allow different people at the same table to have separate, mutually private conversations.
It’s hard to envision that extraneous noise just doesn’t matter.
It’s really a huge change from the world of microphones, where your voice gets combined with all the ambient sound around you. With EMG, absolutely all ambient input noise concerns go away. The electrodes don’t pick up any surrounding sound. And because you’re mouthing words, you’re not adding to the noise for others. Workers could have a conversation even on a construction site.
It’s not obvious to me which applications are going to dominate the others, but collectively there are a lot of possible scenarios. All of these possibilities together form the market. When you start on this kind of project, you don’t know exactly where it is going to go. We’ve got some cool questions and we’re starting to get cool answers. We’ll find out what works.