Big Brother is listening. Companies use “bossware” to listen to their employees when they are near their computers. Several “spyware” applications can record phone calls. And home devices like Amazon’s Echo can record everyday conversations. A new technology, called Neural Voice Camouflage, now offers a defense. It generates custom audio noise in the background while you speak, confusing the artificial intelligence (AI) that transcribes our recorded voices.
The new system uses an “adversarial attack”. The strategy uses machine learning – in which algorithms find patterns in data – to alter sounds in such a way that an AI, but not people, mistakes them for something else. Essentially, you’re using one AI to trick another.
However, the process is not as simple as it seems. The machine-learning AI has to process the entire sound clip before it knows how to edit it, which doesn’t work when you want to camouflage in real time.
So in the new study, the researchers taught a neural network, a brain-inspired machine learning system, to effectively predict the future. They trained it on many hours of recorded speech so it could continuously process 2 second audio clips and disguise what is likely to be said next.
For example, if someone just said “enjoy the big party,” they can’t predict exactly what will be said next. But taking into account what has just been said, as well as the characteristics of the speaker’s voice, it produces sounds that will disturb a range of possible sentences that could follow. This includes what actually happened next; here, the same speaker saying, “it’s cooking.” To human listeners, the audio camouflage sounds like background noise and they have no trouble understanding spoken words. But the machines stumble.
The scientists overlaid the output of their system onto the recorded speech as it was fed directly into one of the automatic speech recognition (ASR) systems that could be used by eavesdroppers for transcription. The system increased the ASR software word error rate from 11.3% to 80.2%. “I almost starve myself, for conquering kingdoms is hard work”, for example, has been transcribed as “im mearly starme my seal for threa for this conqernd kindoms as harenar ov the reson” (see video below above).
Error rates for speech disguised by white noise and a concurrent adversarial attack (which, lacking predictive capabilities, only masked what it had just heard with noise played half a second too late) did not were only 12.8% and 20.5%, respectively. The work was presented in a paper last month at the International Conference on Representations of Learning, which reviews peer-reviewed manuscript submissions.
Even when the ASR system was trained to transcribe disturbed speech with Neural Voice Camouflage (a technique that eavesdroppers could possibly use), its error rate remained at 52.5%. In general, the hardest words to disrupt were short words, such as “the”, but these are the least revealing parts of a conversation.
The researchers also tested the method in the real world, playing a voice recording combined with the camouflage through a set of speakers in the same room as a microphone. It always worked. For example, “I also just got a new monitor” was transcribed as “with reasons with them also toscat and neumanitor”.
This is just the first step to protecting privacy in the face of AI, says Mia Chiquier, a computer scientist at Columbia University who led the research. “Artificial intelligence collects data about our voice, our faces and our actions. We need a new generation of technology that respects our privacy.
Chiquier adds that the predictive part of the system has great potential for other applications requiring real-time processing, such as autonomous vehicles. “You have to anticipate where the car will be next, where the pedestrian might be,” she says. Brains also work by anticipation; you are surprised when your brain predicts something incorrectly. In that regard, says Chiquier, “We mimic the way humans do things.”
“There’s something neat about the way it combines predicting the future, a classic machine learning problem, with this other adversarial machine learning problem,” says Andrew Owens, a computer scientist at the University of Michigan, Ann Arbor, which studies audio processing. and visual camouflage and did not participate in the work. Bo Li, a computer scientist at the University of Illinois, Urbana-Champaign, who has worked on audio adversarial attacks, was impressed that the new approach worked even against the fortified ASR system.
Audio camouflage is a must, says Jay Stanley, senior policy analyst at the American Civil Liberties Union. “We are all susceptible to having our innocent speech misinterpreted by security algorithms.” Maintaining privacy is hard work, he says. Or rather it is harenar ov la reson.