In a move that highlights the growing ethical concerns around advanced AI, Microsoft has developed a remarkably realistic text-to-speech system, VALL-E 2, but has chosen to keep it under wraps due to potential misuse.
While often associated with flashy releases and wide availability, advancements in AI are increasingly forcing tech giants to tread carefully. Microsoft’s latest innovation, VALL-E 2, is a prime example of this trend. This AI marvel can mimic human speech with astonishing accuracy using just a few seconds of audio, marking a significant leap in text-to-speech (TTS) technology.
“VALL-E 2 is the first voice AI to reach human parity in speech robustness, naturalness, and speaker similarity,” the Microsoft researchers proudly declare. This “human parity” means that AI-generated speech is nearly indistinguishable from a real person’s voice.
So, what makes VALL-E 2 so believable?
Two key features contribute to its realism. “Repetition Aware Sampling” allows the AI to avoid the monotonous repetition often found in TTS systems by intelligently addressing repeated words or syllables, making the speech flow more naturally. Secondly, “Grouped Code Modeling” boosts efficiency by processing shorter sound sequences, speeding up speech generation and handling long, complex audio strings.
Fears of misuse overshadow potential.
Despite its vast potential in education, entertainment, accessibility, and more, Microsoft has opted to keep VALL-E 2 under tight control. The company cites concerns about potential misuse, particularly regarding voice identification spoofing and convincing impersonations.
“VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public,” the researchers state, echoing similar restrictions imposed by other AI companies like OpenAI on their voice technology.
Despite these concerns, Microsoft remains optimistic about the future of AI speech technology. The researchers envision safe and ethical applications where synthesised speech retains speaker identity with proper consent and robust detection mechanisms.
This groundbreaking research has been detailed in a pre-print paper, offering a glimpse into the future of AI while raising crucial questions about its responsible development and deployment.
Jaspreet Bindra, Founder of Tech Whisperer commented, “Microsoft calls this purely a research project and the reason they have not put it out for commercial use is primarily because the product is so good that it can be used for voice spoofing. Any actor or person’s voice can be spoofed, resulting in multiple issues, both legal and ethical. With the upcoming US elections, companies are being very careful in releasing AI products which generate realistic voice or video.”