Meta Unveils Voicebox AI: The Future of Spoken Speech Generation

Meta, the parent company of Facebook and Instagram, has revealed its groundbreaking generative model, Voicebox AI, designed to generate spoken speech based on textual input. This innovation marks a significant leap in the development of voice assistants and natural language understanding.

Voicebox AI operates on a similar generative model framework as ChatGPT and DALL-E, but with a focus on generating spoken language rather than text or images. The system has been trained on a vast dataset of 50,000 hours of unfiltered audio, comprising transcripts from publicly available audiobooks recorded in various languages, including English, French, Spanish, German, Polish, and Portuguese.

According to Meta’s researchers, this diverse and extensive dataset enables the system to produce more natural and conversational speech, regardless of the languages spoken in the conversation.

Meta proudly reports that Voicebox outperforms Microsoft’s VALL-E in text-to-speech conversion in terms of both intelligibility, with a word error rate of 5.9% compared to 1.9%, and audio similarity, boasting a 0.580% score versus 0.681% for VALL-E. Notably, Voicebox accomplishes these feats while operating at speeds 20 times faster than its counterpart.

This generative model offers several valuable features, including audio editing capabilities to remove noise and rectify mispronunciations. Users can pinpoint segments of distorted speech caused by external noise, trim them, and instruct the model to rectify those sections.

Voicebox AI’s innovation extends beyond voice assistants and digital conversations. Researchers foresee the technology having applications in areas such as prosthetics for patients with damaged vocal cords, interactive gaming non-player characters (NPCs), and enhancing the capabilities of digital assistants.

Meta’s approach to training this speech synthesis model is described as “Flow Matching.” While the company has released a research paper along with audio examples showcasing the capabilities of Voicebox, neither the program nor its source code is currently available to the public. Meta cites concerns about potential misuse as the reason for withholding access.

This decision is notable given that, in January, Meta released its LLaMA AI language model as an open-source package for the AI community. However, shortly after making it accessible, the model was made available for unauthorized download on various online platforms.

Additionally, Meta has developed SAM, an AI image segmentation model capable of identifying specific objects in images or videos based on user cues, such as text or cursor indications. Meta is providing developers with open-source code and a dataset of 180,000 images for the Animated Drawings AI project, which aims to animate everyday drawings.

Meta’s foray into advanced AI models like Voicebox AI highlights the company’s commitment to pushing the boundaries of technology and exploring new horizons in AI-driven solutions. While the withholding of the model’s code reflects concerns about misuse, the potential for innovation and improvement in voice assistants and beyond is unmistakable. As the AI landscape continues to evolve, Meta’s contributions will likely play a significant role in shaping the future of AI-driven applications.

LATEST NEWS

CHIPS Act Update: TSMC’s Arizona Complex Gets Federal Green Light

Breaking Ground: EU’s First AI Draft Targets Risk and Copyright Compliance

CONTACTS

Meta Unveils Voicebox AI: The Future of Spoken Speech Generation

Our Address

Our Mailbox