rankingvilla.blogg.se - Audiobook builder for windows 10

#AUDIOBOOK BUILDER FOR WINDOWS 10 KEYGEN#

In addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the "acoustic environment" of the sample audio. Some VALL-E results seem computer-generated, but others could potentially be mistaken for a human's speech, which is the goal of the model. In some cases, the two samples are very close. So compare the "Ground Truth" sample to the "VALL-E" sample. While using VALL-E to generate those results, the researchers only fed the three-second "Speaker Prompt" sample and a text string (what they wanted the voice to say) into VALL-E. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. Microsoft trained VALL-E's speech-synthesis capabilities on an audio library, assembled by Meta, called LibriLight. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder. To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Or, as Microsoft puts it in the VALL-E paper: It basically analyzes how a person sounds, breaks that information into discrete components (called "tokens") thanks to EnCodec, and uses training data to match what it "knows" about how that voice would sound if it spoke other phrases outside of the three-second sample. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. Microsoft calls VALL-E a "neural codec language model," and it builds off of a technology called EnCodec, which Meta announced in October 2022.

#AUDIOBOOK BUILDER FOR WINDOWS 10 KEYGEN#

Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3. Further Reading Meta’s AI-powered audio codec promises 10x compression over MP3