A neural codec language model - VALL-E can reproduce a voice from a three-second audio recording

Text-to-speech models usually require significantly longer training samples, while VALL-E creates a much more natural-sounding synthetic voice from just a few seconds.

Feb 11, 2025 - 11:37
 0
A neural codec language model - VALL-E can reproduce a voice from a three-second audio recording
Text-to-speech models usually require significantly longer training samples, while VALL-E creates a much more natural-sounding synthetic voice from just a few seconds.