Amazon introduces Nova Sonic, a new foundation model for voice AI

The latest model processes voice input directly and produces speech output, unlike traditional systems that separately handle speech recognition, text-based processing, and speech synthesis.

Apr 9, 2025 - 08:55
 0
Amazon introduces Nova Sonic, a new foundation model for voice AI

Tech giant Amazonhas rolled out Nova Sonic, its latest voice-focused artificial intelligence (AI) model, designed to support natural-sounding conversations in AI applications.

The latest foundational model processes voice input directly and produces speech output, unlike traditional systems that separately handle speech recognition, text-based processing, and speech synthesis.

“Traditional approaches in building voice-enabled applications involve complex orchestration of multiple models, such as speech recognition to convert speech to text, large language models (LLMs) to understand and generate responses, and text-to-speech to convert text back to audio,” read the company’s blog. 

According to Amazon, Nova Sonic integrates these functions into a single foundation model, which allows it to match generated speech to the style and tone of the conversation, handling nuances such as natural pauses, hesitations, and interruptions. 

The model can be used to develop voice-based applications, including automated customer service calls and AI agents across various industries such as travel, education, healthcare, and entertainment.

The launch comes at a time while several AI giants have been developing advanced voice AI models, including OpenAI's GPT-4o powering ChatGPT's Voice Mode, Google's Gemini, and Meta's voice assistants, which are raising user expectations for natural, conversational interactions compared to legacy assistants such as Alexa and Siri.

The company also claims that Nova Sonic is among the most cost-efficient voice AI models available, costing approximately 80% less than OpenAI’s GPT-4o.

Amazon also introduced Nova Reel 1.1, a video generation model that produces short videos from text descriptions and optional reference images. The tool enables users to create videos for marketing campaigns, product design, and social media content.