A team at Tencent’s Hunyuan lab has created a new AI, ‘Hunyuan Video-Foley,’ that finally brings lifelike audio to generated video. It’s designed to listen to videos and generate a high-quality soundtrack that’s perfectly in sync with the action on screen.
Ever watched an AI-generated video and felt like something was missing? The visuals might be stunning, but they often have an eerie silence that breaks the spell. In the film industry, the sound that fills that silence – the rustle of leaves, the clap of thunder, the clink of a glass – is called Foley art, and it’s a painstaking craft performed by experts.
Matching that level of detail is a huge challenge for AI. For years, automated systems have struggled to create believable sounds for videos.
How is Tencent solving the AI-generated audio for video problem?
One of the biggest reasons video-to-audio (V2A) models often fell short in the sound department was what the researchers call “modality imbalance”. Essentially, the AI was listening more to the text prompts it was given than it was watching the actual video.
For instance, if you gave a model a video of a busy beach with people walking and seagulls flying, but the text prompt only said “the sound of ocean waves,” you’d likely just get the sound of waves. The AI would completely ignore the footsteps in the sand and the calls of the birds, making the scene feel lifeless.
On top of that, the quality of the audio was often subpar, and there simply wasn’t enough high-quality video with sound to train the models effectively.
Tencent’s Hunyuan team tackled these problems from three different angles:
- Tencent realised the AI needed a better education, so they built a massive, 100,000-hour library of video, audio, and text descriptions for it to learn from. They created an automated pipeline that filtered out low-quality content from the internet, getting rid of clips with long silences or compressed, fuzzy audio, ensuring the AI learned from the best possible material.
- They designed a smarter architecture for the AI. Think of it like teaching the model to properly multitask. The system first pays incredibly close attention to the visual-audio link to get the timing just right—like matching the thump of a footstep to the exact moment a shoe hits the pavement. Once it has that timing locked down, it then incorporates the text prompt to understand the overall mood and context of the scene. This dual approach ensures the specific details of the video are never overlooked.
- To guarantee the sound was high-quality, they used a training strategy called Representation Alignment (REPA). This is like having an expert audio engineer constantly looking over the AI’s shoulder during its training. It compares the AI’s work to features from a pre-trained, professional-grade audio model to guide it towards producing cleaner, richer, and more stable sound.
The results speak sound for themselves
When Tencent tested Hunyuan Video-Foley against other leading AI models, the audio results were clear. It wasn’t just that the computer-based metrics were better; human listeners consistently rated its output as higher quality, better matched to the video, and more accurately timed.
Across the board, the AI delivered improvements in making the sound match the on-screen action, both in terms of content and timing. The results across multiple evaluation datasets support this:
Tencent’s work helps to close the gap between silent AI videos and an immersive viewing experience with quality audio. It’s bringing the magic of Foley art to the world of automated content creation, which could be a powerful capability for filmmakers, animators, and creators everywhere.
See also: Google Vids gets AI avatars and image-to-video tools

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events, click here for more information.
AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.
Read the full article here