SoundHound AI, already a major player in voice assistants, is now giving its technology a pair of eyes.
Imagine driving past a landmark and, without pulling out your phone, asking your car, âWhatâs that building over there?â and getting an instant answer. Thatâs what SoundHound AI is building.Â
With the launch of Vision AI, SoundHoundâs new system combines sight with sound to create a much smarter and more natural way to interact with technology. The idea is to mimic how we as humans operate; we donât just listen to someone, we also see their gestures and what theyâre looking at.
By bringing this same contextual understanding to AI, SoundHound hopes to smooth over the clunky and often frustrating experience we have with many of todayâs smart devices. The company is targeting real-world applications where this combined sense could make a huge difference, whether thatâs in your next car, at the restaurant drive-thru, or a factory floor.
Keyvan Mohajer, CEO of SoundHound AI, said: âAt SoundHound, we believe the future of AI isnât just multimodalâitâs deeply integrated, responsive, and built for real-world impact.
âWith Vision AI, weâre extending our leadership in voice and conversational AI to redefine how humans interact with products and services offered and used by businesses.â
So, how does it work? Vision AI takes a live feed from a camera and fuses it with the companyâs voice technology, which already excels at understanding natural speech. By processing what it sees and what it hears at the exact same time, the system can grasp the userâs true intent in a way a simple voice assistant never could.
Think of a mechanic wearing smart glasses who can simply look at an engine part and ask for instructions, receiving instant visual and audio guidance without ever putting down their tools. In a shop, a staff member could scan shelves just by looking at them to get a real-time inventory count. For the rest of us, it might mean a drive-thru kiosk that visually confirms our order on screen the moment we say it.
One of the biggest technical problems in creating such a system is ensuring the audio and visual elements are perfectly synchronised. Any lag would shatter the illusion of a natural conversation.
Pranav Singh, VP of Engineering at SoundHound AI, commented: âWith Vision AI, we are fusing visual recognition and conversational intelligence into a single, synchronised flow. Every frame, every utterance, every intent is interpreted within the same ecosystemâensuring faster, more natural user experiences that scale across surfaces from kiosks to embedded devices.
âThis is innovation at the intersection of intelligence and execution, delivering AI that sees what you see, hears what you say, and responds in the moment.â
For the businesses adopting this tech, the promise is to provide faster service, fewer mistakes, and happier customers. Itâs about removing friction and making technology feel less like a tool you have to operate and more like a partner that helps you get things done.
This new visual capability isnât the only upgrade SoundHound is rolling out. The company also recently improved the âbrainâ of its system with a new update, Amelia 7.1. This enhancement makes its AI agents faster, more accurate, and gives businesses more control and transparency over how they work.
By combining sight and sound, SoundHound is aiming to push us closer to a world where interacting with AI feels as easy and intuitive as talking to another person.
(Photo by Christian Lue)
See also: Alan Turing Institute: Humanities are key to the future of AI
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
Read the full article here