
Now loading...
Meta has unveiled SAM Audio, a groundbreaking AI model that brings the precision of image segmentation to the realm of sound. Just as the company’s Segment Anything Model transformed how machines handle visual data by letting users isolate any object in photos or videos with simple prompts, SAM Audio now does the same for audio clips. This new tool allows anyone to pull out specific sounds from messy recordings using everyday inputs like text descriptions, clicks on video elements, or highlights on timelines, making audio editing as intuitive as pointing and clicking.
Powering this innovation is the Perception Encoder Audiovisual, or PE-AV, an advanced system that processes both sight and sound to deliver top-tier results. Drawing from Meta’s open-source Perception Encoder released earlier this year, PE-AV acts as the sensory backbone, aligning video frames with audio timestamps to understand what is happening on screen or implied off it. For instance, in a clip of a live music performance, a single tap on the guitar in the video could extract its riff cleanly, while text like filter out street noise might silence honking cars from an outdoor interview.
The model shines in real-world applications, from cleaning up podcasts by removing persistent distractions like a yapping dog to enhancing creative projects with precise sound isolation. Meta is releasing SAM Audio and PE-AV today for public use, complete with detailed research papers. They are also introducing SAM Audio-Bench, the inaugural benchmark for testing audio separation in uncontrolled environments across speech, music, and everyday noises, and SAM Audio Judge, an AI evaluator that scores outputs based on how humans perceive quality, without needing perfect reference tracks.
Users can experiment with these tools right away on the Segment Anything Playground, a platform that also features recent additions like SAM 3 for video segmentation and SAM 3D for spatial analysis. SAM Audio supports three flexible prompting styles: typing phrases such as dog barking to grab that sound, selecting objects or people visually in accompanying footage, or marking time spans to target noises across a whole segment, an approach no other system offers at this scale.
Under the hood, SAM Audio employs a diffusion transformer setup that generates clean audio tracks from mixed inputs, trained on a vast dataset blending real recordings with synthetic ones created through smart mixing and labeling techniques. This ensures it handles everything from conversations to symphonies robustly. PE-AV, meanwhile, draws on over 100 million videos, using tools like PyTorchVideo for processing and contrastive learning to match visuals with audio cues, enabling seamless multimodal control.
Early tests show SAM Audio outperforming rivals in universal separation tasks while rivaling specialists in areas like vocals or instruments, and it processes audio faster than it plays at scales from hundreds of millions to billions of parameters. Still, it cannot use audio as input prompts or fully disentangle unprompted mixes, and distinguishing near-identical sources, such as one voice in a choir, poses ongoing hurdles.
Meta envisions SAM Audio fueling next-generation media apps, from noise reduction to immersive content creation, and is collaborating with Starkey, a major hearing aid maker, and 2gether-International, an accelerator for founders with disabilities, to boost accessibility. By open-sourcing this technology, the company hopes to spark a wave of audio AI innovations that make sound manipulation as simpler and powerful as visual editing has become.
