Ai2 Releases Molmo 2: Compact AI Models Master Video Understanding

    Ai2 Releases Molmo 2: Compact AI Models Master Video Understanding

    Video data is emerging as a primary source across fields like autonomous vehicles, robotics and scientific monitoring, driving advances in artificial intelligence that interpret dynamic scenes. A nonprofit research group has unveiled a new set of AI models designed to enhance understanding of videos and multiple images, building on last year’s breakthrough in image analysis.

    The Allen Institute for AI, known as Ai2, launched Molmo in 2024 as a collection of freely available tools that excelled in interpreting still photos and introduced a method for precisely identifying elements within them. That release drew millions of downloads and inspired widespread use in academia and business, while its accompanying high-quality annotation set, PixMo, became a standard for training efficient visual systems.

    The latest version, Molmo 2, extends those capabilities to moving footage and sequences of pictures, aiming to set benchmarks in temporal and spatial awareness. Available in three configurations, the models cater to varied requirements: the 8-billion-parameter edition, powered by the Qwen 3 language framework, delivers top performance in locating objects in videos and answering related questions; a 4-billion-parameter counterpart prioritizes speed and resource savings, also using Qwen 3; and a 7-billion-parameter open-source variant relies on Ai2’s own Olmo foundation model, granting users complete access to its components from visual processing to text generation.

    Despite their compact size, these models surpass larger predecessors and rivals. The 8-billion model exceeds the original 72-billion Molmo on tests for pinpointing and reasoning about image details, while dominating video pursuit tasks over Google’s Gemini 3 Pro and other public options. Notably, it achieved these results with far less training material, using under 10 million video segments compared to the 70 million-plus employed by a similar effort from Meta Platforms Inc.

    In evaluations spanning image queries, brief video analysis, object enumeration, motion following and user satisfaction ratings, Molmo 2 tops or matches leading public models and holds its own against bigger commercial ones from companies like OpenAI and Anthropic. It excels particularly in tracking entities through videos, outpacing specialized tools and even some closed systems, and leads public entries in image logic across 11 measures, edging close to elite proprietary APIs.

    Human evaluators favored Molmo 2’s responses over many alternatives, including certain paid services, and it shone in short-video comprehension on datasets like NextQA and Video-MME. For counting and localization in footage, it provides verifiable points and times rather than vague estimates, though accuracy remains below 40 percent, signaling opportunities for further refinement. On extended clips, it performs solidly but trails the largest commercial models slightly.

    Compared to earlier Molmo releases, the new iteration shows broad improvements, especially in localization and tallying exercises, though the bigger original holds a slim edge in one informational query test.

    Molmo 2 processes single photos, grouped images or clips of any duration, evolving image-specific localization into full video awareness. It can respond to inquiries like tallying a robot’s interactions with an object by marking each occurrence with position and timing, or spotting when an item drops by highlighting the exact moment and spot. In sports footage, it identifies scorers and their positions across frames, while maintaining consistent labels for subjects amid interruptions.

    These features enable applications such as detailed event narration, flaw detection in synthetic videos, anomaly spotting and queries blending visuals with on-screen text. For instance, it can outline every muscle flex by a specific individual in a sequence, assigning unique identifiers to prevent overlaps, or resolve descriptions like finding a gadget in someone’s hand or an animal affecting a balance.

    During use, the system handles extended content efficiently by focusing high detail on pivotal moments and lighter processing elsewhere, preserving precision with reduced computational load. Its design integrates a visual processor for frames, a core language engine and a simple linker that weaves in time and spatial data for unified reasoning.

    Development involved dual phases: initial alignment via combined captioning and pointing on images with text integration, followed by refinement on a blend of visual, video and linguistic tasks. This drew from a freshly curated archive exceeding 9 million entries, including novel sets for intricate descriptions, question-answering and precise tracking.

    A key innovation was a detailed captioning process where people verbally describe clips, yielding transcripts augmented by the prior model for nuance, resulting in verbose annotations far richer than usual. Additional data covered segmented long videos, synthetic grounding examples and adapted scholarly resources, plus new query collections totaling over 1.5 million instances. Ai2 is sharing annotations for more than 100,000 distinct videos to support community efforts.

    Aimed at scholarly and teaching purposes under Ai2’s ethical guidelines, Molmo 2 launches with the main variants, tuned editions for specific tasks, datasets, assessment tools and sample implementations. Users can experiment immediately in Ai2’s online demo environment with uploads for analysis, viewing the model’s focus points in real time. An application programming interface and open training scripts are forthcoming.

    The models operate under the Apache 2.0 license but incorporate external data restricted to non-commercial research. For access, visit the Hugging Face collection or the Ai2 Playground.


    You might also like this video

    Leave a Reply