What Is MolmoMotion? Ai2's Open 3D Motion Model (July 2026)
What Is MolmoMotion? Ai2’s Open 3D Motion Model (July 2026)
On July 1, 2026, the Allen Institute for AI (Ai2) released MolmoMotion — an open vision-language model that takes a few seconds of video and predicts where the objects in that video will move next in 3D space. It’s a small release compared to the same-week Claude Sonnet 5 and GPT-5.6 Sol headlines, but it’s the most important open contribution to a rapidly commercializing field: language-conditioned 3D motion forecasting.
Last verified: July 2, 2026
What MolmoMotion actually does
Feed MolmoMotion:
- A short video (a few seconds)
- Optionally: a language prompt (“the person in the red jersey is about to jump”)
It outputs:
- 3D trajectories for objects in the scene — where they’ll be in the next N seconds
- Uncertainty estimates for each prediction
- Optional language grounding — which object the model thinks the prompt refers to
It’s not a video generator. It’s a motion prior — a compact structured output that other systems (robot planners, animation tools, sports-analytics dashboards) can consume.
Why this matters
Motion forecasting is the missing link between vision-language models and physical AI. VLMs today can describe what they see; world models like Google’s Genie 3 can hallucinate plausible futures as video. Neither directly answers the question a robot arm or a self-driving car actually needs: “where will this object be 500ms from now?”
Closed labs are building this internally (Waymo, Wayve, Tesla, and OpenAI’s rumored world simulation work). Ai2 is the first to ship a fully open attempt — weights, code, training data recipe.
Three practical unlocks:
- Robotics research. Open motion priors mean smaller labs can train manipulation policies without commissioning a proprietary world model.
- Video generation. Language-conditioned motion trajectories can plug into diffusion video pipelines (Runway, Kling, Veo) as controllable motion guides.
- Sports and biomechanics. Analysts can fine-tune MolmoMotion on their sport of choice without paying per-frame for a closed API.
How MolmoMotion fits in the Molmo family
Ai2’s Molmo lineup as of July 2026:
| Model | Focus | Released |
|---|---|---|
| Molmo | Open multimodal (vision + language) | 2024, iterated through 2025-2026 |
| MolmoAct | Open action model for robotics | 2025 |
| MolmoMotion | Open 3D motion forecasting | July 1, 2026 |
MolmoMotion completes a family: perception (Molmo) → action (MolmoAct) → prediction (MolmoMotion). All three are Apache-licensed with open weights.
Comparison: MolmoMotion vs Closed Motion Models
| Property | MolmoMotion | Google Genie 3 | Closed lab world models |
|---|---|---|---|
| License | Apache 2.0 (open weights) | Proprietary | Proprietary |
| Output | Structured 3D trajectories | Generated video | Varies |
| Language conditioning | Yes | Yes | Sometimes |
| Best for | Downstream planners | Video generation | Internal use |
| Cost | Free to self-host | API + credits | Not available externally |
Limitations
- Small model, small scenes. MolmoMotion is not a general-purpose world simulator. It’s tuned for a handful of seconds and a modest number of objects.
- Depth ambiguity. Predicting 3D motion from monocular video is fundamentally underconstrained. MolmoMotion outputs uncertainty, but users must handle it.
- No physics simulator. The model predicts what motion looks plausible, not what physics actually would produce. Downstream planners still need a real simulator or safety layer.
What to watch
- Community fine-tunes on sports (NBA/NFL analytics), driving (nuScenes, Waymo Open), and manipulation datasets
- Integrations with open robotics stacks (LeRobot, DROID) and open video generators (Runway ML’s open-source components, Stable Video Diffusion successors)
- Ai2’s next model — a Molmo-scale world simulator would close the gap with Google Genie
Bottom line
MolmoMotion is a small but important release: the first fully open, language-conditioned 3D motion forecasting model. It won’t compete with closed world models on generative video quality, but it gives the open community a compact, composable motion prior that can plug into robotics, video generation, and analytics pipelines. In a week dominated by proprietary flagship model announcements, it’s the most useful open contribution.
Related: AI video generators 2026: Veo vs Runway vs Kling vs Pika · AI video after Sora shutdown