What Is Gemini Omni? Google's Multimodal World Model (May 2026)
What Is Gemini Omni? Google’s Multimodal World Model (May 2026)
Gemini Omni is Google DeepMind’s new “world model”, unveiled at Google I/O 2026 on May 19, 2026. It takes text, image, audio, or video as input and produces physics-accurate video as output. Demis Hassabis called it a step toward AGI. The first version — Gemini Omni Flash — is shipping today.
Last verified: May 20, 2026
Quick facts
| Property | Value |
|---|---|
| Announced | May 19, 2026 (Google I/O 2026) |
| Vendor | Google DeepMind |
| First model | Gemini Omni Flash |
| Inputs | Text, image, audio, video (any combination) |
| Output | Video (with physics simulation), conversationally editable |
| Watermark | SynthID embedded in every generated video |
| Available | Gemini app + Google Flow (paid), YouTube Shorts/Create (free, rolling out) |
| Successor positioning | Beyond Veo — a “world model,” not just a video model |
What is a “world model”?
A world model is an AI system that has an internal simulation of how the physical world works — gravity, momentum, fluid dynamics, lighting, shadows, object permanence — and uses that simulation to generate or predict outcomes.
Where Veo 3.1 generates beautiful video, Omni generates video that respects physics: pour liquid and it falls correctly, push an object and it slides realistically, change the lighting and shadows update.
For Google DeepMind, this is the unification of three things that used to be separate models:
- Reasoning — language-model-style understanding of intent.
- Real-world knowledge — physics, materials, biology.
- Generation — turning all of that into video, image, or audio output.
What Gemini Omni Flash can do
- Generate video from any input — text prompt, image, audio clip, or existing video, or any combination.
- Conversational editing — “make the sky brighter,” “remove the second character,” “add a chair on the left” — all in natural language, with visual consistency preserved across edits.
- Physics-grounded scenes — gravity, kinetic energy, fluid dynamics, light/shadow are modeled rather than guessed.
- Multi-turn refinement — keep editing the same video over multiple prompts.
- Aspect-ratio control — portrait (9:16) and landscape (16:9), with frame-level guidance.
- Speech input — provide a voice sample as part of the prompt (full speech generation/editing coming later).
Gemini Omni vs Veo 3.1 vs Sora 2
| Gemini Omni Flash | Veo 3.1 | Sora 2 | |
|---|---|---|---|
| Vendor | Google DeepMind | Google DeepMind | OpenAI |
| Released | May 19, 2026 | October 2025 | September 2025 |
| Status | Live, rolling out | Live | Deprecated (sunset Sep 24, 2026) |
| Multimodal input | Text + image + audio + video | Text + image | Text + image + video |
| Audio generation | Speech samples; full editing planned | Native synced dialogue + ambient + music | Native dialogue + SFX |
| Editing | Conversational, multi-turn | Frame-specific, video extension | Remix + targeted edits |
| Physics realism | World-model grounded | Strong | Strong (e.g. gymnastics, buoyancy) |
| Max clip length | Short clips (dynamic) | Up to 8s | Up to 20s |
| Resolution | High (not officially stated) | Up to 4K | Up to 1080p |
| Watermark | SynthID | SynthID | C2PA |
| Best for | Physics-accurate scenes + iterative editing | Cinematic motion + audio | (Not recommended — deprecated) |
Why Sora 2 deprecation matters
OpenAI’s decision to shut down the Sora product and API by September 2026 is the single biggest change in the AI video landscape, and it lands the same week as Gemini Omni. Net effect: Google now has the only two live frontier-class video models (Omni for editing/world-sim, Veo 3.1 for cinematic generation). For anyone building on top of OpenAI’s video stack, Omni is the obvious migration path.
How to use Gemini Omni
| Path | Who | How |
|---|---|---|
| Gemini app | Paid AI Plus, Pro, Ultra | Open Gemini, switch to Omni Flash in the model picker |
| Google Flow | Paid AI Plus, Pro, Ultra | The video creation studio for Omni + Veo |
| YouTube Shorts | Everyone (rolling out) | Generate Shorts from a prompt or remix |
| YouTube Create | Everyone (rolling out) | Edit footage conversationally |
| Vertex AI / Gemini API | Developers (expected) | Programmatic access via Vertex (timing TBC) |
Limits and caveats
- Clip length — short by default; not yet for long-form film.
- Audio output — initial release is speech-sample-conditioned; full generative audio editing comes later “responsibly.”
- SynthID is mandatory — every Omni-generated frame is watermarked. This is a feature for trust, but worth knowing if you’re building on top of Omni.
- API access — consumer surfaces first; programmatic Vertex/Gemini API rollout following.
Who should care
- Creators / video pros — conversational editing is the workflow change.
- Marketing / ad teams — physics-accurate product visualization without a 3D pipeline.
- Educators — explainer videos with correct physics (drop the ball, it actually falls right).
- AGI watchers — Hassabis explicitly framed this as a step toward AGI; the bet is that learning physics from video is part of the path.
- Anyone building on Sora 2 — start migrating to Omni or Veo 3.1.
TL;DR
Gemini Omni is video generation that actually understands the world. It accepts any multimodal input, outputs physics-grounded video, and lets you edit conversationally. Sora 2 is being deprecated; Omni is the live frontier-class video model that ships May 19, 2026. If you make video, the workflow is changing this month.