Introducing Gemini Omni — Gemini Omni Flash (multimodal video creation & editing)
Key Points
- Multimodal inputs (text, image, video, voice) → video output
- Conversational multi‑turn video editing with scene memory and physics awareness
- Omni Flash rolling out to Gemini app/Flow for subscribers; YouTube access free; APIs coming
Summary
Gemini Omni is a new multimodal model family that combines Gemini's reasoning with generative creation, starting with high-quality video output. The first model, Gemini Omni Flash, accepts images, audio (voice references initially), video and text as combined inputs to generate or iteratively edit videos via natural-language conversation. Omni aims for more realistic results by incorporating physics-aware reasoning and grounding in real-world knowledge.
Key Points
- Multimodal inputs: accepts text, images, video and (initially) voice audio as references; can blend references into a single cohesive output.
- Initial output modality: video generation and conversational, multi-turn video editing; image and audio outputs planned in the future.
- Conversational editing: instructions build on previous turns while keeping scene continuity (consistent characters, physics, camera and memory across edits).
- Physics-aware and knowledge-grounded: improved intuitive handling of forces, motion and contextual reasoning to create more realistic and meaningful scenes.
- Style, motion and effect transfer: apply motion from one source to another, transfer visual styles, and compose complex timing/sync to audio.
- Avatars and voice: supports creating videos with a user's digital avatar and voice via Avatars; additional audio editing capabilities are being tested.
- Content transparency: all Omni-generated videos include an imperceptible SynthID watermark; verification available via the Gemini app, Gemini in Chrome and Google Search.
Rollout & developer access
- Gemini Omni Flash is rolling out today to Google AI Plus, Pro and Ultra subscribers via the Gemini app and Google Flow.
- Available at no cost on YouTube Shorts and YouTube Create App starting this week.
- API access for developers and enterprise customers will roll out in the coming weeks.
Practical considerations for engineers
- Design UIs/UX to support multi-turn conversational edits and stateful scene context so edits remain consistent across turns.
- Plan for verification/integrity checks using SynthID when processing or displaying user-generated content.
- Expect expanded input/output modality support (image/audio outputs and broader audio inputs) over time; handle evolving API capabilities and backward compatibility.