Mochi 1 AI Video Model: Setup, Hardware Requirements, and Working Examples
Mochi 1 is an open-source text-to-video generation model released by Genmo AI in October 2024. It produces fluid, high-motion video clips from text prompts and is available under the Apache 2.0 license, meaning you can run it commercially and modify the weights. This guide covers what hardware you need to run Mochi 1 locally, the three main installation paths, and working Python examples you can adapt immediately.
What Is Mochi 1?
Mochi 1 is a 10-billion-parameter diffusion-based video generation model. Unlike closed-source models like OpenAI's Sora, Mochi 1 ships with full weights and source code on Hugging Face and GitHub. The model generates videos at 480p resolution, targeting approximately 5.4 seconds of video (163 frames at 30fps) from a text description.
Key characteristics:
- Architecture: Asymmetric Diffusion Transformer (AsymmDiT)
- Parameters: ~10B
- License: Apache 2.0 (commercial use allowed)
- Output: 480p, up to 163 frames (~5.4s at 30fps)
- Strengths: Strong prompt adherence, high-quality motion
- Developed by: Genmo AI (October 2024)
Mochi 1 particularly excels at following complex prompts that describe motion — panning shots, character movement, and physical interactions — which is where many competing models fall short. If you want to explore other open-source video tools, our roundup of top AI video generators covers the broader landscape.
Mochi 1 Hardware Requirements
This is where most developers hit a wall. Mochi 1's hardware demands are real, but there are three viable paths depending on your setup.
VRAM Tiers at a Glance
| VRAM Available | Mode | Suitable GPUs | Notes |
|---|---|---|---|
| 60GB+ | Full precision (float32) | A100 80GB, H100 | Full 163-frame generation, fastest inference |
| 22–24GB | bfloat16 | RTX 3090, RTX 4090, A5000 | Near-identical quality to float32; practical sweet spot |
| 8–20GB | CPU offload + VAE tiling | RTX 3080, RTX 4070 Ti, RTX 4080 | Works but significantly slower; shorter clips recommended |
| Multi-GPU | Model sharding | 2x RTX 3090, 2x A5000 | Official repo supports splitting across cards |
For most developers, the RTX 3090 or RTX 4090 with the bfloat16 variant is the practical sweet spot. The 24GB tier is the minimum for reasonable generation speed without CPU offloading. If you're hitting VRAM limits across AI projects, our breakdown of why 24GB+ VRAM is hard to access provides useful context.
Storage: Downloading the Weights
Before running anything, account for model storage:
- Model weights (dit.safetensors): ~19.5GB
- VAE weights (decoder.safetensors): ~500MB
- Total: ~20GB on disk
You'll also need 32GB+ system RAM when using CPU offloading mode, as model components get loaded to system RAM during generation. For standard high-VRAM setups, 16GB system RAM is sufficient.
How to Install Mochi 1
There are three practical installation paths. Choose based on your hardware and workflow preference.
Method 1 – Official Genmo Repo with uv
The official repo uses uv for dependency management. This gives you the most control and supports multi-GPU inference.
# Clone the repository
git clone https://github.com/genmoai/mochi
cd mochi
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create a virtual environment and install dependencies
uv venv .venv
source .venv/bin/activate # On Windows: .venv\Scriptsctivate
uv pip install -e .
# Install FFMPEG (required for video export)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Download model weights
python -m scripts.download_weights /path/to/weights/dirTo run inference from the CLI:
# Single GPU
python -m demos.cli --model_dir /path/to/weights/dir
# With CPU offloading (lower VRAM usage)
python -m demos.cli --model_dir /path/to/weights/dir --cpu_offload
# Multi-GPU (example: 2 GPUs)
torchrun --nproc-per-node=2 -m demos.cli --model_dir /path/to/weights/dirMethod 2 – Hugging Face Diffusers (Recommended for Python Developers)
The diffusers integration is the cleanest path for Python developers already in the Hugging Face ecosystem. It handles weight downloading automatically and integrates directly with existing pipelines.
pip install diffusers transformers accelerate torch
pip install imageio[ffmpeg]The MochiPipeline class is available from diffusers 0.31+ and fetches weights from genmo/mochi-1-preview on first run (~20GB download).
Method 3 – ComfyUI for Lower VRAM
If you're running on a 12–20GB card and want a GUI workflow, ComfyUI's Mochi 1 nodes reduce VRAM requirements through aggressive memory optimization. The setup pattern is similar to running LTX-2 via ComfyUI if you've done that before.
- Install ComfyUI and the
ComfyUI-MochiWrappercustom node - Download the Mochi 1 weights to
ComfyUI/models/mochi/ - Load the Mochi workflow JSON and connect your text prompt node
- Enable quantization options in the node settings to reduce VRAM further
Running Mochi 1: Python Examples and Prompt Tips
Basic Text-to-Video Generation
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
# Load in bfloat16 (recommended for 24GB cards)
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
torch_dtype=torch.bfloat16
)
pipe.to("cuda")
prompt = "A drone shot sweeping over a dense redwood forest at golden hour, fog drifting between the trees"
frames = pipe(
prompt,
num_frames=84, # 84 frames = 2.8s at 30fps
num_inference_steps=64,
guidance_scale=4.5,
).frames[0]
export_to_video(frames, "output.mp4", fps=30)Memory-Efficient Generation
For GPUs with less than 24GB VRAM, enable CPU offloading and VAE tiling. This trades generation speed for accessibility — expect 10–30 minutes on consumer hardware below 20GB.
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
torch_dtype=torch.bfloat16
)
# Enable memory optimizations
pipe.enable_model_cpu_offload() # Moves inactive layers to CPU RAM
pipe.enable_vae_tiling() # Tiles VAE decoding to reduce peak VRAM
prompt = "A close-up of water droplets falling in slow motion onto a dark stone surface"
with torch.autocast("cuda", torch.bfloat16):
frames = pipe(
prompt,
num_frames=19, # Start short when testing on limited VRAM
num_inference_steps=50,
guidance_scale=4.5,
).frames[0]
export_to_video(frames, "output_efficient.mp4", fps=30)Key Parameters and Prompt Writing
| Parameter | Default | Effect |
|---|---|---|
num_frames |
19 | Total frames; 163 = ~5.4s at 30fps. Start at 19–84 while testing. |
num_inference_steps |
50 | Denoising steps. Use 64–100 for final renders; 20–30 for fast previews. |
guidance_scale |
4.5 | Prompt adherence strength. Values 3.5–6.0 are practical; above 7 can over-saturate. |
Prompt tips that matter for Mochi 1:
- Describe motion explicitly: Add phrases like "slowly rotating," "panning left," or "rising upward" — don't leave movement implicit
- Specify camera behavior: "drone shot," "close-up," "wide angle" significantly improve compositional output
- Include lighting: "golden hour," "overcast daylight," "neon-lit night" anchor the visual style consistently
- Keep subject count low: One subject with rich environmental detail outperforms multi-character scenes in temporal consistency
- Avoid conflicting verbs: "A cat slowly walking and running and jumping" confuses the motion model — pick one dominant action
Output Format and Resolution
Mochi 1 outputs frames as a list of PIL images, which export_to_video assembles into an mp4. Output resolution is fixed at 480p. There is no native upscaling in the base model — post-process with Real-ESRGAN or similar tools if you need higher resolution output.
Mochi 1 Architecture: How AsymmDiT Works
Mochi 1 uses an Asymmetric Diffusion Transformer (AsymmDiT). Unlike standard DiT architectures where encoder and decoder mirror each other, AsymmDiT uses non-square QKV and output projection layers. This asymmetry reduces peak VRAM during inference compared to a symmetric model of equivalent total parameter count.
The model operates in latent space using a 3D VAE that compresses spatial and temporal dimensions together. This means the model treats video as a unified temporal volume rather than processing frames independently — the primary reason Mochi 1 produces coherent motion across frames rather than flickery independent predictions.
A separate T5-based text encoder processes prompts into conditioning vectors fed into the DiT. Both the DiT and the text encoder must be in memory during inference, which explains the high baseline VRAM figure when no CPU offloading is used.
Mochi 1 vs Alternatives
Choosing a video generation model depends on your hardware, use case, and licensing needs.
| Model | Min VRAM | License | Strengths | Weaknesses |
|---|---|---|---|---|
| Mochi 1 | ~8GB (offload) | Apache 2.0 | Prompt adherence, motion quality | 480p only, high native VRAM |
| Wan 2.1 (1.3B variant) | ~8GB | Apache 2.0 | Low VRAM, I2V support, multilingual | Lower motion fidelity than 10B models |
| LTX-Video | ~8GB | Apache 2.0 | Very fast inference | Lower prompt adherence |
| HunyuanVideo | ~60GB | Custom (non-commercial) | High cinematic quality | License restrictions, same VRAM demand |
| Sora (OpenAI) | N/A (cloud only) | Proprietary | Highest quality, longer clips | No local deployment, subscription cost |
If you need lower VRAM with commercial freedom, Wan 2.1 or LTX-Video are pragmatic choices. See our Wan 2.1 vs Runway Gen-3 comparison for detail on that trade-off. Mochi 1 wins when prompt fidelity and motion coherence matter more than hardware efficiency.
Cloud Options When Local Hardware Isn't Enough
If you need Mochi 1 output without the hardware investment, several platforms support it directly:
- Replicate: Pay-per-generation API at replicate.com/genmoai/mochi-1 — no setup required
- Vast.ai: Rent H100 or A100 instances hourly; official Mochi 1 template available in the model library
- RunPod: GPU pods with 40–80GB VRAM; community templates exist for Mochi 1
- Modal: Serverless GPU execution with an official Mochi 1 example in their docs
- Genmo Playground: Genmo's hosted demo at genmo.ai — free tier for low-volume testing
For selecting cloud GPU providers by cost and performance, our guide on best cloud GPUs for AI workloads applies directly. If you just want quick text-to-video output without infrastructure, there are also free online text-to-video tools that can serve as a low-commitment starting point.
Bottom line: Mochi 1 is the right local video model if you have a 24GB card, need Apache 2.0 licensing, and want strong prompt-following motion. Start with the diffusers path and bfloat16 mode. If you're below 24GB, CPU offload works — budget for longer generation times. For teams that need lower hardware requirements without sacrificing output quality, Wan 2.1's 14B variant is the next thing to evaluate.