Mochi 1 Setup Guide: Hardware, Install, and Python Examples

Mochi 1 is an open-source text-to-video generation model released by Genmo AI in October 2024. It produces fluid, high-motion video clips from text prompts and is available under the Apache 2.0 license, meaning you can run it commercially and modify the weights. This guide covers what hardware you need to run Mochi 1 locally, the three main installation paths, and working Python examples you can adapt immediately.

What Is Mochi 1?

Mochi 1 is a 10-billion-parameter diffusion-based video generation model. Unlike closed-source models like OpenAI's Sora, Mochi 1 ships with full weights and source code on Hugging Face and GitHub. The model generates videos at 480p resolution, targeting approximately 5.4 seconds of video (163 frames at 30fps) from a text description.

Key characteristics:

Architecture: Asymmetric Diffusion Transformer (AsymmDiT)
Parameters: ~10B
License: Apache 2.0 (commercial use allowed)
Output: 480p, up to 163 frames (~5.4s at 30fps)
Strengths: Strong prompt adherence, high-quality motion
Developed by: Genmo AI (October 2024)

Mochi 1 particularly excels at following complex prompts that describe motion — panning shots, character movement, and physical interactions — which is where many competing models fall short. If you want to explore other open-source video tools, our roundup of top AI video generators covers the broader landscape.

Mochi 1 Hardware Requirements

This is where most developers hit a wall. Mochi 1's hardware demands are real, but there are three viable paths depending on your setup.

VRAM Tiers at a Glance

VRAM Available	Mode	Suitable GPUs	Notes
60GB+	Full precision (float32)	A100 80GB, H100	Full 163-frame generation, fastest inference
22–24GB	bfloat16	RTX 3090, RTX 4090, A5000	Near-identical quality to float32; practical sweet spot
8–20GB	CPU offload + VAE tiling	RTX 3080, RTX 4070 Ti, RTX 4080	Works but significantly slower; shorter clips recommended
Multi-GPU	Model sharding	2x RTX 3090, 2x A5000	Official repo supports splitting across cards

For most developers, the RTX 3090 or RTX 4090 with the bfloat16 variant is the practical sweet spot. The 24GB tier is the minimum for reasonable generation speed without CPU offloading. If you're hitting VRAM limits across AI projects, our breakdown of why 24GB+ VRAM is hard to access provides useful context.

Storage: Downloading the Weights

Before running anything, account for model storage:

Model weights (dit.safetensors): ~19.5GB
VAE weights (decoder.safetensors): ~500MB
Total: ~20GB on disk

You'll also need 32GB+ system RAM when using CPU offloading mode, as model components get loaded to system RAM during generation. For standard high-VRAM setups, 16GB system RAM is sufficient.

How to Install Mochi 1

There are three practical installation paths. Choose based on your hardware and workflow preference.

Method 1 – Official Genmo Repo with uv

The official repo uses uv for dependency management. This gives you the most control and supports multi-GPU inference.

# Clone the repository
git clone https://github.com/genmoai/mochi
cd mochi

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install dependencies
uv venv .venv
source .venv/bin/activate  # On Windows: .venv\Scriptsctivate
uv pip install -e .

# Install FFMPEG (required for video export)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg

# Download model weights
python -m scripts.download_weights /path/to/weights/dir

To run inference from the CLI:

# Single GPU
python -m demos.cli --model_dir /path/to/weights/dir

# With CPU offloading (lower VRAM usage)
python -m demos.cli --model_dir /path/to/weights/dir --cpu_offload

# Multi-GPU (example: 2 GPUs)
torchrun --nproc-per-node=2 -m demos.cli --model_dir /path/to/weights/dir

Method 2 – Hugging Face Diffusers (Recommended for Python Developers)

The diffusers integration is the cleanest path for Python developers already in the Hugging Face ecosystem. It handles weight downloading automatically and integrates directly with existing pipelines.

pip install diffusers transformers accelerate torch
pip install imageio[ffmpeg]

The MochiPipeline class is available from diffusers 0.31+ and fetches weights from genmo/mochi-1-preview on first run (~20GB download).

Method 3 – ComfyUI for Lower VRAM

If you're running on a 12–20GB card and want a GUI workflow, ComfyUI's Mochi 1 nodes reduce VRAM requirements through aggressive memory optimization. The setup pattern is similar to running LTX-2 via ComfyUI if you've done that before.

Install ComfyUI and the ComfyUI-MochiWrapper custom node
Download the Mochi 1 weights to ComfyUI/models/mochi/
Load the Mochi workflow JSON and connect your text prompt node
Enable quantization options in the node settings to reduce VRAM further

Running Mochi 1: Python Examples and Prompt Tips

Basic Text-to-Video Generation

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

# Load in bfloat16 (recommended for 24GB cards)
pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

prompt = "A drone shot sweeping over a dense redwood forest at golden hour, fog drifting between the trees"

frames = pipe(
    prompt,
    num_frames=84,           # 84 frames = 2.8s at 30fps
    num_inference_steps=64,
    guidance_scale=4.5,
).frames[0]

export_to_video(frames, "output.mp4", fps=30)

Memory-Efficient Generation

For GPUs with less than 24GB VRAM, enable CPU offloading and VAE tiling. This trades generation speed for accessibility — expect 10–30 minutes on consumer hardware below 20GB.

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video

pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    torch_dtype=torch.bfloat16
)

# Enable memory optimizations
pipe.enable_model_cpu_offload()  # Moves inactive layers to CPU RAM
pipe.enable_vae_tiling()         # Tiles VAE decoding to reduce peak VRAM

prompt = "A close-up of water droplets falling in slow motion onto a dark stone surface"

with torch.autocast("cuda", torch.bfloat16):
    frames = pipe(
        prompt,
        num_frames=19,          # Start short when testing on limited VRAM
        num_inference_steps=50,
        guidance_scale=4.5,
    ).frames[0]

export_to_video(frames, "output_efficient.mp4", fps=30)

Key Parameters and Prompt Writing

Parameter	Default	Effect
`num_frames`	19	Total frames; 163 = ~5.4s at 30fps. Start at 19–84 while testing.
`num_inference_steps`	50	Denoising steps. Use 64–100 for final renders; 20–30 for fast previews.
`guidance_scale`	4.5	Prompt adherence strength. Values 3.5–6.0 are practical; above 7 can over-saturate.

Prompt tips that matter for Mochi 1:

Describe motion explicitly: Add phrases like "slowly rotating," "panning left," or "rising upward" — don't leave movement implicit
Specify camera behavior: "drone shot," "close-up," "wide angle" significantly improve compositional output
Include lighting: "golden hour," "overcast daylight," "neon-lit night" anchor the visual style consistently
Keep subject count low: One subject with rich environmental detail outperforms multi-character scenes in temporal consistency
Avoid conflicting verbs: "A cat slowly walking and running and jumping" confuses the motion model — pick one dominant action

Output Format and Resolution

Mochi 1 outputs frames as a list of PIL images, which export_to_video assembles into an mp4. Output resolution is fixed at 480p. There is no native upscaling in the base model — post-process with Real-ESRGAN or similar tools if you need higher resolution output.

Mochi 1 Architecture: How AsymmDiT Works

Mochi 1 uses an Asymmetric Diffusion Transformer (AsymmDiT). Unlike standard DiT architectures where encoder and decoder mirror each other, AsymmDiT uses non-square QKV and output projection layers. This asymmetry reduces peak VRAM during inference compared to a symmetric model of equivalent total parameter count.

The model operates in latent space using a 3D VAE that compresses spatial and temporal dimensions together. This means the model treats video as a unified temporal volume rather than processing frames independently — the primary reason Mochi 1 produces coherent motion across frames rather than flickery independent predictions.

A separate T5-based text encoder processes prompts into conditioning vectors fed into the DiT. Both the DiT and the text encoder must be in memory during inference, which explains the high baseline VRAM figure when no CPU offloading is used.

Mochi 1 vs Alternatives

Choosing a video generation model depends on your hardware, use case, and licensing needs.

Model	Min VRAM	License	Strengths	Weaknesses
Mochi 1	~8GB (offload)	Apache 2.0	Prompt adherence, motion quality	480p only, high native VRAM
Wan 2.1 (1.3B variant)	~8GB	Apache 2.0	Low VRAM, I2V support, multilingual	Lower motion fidelity than 10B models
LTX-Video	~8GB	Apache 2.0	Very fast inference	Lower prompt adherence
HunyuanVideo	~60GB	Custom (non-commercial)	High cinematic quality	License restrictions, same VRAM demand
Sora (OpenAI)	N/A (cloud only)	Proprietary	Highest quality, longer clips	No local deployment, subscription cost

If you need lower VRAM with commercial freedom, Wan 2.1 or LTX-Video are pragmatic choices. See our Wan 2.1 vs Runway Gen-3 comparison for detail on that trade-off. Mochi 1 wins when prompt fidelity and motion coherence matter more than hardware efficiency.

Cloud Options When Local Hardware Isn't Enough

If you need Mochi 1 output without the hardware investment, several platforms support it directly:

Replicate: Pay-per-generation API at replicate.com/genmoai/mochi-1 — no setup required
Vast.ai: Rent H100 or A100 instances hourly; official Mochi 1 template available in the model library
RunPod: GPU pods with 40–80GB VRAM; community templates exist for Mochi 1
Modal: Serverless GPU execution with an official Mochi 1 example in their docs
Genmo Playground: Genmo's hosted demo at genmo.ai — free tier for low-volume testing

For selecting cloud GPU providers by cost and performance, our guide on best cloud GPUs for AI workloads applies directly. If you just want quick text-to-video output without infrastructure, there are also free online text-to-video tools that can serve as a low-commitment starting point.

Bottom line: Mochi 1 is the right local video model if you have a 24GB card, need Apache 2.0 licensing, and want strong prompt-following motion. Start with the diffusers path and bfloat16 mode. If you're below 24GB, CPU offload works — budget for longer generation times. For teams that need lower hardware requirements without sacrificing output quality, Wan 2.1's 14B variant is the next thing to evaluate.