How to Run Mochi 1 with Diffusers and Lower VRAM Settings
Mochi 1 normally needs 22+ GB VRAM, but with CPU offloading, VAE tiling, and 8-bit quantization you can run it on consumer hardware. Full Python code for each technique.
Mochi 1 is one of the most capable open-source text-to-video models available, but its default weight sizes make it prohibitive for most developers' hardware. Running the full-precision model requires tens of gigabytes of VRAM. With the right mochi 1 diffusers configuration, however, you can run inference on a consumer GPU with 16–24 GB of VRAM — or even less with quantization. This guide walks through every technique, from basic CPU offloading to NF4 quantized checkpoints, with complete working code for each step.
What Is Mochi 1 and Why VRAM Matters
Mochi 1 is a 10-billion-parameter video generation model developed by Genmo. Its architecture is based on an Asymmetric Diffusion Transformer (AsymmDiT), which processes video frames in a latent space using a joint attention mechanism between the text and video tokens. The asymmetry refers to the fact that text and video tokens share attention computations but use different projection dimensions.
The practical consequence of this architecture is weight size. The full BF16 model weighs approximately 80 GB. A reduced-precision BF16 variant hosted on Hugging Face loads in approximately 22 GB of VRAM — already out of reach for the RTX 3080 or RTX 4070 Ti sitting on most developer desks. The three techniques in this guide progressively reduce that floor to 16 GB, then to 12 GB, making Mochi 1 accessible without renting cloud compute.
If you are looking for OS-specific installation steps, see the platform guides for running Mochi 1 on Ubuntu, Windows, and macOS. This guide focuses exclusively on the Diffusers API and VRAM reduction.
Install Diffusers for Mochi 1
Mochi 1 support landed in Diffusers in late 2024. Use a recent stable release to guarantee MochiPipeline is present.
pip install --upgrade diffusers transformers accelerate torch torchvision
# For quantization (Technique 3)
pip install bitsandbytes
Verify your CUDA setup before loading any model weights:
import torch
print(torch.cuda.is_available()) # must be True
print(torch.cuda.get_device_name(0)) # e.g. NVIDIA GeForce RTX 4090
print(torch.cuda.get_device_properties(0).total_memory // 1024**3, "GB")
You also need Hugging Face access to genmo/mochi-1-preview. The model is gated — accept the license on the model card, then authenticate:
huggingface-cli login
Approximate disk space needed: 90 GB for the full BF16 checkpoint. If disk space is a constraint, jump directly to Technique 3 and load the NF4 community checkpoint (~20 GB).
Baseline Mochi 1 Diffusers Pipeline
The following script is the minimal working example with no memory optimizations applied. Run it only if you have at least 22 GB of VRAM available. Its purpose is to establish a baseline you can compare against each technique.
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")
prompt = "A timelapse of clouds moving over a mountain range at golden hour"
with torch.no_grad():
frames = pipe(
prompt,
num_frames=25,
guidance_scale=4.5,
num_inference_steps=50,
).frames[0]
export_to_video(frames, "output.mp4", fps=30)
print("Saved to output.mp4")
Expected VRAM at this baseline: ~22 GB. If you hit an out-of-memory error immediately, skip to Technique 1 and return here once the offloading is active.
Technique 1 — Model CPU Offload
Calling enable_model_cpu_offload() instructs the Accelerate library to move model submodules to CPU RAM between forward passes. Only the submodule currently executing lives on the GPU; everything else sits in system RAM. The GPU never holds the full model weight set simultaneously.
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload() # manages device placement automatically
prompt = "A timelapse of clouds moving over a mountain range at golden hour"
with torch.no_grad():
frames = pipe(
prompt,
num_frames=25,
guidance_scale=4.5,
num_inference_steps=50,
).frames[0]
export_to_video(frames, "output_offload.mp4", fps=30)
Important: Do not callpipe.to("cuda")after enabling CPU offload. The offloader manages device placement automatically — calling.to("cuda")after the fact defeats the optimization by forcing all weights back onto the GPU at once.
Expected VRAM after this change: ~18–20 GB, depending on driver and PyTorch version. Generation speed drops by roughly 2x because data is continuously transferred between CPU RAM and GPU VRAM during each denoising step. For a 25-frame clip on an RTX 4090, expect 8–15 minutes instead of 2–4 minutes.
Technique 2 — VAE Tiling
The VAE (Variational Autoencoder) is the decoder stage that converts latent representations back into pixel space. For video, the VAE must decode every frame, making its memory footprint proportional to the number of frames requested. On a 25-frame clip at the default resolution this is already significant; at 85 frames it can cause an OOM spike even when the transformer fit comfortably.
VAE tiling splits the spatial dimensions of each frame into overlapping tiles and decodes them independently, then stitches them back together. The result is near-identical visually while the peak VRAM spike is bounded regardless of frame count.
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling() # bounds VAE memory regardless of frame count
prompt = "A timelapse of clouds moving over a mountain range at golden hour"
with torch.no_grad():
frames = pipe(
prompt,
num_frames=37, # safe to increase with tiling active
guidance_scale=4.5,
num_inference_steps=50,
).frames[0]
export_to_video(frames, "output_tiled.mp4", fps=30)
Expected VRAM with both offload and tiling: ~16–18 GB. You can also increase num_frames to 37 or 43 without hitting the OOM wall that occurs without tiling.
Technique 3 — 8-bit and NF4 Quantization
Quantization converts weight values from 16-bit floating point to lower bit-width representations. This reduces the in-memory footprint of model weights at the cost of a small precision loss. For most generative video tasks, 8-bit quantization is imperceptible in output quality; NF4 (4-bit NormalFloat) is more aggressive but often acceptable.
8-bit Quantization with bitsandbytes
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
quantization_config=quantization_config,
torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
prompt = "A robot walking through a rainy neon-lit city street"
with torch.no_grad():
frames = pipe(
prompt,
num_frames=25,
guidance_scale=4.5,
num_inference_steps=50,
).frames[0]
export_to_video(frames, "output_8bit.mp4", fps=30)
Expected VRAM: ~12–14 GB. This brings Mochi 1 within reach of RTX 3080 (10 GB) with aggressive CPU offloading, though at that level generation speed will be very slow — 20–30 minutes per 25-frame clip.
NF4 Community Checkpoint
If you do not want to quantize on-the-fly, the community checkpoint imnotednamode/mochi-1-preview-mix-nf4 on Hugging Face provides pre-quantized NF4 weights. These load faster than on-the-fly quantization and produce consistent results.
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
pipe = MochiPipeline.from_pretrained(
"imnotednamode/mochi-1-preview-mix-nf4",
torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
prompt = "A robot walking through a rainy neon-lit city street"
with torch.no_grad():
frames = pipe(
prompt,
num_frames=25,
guidance_scale=4.5,
num_inference_steps=50,
).frames[0]
export_to_video(frames, "output_nf4.mp4", fps=30)
Note: Community checkpoints are not officially maintained by Genmo or the Diffusers team. Verify the checkpoint before using it in production workflows. The smaller variant imnotednamode/mochi-1-preview-mix-nf4-small cuts VRAM further but reduces visual fidelity noticeably.Combining All Techniques for Maximum VRAM Savings
For developers with 12–16 GB VRAM, combining all three techniques is the recommended configuration. Below is the complete production-ready script:
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
from transformers import BitsAndBytesConfig
# Quantize to 8-bit to reduce weight memory
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
quantization_config=quantization_config,
torch_dtype=torch.float16,
)
# Offload inactive model submodules to CPU RAM
pipe.enable_model_cpu_offload()
# Tile the VAE decoder to bound VRAM spike during decoding
pipe.enable_vae_tiling()
prompt = (
"A golden retriever puppy running through autumn leaves in slow motion, "
"cinematic bokeh, warm color grading"
)
with torch.no_grad():
frames = pipe(
prompt,
num_frames=25, # keep low for VRAM-constrained runs
guidance_scale=4.5,
num_inference_steps=50,
generator=torch.Generator(device="cpu").manual_seed(42),
).frames[0]
export_to_video(frames, "output_optimized.mp4", fps=30)
print("Done — output_optimized.mp4")
Expected VRAM: ~10–14 GB depending on GPU, driver, and PyTorch version. System RAM requirement is ~32–64 GB when all submodules are offloaded to CPU. Generation will be slow — a 25-frame clip on an RTX 3080 may take 20–30 minutes.
Tuning num_frames, guidance_scale, and num_inference_steps
Understanding the constraints on these parameters prevents cryptic runtime errors and improves output quality.
num_frames — The 6n+1 Constraint
Mochi 1 requires num_frames to satisfy (num_frames - 1) % 6 == 0. Valid values are 25, 31, 37, 43, 49, 55, 61, 67, 73, 79, and 85. Passing any other value raises a ValueError at runtime. The most common mistake is passing 24 instead of 25. For VRAM-constrained environments, start at 25 and increase only after verifying the baseline configuration fits.
# Valid num_frames values
valid_frames = [6 * n + 1 for n in range(4, 14)]
print(valid_frames) # [25, 31, 37, 43, 49, 55, 61, 67, 73, 79, 85]
guidance_scale
The default is 4.5. The effective range is 3.5–7.0. Values below 3.5 produce less coherent motion; values above 6.0 tend to over-saturate colors and create jerky transitions. Increasing guidance scale reduces apparent motion — the model prioritises text fidelity over dynamic movement at high values. For prompts describing action, stay in the 4.0–5.0 range.
num_inference_steps
The default is 50 denoising steps. Increasing to 64–80 steps improves fine detail and temporal consistency at the cost of proportional generation time. For a quick VRAM test, 20–30 steps is sufficient.
- 20–30 steps: Draft quality — use for VRAM testing and quick iteration
- 50 steps: Default — good quality for most use cases
- 64–80 steps: Best quality — use for final output
Common Errors and Fixes
CUDA out of memory
RuntimeError: CUDA out of memory. Tried to allocate X GiB
Apply the techniques in order: CPU offload first, then VAE tiling, then 8-bit quantization. Reduce num_frames to 25. If OOM persists after all three, switch to the NF4-small community checkpoint and reduce num_inference_steps to 30.
Invalid num_frames value
ValueError: num_frames must satisfy (num_frames - 1) % 6 == 0
Use only values from the valid list above. The most common culprit is 24 instead of 25.
Model not found / 401 Unauthorized
OSError: genmo/mochi-1-preview does not exist or you don't have access
Accept the model license at huggingface.co/genmo/mochi-1-preview and run huggingface-cli login before the first download. The model is gated and requires authentication.
Slow generation with CPU offload
This is expected behaviour, not an error. On cards with 16 GB VRAM, experiment with removing enable_model_cpu_offload() while keeping enable_vae_tiling() and quantization — you may find the full 8-bit model fits on the GPU, significantly improving speed.
For a broader look at consumer GPU constraints in AI workflows, see why high-VRAM GPUs remain inaccessible to most developers. For other local video generation options, the top AI video generation models guide covers the full landscape.