Install Qwen2.5-Omni 3B on Windows

Qwen2.5-Omni 3B is Alibaba Cloud’s compact, multimodal AI model optimized for local deployment on consumer-grade hardware. Unlike the 7B variant, the 3B model significantly reduces VRAM usage—by more than 50%—while maintaining robust performance across text, image, audio, and video tasks.
With real-time output and simultaneous multimodal input support, Qwen2.5-Omni 3B is ideal for building local virtual assistants, media analytics tools, and interactive content engines.
This guide walks you through installing Qwen2.5-Omni 3B on Windows, including dependency management, GPU compatibility, and handling multimodal inputs.
System Requirements
Hardware
- GPU: NVIDIA GPU with ≥24GB VRAM (FP32) or ≥18GB VRAM (BF16 with flash attention, e.g., RTX 3090/4090).
- Note: Real-world VRAM usage is ~1.2x above theoretical minimums (e.g., ~18.38GB for 15s video).
- RAM: At least 32GB recommended.
- Storage: Minimum 15GB free for model and dependencies.
Software
- Operating System: Windows 10 or 11 (64-bit)
- CUDA: Version 12.1 (required for PyTorch compatibility)
- Python: Version 3.10 (via Conda for isolated environment)
- FFmpeg: Required for audio and video processing
Installation Steps
1. Set Up the Environment
Install Conda
Download Miniconda from the official site, install it, then run:
conda create -n qwen python=3.10 -y
conda activate qwen
Install PyTorch with CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Install Base Dependencies
pip install sentencepiece bitsandbytes protobuf numpy einops timm pillow soundfile
2. Install Qwen-Specific Packages
Transformers with Qwen2.5 Support
pip uninstall -y transformers
pip install git+https://github.com/huggingface/[email protected]
pip install accelerate
Qwen Omni Utilities
pip install qwen-omni-utils[decord]
Note: If installation fails, fallback to:
pip install qwen-omni-utils
3. Configure FFmpeg
- Download binaries from Gyan.dev.
- Extract and add
C:\path\to\ffmpeg\bin
to your system PATH. - Test installation:
ffmpeg -version
4. Download the Model
from transformers import Qwen2_5OmniForConditionalGeneration
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-3B",
device_map="auto"
)
Inference Configuration
Memory Optimization
Use BF16 and flash attention for lower VRAM usage:
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-3B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto"
)
Video Reader Configuration
Set FORCE_QWENVL_VIDEO_READER
to use the proper backend:
set FORCE_QWENVL_VIDEO_READER=decord
Running a Multimodal Example
import torch
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
# Load model and processor
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-3B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-3B")
# Define conversation with video input
conversation = [
{
"role": "user",
"content": [{"type": "video", "video": "https://example.com/sample.mp4"}]
}
]
# Process inputs
text = processor.apply_chat_template(conversation, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt"
).to(model.device)
# Generate output
text_ids, audio = model.generate(**inputs)
print(processor.decode(text_ids[0]))
sf.write("output.wav", audio.numpy(), 24000)
Troubleshooting Common Issues
1. CUDA Out of Memory
- Trim video duration to ≤15 seconds.
- Use
bitsandbytes
for 4-bit quantization (if supported). - Set
use_audio_in_video=False
to save memory.
2. Dependency Errors
- Error:
KeyError: 'qwen2_5_omni'
- Fix: Reinstall transformers using the correct commit.
3. Video Input Compatibility
- Use
decord
for HTTP URLs. - For HTTPS, ensure
torchvision>=0.19.0
is installed.
Performance Benchmarks
Task | Qwen2.5-Omni 3B | Qwen2.5-Omni 7B |
---|---|---|
15s Video (BF16) | 18.38 GB* | 31.11 GB* |
Text-Only Inference | 6–8 GB | 10–12 GB |
*Values represent minimum theoretical usage with flash attention.
Use Cases and Customization
- Voice Selection: Choose between built-in voices like Chie and Ethan using prompt engineering.
- Enterprise Use: Ideal for tasks like content moderation, real-time transcription, and video summarization.
Limitations and Licensing
- License: Non-commercial use only, under Alibaba’s Qwen Research License.
- Hardware Needs: High-end GPUs are necessary for full multimodal functionality. Cloud APIs may be better suited for low-resource environments.
Conclusion
Qwen2.5-Omni 3B brings advanced multimodal AI capabilities to local setups without requiring massive infrastructure. While setup requires careful attention to dependencies and GPU specs, the model's real-time performance and flexibility make it a powerful tool for researchers and developers alike.