qwen

Install Qwen2.5-Omni 3B on Windows

Anas Mohammad

May 1, 2025 • 3 min read

Qwen2.5-Omni 3B is Alibaba Cloud’s compact, multimodal AI model optimized for local deployment on consumer-grade hardware. Unlike the 7B variant, the 3B model significantly reduces VRAM usage—by more than 50%—while maintaining robust performance across text, image, audio, and video tasks.

With real-time output and simultaneous multimodal input support, Qwen2.5-Omni 3B is ideal for building local virtual assistants, media analytics tools, and interactive content engines.

This guide walks you through installing Qwen2.5-Omni 3B on Windows, including dependency management, GPU compatibility, and handling multimodal inputs.

System Requirements

Hardware

GPU: NVIDIA GPU with ≥24GB VRAM (FP32) or ≥18GB VRAM (BF16 with flash attention, e.g., RTX 3090/4090).
- Note: Real-world VRAM usage is ~1.2x above theoretical minimums (e.g., ~18.38GB for 15s video).
RAM: At least 32GB recommended.
Storage: Minimum 15GB free for model and dependencies.

Software

Operating System: Windows 10 or 11 (64-bit)
CUDA: Version 12.1 (required for PyTorch compatibility)
Python: Version 3.10 (via Conda for isolated environment)
FFmpeg: Required for audio and video processing

Installation Steps

1. Set Up the Environment

Install Conda

Download Miniconda from the official site, install it, then run:

conda create -n qwen python=3.10 -y
conda activate qwen

Install PyTorch with CUDA 12.1

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install Base Dependencies

pip install sentencepiece bitsandbytes protobuf numpy einops timm pillow soundfile

2. Install Qwen-Specific Packages

Transformers with Qwen2.5 Support

pip uninstall -y transformers
pip install git+https://github.com/huggingface/[email protected]
pip install accelerate

Qwen Omni Utilities

pip install qwen-omni-utils[decord]

Note: If installation fails, fallback to:

pip install qwen-omni-utils

3. Configure FFmpeg

Download binaries from Gyan.dev.
Extract and add C:\path\to\ffmpeg\bin to your system PATH.
Test installation:

ffmpeg -version

4. Download the Model

from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    device_map="auto"
)

Inference Configuration

Memory Optimization

Use BF16 and flash attention for lower VRAM usage:

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

Video Reader Configuration

Set FORCE_QWENVL_VIDEO_READER to use the proper backend:

set FORCE_QWENVL_VIDEO_READER=decord

Running a Multimodal Example

import torch
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# Load model and processor
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-3B")

# Define conversation with video input
conversation = [
    {
        "role": "user",
        "content": [{"type": "video", "video": "https://example.com/sample.mp4"}]
    }
]

# Process inputs
text = processor.apply_chat_template(conversation, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt"
).to(model.device)

# Generate output
text_ids, audio = model.generate(**inputs)
print(processor.decode(text_ids[0]))
sf.write("output.wav", audio.numpy(), 24000)

Troubleshooting Common Issues

1. CUDA Out of Memory

Trim video duration to ≤15 seconds.
Use bitsandbytes for 4-bit quantization (if supported).
Set use_audio_in_video=False to save memory.

2. Dependency Errors

Error: KeyError: 'qwen2_5_omni'
- Fix: Reinstall transformers using the correct commit.

3. Video Input Compatibility

Use decord for HTTP URLs.
For HTTPS, ensure torchvision>=0.19.0 is installed.

Performance Benchmarks

Task	Qwen2.5-Omni 3B	Qwen2.5-Omni 7B
15s Video (BF16)	18.38 GB*	31.11 GB*
Text-Only Inference	6–8 GB	10–12 GB

*Values represent minimum theoretical usage with flash attention.

Use Cases and Customization

Voice Selection: Choose between built-in voices like Chie and Ethan using prompt engineering.
Enterprise Use: Ideal for tasks like content moderation, real-time transcription, and video summarization.

Limitations and Licensing

License: Non-commercial use only, under Alibaba’s Qwen Research License.
Hardware Needs: High-end GPUs are necessary for full multimodal functionality. Cloud APIs may be better suited for low-resource environments.

Conclusion

Qwen2.5-Omni 3B brings advanced multimodal AI capabilities to local setups without requiring massive infrastructure. While setup requires careful attention to dependencies and GPU specs, the model's real-time performance and flexibility make it a powerful tool for researchers and developers alike.

Install Qwen2.5-Omni 3B on Windows

Anas Mohammad

System Requirements

Hardware

Software

Installation Steps

1. Set Up the Environment

Install Conda

Install PyTorch with CUDA 12.1

Install Base Dependencies

2. Install Qwen-Specific Packages

Transformers with Qwen2.5 Support

Qwen Omni Utilities

3. Configure FFmpeg

4. Download the Model

Inference Configuration

Memory Optimization

Video Reader Configuration

Running a Multimodal Example

Troubleshooting Common Issues

1. CUDA Out of Memory

2. Dependency Errors

3. Video Input Compatibility

Performance Benchmarks

Use Cases and Customization

Limitations and Licensing

Conclusion

References

Sign up for more like this.