Install Qwen2.5-Omni 3B on Windows

Install Qwen2.5-Omni 3B on Windows
Install Qwen2.5-Omni 3B on Windows

Qwen2.5-Omni 3B is Alibaba Cloud’s compact, multimodal AI model optimized for local deployment on consumer-grade hardware. Unlike the 7B variant, the 3B model significantly reduces VRAM usage—by more than 50%—while maintaining robust performance across text, image, audio, and video tasks.

With real-time output and simultaneous multimodal input support, Qwen2.5-Omni 3B is ideal for building local virtual assistants, media analytics tools, and interactive content engines.

This guide walks you through installing Qwen2.5-Omni 3B on Windows, including dependency management, GPU compatibility, and handling multimodal inputs.

System Requirements

Hardware

  • GPU: NVIDIA GPU with ≥24GB VRAM (FP32) or ≥18GB VRAM (BF16 with flash attention, e.g., RTX 3090/4090).
    • Note: Real-world VRAM usage is ~1.2x above theoretical minimums (e.g., ~18.38GB for 15s video).
  • RAM: At least 32GB recommended.
  • Storage: Minimum 15GB free for model and dependencies.

Software

  • Operating System: Windows 10 or 11 (64-bit)
  • CUDA: Version 12.1 (required for PyTorch compatibility)
  • Python: Version 3.10 (via Conda for isolated environment)
  • FFmpeg: Required for audio and video processing

Installation Steps

1. Set Up the Environment

Install Conda

Download Miniconda from the official site, install it, then run:

conda create -n qwen python=3.10 -y
conda activate qwen

Install PyTorch with CUDA 12.1

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install Base Dependencies

pip install sentencepiece bitsandbytes protobuf numpy einops timm pillow soundfile

2. Install Qwen-Specific Packages

Transformers with Qwen2.5 Support

pip uninstall -y transformers
pip install git+https://github.com/huggingface/[email protected]
pip install accelerate

Qwen Omni Utilities

pip install qwen-omni-utils[decord]

Note: If installation fails, fallback to:

pip install qwen-omni-utils

3. Configure FFmpeg

  1. Download binaries from Gyan.dev.
  2. Extract and add C:\path\to\ffmpeg\bin to your system PATH.
  3. Test installation:
ffmpeg -version

4. Download the Model

from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    device_map="auto"
)

Inference Configuration

Memory Optimization

Use BF16 and flash attention for lower VRAM usage:

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

Video Reader Configuration

Set FORCE_QWENVL_VIDEO_READER to use the proper backend:

set FORCE_QWENVL_VIDEO_READER=decord

Running a Multimodal Example

import torch
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# Load model and processor
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-3B")

# Define conversation with video input
conversation = [
    {
        "role": "user",
        "content": [{"type": "video", "video": "https://example.com/sample.mp4"}]
    }
]

# Process inputs
text = processor.apply_chat_template(conversation, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt"
).to(model.device)

# Generate output
text_ids, audio = model.generate(**inputs)
print(processor.decode(text_ids[0]))
sf.write("output.wav", audio.numpy(), 24000)

Troubleshooting Common Issues

1. CUDA Out of Memory

  • Trim video duration to ≤15 seconds.
  • Use bitsandbytes for 4-bit quantization (if supported).
  • Set use_audio_in_video=False to save memory.

2. Dependency Errors

  • Error: KeyError: 'qwen2_5_omni'
    • Fix: Reinstall transformers using the correct commit.

3. Video Input Compatibility

  • Use decord for HTTP URLs.
  • For HTTPS, ensure torchvision>=0.19.0 is installed.

Performance Benchmarks

Task Qwen2.5-Omni 3B Qwen2.5-Omni 7B
15s Video (BF16) 18.38 GB* 31.11 GB*
Text-Only Inference 6–8 GB 10–12 GB

*Values represent minimum theoretical usage with flash attention.

Use Cases and Customization

  • Voice Selection: Choose between built-in voices like Chie and Ethan using prompt engineering.
  • Enterprise Use: Ideal for tasks like content moderation, real-time transcription, and video summarization.

Limitations and Licensing

  • License: Non-commercial use only, under Alibaba’s Qwen Research License.
  • Hardware Needs: High-end GPUs are necessary for full multimodal functionality. Cloud APIs may be better suited for low-resource environments.

Conclusion

Qwen2.5-Omni 3B brings advanced multimodal AI capabilities to local setups without requiring massive infrastructure. While setup requires careful attention to dependencies and GPU specs, the model's real-time performance and flexibility make it a powerful tool for researchers and developers alike.

References

  1. Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
  2. Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
  3. Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
  4. Install Qwen2.5-Omni 3B on macOS