Install Qwen2.5-Omni 3B on macOS

Qwen2.5-Omni 3B is a cutting-edge multimodal AI model developed to handle text, image, audio, and video processing tasks. While macOS doesn't offer the same native GPU acceleration as Linux or Windows systems, it's still possible to run Qwen2.5-Omni 3B locally with some optimization.
This guide walks you through the complete installation process on macOS, with additional tips to improve performance on Apple Silicon and CPU-based systems.
System Requirements
To ensure smooth operation, check the following prerequisites:
- macOS: Monterey (12.x) or newer
- RAM: Minimum 16GB (32GB recommended)
- Storage: At least 10GB of free disk space
- Recommended Hardware:
- Apple Silicon (M1/M2/M3) for ARM optimizations
- Optional eGPU (24GB+ VRAM) for advanced acceleration
Step 1: Install Prerequisites
Homebrew
Install Homebrew by running:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Add it to your shell environment:
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zshrc
source ~/.zshrc
Python and Tools
Install Python 3.10 and core dependencies:
brew install [email protected]
pip install --upgrade pip
brew install cmake ffmpeg
pip install torch torchvision torchaudio
Step 2: Configure Python Environment
Create and activate a virtual environment:
python -m venv qwen-env
source qwen-env/bin/activate
Step 3: Install Custom Transformers for Qwen
Uninstall any existing transformers
library and install the custom preview version that supports Qwen2.5-Omni:
pip uninstall transformers -y
pip install git+https://github.com/huggingface/[email protected]
pip install accelerate sentencepiece soundfile einops
Step 4: Install Qwen-Omni Utilities
Install the toolkit with video decoding support:
pip install qwen-omni-utils[decord]
Note: If the decord
installation fails on macOS, use:
pip install qwen-omni-utils
This fallback may be slower for video processing.
Step 5: Download Qwen2.5-Omni 3B Model
Use the huggingface_hub
API to download the model:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Qwen/Qwen2.5-Omni-3B", local_dir="qwen-3b")
Alternatively, download it manually from the Hugging Face Hub.
Step 6: Run Inference with Qwen2.5-Omni
Create an inference.py
file with the following code:
import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
# Load model and processor
model = Qwen2_5OmniModel.from_pretrained("qwen-3b", device_map="auto", torch_dtype=torch.float16)
processor = Qwen2_5OmniProcessor.from_pretrained("qwen-3b")
# Prepare input
inputs = processor("Describe this image: [img_path]", return_tensors="pt").to("cpu")
outputs = model.generate(**inputs)
# Output
print(processor.decode(outputs[0]))
Performance Optimization Tips
Memory Management
- Apple Silicon Optimization: Use
torch_dtype=torch.bfloat16
when supported. - Device Offloading: Use
device_map="auto"
to split workloads.
Quantization for Low RAM:
model = Qwen2_5OmniModel.from_pretrained("qwen-3b", load_in_4bit=True)
Video Processing
- Short Clips: Limit video input to under 15 seconds for stability.
Force torchvision backend:
export FORCE_QWENVL_VIDEO_READER=torchvision
Troubleshooting Common Errors
- KeyError: 'qwen2_5_omni'
→ Reinstall the correcttransformers
branch. - Video Load Failures
→ Updatetorchvision
to at least version 0.19.0. - Memory Overflow
→ Reduce input size or set amax_length
value ingenerate()
.
Advanced Deployment Options
Ollama (Optional)
Use Ollama for a managed local LLM runtime:
brew install --cask ollama
ollama pull qwen2.5-omni-3b
⚠️ You may need to configure custom templates for Qwen2.5-Omni compatibility.
vLLM Server (Experimental)
Clone and run a custom vLLM fork:
git clone -b qwen2_omni_public https://github.com/fyabc/vllm.git
cd vllm && pip install -e .
python -m vllm.entrypoints.api_server --model Qwen/Qwen2.5-Omni-3B
Practical Use Cases
1. Voice-Based Chatbot
inputs = processor("Speak a welcome message.", voice="Ethan", return_tensors="pt")
audio = model.generate_audio(**inputs)
sf.write("output.wav", audio.numpy(), 16000)
2. Video Summarization
inputs = processor("Summarize this video: [video_url]", return_tensors="pt")
Limitations to Consider
- Slow Inference on CPU: Expect <1 token/sec without a GPU.
- License Restrictions: Qwen2.5 is licensed for non-commercial use.
- Missing Dependencies: Audio tasks require
soundfile
.
Conclusion
Installing and running Qwen2.5-Omni 3B on macOS is entirely feasible with the right configuration, even without powerful GPUs.
By following the steps outlined above—setting up Python environments, installing custom libraries, and managing performance through quantization and precision tuning—you can leverage this powerful multimodal AI model for local experimentation and prototyping.