Comprehensive Guide to Setting Up the Qwen2.5-1M Model on Windows

Deploying the Qwen2.5-1M model locally on a Windows machine may seem complex due to its advanced features and hardware requirements. This guide provides a detailed, step-by-step approach to setting up Qwen2.5-1M, enabling users to leverage its cutting-edge capabilities in natural language processing and machine learning.

What is Qwen2.5-1M?

The Qwen2.5-1M model is a powerful language model developed by Alibaba's Qwen team. It boasts an impressive token capacity, supporting up to 1 million tokens. With advanced features like Dual Chunk Attention, Qwen2.5-1M excels in a range of NLP and ML tasks. The model comes in two primary configurations:

Qwen2.5-7B-Instruct-1M
Qwen2.5-14B-Instruct-1M

Each configuration has significant VRAM requirements, making it essential to ensure your system can handle the load for optimal performance.

Prerequisites for Installation

Before you begin, make sure your system meets the following hardware and software requirements:

Hardware Requirements

GPU: Recommended GPU architecture is either Ampere or Hopper for best performance.
VRAM:
- Qwen2.5-7B-Instruct-1M: Minimum of 120GB total across GPUs.
- Qwen2.5-14B-Instruct-1M: Minimum of 320GB total across GPUs.

Software Requirements

Operating System: Windows 10 or later.
CUDA Version: 12.1 or 12.3.
Python Version: Between 3.9 and 3.12.

Step-by-Step Installation Process

Step 1: Install CUDA

CUDA is necessary for utilizing the GPU capabilities of your system. Follow these steps:

Go to the NVIDIA CUDA Toolkit page.
Select your operating system (Windows) and download the appropriate installer for CUDA version 12.1 or 12.3.
Complete the installation as per the on-screen instructions.

Step 2: Install Python

Ensure you have a compatible version of Python:

Download Python from the official website.
Choose a version between 3.9 and 3.12.
Run the installer and make sure to check the option to Add Python to PATH.

Step 3: Install Git

Git is required to clone repositories. If it's not already installed, follow these steps:

Download Git from git-scm.com.
Follow the provided installation steps.

Step 4: Clone the vLLM Repository

Clone the necessary repository and install it in editable mode by running:

git clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.git
cd vllm
pip install -e . -v

Step 5: Install Additional Dependencies

To run Qwen2.5-1M efficiently, install the following dependencies:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install transformers

If you're using CUDA 12.3, replace cu121 with cu123.

Step 6: Set Up Environment Variables

To configure your system to recognize CUDA, follow these steps:

Right-click "This PC" and choose "Properties."
Click "Advanced system settings" and then "Environment Variables."
Add a new system variable:
- Variable name: CUDA_HOME
- Variable value: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.X (replace v12.X with your installed CUDA version).
Add C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.X\bin to your Path variable.

Step 7: Start the OpenAI-Compatible API Service

Once the environment is set up, launch the API service with the following command:

vllm serve Qwen/Qwen2.5-7B-Instruct-1M \
--tensor-parallel-size 4 \
--max-model-len 1010000 \
--enable-chunked-prefill --max-num-batched-tokens 131072 \
--enforce-eager --max-num-seqs 1

Parameter Explanations:

--tensor-parallel-size: Define based on the number of GPUs (up to 4 for the 7B model).
--max-model-len: Adjust the input sequence length if memory issues arise.
--max-num-batched-tokens: Controls the chunk size in Chunked Prefill (recommended: 131072).
--max-num-seqs: Limits the number of concurrent sequences processed.

Step 8: Test the Installation

To confirm that everything is working, you can test with a simple chat completion request using Python:

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1/', api_key='your_api_key')
response = client.chat.completions.create(
    messages=[
        {'role': 'user', 'content': 'Hello! How can I use Qwen?'}
    ],
    model='Qwen/Qwen2.5-7B-Instruct',
)

print("Response:", response)

Replace 'your_api_key' with a valid API key if needed.

Common Troubleshooting Tips

VRAM Issues

If you encounter VRAM-related errors, try reducing the max-model-len or adjusting the tensor-parallel-size.

API Connection Issues

Ensure the API is running at http://localhost:8000. If you face connection issues, check your firewall settings and ensure the service is active.

Conclusion

This guide provides the essential steps to deploy Qwen2.5-1M on Windows. By following the outlined steps, you'll be able to utilize this powerful model for advanced language processing tasks. Keep up-to-date with future improvements from Alibaba’s Qwen team to maximize performance and capabilities.

For those who prefer a MacOS setup, you can refer to our dedicated guide on setting up Qwen2.5-1M on Mac for detailed instructions on the process for Apple devices.