Set Up the Qwen2.5-1M Model on Ubuntu/Linux locally
To set up the Qwen2.5-1M model locally on Ubuntu/Linux, follow this comprehensive step-by-step guide. This guide will cover system requirements, installation of dependencies, launching the model, and troubleshooting common issues.
System Requirements
Before you begin the installation process, ensure your system meets the following requirements for optimal performance:
- GPU Requirements:
- Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).
- Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).
- Recommended GPU architectures: Ampere or Hopper.
- Software Requirements:
- CUDA Version: 12.1 or 12.3.
- Python Version: Between 3.9 and 3.12.
If your GPUs do not meet the VRAM requirements, you can still use the Qwen2.5-1M models for shorter tasks.
Step 1: Install Dependencies
To run the Qwen2.5-1M model, you need to clone the vLLM repository from the custom branch and install it manually. Follow these steps:
- Open your terminal.
Install the necessary Python packages:
pip install -e . -v
Navigate to the cloned directory:
cd vllm
Clone the vLLM repository:
git clone -b dev/dual-chunk-attn [email protected]:QwenLM/vllm.git
This will set up the required environment for running the model.
Step 2: Launching the Model
Once you have installed all dependencies, you can launch the Qwen2.5-1M model using an OpenAI-compatible API service. Use the following command to start the service, adjusting it based on your hardware setup:
vllm serve Qwen/Qwen2.5-7B-Instruct-1M \
--tensor-parallel-size 4 \
--max-model-len 1010000 \
--enable-chunked-prefill --max-num-batched-tokens 131072 \
--enforce-eager \
--max-num-seqs 1
Parameter Explanations
--tensor-parallel-size
: Set this to the number of GPUs you are using (maximum of 4 for the 7B model and 8 for the 14B model).--max-model-len
: This defines the maximum input sequence length; reduce this value if you encounter Out of Memory issues.--max-num-batched-tokens
: This sets the chunk size in Chunked Prefill; a smaller value reduces activation memory usage but may slow down inference.--max-num-seqs
: This limits concurrent sequences processed.
Optional Settings
You may also enable FP8 quantization for model weights to reduce memory usage by adding --quantization fp8
to your command.
Step 3: Testing Your Setup
After launching the model, it's crucial to test if everything is functioning correctly. You can do this by sending a sample request to your local server using a tool like curl
or Postman.
Example Request
Using curl
, you can send a request like this:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct-1M",
"messages": [{"role": "user", "content": "Hello, how can I set up Qwen locally?"}]
}'
If everything is set up correctly, you should receive a response from the model.
Troubleshooting Common Issues
While setting up and running Qwen2.5-1M, you may encounter some common issues:
Out of Memory Errors
If you experience Out of Memory (OOM) errors:
- Reduce
--max-model-len
. - Decrease
--max-num-batched-tokens
.
Installation Issues
If there are problems during installation:
- Ensure that your Python and CUDA versions meet the specified requirements.
- Check that all dependencies are properly installed.
Performance Optimization
For better performance:
- Ensure that your system is equipped with sufficient VRAM.
- Optimize your GPU settings according to your hardware capabilities.
Conclusion
Setting up the Qwen2.5-1M model locally on Ubuntu/Linux involves careful preparation and attention to system requirements and dependencies. By following this guide, you should be able to successfully deploy and test your own instance of this powerful language model, capable of processing long context lengths of up to one million tokens.
For setting up the Qwen2.5-1M model on macOS, refer to our detailed guide.
This concludes our detailed guide on setting up Qwen2.5-1M locally on Ubuntu/Linux. For further assistance or advanced configurations, refer to community forums or documentation related to Qwen models and vLLM usage.
Citations: [1] https://qwenlm.github.io/blog/qwen2.5-1m/ [2] https://www.reddit.com/r/LocalLLaMA/comments/1c6ehct/codeqwen15_7b_is_pretty_darn_good_and_supposedly/ [3] https://simonwillison.net/2025/Jan/26/qwen25-1m/ [4] https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M