Run DeepSeek-VL2 on Windows: Installation Guide

DeepSeek AI has rapidly gained prominence as a Chinese AI model, rivaling even OpenAI's ChatGPT. Its open-source model, DeepSeek R1, is licensed by the Massachusetts Institute of Technology (MIT), ensuring accessibility for both personal and professional endeavors.

Why DeepSeek-VL2 Matters

As the first open-source MoE (Mixture of Experts) vision-language model with MIT licensing, DeepSeek-VL2 offers:

Multimodal Understanding: Processes images (JPG/PNG) and text simultaneously
Commercial Flexibility: MIT license enables enterprise deployment
Efficiency: 3.37B to 27.5B parameter variants balance performance/resource use
Local Operation: Run offline after initial setup

Minimum System Requirements:

Operating System: Windows 10 or later
CPU: Multi-core processor (Quad-core or higher recommended)
GPU: High-performance GPU (NVIDIA with CUDA support is typically required for AI tasks)
RAM: Minimum 8GB (16GB or more recommended for AI-related tasks)
Storage: SSD with at least 50GB free space (more space required for handling large datasets)
Software Dependencies:
- Python (for model training and scripting)
- CUDA Toolkit (for GPU acceleration)
- Required deep learning libraries (e.g., TensorFlow, PyTorch)
Critical Notes:
NVIDIA GPUs require CUDA Toolkit 11.8+
Enable WSL2 for Linux-containerized AI workloads
64-bit Windows mandatory for Ollama compatibility

Installation

Using Ollama

Ollama simplifies the installation process, negating the necessity for cloud subscriptions.

Download Ollama: Visit the official Ollama website and download the Windows installer.
Install Ollama: Double-click the installer and follow the on-screen prompts. Ensure there is at least 4GB of free storage space before proceeding.
Open Command Prompt: Once installed, open Command Prompt on your computer.

Start DeepSeek: Enter the following command to start the application with debug mode enabled:

$env:OLLAMA_DEBUG="1" & "ollama app.exe"

Directory Information:

Data and Logs: %LOCALAPPDATA%\Ollama
Program Files: %LOCALAPPDATA%\Programmes\Ollama
Models and Settings: %HOMEPATH%.ollama

Accessing DeepSeek on the Web

If you prefer not to download the software, you can access DeepSeek on the web:

Visit the Website: Go to DeepSeek Chat.
Registration: Register using your email ID or Google account. Note that new registrations are currently on hold due to security issues.
Interact with the Chatbot: Once registered, you can start interacting with the chatbot.

Step-by-Step Guide for Inference of DeepSeek Models

This guide explains how to deploy the DeepSeek model using the vLLM framework.

Prerequisites

Python Environment: Ensure you have Python installed (preferably Python 3.8 or later).

Install Required Packages: Install the required libraries using pip:

pip install vllm==0.6.6.post1

Installation Methods Compared

Method 1: Ollama (Beginner-Friendly)

Download Windows installer from Ollama Official Site
Allocate VRAM: setx OLLAMA_GPUS "1" (Admin CMD)
Verify installation: ollama list

Launch with debugging:

$env:OLLAMA_DEBUG="1"; ollama run deepseek-vl2-tiny

Pros: One-click setup, automatic updates
Cons: Limited model customization

Method 2: Manual Setup (Advanced Users)

# Create isolated environment
python -m venv deepseek_env
.\deepseek_env\Scripts\activate

# Install core dependencies
pip install torch==2.3.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
pip install vllm==0.6.6.post1 deepseek-vl2[gradio]

Method 3: Docker Containers (Production)

FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
RUN apt-get update && apt-get install -y python3.11
COPY requirements.txt .
RUN pip install -r requirements.txt

DeepSeek-VL2 Code Implementation

To implement DeepSeek-VL2, follow these steps:

Installation

Install the necessary dependencies using pip:

pip install -e .[gradio]

Python Code

Example usage of DeepSeek-VL2 in Python:

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images

# specify the path to the model
model_path = "deepseek-ai/deepseek-vl2-tiny"
vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

Gradio Demo

To run the Gradio demo, use the following commands:

CUDA_VISIBLE_DEVICES=2 python web_demo.py \
    --model_name "deepseek-ai/deepseek-vl2-tiny" \
    --port 37914

Vision-Language Workflow: Step-by-Step

1. Image Preprocessing

from deepseek_vl2.utils.io import load_pil_images

conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\nAnalyze this medical scan",
        "images": ["./patient_scan.png"]
    },
    {"role": "<|Assistant|>", "content": ""}
]

pil_images = load_pil_images(conversation, max_size=(1024,1024))

2. Multimodal Encoding

processor = DeepseekVLV2Processor.from_pretrained("deepseek-ai/deepseek-vl2-small")
inputs = processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to("cuda")

3. Expert Model Inference

model = DeepseekVLV2ForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-vl2-small",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

outputs = model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=inputs.attention_mask,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.8
)

Performance Optimization Tips

VRAM Management:
- 7B Model: Requires 20GB+ VRAM
- 20B Model: Needs 40GB+ VRAM
- Use --chunk_size 512 for memory-constrained systems

Batch Processing:

llm = LLM(model="deepseek-vl2", max_batch_size=8, gpu_memory_utilization=0.85)

Quantization (FP16 → INT8):

model = quantize_model(model, quantization_config=BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
))

Enterprise Deployment Checklist

Security:
- Enable TLS in Ollama: ollama serve --tls-cert cert.pem --tls-key key.pem
- Set API rate limits: 10 requests/second default

CI/CD Pipeline:

# GitHub Actions Example
- name: DeepSeek Model Test
  run: |
    ollama pull deepseek-vl2-tiny
    pytest vision_tests/
  env:
    OLLAMA_HOST: 127.0.0.1
    CUDA_VISIBLE_DEVICES: 0

Monitoring:

ollama logs --format json | jq '.latency, .gpu_util'

Troubleshooting Common Issues

Problem: CUDA Out-of-Memory Error
Solution:

# Reduce image resolution
processor.image_size = (512,512)
# Enable gradient checkpointing
model.gradient_checkpointing_enable()

Problem: Ollama Connection Refused
Fix:

netsh advfirewall firewall add rule name="Ollama Port" dir=in action=allow protocol=TCP localport=11434

Problem: Slow Inference Speed
Optimizations:

Update NVIDIA drivers to 550+
Set power management: nvidia-smi -pm 1
Enable FP8 precision: export OLLAMA_FP8_MATH=1

Future-Proofing Your Setup

Hardware Upgrades:
- NVIDIA RTX 5090 (2024 Q3) for 2x FP8 performance
- PCIe 5.0 SSDs for faster model loading
- ONNX Runtime conversion

Edge Deployment:

torch.onnx.export(model, inputs, "deepseek-vl2.onnx", opset_version=18)

Model Updates:

ollama pull deepseek-vl2-2024Q3

How to Use DeepSeek

To start using DeepSeek:

Open Terminal: Open the Terminal app.
Interact with the Model: Press Enter to run the command, and DeepSeek R1 will start, allowing you to interact with the model through the terminal interface.

Run Command: Type the following command:

ollama run deepseek-r1:8b

Important Considerations

The provided demo implementation is basic and may result in slower performance.
For production environments, consider using optimized deployment solutions like vllm, sglang, or lmdeploy for faster response times and better cost efficiency.

By following these instructions and utilizing the code examples, you can effectively run and implement DeepSeek-VL2 on a Windows environment.

Why DeepSeek-VL2 Matters

Minimum System Requirements:

Installation

Using Ollama

Directory Information:

Accessing DeepSeek on the Web

Step-by-Step Guide for Inference of DeepSeek Models

Prerequisites

Installation Methods Compared

Method 1: Ollama (Beginner-Friendly)

Method 2: Manual Setup (Advanced Users)

Method 3: Docker Containers (Production)

DeepSeek-VL2 Code Implementation

Installation

Python Code

Gradio Demo

Vision-Language Workflow: Step-by-Step

1. Image Preprocessing

2. Multimodal Encoding

3. Expert Model Inference

Performance Optimization Tips

Enterprise Deployment Checklist

Troubleshooting Common Issues

Future-Proofing Your Setup

How to Use DeepSeek

Important Considerations

References