Run Qwen3-VL-4B Locally with Transformers: Step-by-Step Developer Guide
A complete developer guide to loading and running Qwen3-VL-4B locally using the HuggingFace Transformers library — including quantization, multi-image inputs, and video frame inference.
Running Qwen3-VL-4B locally with HuggingFace Transformers gives developers full programmatic control over the model — from dtype selection and quantization to custom preprocessing pipelines and batch inference. Unlike Ollama, which abstracts the model behind an API server, the Transformers library exposes every layer of the inference stack. This guide walks you through the exact steps to load, run, and optimize Qwen3-VL-4B on your own hardware.
What Is Qwen3-VL-4B and Why Use HuggingFace Transformers?
Qwen3-VL-4B is a 4-billion-parameter vision-language model (VLM) from Alibaba's Qwen team. It processes images, multi-image sequences, and video frames alongside text, making it practical for OCR, document understanding, UI automation, and visual question answering. The 4B size sits in the sweet spot: capable enough for real tasks, small enough to run on a single consumer GPU.
Two common local deployment paths exist: Ollama and HuggingFace Transformers. Ollama is simpler — one command and you have an API server — but it gives you no control over quantization format, batch configuration, or preprocessing. Transformers requires more setup but lets you:
- Choose your own dtype (
float16,bfloat16, or 4-bit via BitsAndBytes) - Integrate the model directly into Python pipelines without an HTTP layer
- Control memory layout with
device_map - Access hidden states, attention weights, and logits directly
If you need the model as a component in a larger Python application — not a standalone chat server — Transformers is the right tool. For a broader overview of the model's capabilities and setup prerequisites, see our Qwen3-VL-4B setup guide and hardware requirements.
Hardware Requirements for Running Qwen3-VL-4B Locally
Before writing a single line of code, verify your hardware meets the minimum requirements. The figures below are approximate and will vary depending on input image size and sequence length.
- GPU (float16): 8 GB VRAM minimum — RTX 3080 / RTX 4070 or better. Approximately 10–25 tokens/sec.
- GPU (4-bit quantized): 4–6 GB VRAM — RTX 3060 / RTX 4060. Approximately 6–15 tokens/sec.
- CPU only (float32): 16 GB RAM minimum, 32 GB recommended. Approximately 0.5–2 tokens/sec.
- Apple Silicon (MPS): 16 GB unified memory — M2 Pro / M3 Max or better. Approximately 4–10 tokens/sec.
CPU-only note: If you have no compatible GPU, the model will run on CPU using device_map="cpu". Inference will be slow — expect several minutes per response for complex images — but fully functional. Use torch.float32 on CPU; float16 is not natively supported on most CPUs without explicit hardware acceleration.
Setting Up Your Python Environment
Use a virtual environment to keep dependencies isolated. Python 3.10 or 3.11 is recommended — newer patch versions are fine, but Python 3.12 may have compatibility issues with some CUDA libraries.
python -m venv qwen3vl-env
source qwen3vl-env/bin/activate # On Windows: qwen3vl-env\Scripts\activate
Install the required packages. Note qwen-vl-utils specifically — this package handles vision preprocessing, converting images and video frames into the tensor format the model expects. It is a required dependency, not optional.
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate
pip install qwen-vl-utils
pip install pillow requests
If you plan to use 4-bit quantization (covered in a later section), also install BitsAndBytes:
pip install bitsandbytes
Verify your CUDA installation matches your PyTorch build before proceeding:
import torch
print(torch.cuda.is_available()) # Should print True on GPU machines
print(torch.version.cuda) # Should match your driver's CUDA version
Loading Qwen3-VL-4B with HuggingFace Transformers
The model is hosted on the HuggingFace Hub. The first load will download approximately 8 GB of model weights — they are cached locally after that, so subsequent loads are fast.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
MODEL_ID = "Qwen/Qwen3-VL-4B-Instruct"
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Note: if AutoModelForImageTextToText fails with an unrecognized model error,
# check the model card on HuggingFace Hub for the correct class name.
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model.eval()
device_map="auto" lets Accelerate distribute layers across available GPUs, or fall back to CPU if no GPU is detected. model.eval() disables dropout and training-mode batch normalization — always call this before inference.
Choosing the Right torch_dtype
- float16: Best for NVIDIA GPUs with Tensor Cores (Turing architecture and later). Uses approximately 8 GB VRAM for this model.
- bfloat16: Preferred on Ampere+ (RTX 3000 series and above) and Apple Silicon (MPS). Better numerical stability than float16 for some workloads.
- float32: CPU fallback only. Doubles memory usage compared to float16 — use only when a GPU is unavailable.
- auto: Pass
torch_dtype="auto"to letfrom_pretraineddecide based on the model config. Safe default when you are unsure.
Running Your First Image Inference with Qwen3-VL-4B
The following example loads a local image file and asks Qwen3-VL-4B to describe it. The process_vision_info function from qwen_vl_utils extracts image and video tensors from the message structure before they are passed to the processor.
from PIL import Image
from qwen_vl_utils import process_vision_info
# Load your image
image = Image.open("your_image.jpg").convert("RGB")
# Build the conversation message in Qwen VL format
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
# Apply the model's chat template to format the prompt correctly
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
# Extract image and video tensors from the messages
image_inputs, video_inputs = process_vision_info(messages)
# Build the final input tensors
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Generate the response
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
)
# Strip the input prompt from the output before decoding
generated_ids = [
output_ids[i][len(inputs.input_ids[i]):]
for i in range(len(output_ids))
]
response = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
print(response)
The input-stripping step — slicing output_ids[i][len(inputs.input_ids[i]):] — is important. Without it, batch_decode will include the full prompt (including image tokens) in the decoded string, producing garbled output at the start of the response.
Working with Multiple Images and Video Frames
Qwen3-VL-4B supports multiple images in a single conversation turn. Pass them as additional items in the content array:
image1 = Image.open("before.jpg").convert("RGB")
image2 = Image.open("after.jpg").convert("RGB")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image1},
{"type": "image", "image": image2},
{"type": "text", "text": "What changed between these two images?"},
],
}
]
For video inference, pass a local file path with fps and max_pixels controls. Lower values reduce VRAM usage for long videos:
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "path/to/your_video.mp4",
"fps": 1.0, # sample 1 frame per second
"max_pixels": 360 * 420,
},
{"type": "text", "text": "Summarize what happens in this video."},
],
}
]
The rest of the inference code (applying the chat template, calling process_vision_info, generating) is identical to the single-image example above. More vision-language model options for local inference are covered in our guide to the best small LLMs to run locally.
Quantizing Qwen3-VL-4B to Fit in 4-6 GB VRAM
If your GPU has less than 8 GB VRAM, 4-bit quantization via BitsAndBytes reduces memory usage to approximately 4-5 GB with a modest quality tradeoff. The nf4 (NormalFloat4) quantization type is recommended for vision-language models — it preserves more weight distribution accuracy than standard int4.
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
MODEL_ID = "Qwen/Qwen3-VL-4B-Instruct"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID,
quantization_config=quantization_config,
device_map="auto",
)
model.eval()
Once loaded, the inference code is identical to the non-quantized version — no changes required to the generation loop. BitsAndBytes handles the dequantization transparently at runtime.
For an alternative approach to running compact vision models, see our guide on running IBM Granite 4.0 Vision locally — it covers similar quantization patterns for a different architecture and is useful for comparison.
Common Errors and Fixes
The list below covers the errors developers most commonly encounter when running Qwen3-VL-4B locally with Transformers for the first time.
- CUDA out of memory: Model too large for available VRAM. Switch to 4-bit quantization or reduce
max_new_tokens. - ModuleNotFoundError: qwen_vl_utils: Missing vision utils package. Run
pip install qwen-vl-utils. - ValueError: Unrecognized model: Transformers version too old. Upgrade with
pip install --upgrade transformers. - RuntimeError: CUDA device-side assert: CUDA/PyTorch version mismatch. Reinstall PyTorch matching your CUDA version from pytorch.org.
- Extremely slow inference: No GPU detected; model is running on CPU. Verify CUDA installation with
torch.cuda.is_available(). - Empty or truncated responses:
max_new_tokenstoo low. Increase to 512 or 1024.
Extending Your Local Inference Pipeline
Once your basic inference pipeline is working, consider these extensions for production use:
- Batch inference: Pass multiple message lists to
processor()and generate in parallel — monitor VRAM usage carefully as batch size scales - Streaming output: Use
TextStreamerfrom Transformers to print tokens as they are generated rather than waiting for the full sequence - Custom system prompts: Add
{"role": "system", "content": [{"type": "text", "text": "..."}]}as the first message in the conversation - FastAPI serving: Wrap the inference code in an async endpoint for multi-client access without switching to Ollama
If you want to compare Qwen3-VL-4B Instruct against the Thinking variant to determine which is better for your specific use case before building out a full pipeline, the Qwen3-VL-4B Instruct vs Thinking comparison guide covers the behavioral and performance differences in detail. For a lightweight alternative using the same Transformers-based workflow, the SmolVLM2 2.2B local setup guide is a useful reference.