Run, Install & Benchmark Qwen3.5 + Claude Code: Free Local AI Coding Agent

Qwen3.5 is Alibaba’s latest open‑weight multimodal model family, released under the Apache 2.0 license and designed to run efficiently from phones to high‑end GPUs while still competing with frontier cloud models on many language, coding, and agent benchmarks.

Claude Code is Anthropic’s agentic coding tool that runs in the terminal, understands your codebase, and automates edits, refactors, and git workflows via natural‑language instructions.

By pointing Claude Code at a locally served Qwen3.5 instance (via llama.cpp or a similar OpenAI‑compatible server), developers can create a free, private, local AI coding agent that behaves like a Claude‑style co‑worker but runs entirely on their own hardware.​

This report explains what Qwen3.5 and Claude Code are, how to install and connect them, how to benchmark and test the setup, and how it compares with popular alternatives like cloud Claude Code, Qwen Code CLI, Codeium, and Aider.

Qwen3.5 in a nutshell

Qwen3.5 is the newest generation in Alibaba’s Qwen series, offered as a family of dense and Mixture‑of‑Experts (MoE) models from 0.8B up to 397B “A17B” activated parameters. The models are open‑weighted under the Apache 2.0 license, allowing commercial and on‑prem deployments without usage‑based licensing.

On Ollama, the Qwen3.5 library exposes small to very large variants (0.8B, 2B, 4B, 9B, 27B, 35B, 122B, plus cloud variants) with a unified 256K‑token context window, suitable for large codebases and long agentic sessions.

Unsloth and Hugging Face host quantized GGUF builds such as Qwen3.5‑4B‑IQ4  and Qwen3.5‑4B‑Q4_K_M, which shrink model size to roughly 2.5–3 GB on disk and make local inference on consumer hardware practical.

Coding and agent benchmarks

Although Qwen3.5 is a general multimodal model, its large variants score competitively on code and agent benchmarks against GPT‑5‑class and Gemini‑3‑class models.​

For the flagship Qwen3.5‑397B‑A17B, Qwen reports for example:​

  • LiveCodeBench v6 (code generation): 83.6 vs 87.7 for GPT‑5.2 and 90.7 for Gemini‑3 Pro.
  • SWE‑bench Verified (coding agent, GitHub issues): 76.2, comparable to GPT‑5.2 (80.0) and Claude 4.5 Opus (80.9).
  • SWE‑bench Multilingual: 69.3 vs 72.0 (GPT‑5.2) and 77.5 (Claude 4.5 Opus).
  • SecCodeBench (security‑aware code): 68.3 vs 68.7 (GPT‑5.2) and 68.6 (Claude 4.5 Opus).

While these results are for the largest model, the 4B and 9B “Small” variants are designed to preserve strong instruction‑following and coding performance at much lower compute, and external reviews place Qwen3.5‑4B near or above peers like Llama 3.2 3B and Gemma 3 4B on coding tasks.

Hardware profile and quantization

Qwen3.5 Small models target on‑device and edge deployment. MindStudio’s hardware guide suggests:

  • Qwen3.5‑4B‑Q4 runs well on modern phones and low‑end laptops (5–15 tokens per second on an iPhone 15 Pro or modest CPU‑only PC).
  • Qwen3.5‑8B‑Q4 targets tablets and laptops with more RAM, reaching 15–30 tokens per second on M‑series Macs.

Quantized GGUF builds from Unsloth expose Q4 and Q8 variants; for Qwen3.5‑4B, IQ4
a / Q4
a
 files are around 2.5–3 GB, small enough for SSD‑only setups and single‑GPU cards with 8–12 GB VRAM.

Licensing and ecosystem

All Qwen3.x and Qwen3.5 models are released under Apache 2.0, explicitly allowing commercial use, redistribution, and modification as long as attribution and license terms are respected. The ecosystem supports common inference engines like llama.cppvLLMSGLang, and tools such as OllamaLM Studio, and Qwen Code CLI.

This permissive licensing plus broad tooling support is what makes Qwen3.5 especially attractive for a free, local coding agent.

What is Claude Code?

Claude Code is an “agentic coding” tool from Anthropic that runs in the terminal and connects to Claude models in the cloud. It scans your repository, reads and writes files, runs tests and commands, and uses natural‑language prompts to drive incremental changes.

According to Anthropic’s docs and the npm package description, Claude Code:

  • Installs as a native binary or via npm (npm install -g @anthropic-ai/claude-code).
  • Is launched from a project directory with the claude command.
  • Understands codebases, git status, and build/test commands, and can perform refactors and multi‑file edits.
  • Uses Anthropic’s Claude models (e.g., Claude Sonnet/Opus) by default through an authenticated API connection.

Pricing and access

Claude Code access is bundled into Claude’s Pro and Max subscription tiers and some Team/Enterprise seats.

  • Claude Pro (about 17 USD/month on annual billing or 20 USD/month monthly) includes terminal‑based Claude Code alongside enhanced usage limits and longer context for chat.
  • Claude Max (100–200 USD/month) multiplies usage limits by roughly 5–20× and adds priority access and wider model access.
  • Free users can use Claude chat but do not get full Claude Code access.

Out of the box, Claude Code expects a paid Claude plan or API; however, its architecture can be pointed at compatible backends.

Why pair Qwen3.5 with Claude Code?

Normally, Claude Code sends your prompts, repository context, and tool calls to Anthropic’s servers, which then run Claude models in the cloud. For many teams this is fine, but it raises privacy and cost concerns for sensitive or large‑volume projects.

The key idea behind “Qwen3.5 + Claude Code as a free local AI coding agent” is:

  1. Run Qwen3.5 locally via llama.cpp or another OpenAI‑compatible server (for example, as in a recent Datacamp tutorial for Qwen3.5).
  2. Configure Claude Code (or a Claude‑style coding CLI) to speak to this local endpoint instead of Anthropic’s cloud, using a dummy key and the local URL.​
  3. Let the agent workflow (file inspection, edits, test runs) remain in Claude Code, but all language modeling happens inside Qwen3.5 running on your own GPU or CPU.​

YouTube demos and community tutorials show this pattern by combining a quantized Qwen3.5 4B GGUF model served with llama-server and wiring Claude Code or similar CLIs to that local endpoint.

Benefits of this hybrid setup

  • Zero per‑token cost: Once Qwen3.5 weights are downloaded, inference is free beyond electricity and hardware.
  • Stronger privacy: Source code never leaves your machine; llama.cpp does not require external calls, and even Anthropic’s cloud is bypassed.
  • Agentic tooling without lock‑in: You keep the mature terminal UX of Claude Code while retaining full control over which local model to serve (Qwen3.5 now, another model later).​
  • Good coding quality at small sizes: Qwen3.5‑4B and 9B are tuned to deliver strong coding assistance relative to their size, making them practical for laptops and single‑GPU workstations.

Architecture of a local Qwen3.5 + Claude Code agent

A typical setup looks like this:

  1. llama.cpp (or vLLM/SGLang) runs Qwen3.5‑4B‑Q4
    a
     or similar as an HTTP server exposing an OpenAI‑style /v1/chat/completions endpoint on localhost (for example, port 8080).
  2. Claude Code (or an equivalent agent CLI) is configured via environment variables or a config file to:
    • Use a dummy API key (e.g., OPENAI_API_KEY="EMPTY").
    • Point to the local base URL (e.g., OPENAI_BASE_URL=http://localhost:8080/v1).
    • Use the alias/model name exported by llama-server.
  3. When the user types a natural‑language command (“migrate this project to FastAPI and add tests”), Claude Code:
    • Gathers file context.
    • Sends a structured prompt to the local OpenAI‑compatible server.
    • Receives Qwen3.5’s response and applies file edits and CLI commands.

This pattern mirrors how Qwen’s own Qwen Code CLI and other tools (like OpenClaw, OpenCode, or Gemini‑based CLIs) integrate with local or remote models via OpenAI‑compatible endpoints.​

Step‑by‑step installation guide

1. Install prerequisites

Operating system and tooling

  • A recent Linux distribution, macOS, or Windows with WSL2 is recommended for building and running llama.cpp and Claude Code smoothly.
  • Install gitCMake, a C/C++ compiler, and Python 3 if not already present; Datacamp’s Qwen3.5 tutorial shows this on Ubuntu when preparing a GPU VM.

Node.js or native installer for Claude Code

  • For Claude Code, Anthropic recommends a native installer:
    • macOS/Linux: curl -fsSL https://claude.ai/install.sh | bash
    • Windows: irm https://claude.ai/install.ps1 | iex in PowerShell.
  • If needed, you can still install via npm:
    • Ensure Node.js 18 or later is installed.
    • Run npm install -g @anthropic-ai/claude-code.

2. Build llama.cpp with server support

The llama.cpp project provides high‑performance local inference for GGUF models and includes a built‑in HTTP server (llama-server).

A typical build sequence is:

bashgit clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # enable CUDA for NVIDIA GPUs (optional)
cmake --build build --config Release -j

Guides from Arm, Datacamp, and others show similar commands; enabling LLAMA_BUILD_SERVER via CMake or targeting llama-server explicitly ensures the HTTP server binary is built.

After compilation, key binaries such as llama-clillama-server, and llama-quantize are available in the build directory.

3. Download a Qwen3.5 GGUF model

Qwen3.5 GGUF models suitable for llama.cpp are available from Unsloth’s Qwen3.5‑4B‑GGUF repository, among others.

General GGUF download instructions from Qwen and Hugging Face are:

  1. Install the Hugging Face CLI:bashpip install huggingface_hub
  2. Download a quantized model:bashhuggingface-cli download unsloth/Qwen3.5-4B-GGUF \
    Qwen3.5-4B-IQ4_NL.gguf \

    --local-dir ./models/qwen35-4b

The exact filename may differ (for example, Qwen3.5-4B-Q4_K_M.gguf or Qwen3.5-4B-IQ4_XS.gguf); Unsloth’s model card lists available quantizations and disk sizes, such as roughly 2.5–3 GB for Q4 variants.

4. Start an OpenAI‑compatible server for Qwen3.5

llama.cpp can serve GGUF models through an OpenAI‑style chat completions API by running llama-server with appropriate parameters.

A Datacamp tutorial for Qwen3.5 shows a full‑featured example (here simplified):​

bash./llama.cpp/llama-server \
--model models/Qwen3.5-4B-IQ4_NL.gguf \
--alias "Qwen3.5-4B" \
--host 0.0.0.0 \
--port 8080 \
--fit on \
--ctx-size 16384 \

--jinja

Hugging Face’s GGUF/llama.cpp guide shows that you can also launch the server directly from a Hugging Face repo with -hf shorthand; llama.cpp will fetch and cache the model automatically.​

Once running, the server exposes an OpenAI‑compatible /v1/chat/completions endpoint, which can be tested with a simple curl command.

5. Install and configure Claude Code

With Qwen3.5 served locally, configure Claude Code (or an equivalent coding agent CLI) to talk to the local endpoint.

  1. Install Claude Code using Anthropic’s installer or npm, as described earlier.
  2. Run claude once in a project directory to let it initialize configuration and prompt you for connection details.
  3. Instead of connecting to Anthropic’s cloud API, configure environment variables or settings as if you were using a generic OpenAI‑compatible backend:bashexport OPENAI_API_KEY="EMPTY" # llama.cpp ignores this
    export OPENAI_BASE_URL="http://localhost:8080/v1"
    export OPENAI_MODEL="Qwen3.5-4B" # must match --alias

YouTube tutorials that combine Qwen3.5 with Claude Code or OpenClaw use a similar pattern: an alias in llama-server and a configuration pointing the agent to localhost:8080 instead of a cloud provider.

6. Verify the end‑to‑end agent

To confirm that the setup is working:

  1. Watch Claude Code’s logs; they should show requests directed at the local OpenAI‑style endpoint (localhost:8080).
  2. llama.cpp’s console output will confirm the model is generating tokens for each request.
  3. After completion, verify that the requested file was created and contains plausible Qwen3.5‑generated code.

In a small test repo, run claude and ask:

“Scan this project and create a new script hello_agent.py that prints the current time every second.”

Some community demos show exactly this kind of workflow, where Claude Code or a Qwen‑based CLI creates files, updates tests, and refactors code while backed entirely by a local Qwen model.

Benchmarking and testing your local agent

Measuring raw model performance

llama.cpp and vendor guides provide several approaches to benchmarking Qwen3.5 on your hardware.

Key metrics include:

  • Tokens per second (t/s) during generation; llama.cpp includes detailed timings (prompt processing, token generation) that can be accessed via llama_print_timings output or benchmark tools like llama-bench.
  • Latency to first token (TTFT) and inter‑token latency (ITL) for interactive tasks, which other engines like vLLM also optimize.​

AMD and NVIDIA guides for llama.cpp show how to run benchmark commands that output average tokens per second across multiple runs, including on GPUs like RTX 40‑series or MI300‑class accelerators.

Practical coding benchmarks

For a local coding agent, practical evaluation matters more than synthetic scores. Useful real‑world tests include:

  1. Unit‑test driven coding
    • Ask the agent to add a new feature plus tests to a small Python or TypeScript project.
    • Measure how many iterations are needed before all tests pass.
    • Compare with cloud Claude Code and with another local model (e.g., Qwen2.5‑Coder) under similar prompts.
  2. Refactor and multi‑file edits
    • Have the agent rename a core module, update imports, and adjust associated tests.
    • Inspect git diffs for correctness and clarity.
  3. Cross‑language tasks
    • Ask for a translation of logic from Python to Go or Rust.
    • See how well Qwen3.5 handles language‑specific idioms and tooling.

Qwen’s own benchmarks (LiveCodeBench, SWE‑bench) plus user reports from the LocalLLaMA community suggest that Qwen‑family coder models already deliver high‑quality code for many languages and scenarios, often rivaling earlier Claude or GPT‑4‑class models when unthrottled.

End‑to‑end agent reliability tests

To validate the agent behavior rather than just raw model quality, consider scripted scenarios like:

  • Bootstrap a new project: Provide a blank directory and ask the agent to create a minimal web API or CLI app with tests and a Dockerfile.
  • Bug reproduction and fix: Paste failing test output or a stack trace and expect the agent to identify and patch the bug.
  • Upgrade a dependency: Ask it to upgrade a major framework version and resolve deprecations.

You can automate these tests using the same approaches Anthropic and others use for evaluating coding agents, such as re‑running tests after each patch and scoring success based on pass/fail.

How Qwen3.5 + Claude Code compares to alternatives

Quick comparison table (high level)

Tool / SetupModel locationRecurring costOffline?Best forNotes
Qwen3.5 + Claude Code (local)Local Qwen3.5 via llama.cppNone beyond hardware & powerYesDevelopers who want Claude‑style terminal agent with full local controlRequires manual setup; quality depends on chosen Qwen3.5 size and quantization.
Claude Code (cloud default)Anthropic cloud (Claude Sonnet/Opus)≈17–20 USD/month Pro; 100–200 USD/month Max; team seats 25–150 USD/user/monthNoTeams wanting managed, highest‑quality Claude modelsBest quality & support, but source and usage go to Anthropic servers.
Qwen Code CLI + Qwen3‑CoderQwen3‑Coder models (cloud via OpenRouter/Alibaba, or local)Model/API costs or local GPUOptionalPower users focused on Qwen ecosystemFirst‑party CLI for Qwen models with strong Terminal‑Bench performance.
Codeium + VS CodeCodeium cloud modelsFree for individuals; paid tiers for enterprisesNoFast autocomplete and chat with almost no setupGreat developer UX but code goes to vendor backend.
Aider (terminal pair programmer)Connects to Claude, OpenAI, DeepSeek, or local LLMsDepends on chosen modelsOptionalTerminal‑first workflow and git‑aware pairingStrong git integration; works with Claude 3.5 and local models as backends.

Unique selling points of Qwen3.5 + Claude Code

  1. Cloud‑grade model family, fully localQwen3.5’s larger variants now appear alongside GPT‑ and Gemini‑class models on global leaderboards, delivering high scores on language, coding, and agent benchmarks while being open‑weight. Small quantized variants like Qwen3.5‑4B‑Q4 still retain strong coding abilities while fitting on consumer hardware.
  2. Agent UX people already likeClaude Code (and similar CLIs) have become a standard “AI co‑worker in the terminal,” with mature support for repository context, git, and shell integration. Reusing this UX while swapping out the opaque cloud model for a transparent local one lets developers keep workflows they already know.
  3. Cost‑efficient at scaleClaude Pro and Max are cost‑effective for many individuals, but large teams or heavy workloads can quickly hit usage caps or high monthly bills. Once a local Qwen3.5 setup is in place, additional usage incurs negligible marginal cost.
  4. Flexible and swappableBecause llama.cpp exposes a generic OpenAI‑style API, swapping Qwen3.5 for another GGUF model (for example, a reasoning‑distilled Qwen3.5‑4B variant or Qwen3‑Coder‑Next) is mostly a matter of changing the model path and alias. This makes it easier to experiment with new weights without changing the agent layer.

Example workflow: from request to merged PR

Consider a simple but realistic scenario: migrating a small Flask API to FastAPI and adding tests.

  1. Start the stack
    • llama-server is already running Qwen3.5‑4B‑IQ4 on port 8080.
    • claude is launched in the project root with environment variables pointing to the local endpoint.
  2. Describe the task in natural languagetextPlease migrate this Flask API to FastAPI, keeping all routes equivalent.
    Add type hints, pydantic models, and pytest tests for each endpoint.
    Use uvicorn for local dev.
  3. Agent planning and execution
    • Claude Code inspects the current files, identifies Flask usage, and drafts a plan.
    • It calls Qwen3.5 via the local API to generate new main.pymodels.py, and test_api.py files.
    • It runs pytest and, if tests fail, loops back with the failure output as additional context.
  4. Review and merge
    • You inspect diffs, tweak anything that looks off, and commit.
    • Since everything happened locally, there is no external log of the code or prompts.

This is functionally similar to using cloud Claude Code, but users retain full control over data, versioning, and underlying model selection.

Pricing analysis

Cost profile of Qwen3.5 + Claude Code

Running Qwen3.5 locally incurs no per‑token charges, only one‑time or amortized costs:

  • Hardware (GPU/CPU and storage) and electricity.
  • Occasional downloads of updated GGUF weights.

Apache 2.0 licensing means there is no vendor‑imposed metering or seat‑based pricing, even for commercial applications.

Claude Code subscription costs for comparison

Claude Code is included in several Claude subscription tiers:

  • Claude Pro: about 17 USD/month (annual) or 20 USD/month (monthly), including Claude Code and extended usage.
  • Claude Max: 100–200 USD/month, with 5×–20× more usage and full access to Opus.
  • Team seats: roughly 25 USD/month for Standard and 150 USD/month for Premium (with Code access) per user.

While these plans are attractive for individuals and small teams, large organizations or heavy users can face substantial recurring costs, especially when combined with API usage.

Other competitor pricing snapshots

  • Codeium offers free AI code completion and chat for individuals, with paid enterprise plans.
  • Aider itself is open‑source, but costs depend on the connected models (Claude, GPT‑4‑class, or local LLMs).
  • Qwen Code CLI mostly depends on the cost of the backing Qwen3‑Coder model, which can be accessed via Alibaba Cloud, OpenRouter, or local GPUs; some providers offer generous free tiers.

In this landscape, Qwen3.5 + Claude Code stands out as a way to combine a polished agent UX with truly unmetered local inference.

Best practices and tips

Choosing the right Qwen3.5 size

  • Start with Qwen3.5‑4B‑Q4 if running on a laptop or single mid‑range GPU; it offers a good trade‑off between speed and code quality.
  • Upgrade to Qwen3.5‑9B or larger if you have more VRAM or use a remote GPU server, especially for complex refactors or multi‑file reasoning.​

Tuning server parameters

Guides for llama.cpp and Qwen3‑Coder show several useful flags:

  • --threads and --threads-batch to match CPU cores.
  • --ctx-size to balance long‑context needs and memory.
  • --fit on to auto‑balance VRAM and RAM when the model does not fully fit on GPU.
  • --flash-attn on (where supported) to reduce latency.

Improving agent reliability

  • Keep prompts explicit: specify languages, frameworks, and constraints (“use pytest”, “avoid external services”).
  • Use git branches so the agent can make bold changes without risking main.
  • Periodically re‑benchmark with llama-bench to confirm that future updates or quantizations have not regressed performance.

Limitations and caveats

  • Model quality vs. top‑tier Claude: Even though Qwen3.5 is strong, small local variants will still lag behind Anthropic’s latest Sonnet/Opus models on very complex reasoning or huge codebases.
  • Setup complexity: Building llama.cpp, managing GGUF downloads, and wiring up an OpenAI‑compatible endpoint requires more DevOps effort than using a hosted IDE plugin.
  • Hardware dependence: Performance and usability depend heavily on your hardware; under‑powered machines may only achieve modest tokens per second and feel sluggish.

Despite these trade‑offs, the combination of open‑weight Qwen3.5 models and an agentic terminal UX like Claude Code offers one of the most compelling ways in 2026 to run a free, local AI coding agent with strong real‑world performance and no per‑token bill.

FAQs

1. What is Qwen3.5 and why use it for coding?
Qwen3.5 is Alibaba’s open‑weight multimodal model family with strong coding and agent performance, available from 0.8B to 397B parameters under Apache 2.0. Smaller quantized variants like Qwen3.5‑4B‑Q4 run comfortably on consumer hardware while still delivering reliable code generation and refactoring.

2. How does Claude Code fit into a local setup?
Claude Code is an agentic coding tool that lives in your terminal, understands your repo, and automates edits, tests, and git workflows. By pointing it at a local OpenAI‑compatible server running Qwen3.5, you keep the Claude‑style UX while all model inference stays on your machine.​

3. What are the main benefits of Qwen3.5 + Claude Code vs cloud tools?
You get zero per‑token costs, stronger privacy (code never leaves your hardware), and flexibility to swap models or quantizations as your needs grow. Quality is competitive for many coding tasks, especially with 4B or 9B variants, though still below top‑tier cloud Claude on very complex problems.

4. What hardware do I need to run Qwen3.5 locally?
Quantized Qwen3.5‑4B‑Q4 models are around 2.5–3 GB and are designed to run on modern laptops, desktops without a GPU, and recent mobile devices at usable speeds. Larger models like 8B or 9B benefit from dedicated GPUs or remote GPU VMs but offer significantly better reasoning and coding depth.

5. How does this setup compare to tools like Codeium, Qwen Code, or Aider?
Codeium offers fast, free cloud autocomplete and chat, while Qwen Code and Aider provide powerful terminal‑first or git‑aware agents backed by cloud or local models. Qwen3.5 + Claude Code is unique in combining a Claude‑style terminal UX with fully local, unmetered inference using an Apache‑licensed model family.​