DeepSeek

DeepSeek V4: Full Release Breakdown — Features, Benchmarks and How to Use It

DeepSeek V4 is officially released. This article covers the real architecture (CSA+HCA, mHC, Muon), verified benchmarks for V4-Pro and V4-Flash, correct model specs, and exact API pricing to start using DeepSeek V4 today.

John Walter

10 Apr 2026 • 5 min read

DeepSeek V4 is officially released. On April 24, 2026, DeepSeek shipped two production-ready models — DeepSeek V4-Pro and DeepSeek V4-Flash — both available immediately via the DeepSeek API and as open weights under the MIT license. This article covers the real architecture, verified benchmarks, correct model specifications, and exact API pricing you need to start using DeepSeek V4 today.

What Is DeepSeek V4? Two Models, One Release

DeepSeek V4 is a dual-model release built on a Mixture-of-Experts (MoE) architecture. Both models support a 1 million token context window with a maximum output of 384K tokens, and both are released under the MIT license — meaning free commercial use and full weights access on Hugging Face.

Model	Total Parameters	Activated per Token	Context Window	Max Output	License
DeepSeek V4-Pro	1.6T	49B	1M tokens	384K tokens	MIT
DeepSeek V4-Flash	284B	13B	1M tokens	384K tokens	MIT

V4-Pro is the flagship model, targeting frontier-level reasoning, coding, and agentic workflows. V4-Flash is the cost-optimized variant — it trades some benchmark headroom for dramatically lower latency and API cost, making it the practical choice for high-volume production workloads. For a detailed comparison with the previous generation, see DeepSeek V4 vs DeepSeek V3.2: What Changed and What Developers Should Use.

DeepSeek V4 Architecture — Three Real Innovations

DeepSeek V4 introduces three architectural changes that separate it from V3.2. Understanding them matters because they explain why V4 can handle 1M-token contexts at a fraction of the inference cost of competing models.

1. CSA + HCA Hybrid Attention

The central innovation is a hybrid attention mechanism that interleaves Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) across Transformer layers.

CSA compresses the Key-Value cache of every m tokens into a single entry using a learned token-level compressor, then applies DeepSeek Sparse Attention (DSA) where each query token attends only to top-k selected compressed KV entries. HCA takes compression further for layers that tolerate greater approximation. The result: at 1M-token context, DeepSeek V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared with DeepSeek-V3.2. That is not a rounding artifact — it is a structural efficiency gain that makes long-context inference economically practical.

2. Manifold-Constrained Hyper-Connections (mHC)

Manifold-Constrained Hyper-Connections (mHC) replace standard residual connections throughout the network. Standard residuals add the layer input directly to the layer output. mHC instead projects residual connections onto a learned manifold, strengthening signal propagation across deep layers while preserving expressivity. The practical outcome is more stable training at scale and reduced gradient degradation in very deep networks.

3. Muon Optimizer

DeepSeek V4 is trained using the Muon optimizer, which applies Newton-Schulz iterations to approximately orthogonalize the gradient update matrix before applying it as a weight update. Compared to AdamW, Muon produces faster convergence and greater training stability — particularly important when training a 1.6T parameter model where optimizer instability would be catastrophic.

Together, CSA+HCA, mHC, and Muon explain how DeepSeek V4 achieves near-frontier benchmark scores while remaining deployable at far lower cost than dense models of similar capability.

DeepSeek V4 Benchmarks

DeepSeek released official benchmark results for both models. The V4-Pro (Max) results represent the best single-run performance with extended inference compute.

V4-Pro Max Benchmarks

Benchmark	DeepSeek V4-Pro Max	What It Measures
MMLU-Pro	87.5	Graduate-level knowledge across 14 domains
GPQA Diamond	90.1	Expert-level science questions (PhD difficulty)
LiveCodeBench	93.5	Competitive programming on unseen problems
SWE Verified	80.6	Real GitHub issue resolution
Codeforces Rating	3206	Competitive programming ELO (top 0.03% range)
HMMT	95.2	Harvard-MIT Math Tournament problems
BrowseComp	83.4	Multi-step web research and retrieval

V4-Flash Benchmarks

Benchmark	DeepSeek V4-Flash
MMLU-Pro	86.2
GPQA Diamond	88.1
LiveCodeBench	91.6
SWE Verified	79.0
Codeforces Rating	3052

The gap between Flash and Pro is narrow — Flash gives up roughly 1-2 points across most benchmarks in exchange for a 12x reduction in API cost. For most production applications that do not require frontier-level reasoning, V4-Flash is the right default.

DeepSeek V4 API Pricing

Both models are available immediately via the DeepSeek API using the model IDs deepseek-v4-pro and deepseek-v4-flash. Pricing follows the standard cache-hit / cache-miss structure.

Model	Input (cache miss)	Input (cache hit)	Output
deepseek-v4-pro	$1.74 / 1M tokens	$0.145 / 1M tokens	$3.48 / 1M tokens
deepseek-v4-flash	$0.14 / 1M tokens	$0.028 / 1M tokens	$0.28 / 1M tokens

At cache-miss rates, V4-Pro costs roughly one-seventh of GPT-5.5 and about one-sixth of Claude Opus 4.7 for equivalent throughput. V4-Flash at $0.14/M input is competitive with the cheapest frontier-class models available anywhere.

Both models support thinking mode and non-thinking mode via the API. Thinking mode adds chain-of-thought reasoning tokens before the final response — useful for math and code generation where reasoning quality matters more than latency.

How to Use DeepSeek V4 via API

The DeepSeek API is OpenAI-compatible. You can use it with any library that targets the OpenAI API format by swapping the base URL and model name.

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "user", "content": "Explain the CSA+HCA hybrid attention mechanism in DeepSeek V4."}
    ],
    max_tokens=2048
)

print(response.choices[0].message.content)

To use V4-Flash instead, replace deepseek-v4-pro with deepseek-v4-flash. No other changes are needed. For local deployment of V4-Flash, see the Run DeepSeek V4 Flash Locally: Full 2026 Setup Guide for hardware requirements and setup instructions.

Enabling Thinking Mode

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "user", "content": "Solve this differential equation step by step..."}
    ],
    extra_body={"thinking": True},
    max_tokens=8192
)

Thinking mode is billed at the same per-token rate as standard output. Budget for 2-5x more output tokens when enabling it on complex tasks.

DeepSeek V4 vs DeepSeek V3.2: The Practical Upgrade Decision

If you are currently running DeepSeek V3.2 via the API, the migration path is straightforward: update the model string, test your prompts, and monitor for any output format differences. The API is backward-compatible.

The architectural changes matter most at long context. At 1M tokens, V4-Pro uses 10% of V3.2's KV cache. For applications with large system prompts, long chat histories, or document-grounded generation, V4 will be substantially cheaper and faster than V3.2 at the same context length.

For short-context workloads under 32K tokens, the per-token difference is smaller, but V4-Pro's benchmark improvements in code generation and STEM reasoning still make it the better default unless cost is the binding constraint — in which case V4-Flash provides nearly equivalent output quality at a fraction of the price.

Open Weights: What MIT License Means for Developers

Both V4-Pro and V4-Flash are released under the MIT license — the most permissive open-source license available. You can:

Download and run the weights for free, including commercial use
Fine-tune on your own data without restriction
Build and sell products on top of V4 without royalties
Redistribute modified versions

V4-Flash weights are the practical self-hosting target. At 284B parameters with 13B activated per token, V4-Flash can run on a multi-GPU setup that most mid-size teams can afford. V4-Pro at 1.6T total parameters requires significant cluster capacity to serve at production latency — most teams will use the DeepSeek API for Pro and consider self-hosting only for Flash.

If you are evaluating alternatives for workloads where DeepSeek V4 is not the right fit, see DeepSeek V4 Alternatives: Qwen, Kimi, MiniMax, GPT, and Claude Compared (2026) for a structured comparison.

Summary: Should You Switch to DeepSeek V4?

For most development teams, the answer is yes. DeepSeek V4 delivers frontier benchmark performance at a fraction of the cost of closed-source competitors, ships with open weights under a permissive license, and introduces real architectural advances in long-context efficiency that directly reduce API bills for production workloads.

For new projects: Start with deepseek-v4-flash. Upgrade to Pro only if benchmarks reveal a quality gap on your specific task.
For existing V3.2 users: Migrate now. The API is compatible, and the improvements in long-context efficiency pay for themselves at volume.
For self-hosting: V4-Flash is the practical target. V4-Pro requires cluster-scale hardware to serve at competitive latency.

Need AI developers who have already shipped production systems on frontier models like DeepSeek V4? Codersera provides vetted remote developers with hands-on experience across the full open-source AI stack. Hire from Codersera.