How to Use DeepSeek V4 API: Complete Developer Guide (2026)
DeepSeek V4 is one of the most capable open-weight language models available in 2026, and its API is now a serious option for production workloads. With a 1M token context window, OpenAI-compatible endpoints, and aggressive pricing, the DeepSeek V4 API gives developers a drop-in alternative for coding assistants, AI agents, and general-purpose inference.
This guide walks you through everything you need to start building with the DeepSeek V4 API: from generating your first API key to streaming responses, using reasoning mode, and calling tools. Every example is copy-paste ready.
Prerequisites
Before you begin, make sure you have:
- A DeepSeek platform account (free to create)
- At least $2 in API credits loaded
- Python 3.8+ or Node.js 18+ installed
- The
openaiSDK installed for your language of choice
If you already have code that calls the OpenAI API, you can likely switch to DeepSeek V4 by changing two lines: the base URL and the model name.
Getting Your API Key
Follow these steps to generate your DeepSeek API key:
- Go to platform.deepseek.com and create an account.
- Navigate to the billing section and add a minimum of $2 in credit. The platform requires a balance before you can make API calls.
- Open the API keys page from your dashboard.
- Click "Create new API key" and give it a descriptive name (e.g.,
my-app-dev). - Copy the key immediately. It will not be shown again.
Store the key in an environment variable rather than hardcoding it:
export DEEPSEEK_API_KEY="sk-your-key-here"
The base URL for all API requests is:
https://api.deepseek.com/v1
V4-Pro vs V4-Flash: Which Model to Choose
DeepSeek V4 ships two models optimized for different use cases. Here is a direct comparison:
| Feature | deepseek-v4-pro | deepseek-v4-flash |
|---|---|---|
| Total Parameters | 1.6T | 284B |
| Active Parameters | 49B | 13B |
| Context Window | 1M tokens | 1M tokens |
| Best For | Agentic tasks, complex coding, long-context analysis | Classification, simple Q&A, fast responses |
| Input Price | $0.145/M tokens | $0.14/M tokens |
| Output Price | $1.74/M tokens | $0.28/M tokens |
| Cache Hit Price | 20% of input rate | 20% of input rate |
Choose V4-Pro when you need strong reasoning, multi-step tool use, or complex code generation. The higher output cost is justified by measurably better results on hard tasks.
Choose V4-Flash when latency and cost matter more than peak quality. It handles straightforward extraction, classification, and templated generation well at a fraction of the output cost.
There is also a promotional pricing tier for V4-Pro at $0.435/M input and $0.87/M output. Check the DeepSeek pricing page for current availability.
Basic Setup: Python with the OpenAI SDK
The fastest way to call the DeepSeek V4 API is through the OpenAI Python SDK, since the API is fully compatible with the OpenAI ChatCompletions format.
Install the SDK:
pip install openai
Make your first request:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com/v1",
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function that merges two sorted lists into one sorted list without using built-in sort."},
],
temperature=1.0,
top_p=1.0,
)
print(response.choices[0].message.content)
A note on temperature and top_p: DeepSeek V4 was tuned with temperature=1.0 and top_p=1.0. The documentation recommends keeping these defaults for best results. Lowering them can reduce output diversity but may also degrade quality on reasoning-heavy tasks.
Python with Raw HTTP Requests
If you prefer not to use the OpenAI SDK, you can call the API directly with requests:
import os
import requests
url = "https://api.deepseek.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {os.environ['DEEPSEEK_API_KEY']}",
"Content-Type": "application/json",
}
payload = {
"model": "deepseek-v4-flash",
"messages": [
{"role": "user", "content": "Explain the difference between a mutex and a semaphore in two sentences."}
],
"temperature": 1.0,
"top_p": 1.0,
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
print(data["choices"][0]["message"]["content"])
This approach works in any language that can make HTTP POST requests, which makes it useful for environments where you cannot install the OpenAI SDK.
Node.js / TypeScript Setup
Install the OpenAI Node.js SDK:
npm install openai
Then create a client and make a request:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.DEEPSEEK_API_KEY,
baseURL: "https://api.deepseek.com/v1",
});
async function main() {
const response = await client.chat.completions.create({
model: "deepseek-v4-pro",
messages: [
{ role: "system", content: "You are a senior backend engineer." },
{
role: "user",
content:
"Write a TypeScript function that retries a fetch request up to 3 times with exponential backoff.",
},
],
temperature: 1.0,
top_p: 1.0,
});
console.log(response.choices[0].message.content);
}
main();
If you are using CommonJS instead of ES modules, replace the import with:
const OpenAI = require("openai").default;
Everything else stays the same.
Streaming Responses
For interactive applications, streaming lets you display tokens as they arrive instead of waiting for the full response. Both models support streaming.
Python Streaming
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com/v1",
)
stream = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "user", "content": "Write a detailed explanation of how B-trees work and why databases use them."},
],
temperature=1.0,
top_p=1.0,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print() # newline at the end
Node.js Streaming
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.DEEPSEEK_API_KEY,
baseURL: "https://api.deepseek.com/v1",
});
async function main() {
const stream = await client.chat.completions.create({
model: "deepseek-v4-pro",
messages: [
{
role: "user",
content: "Explain how event loops work in Node.js.",
},
],
temperature: 1.0,
top_p: 1.0,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
console.log();
}
main();
Streaming adds no extra cost. You are billed the same per-token rate regardless of whether you stream or not.
Using Think Max (Reasoning Mode)
DeepSeek V4-Pro supports a reasoning mode called Think Max, where the model performs extended chain-of-thought before producing a final answer. This is particularly effective for math, logic, and multi-step coding problems.
To enable Think Max, you need to set the context window to at least 384K tokens. The model uses the additional context budget to generate internal reasoning tokens.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com/v1",
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{
"role": "user",
"content": (
"A farmer has 3 fields. Field A produces 20% more wheat than Field B. "
"Field C produces 15% less than Field A. Together they produce 5,400 kg. "
"How much does each field produce? Show your reasoning step by step."
),
},
],
temperature=1.0,
top_p=1.0,
max_tokens=400000, # Set high enough to enable Think Max reasoning
)
message = response.choices[0].message
# The model may include reasoning in a separate field depending on the API version
if hasattr(message, "reasoning_content") and message.reasoning_content:
print("=== Reasoning ===")
print(message.reasoning_content)
print()
print("=== Answer ===")
print(message.content)
Keep in mind that reasoning tokens are billed as output tokens. For straightforward questions, standard mode is more cost-effective. Reserve Think Max for problems where accuracy on complex reasoning justifies the added cost.
Function Calling and Tool Use
DeepSeek V4 supports function calling (tool use), which lets the model invoke structured functions you define. This is essential for building AI agents that interact with databases, APIs, or external services.
import os
import json
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com/v1",
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name, e.g. 'San Francisco'",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit",
},
},
"required": ["city"],
},
},
},
]
# First request: model decides whether to call a tool
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "user", "content": "What's the weather like in Tokyo right now?"},
],
tools=tools,
tool_choice="auto",
temperature=1.0,
top_p=1.0,
)
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
print(f"Model wants to call: {function_name}({arguments})")
# Simulate the function response
weather_data = json.dumps({
"city": arguments["city"],
"temperature": 22,
"unit": "celsius",
"condition": "partly cloudy",
})
# Second request: provide the tool result back to the model
follow_up = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "user", "content": "What's the weather like in Tokyo right now?"},
message, # the assistant message with tool_calls
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": weather_data,
},
],
tools=tools,
temperature=1.0,
top_p=1.0,
)
print(follow_up.choices[0].message.content)
else:
print(message.content)
V4-Pro handles multi-turn tool use reliably, including parallel tool calls where the model invokes multiple functions in a single response. For agentic workflows that require chained reasoning across several tool invocations, V4-Pro is the recommended model.
Caching and Cost Optimization Tips
DeepSeek V4 automatically caches prompt prefixes. When subsequent requests share the same prefix (for example, a long system prompt or a document for analysis), cached tokens are billed at just 20% of the normal input rate. No setup or configuration is required.
Here are practical ways to reduce your API costs:
Use consistent system prompts. Place your system prompt at the start of every request. If the prompt stays identical across calls, those tokens will hit the cache after the first request.
Front-load static context. If you are sending a large document for analysis, put it early in the message list. Tokens that change between requests (like the user's question) should come last. The cache matches from the beginning of the prompt forward.
Choose the right model. V4-Flash output costs roughly 16% of V4-Pro output. If your task does not require deep reasoning, the cost savings compound quickly at scale.
Batch similar requests. If you have multiple short questions about the same context, combine them into a single request rather than making separate calls. This avoids redundant input token charges.
Monitor your usage. The DeepSeek dashboard shows token consumption and cache hit rates. Review this regularly to identify optimization opportunities.
| Optimization | Impact |
|---|---|
| Cache hits on system prompts | Up to 80% reduction in input costs |
| V4-Flash instead of V4-Pro | Up to 84% reduction in output costs |
| Batching context-heavy requests | Fewer redundant input tokens |
Common Errors and Troubleshooting
401 Unauthorized Your API key is missing, invalid, or expired. Verify the key is set correctly in your environment and that it has not been revoked from the dashboard.
402 Payment Required / Insufficient Balance Your account balance is too low. Add credits at platform.deepseek.com. The minimum top-up is $2.
429 Too Many Requests You have exceeded the rate limit. Implement exponential backoff in your retry logic. The API returns a Retry-After header when available.
Model not found Double-check the model name string. Valid values are deepseek-v4-pro and deepseek-v4-flash. Older model names like deepseek-chat or deepseek-reasoner refer to previous generations.
Truncated responses If your output is being cut short, increase max_tokens. The default may be lower than what your task requires. For Think Max reasoning mode, set max_tokens to at least 384,000.
Slow responses with V4-Pro V4-Pro has higher latency than V4-Flash due to its larger active parameter count. If speed is critical for your use case, test whether V4-Flash produces acceptable quality for that specific task.
Connection timeouts Set your HTTP client timeout to at least 120 seconds for long-generation requests. Streaming mode can help mitigate perceived latency in user-facing applications.
Frequently Asked Questions
Is the DeepSeek V4 API compatible with OpenAI client libraries? Yes. The API implements the OpenAI ChatCompletions specification. You can use the official openai Python and Node.js SDKs by changing the base_url (or baseURL) to https://api.deepseek.com/v1 and setting your DeepSeek API key. It also supports the Anthropic API format, giving you flexibility in how you integrate.
Can I use DeepSeek V4 for code generation? Absolutely. DeepSeek V4-Pro performs well on coding benchmarks and is particularly strong at multi-file refactoring, debugging, and agentic coding tasks. V4-Flash can handle simpler code generation like boilerplate, templates, and single-function implementations.
How does the 1M token context window work in practice? Both V4-Pro and V4-Flash accept up to 1 million tokens of input context. This is large enough to process entire codebases, long legal documents, or extended conversation histories in a single request. Keep in mind that very long contexts increase latency and cost, so include only what is relevant.
Do I need to do anything special to enable caching? No. Caching is automatic and requires no setup fee or configuration. Any shared prefix between requests will be cached, and cache hits are billed at 20% of the standard input token rate.
Build Faster with Vetted AI Engineers
The DeepSeek V4 API opens up real possibilities for developers building AI-powered products, from coding assistants and document analysis tools to autonomous agents with tool use. The OpenAI-compatible interface means you can integrate it quickly, and the pricing makes it viable for production workloads.
Need developers who can integrate AI APIs like DeepSeek V4 into your product? Codersera connects you with vetted remote developers skilled in AI/ML engineering. Whether you are building an AI agent framework, adding LLM features to an existing product, or scaling your inference pipeline, Codersera helps you find the right technical fit, fast.