How to Run MiniMax‑M2.7 Locally: Step‑by‑Step Guide
MiniMax‑M2.7 is a new open‑weight model built for coding, agents, and complex office tasks.You can now download its weights and run it on your own hardware instead of only using cloud APIs.
This guide explains what MiniMax‑M2.7 is, the hardware it needs, and how to run it locally with different tools.
What Is MiniMax‑M2.7
MiniMax‑M2.7 is a large language model designed for software engineering, agent workflows, and professional office work.
It uses a sparse Mixture‑of‑Experts architecture with about 230 billion total parameters and around 10 billion active per token. Sparse Mixture‑of‑Experts means the model activates only a subset of its experts for each token, which reduces compute while keeping quality.
The model took part in its own training loop, where it updated tools and scaffolds based on experiment feedback. During internal runs it optimized a programming scaffold over 100 rounds and gained about 30 percent performance on that task.
MiniMax calls this process “self‑evolution,” because the model helps improve the system that trains it. It reaches 56.22 percent on the SWE‑Pro benchmark and strong scores on VIBE‑Pro, Terminal Bench 2, and several software‑engineering suites.
For office tasks it reaches an ELO score of 1495 on the GDPval‑AA benchmark, the highest among open‑weight models reported so far. The open‑weight release is hosted on Hugging Face as MiniMaxAI/MiniMax‑M2.7, with many quantized variants and GGUF conversions.
Third‑party guides from Unsloth and community GGUF maintainers show how to run it with llama.cpp on large‑RAM systems. NVIDIA and MiniMax also publish vLLM and SGLang server commands for running it on multi‑GPU nodes.
Key Features
- Self‑evolving training loop: Model updates its memory, skills, and scaffolds during reinforcement‑learning experiments to improve performance.
- Sparse Mixture‑of‑Experts design: 230B total parameters with about 10B active per token, so compute stays lower than dense peers.
- Strong coding performance: Scores 56.22 percent on SWE‑Pro and competitive scores on SWE Multilingual, Multi SWE Bench, and NL2Repo.
- Real‑world engineering focus: Designed for log analysis, bug hunting, refactoring, code security, and ML workflows, not only toy tasks.
- Agent‑ready capabilities: Supports complex tool use, agent teams, and dynamic tool search for multi‑step workflows.
- Office productivity strength: ELO 1495 on GDPval‑AA, with 97 percent skill adherence on 40 long office tasks over 2000 tokens each.
- Long context window: OpenRouter and partner docs list about 204k tokens of context for API usage.
- Open‑weight with quantizations: Hugging Face lists dozens of quantized versions plus GGUF ports for llama.cpp.
- GGUF support: Community GGUF sets provide Q2_K to Q8_0 and BF16 files, with sizes from about 83GB to 427GB.
- 4‑bit dynamic GGUF: Unsloth’s UD‑IQ4_XS quant reduces the working set to about 108GB so it fits on a 128GB RAM machine.
How to Install or Set Up
This section focuses on a practical local setup that a power user can build at home or in a small lab.
Option 1: llama.cpp with 4‑bit GGUF (Unsloth)
Goal: Run MiniMax‑M2.7 locally on a single workstation with about 128GB RAM using a 4‑bit GGUF file.
- Prepare hardware
- Use a machine with at least 16 CPU cores and 128GB of RAM for the 4‑bit GGUF.
- A modern NVIDIA GPU with 24GB or more VRAM helps, but the GGUF path can also stream from system memory.
- Install basic tools
- Install Git, CMake, and a C++ compiler from your OS package manager.
- Clone llama.cpp from its GitHub repository and build it with
cmakeandmakeas described in its README.
- Download the 4‑bit GGUF
- Unsloth hosts MiniMax‑M2.7 GGUF under
unsloth/MiniMax-M2.7-GGUFon Hugging Face. - The UD‑IQ4_XS quantization reduces model size to about 108GB for 128GB RAM devices.
- Use
huggingface-cli download unsloth/MiniMax-M2.7-GGUFor download the specific UD‑IQ4_XS file from the model page.
- Unsloth hosts MiniMax‑M2.7 GGUF under
- Place the GGUF file
- Copy the chosen GGUF file into a folder under your llama.cpp directory, for example
models/minimax-m2.7. - Ensure the file name matches the commands you plan to use, such as
MiniMax-M2.7-UD-IQ4_XS-00001-of-00004.gguf.
- Copy the chosen GGUF file into a folder under your llama.cpp directory, for example
- Test the CLI binary
Use the Unsloth example command to confirm the model loads and answers prompts.
bashexport LLAMA_CACHE="unsloth/MiniMax-M2.7-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/MiniMax-M2.7-GGUF:UD-IQ4_XS \
--temp 1.0 \
--top-p 0.95 \
--top-k 40
This uses Unsloth’s dynamic GGUF cache and the recommended generation parameters from the MiniMax model card.
Option 2: vLLM server with full HF weights
Goal: Run the full MiniMax‑M2.7 model on a multi‑GPU rig with vLLM, similar to production clusters.
- Prepare hardware
- NVIDIA and MiniMax describe MiniMax M‑series deployments using multiple high‑memory GPUs such as H100 or A100.
- For full‑precision or fp8 deployments, plan for hundreds of GB of effective VRAM or expert‑parallel setups.
- Install vLLM
- Create a new Python environment with Conda or uv.
- Use one of the official methods, for example:
bashconda create -n m27-env python=3.12 -y
conda activate m27-envpip install "vllm>=0.9.2"
- Download or cache the HF model
- MiniMax‑M2.7 lives at
MiniMaxAI/MiniMax-M2.7on Hugging Face. - You can let vLLM download it on first run, or pre‑download with
huggingface-cli download MiniMaxAI/MiniMax-M2.7.
- MiniMax‑M2.7 lives at
- Start vLLM with official MiniMax flags
NVIDIA and MiniMax show an example that enables tool calling and reasoning parsers.
bashvllm serve MiniMaxAI/MiniMax-M2.7 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code \
--enable-expert-parallel
This exposes an OpenAI‑compatible API on a local HTTP port, which you can call from any OpenAI‑style client.
Option 3: llama.cpp server with community GGUF
If you prefer a more generic GGUF deployment, you can use the community GGUF set from AaryanK on Hugging Face.
- Download a GGUF like
MiniMax-M2.7.Q4_K_M.gguffor a balance of quality and memory. - Place it in your llama.cpp models folder.
- Start a local HTTP server:
bash./llama-server -m MiniMax-M2.7.Q4_K_M.gguf \
--port 8080 \
--host 0.0.0.0 \
-c 16384 \
-ngl 99
This gives you a lightweight HTTP endpoint for local apps while using a 4‑bit or 5‑bit GGUF quant.
How to Run or Use It
Running interactive chat with llama.cpp
After you install llama.cpp and download a MiniMax‑M2.7 GGUF, you can start an interactive chat from the terminal.
bash./llama-cli -m MiniMax-M2.7.Q4_K_M.gguf \
-c 8192 \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
-p "You are a helpful assistant for coding and debugging."
This command loads the quantized model, sets a context length of 8192 tokens, and passes a short system prompt.
You then type user questions, such as “Explain this Python error log and suggest a fix.” The model responds with analysis and a code patch using its strong software‑engineering abilities.
Calling the vLLM server from Python
When you run vllm serve with MiniMax‑M2.7, it exposes an OpenAI‑style /v1/chat/completions endpoint.
Many SDKs already support this format, so you can reuse existing OpenAI clients with a custom base URL.
Example using the official openai Python client with a local vLLM server:
pythonfrom openai import OpenAIclient = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy-key",
)
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.7",
messages=[
{"role": "system", "content": "You help debug production incidents."},
{"role": "user", "content": "Our service returns 500s after a deploy. Logs show a DB timeout. What should I check?"}
],
temperature=1.0,
top_p=0.95,
max_tokens=512,
)
print(response.choices[0].message.content)
The temperature and top‑p values follow MiniMax’s recommended inference settings for this model.
Using MiniMax‑M2.7 with tools and agents
MiniMax‑M2.7 is built for heavy tool use and agent teams, so the vLLM example includes dedicated parsers.
The --tool-call-parser minimax_m2 and --enable-auto-tool-choice flags let the model choose and call tools without manual parsing.
This suits workflows like automated bug‑fixing, long‑running ML experiments, or data pipelines controlled by an agent harness.
Benchmark Results
MiniMax‑M2.7 shows strong scores across software‑engineering, office, and agent benchmarks from both MiniMax and independent reviewers.
MiniMax‑M2.7 Benchmark Table
Artificial Analysis also rates MiniMax‑M2.7 at 50 on its Intelligence Index, well above peer average 27.
They report around 45.7 tokens per second via the MiniMax API, slightly below the median speed for similar‑size open‑weight models.
Testing Details
The scores listed above come from three main groups: official MiniMax releases, partner benchmarks, and third‑party community tests.
Official values come from MiniMax’s M2.7 announcement, Hugging Face model card, and NVIDIA NIM model card. These cover software engineering suites, office productivity, and multi‑agent evaluations like Toolathon and MM Claw.
Taken together, the data shows that M2.7 sits in a “frontier‑class” band for coding and agent work while using fewer active parameters than many peers. This is why many users consider it a strong core for self‑hosted coding assistants and agent frameworks.
Comparison Table: MiniMax‑M2.7 vs Competitors
This table compares MiniMax‑M2.7 with Claude Opus 4.6 and GLM‑5, using public pricing and context data.
MiniMax‑M2.7 stands out on price, with input and output costs far below Claude Opus and below or similar to GLM‑5.
Unlike Claude Opus, both MiniMax‑M2.7 and GLM‑5 ship open weights, so you can run them fully local if your hardware is strong enough.
Pricing Table
This section focuses on how much you pay for MiniMax‑M2.7 in different usage modes.
For most hobby users, the local GGUF option has zero marginal cost once you own the machine. For production teams, the trade‑off lies between GPU cluster costs for self‑hosting and higher per‑token API prices for managed services.
USP — What Makes MiniMax‑M2.7 Different
MiniMax‑M2.7 combines three traits that are rare together: strong coding and agent performance, open weights, and aggressive pricing.
Its sparse MoE design activates about 10 billion parameters per token yet matches or approaches frontier dense models on major coding and agent benchmarks.
The model also supports deep tool use, agent teams, and long office workflows, plus a self‑evolving training loop that optimized its own scaffolds.
Pros and Cons
Pros
- Open‑weight release with many quantizations and GGUF ports for local deployment.
- Strong coding results on SWE‑Pro, VIBE‑Pro, and other real‑world benchmarks.
- Excellent office‑task performance with ELO 1495 on GDPval‑AA and high skill adherence.
- Designed for agent workflows with tool parsers and agent‑team support in official deployments.
- Competitive price per token on OpenRouter compared with other frontier‑class models.
- 4‑bit GGUF options enable near‑frontier intelligence on a single 128GB workstation.
Cons
- Full‑precision deployments still require very large GPU memory or multi‑node clusters.
- Even 4‑bit GGUF files are large, with Q4_K_M around 138–160GB and UD‑IQ4_XS about 108GB.
- Community tooling and tutorials are newer than for older open models, so you may debug more on first setup.
- Some benchmarks and pricing data focus on API usage, not local GGUF performance, so local speeds vary by hardware.
Quick Comparison Chart
This chart compares three common ways to run MiniMax‑M2.7 for local or near‑local work.
Ollama also exposes minimax-m2.7:cloud, but that option sends data to cloud inference rather than running the full model locally.
Demo or Real‑World Example: Local Coding Assistant with llama.cpp
This example shows how to use MiniMax‑M2.7 GGUF as a local coding assistant for debugging and refactoring.
Step 1: Start the local server
Assume you downloaded MiniMax-M2.7.Q4_K_M.gguf and placed it in your models folder.
bash./llama-server -m MiniMax-M2.7.Q4_K_M.gguf \
--port 8080 \
--host 127.0.0.1 \
-c 16384 \
-ngl 99
This starts a local HTTP server on port 8080 using a 4‑bit quantization that fits in high‑end workstation memory.
Step 2: Prepare a faulty code snippet
You can take a real error from your logs, for example a Python stack trace that shows a database timeout.
Save the trace and the related function into a file bug_report.txt for easy reuse.
This matches MiniMax‑M2.7’s strength in log analysis and bug hunting.
Step 3: Send a request from Python
Many llama.cpp builds expose an OpenAI‑style endpoint, so you can use a simple HTTP client.
The exact path can vary, but many builds support a /v1/chat/completions route with a JSON schema similar to OpenAI’s.
Example using requests in Python:
pythonimport requestsimport jsonwith open("bug_report.txt", "r") as f:
bug_text = f.read()
payload = {
"model": "MiniMax-M2.7.Q4_K_M.gguf",
"messages": [
{
"role": "system",
"content": "You are a senior backend engineer. Explain bugs and propose safe fixes."
},
{
"role": "user",
"content": f"Here is a failing request log:\n\n{bug_text}\n\nExplain the root cause and propose a patch."
}
],
"temperature": 1.0,
"top_p": 0.95,
"max_tokens": 512
}
resp = requests.post(
"http://127.0.0.1:8080/v1/chat/completions",
headers={"Content-Type": "application/json"},
data=json.dumps(payload),
timeout=120,
)
print(resp.json()["choices"][0]["message"]["content"])
The system prompt sets expectations for code quality and safety, while the user message passes raw logs.
Step 4: Integrate with your editor
Next, you can connect this local endpoint to an editor extension that supports custom OpenAI endpoints. For example, you can configure VS Code plug‑ins to point their “OpenAI Base URL” to http://127.0.0.1:8080/v1.
Conclusion
MiniMax‑M2.7 offers frontier‑level coding and agent performance with open weights and competitive pricing. You can run it locally with 4‑bit GGUF on a large‑RAM workstation or with vLLM on multi‑GPU clusters, depending on your needs.
If you want a powerful, self‑hostable model for coding, debugging, and agents, MiniMax‑M2.7 is a strong option to test on your hardware.
FAQ
1. Can I run MiniMax‑M2.7 on a single GPU?
Full‑precision deployments usually need multiple high‑memory GPUs or expert‑parallel setups. With 4‑bit GGUF you can instead run it on a workstation with about 128GB of RAM plus a strong GPU.
2. What is the minimum RAM for a local GGUF run?
Unsloth reports 108GB for the UD‑IQ4_XS 4‑bit GGUF, which targets 128GB RAM systems. Larger GGUF variants like Q4_K_M need more memory, up to around 138–160GB based on community reports.
3. How fast is MiniMax‑M2.7 compared with other models?
Artificial Analysis measures about 45.7 tokens per second on the MiniMax API, slightly below peer median speed. Local GGUF speed depends on CPU, GPU, and settings, so you should benchmark on your own hardware.
4. Is MiniMax‑M2.7 free to use?
The open‑weight release allows local use, subject to the license terms on the Hugging Face page. Hosted APIs such as OpenRouter and MiniMax cloud charge per token based on their published pricing.
5. Should I choose MiniMax‑M2.7 or Claude Opus 4.6?
Claude Opus 4.6 remains closed‑source, costs about $5/$25 per million input/output tokens, and runs only as an API. MiniMax‑M2.7 is cheaper per token on OpenRouter and also gives you the option of full local deployment with open weights.