How to Run IBM Granite 4.0 3B Vision Locally for Chart, Table, and Document Extraction
Modern documents mix text, tables, charts, and scanned pages. Many teams want to extract this content on local hardware for privacy reasons. IBM Granite 4.0 3B Vision is a compact vision-language model for this need.
It focuses on chart, table, and key-value extraction while keeping hardware demands moderate.
What Is IBM Granite 4.0 3B Vision
IBM Granite 4.0 3B Vision is a 3‑billion‑parameter vision-language model for document data extraction. A vision-language model, or VLM, can read images and text together and answer questions or output structured data.
This model targets enterprise documents with charts, tables, forms, and complex layouts. It is available as a LoRA adapter on top of the Granite 4.0 Micro language model and uses an Apache 2.0 license.
In Granite 4.0 3B Vision, LoRA layers cover attention and MLP blocks so the base model can serve both text and multimodal tasks. This design keeps deployment flexible while keeping the model size moderate for local use.
The model focuses on three main jobs:
- It converts charts into CSV, natural language summaries, or Python plotting code.
- It extracts tables from scanned pages into JSON, HTML, or an intermediate OTSL format.
- It reads forms and pulls out semantic key‑value pairs, even when layout and wording change between documents.
Key Features
- Chart extraction tags: Special prompt tags like
hart2csv>,hart2summary>, andhart2code>trigger chart‑specific outputs such as data tables, text summaries, or Python plotting scripts. - Table extraction tags: Tags like
<tables_json>,<tables_html>, and<tables_otsl>return structured table outputs with rows, columns, and merges. - Semantic key‑value extraction: The model reads forms and returns key‑value pairs based on field names and short descriptions, not just exact string matches.
- DeepStack visual architecture: A DeepStack‑style vision encoder injects image features into multiple language model layers for stronger grounding on charts and layouts.
- ChartNet training data: A custom ChartNet dataset of about 1.7 million chart samples links plotting code, chart images, data tables, summaries, and QA pairs.
- Unified table benchmark focus: The model targets PubTablesV2, OmniDocBench, and TableVQA for table extraction quality across cropped and full‑page images.
- VAREX benchmark strength: On the VAREX benchmark for structured form extraction, it reaches 85.5% exact‑match accuracy in zero‑shot mode.
- vLLM server support: A dedicated vLLM integration script exposes the model through an OpenAI‑compatible HTTP API for local or on‑prem use.
- Open‑source license: The Apache 2.0 license allows commercial and research use, including local deployments and model modifications.
- Reasonable local hardware needs: Community tests report practical use on GPUs in the 8–12 GB VRAM range with about an 8 GB model size.
How to Install or Set Up
Step 1: Prepare hardware and OS
Use a Linux or Windows machine with a recent NVIDIA GPU. For comfortable speed, aim for a GPU with 8–12 GB of VRAM, such as an RTX 3060. The model can run on CPU, but performance drops and is best only for tests. Ensure that Python 3.10 or newer and Git are installed.
Step 2: Install GPU drivers and CUDA
Install the latest NVIDIA driver that matches your GPU. Install the CUDA toolkit and cuDNN that match your driver and PyTorch version. Follow NVIDIA’s platform guide for your OS to avoid version conflicts.
Step 3: Create a Python environment
Create a virtual environment so Granite 4.0 3B Vision and its dependencies stay separate from other projects. For example, create a venv directory and activate it, then install core Python packages:
pip install --upgrade pippip install vllm openai huggingface_hub pillow
vLLM is a high‑performance inference engine that supports image and LoRA features for Granite 4.0 3B Vision. The openai package provides a simple client for the HTTP API.
Step 4: Download the model from Hugging Face
Granite 4.0 3B Vision is hosted on Hugging Face under the ID ibm-granite/granite-4.0-3b-vision. You can let vLLM download weights on first run, so no manual clone is required. If you prefer a full local copy, run:
git lfs installgit clone https://huggingface.co/ibm-granite/granite-4.0-3b-vision
The repository includes example images and a vLLM integration script.
Step 5: Start the vLLM Granite 4.0 Vision server
The Hugging Face repo includes a start_granite4_vision_server.py script that wires Granite 4.0 3B Vision into vLLM. This script exposes an OpenAI‑style endpoint on a configurable host and port. A typical start command looks like this:
python start_granite4_vision_server.py \--model ibm-granite/granite-4.0-3b-vision \--trust_remote_code --host 0.0.0.0 --port 8000 \--hf-overrides '{"adapter_path": "ibm-granite/granite-4.0-3b-vision"}'
This setup lets vLLM load the Granite 4.0 Micro base and apply the vision LoRA per request. Text‑only prompts use the base model, while image prompts trigger the vision path.
Step 6: Optional GGUF and Ollama setup
IBM also maintains GGUF‑encoded variants of Granite models and conversion scripts for llama.cpp and Ollama.
GGUF is a file format for quantized models that fit on smaller GPUs or CPUs. For Granite 4.0 3B Vision, the IBM GGUF repository includes configuration entries and scripts for local testing with an Ollama server on macOS.
To use this route, convert the model to GGUF using the IBM tools and load it in an Ollama configuration file. Expect some feature gaps compared to the full vLLM path, especially around advanced multimodal features.
How to Run or Use It
Basic request pattern
Granite 4.0 3B Vision exposes a chat completion style interface. Each request includes a list of messages with roles like user and assistant.
The user message contains two parts: an image and a control tag text string. The image is passed as a URL or as base64 data, and the text holds a tag such as hart2csv> or <tables_json>.
With the OpenAI Python client, a request has this structure:
- Set
base_urltohttp://localhost:8000/v1and an arbitrary API key. - Load an image from disk and encode it as base64.
- Build a message where
contentis a list with animage_urlitem and atextitem that holds the tag. - Call
client.chat.completions.create()withmodel="ibm-granite/granite-4.0-3b-vision"and read the text from the first choice.
The Hugging Face model card provides a complete code example for chart and table tasks.
Running chart extraction
For chart extraction, Granite 4.0 3B Vision supports three main tags.
hart2csv>: The model reads the chart and outputs a CSV text table with headers and numeric values.hart2summary>: The model returns a short, structured description of the main trends and values in the chart.hart2code>: The model returns Python plotting code that recreates the chart, often with libraries like Matplotlib.
You can reuse the same chart image and change only the tag to get different outputs. This approach keeps prompts simple and removes the need to design complex natural language instructions for each chart.
Running table extraction
For tables, Granite 4.0 3B Vision supports three structured formats via tags.
<tables_json>: Returns a JSON structure with table dimensions, cell content, and merge information.<tables_html>: Returns HTML<table>markup that standard tools can render and parse.<tables_otsl>: Returns OTSL, an intermediate markup language that captures spans, merges, and structure for further processing.
To use these tags, send a page image that contains one or more tables. The model segments the tables and fills the chosen format. Output can feed into pipelines that load JSON into databases or HTML into downstream parsers.
Running key‑value pair extraction
Semantic key‑value extraction reads forms and pulls out fields such as name, date, and total amount. Instead of matching exact labels, the model uses descriptions from the prompt.
A typical prompt includes a short schema description like “Extract the following fields: invoice_number, issue_date, supplier_name, net_amount, tax_amount, total_amount. Return JSON.”
The model then scans the form image and returns a JSON object with keys and values.
On the VAREX benchmark, which uses more than 1,700 US government forms, Granite 4.0 3B Vision reaches 85.5% exact‑match accuracy in zero‑shot mode. Exact‑match means that all key‑value pairs for a form must match ground truth to count as correct.
Handling multiple pages and text prompts
Granite 4.0 3B Vision focuses on single images per request, but it fits into larger pipelines with multiple pages.
For multi‑page PDFs, split pages into images and send them one by one with tags that match each task. Use a separate text‑only model, such as a Granite text model, for long‑form analysis or summarization of extracted content.
The vLLM integration allows both text‑only and image requests on one endpoint. That keeps deployment simple for applications that mix RAG, question answering, and structured extraction.
Benchmark Results
The table below summarizes key Granite 4.0 3B Vision benchmark scores on chart, table, and form extraction tasks.
These scores show that Granite 4.0 3B Vision performs near the top of its parameter class. It often matches or outperforms larger vision models on chart, table, and form extraction tasks.
Testing Details
ChartNet chart understanding
IBM created ChartNet, a dataset with about 1.7 million synthetic and real chart samples. Each sample includes plotting code, a rendered chart image, the underlying data table, a natural‑language chart summary, and question–answer pairs.
Granite 4.0 3B Vision was evaluated on a human‑verified ChartNet benchmark that tests Chart2Summary and Chart2CSV quality. An LLM‑as‑a‑judge procedure scores outputs on correctness and faithfulness to the chart.
On this benchmark, Granite 4.0 3B Vision achieves 86.4% for Chart2Summary and 62.1% for Chart2CSV, placing first and second among all tested models on each task.
Unified table extraction suite
For tables, IBM built a unified evaluation suite spanning PubTablesV2, OmniDocBench, and TableVQA. PubTablesV2 measures table structure and content reconstruction with the TEDS metric.
OmniDocBench includes complex layouts with multiple tables and surrounding content. TableVQA tests whether models can answer questions about tables after extracting them.
Granite 4.0 3B Vision outputs HTML tables, which are then scored by TEDS. The model leads across this suite, with 92.1 on cropped PubTablesV2 tables, 79.3 on full‑page PubTablesV2 tables, 64.0 on OmniDocBench, and 88.1 on TableVQA.
These results indicate strong performance both on clean table crops and on full pages with surrounding text and figures.
VAREX key‑value extraction
The VAREX benchmark targets structured key‑value extraction from real US government forms. It includes 1,777 forms with diverse layouts, from simple flat structures to nested and tabular fields. Models must output key‑value pairs for a specified schema and are scored by exact‑match accuracy.
Granite 4.0 3B Vision obtains 85.5% exact‑match accuracy in a zero‑shot setting. Zero‑shot here means the model was not fine‑tuned on VAREX forms.
The score places it third among 2–4B parameter models as of March 2026 and competitive with larger models.
Relation to earlier Granite Vision models
Earlier Granite Vision models, such as Granite Vision 3.2 2B and Granite Vision 3.3 2B, already reached strong scores on document benchmarks like DocVQA and ChartQA.
For example, granite‑vision‑3.3‑2b reports ChartQA scores around 0.87 and DocVQA scores up to 0.91. Granite 4.0 3B Vision extends this focus on charts and tables with a modern DeepStack variant, ChartNet data, and stronger KVP extraction.
Comparison Table
The table below compares Granite 4.0 3B Vision with related and competing vision models.
Pricing Table
Granite 4.0 3B Vision itself is an open‑source model, but many teams mix local and hosted options. The table below summarizes realistic pricing paths based on current IBM watsonx.ai information and related Granite Vision models.
Prices may change, and regional differences apply, so always check the latest IBM pricing page before planning costs.
USP — What Makes It Different
Granite 4.0 3B Vision stands out through its narrow but deep focus on document structure rather than broad natural image tasks. The ChartNet dataset and
DeepStack architecture together push chart summarization and chart‑to‑CSV extraction to the top of current benchmarks for models of similar or larger size. Its table extraction results on PubTablesV2, OmniDocBench, and TableVQA show that a 3B model can rival or beat much larger VLMs in this niche.
At the same time, the Apache 2.0 license and vLLM integration make it practical for local deployments where data cannot leave internal networks.
Pros and Cons
Pros
- Strong results on chart summarization and chart‑to‑CSV tasks for its size.
- Leading table extraction scores on PubTablesV2, OmniDocBench, and TableVQA.
- High exact‑match accuracy on VAREX for form key‑value extraction.
- Apache 2.0 license supports commercial local use without model fee.
- vLLM integration with an OpenAI‑style API reduces application changes.
- 3B parameter scale fits mid‑range GPUs and some laptop‑class GPUs.
Cons
- Focus on document images means weaker support for general natural image tasks than some larger VLMs.
- CPU‑only setups run, but speed often does not suit production pipelines.
- Training and benchmarks focus on English, so other languages may see lower accuracy.
- Hosted pricing for Granite Vision models exists only for earlier 3.x versions so far, which can complicate planning for 4.0.
Quick Comparison Chart
The chart below maps common needs to suitable options.
Demo or Real‑World Example
Use case: Local extraction from a financial report page
This example walks through a simple workflow: extract a chart, a table, and key fields from a page of a financial report using Granite 4.0 3B Vision running on vLLM.
A. Prepare the input image
- Export a single page from a PDF report that contains a revenue trend chart, a quarterly revenue table, and a small summary box with key figures.
- Save this page as a high‑resolution PNG file, for example
report_q4_2025.png.
B. Start the Granite 4.0 3B Vision server
- Activate your Python environment and start the vLLM server with the
start_granite4_vision_server.pyscript. - Confirm that the server listens on
http://localhost:8000/v1.
C. Extract the chart as CSV
- Write a short Python script that loads
report_q4_2025.png, encodes it as base64, and sends a chat completion request with the taghart2csv>. - The response body contains CSV text with headers like
Quarter,Revenueand numerical values that match the chart.
D. Extract the chart summary
- Reuse the same image but change the tag to
hart2summary>. - The model returns a paragraph that explains growth or decline across quarters, with explicit numbers where present in the chart.
E. Extract the table structure as JSON
- Send another request with the same image and the tag
<tables_json>. - The response includes a JSON object that lists table rows, columns, and cell content, which you can store directly in a database or transform into a DataFrame.
F. Extract key‑value pairs for a summary box
- Craft a prompt such as “Extract the following fields: total_revenue, operating_income, net_income, earnings_per_share. Return JSON.” and send it with the page image.
- The model returns a JSON object with the requested keys and their values pulled from the summary box or nearby text.
G. Integrate into an ETL pipeline
- Wrap these steps into a small service that accepts PDFs, converts pages to images, calls Granite 4.0 3B Vision for each task, and writes CSV, JSON, and summary text into your data warehouse or lake.
- Over time, this pipeline can replace manual data entry for recurring financial reports while keeping all processing on internal hardware.
This workflow mirrors how Granite 4.0 3B Vision was designed to operate: as a local, structured extraction engine for enterprise documents.
Conclusion
Granite 4.0 3B Vision offers a focused answer to chart, table, and form extraction needs in enterprise documents. The open Apache 2.0 license and vLLM integration give teams a clear path to private, on‑prem deployments.
For organizations that handle many complex PDFs and scanned forms, Granite 4.0 3B Vision is a strong candidate for the extraction layer in a broader document AI stack.
FAQ
1. Does Granite 4.0 3B Vision support languages other than English?
Training and documentation focus on English, and most benchmarks use English documents. It may work on other languages, but accuracy can drop, so careful evaluation is important.
2. Can Granite 4.0 3B Vision replace OCR engines?
The model includes strong OCR‑related abilities on tables, charts, and forms, similar to earlier Granite Vision models that score well on OCRBench. For simple text‑only scans, a classic OCR tool may still be faster, but Granite 4.0 3B Vision shines when layout and structure matter.
3. What GPU do I need to run it locally?
Community reports and third‑party tests suggest that GPUs with 8–12 GB VRAM, such as RTX 3060‑class cards, can run Granite 4.0 3B Vision at practical speeds. Larger GPUs improve throughput and batch size, especially in production.
4. How does it compare with larger proprietary models like GPT‑4V or Gemini?
Research on Granite Vision models shows that small Granite Vision variants can approach or match much larger proprietary models on document tasks like DocVQA and ChartQA. Granite 4.0 3B Vision extends this trend on chart and table extraction, but large proprietary models still lead on broad general‑purpose multimodal reasoning.
5. Is there a managed cloud option for Granite 4.0 3B Vision today?
Granite 4.0 3B Vision is currently available first as an open model on Hugging Face. For teams that want cloud hosting, using Granite Vision 3.x on watsonx is the closest managed alternative until Granite 4.0 appears as a hosted option.