How to Run IBM Granite 4.0 3B Vision Locally for Chart, Table, and Document Extraction

Modern documents mix text, tables, charts, and scanned pages. Many teams want to extract this content on local hardware for privacy reasons. IBM Granite 4.0 3B Vision is a compact vision-language model for this need.

It focuses on chart, table, and key-value extraction while keeping hardware demands moderate.

What Is IBM Granite 4.0 3B Vision

IBM Granite 4.0 3B Vision is a 3‑billion‑parameter vision-language model for document data extraction. A vision-language model, or VLM, can read images and text together and answer questions or output structured data.

This model targets enterprise documents with charts, tables, forms, and complex layouts. It is available as a LoRA adapter on top of the Granite 4.0 Micro language model and uses an Apache 2.0 license.

In Granite 4.0 3B Vision, LoRA layers cover attention and MLP blocks so the base model can serve both text and multimodal tasks. This design keeps deployment flexible while keeping the model size moderate for local use.

The model focuses on three main jobs:

It converts charts into CSV, natural language summaries, or Python plotting code.
It extracts tables from scanned pages into JSON, HTML, or an intermediate OTSL format.
It reads forms and pulls out semantic key‑value pairs, even when layout and wording change between documents.

Key Features

Chart extraction tags: Special prompt tags like hart2csv>, hart2summary>, and hart2code> trigger chart‑specific outputs such as data tables, text summaries, or Python plotting scripts.
Table extraction tags: Tags like <tables_json>, <tables_html>, and <tables_otsl> return structured table outputs with rows, columns, and merges.
Semantic key‑value extraction: The model reads forms and returns key‑value pairs based on field names and short descriptions, not just exact string matches.
DeepStack visual architecture: A DeepStack‑style vision encoder injects image features into multiple language model layers for stronger grounding on charts and layouts.
ChartNet training data: A custom ChartNet dataset of about 1.7 million chart samples links plotting code, chart images, data tables, summaries, and QA pairs.
Unified table benchmark focus: The model targets PubTablesV2, OmniDocBench, and TableVQA for table extraction quality across cropped and full‑page images.
VAREX benchmark strength: On the VAREX benchmark for structured form extraction, it reaches 85.5% exact‑match accuracy in zero‑shot mode.
vLLM server support: A dedicated vLLM integration script exposes the model through an OpenAI‑compatible HTTP API for local or on‑prem use.
Open‑source license: The Apache 2.0 license allows commercial and research use, including local deployments and model modifications.
Reasonable local hardware needs: Community tests report practical use on GPUs in the 8–12 GB VRAM range with about an 8 GB model size.

How to Install or Set Up

Step 1: Prepare hardware and OS

Use a Linux or Windows machine with a recent NVIDIA GPU. For comfortable speed, aim for a GPU with 8–12 GB of VRAM, such as an RTX 3060. The model can run on CPU, but performance drops and is best only for tests. Ensure that Python 3.10 or newer and Git are installed.

Step 2: Install GPU drivers and CUDA

Install the latest NVIDIA driver that matches your GPU. Install the CUDA toolkit and cuDNN that match your driver and PyTorch version. Follow NVIDIA’s platform guide for your OS to avoid version conflicts.

Step 3: Create a Python environment

Create a virtual environment so Granite 4.0 3B Vision and its dependencies stay separate from other projects. For example, create a venv directory and activate it, then install core Python packages:

pip install --upgrade pip
pip install vllm openai huggingface_hub pillow

vLLM is a high‑performance inference engine that supports image and LoRA features for Granite 4.0 3B Vision. The openai package provides a simple client for the HTTP API.

Step 4: Download the model from Hugging Face

Granite 4.0 3B Vision is hosted on Hugging Face under the ID ibm-granite/granite-4.0-3b-vision. You can let vLLM download weights on first run, so no manual clone is required. If you prefer a full local copy, run:

git lfs install
git clone https://huggingface.co/ibm-granite/granite-4.0-3b-vision

The repository includes example images and a vLLM integration script.

Step 5: Start the vLLM Granite 4.0 Vision server

The Hugging Face repo includes a start_granite4_vision_server.py script that wires Granite 4.0 3B Vision into vLLM. This script exposes an OpenAI‑style endpoint on a configurable host and port. A typical start command looks like this:

python start_granite4_vision_server.py \
--model ibm-granite/granite-4.0-3b-vision \
--trust_remote_code --host 0.0.0.0 --port 8000 \
--hf-overrides '{"adapter_path": "ibm-granite/granite-4.0-3b-vision"}'

This setup lets vLLM load the Granite 4.0 Micro base and apply the vision LoRA per request. Text‑only prompts use the base model, while image prompts trigger the vision path.

Step 6: Optional GGUF and Ollama setup

IBM also maintains GGUF‑encoded variants of Granite models and conversion scripts for llama.cpp and Ollama.

GGUF is a file format for quantized models that fit on smaller GPUs or CPUs. For Granite 4.0 3B Vision, the IBM GGUF repository includes configuration entries and scripts for local testing with an Ollama server on macOS.

To use this route, convert the model to GGUF using the IBM tools and load it in an Ollama configuration file. Expect some feature gaps compared to the full vLLM path, especially around advanced multimodal features.

How to Run or Use It

Basic request pattern

Granite 4.0 3B Vision exposes a chat completion style interface. Each request includes a list of messages with roles like user and assistant.

The user message contains two parts: an image and a control tag text string. The image is passed as a URL or as base64 data, and the text holds a tag such as hart2csv> or <tables_json>.

With the OpenAI Python client, a request has this structure:

Set base_url to http://localhost:8000/v1 and an arbitrary API key.
Load an image from disk and encode it as base64.
Build a message where content is a list with an image_url item and a text item that holds the tag.
Call client.chat.completions.create() with model="ibm-granite/granite-4.0-3b-vision" and read the text from the first choice.

The Hugging Face model card provides a complete code example for chart and table tasks.

Running chart extraction

For chart extraction, Granite 4.0 3B Vision supports three main tags.

hart2csv>: The model reads the chart and outputs a CSV text table with headers and numeric values.
hart2summary>: The model returns a short, structured description of the main trends and values in the chart.
hart2code>: The model returns Python plotting code that recreates the chart, often with libraries like Matplotlib.

You can reuse the same chart image and change only the tag to get different outputs. This approach keeps prompts simple and removes the need to design complex natural language instructions for each chart.

Running table extraction

For tables, Granite 4.0 3B Vision supports three structured formats via tags.

<tables_json>: Returns a JSON structure with table dimensions, cell content, and merge information.
<tables_html>: Returns HTML <table> markup that standard tools can render and parse.
<tables_otsl>: Returns OTSL, an intermediate markup language that captures spans, merges, and structure for further processing.

To use these tags, send a page image that contains one or more tables. The model segments the tables and fills the chosen format. Output can feed into pipelines that load JSON into databases or HTML into downstream parsers.

Running key‑value pair extraction

Semantic key‑value extraction reads forms and pulls out fields such as name, date, and total amount. Instead of matching exact labels, the model uses descriptions from the prompt.

A typical prompt includes a short schema description like “Extract the following fields: invoice_number, issue_date, supplier_name, net_amount, tax_amount, total_amount. Return JSON.”

The model then scans the form image and returns a JSON object with keys and values.

On the VAREX benchmark, which uses more than 1,700 US government forms, Granite 4.0 3B Vision reaches 85.5% exact‑match accuracy in zero‑shot mode. Exact‑match means that all key‑value pairs for a form must match ground truth to count as correct.

Handling multiple pages and text prompts

Granite 4.0 3B Vision focuses on single images per request, but it fits into larger pipelines with multiple pages.

For multi‑page PDFs, split pages into images and send them one by one with tags that match each task. Use a separate text‑only model, such as a Granite text model, for long‑form analysis or summarization of extracted content.

The vLLM integration allows both text‑only and image requests on one endpoint. That keeps deployment simple for applications that mix RAG, question answering, and structured extraction.

Benchmark Results

The table below summarizes key Granite 4.0 3B Vision benchmark scores on chart, table, and form extraction tasks.

Task type	Benchmark / dataset	Metric	Granite 4.0 3B Vision score	Notes
Chart summarization	ChartNet Chart2Summary (human‑verified subset)	LLM‑as‑judge quality score	86.4%	Highest score among all evaluated models, including larger ones.
Chart to CSV	ChartNet Chart2CSV	LLM‑as‑judge quality score	62.1%	Second place behind Qwen3.5‑9B at 63.4%.
Table extraction (cropped)	PubTablesV2 cropped tables	TEDS (structure + content)	92.1	Best reported score on this benchmark suite.
Table extraction (full page)	PubTablesV2 full‑page tables	TEDS	79.3	Leads other tested models on full‑page tables.
Table extraction (mixed layouts)	OmniDocBench	TEDS	64.0	Strong performance across varied document layouts.
Table‑centric QA	TableVQA	Accuracy	88.1	Top score among evaluated small VLMs.
Semantic KVP extraction	VAREX (US government forms)	Exact‑match accuracy, zero‑shot	85.5%	Ranks third among 2–4B parameter models as of March 2026.

These scores show that Granite 4.0 3B Vision performs near the top of its parameter class. It often matches or outperforms larger vision models on chart, table, and form extraction tasks.

Testing Details

ChartNet chart understanding

IBM created ChartNet, a dataset with about 1.7 million synthetic and real chart samples. Each sample includes plotting code, a rendered chart image, the underlying data table, a natural‑language chart summary, and question–answer pairs.

Granite 4.0 3B Vision was evaluated on a human‑verified ChartNet benchmark that tests Chart2Summary and Chart2CSV quality. An LLM‑as‑a‑judge procedure scores outputs on correctness and faithfulness to the chart.

On this benchmark, Granite 4.0 3B Vision achieves 86.4% for Chart2Summary and 62.1% for Chart2CSV, placing first and second among all tested models on each task.

Unified table extraction suite

For tables, IBM built a unified evaluation suite spanning PubTablesV2, OmniDocBench, and TableVQA. PubTablesV2 measures table structure and content reconstruction with the TEDS metric.

OmniDocBench includes complex layouts with multiple tables and surrounding content. TableVQA tests whether models can answer questions about tables after extracting them.

Granite 4.0 3B Vision outputs HTML tables, which are then scored by TEDS. The model leads across this suite, with 92.1 on cropped PubTablesV2 tables, 79.3 on full‑page PubTablesV2 tables, 64.0 on OmniDocBench, and 88.1 on TableVQA.

These results indicate strong performance both on clean table crops and on full pages with surrounding text and figures.

VAREX key‑value extraction

The VAREX benchmark targets structured key‑value extraction from real US government forms. It includes 1,777 forms with diverse layouts, from simple flat structures to nested and tabular fields. Models must output key‑value pairs for a specified schema and are scored by exact‑match accuracy.

Granite 4.0 3B Vision obtains 85.5% exact‑match accuracy in a zero‑shot setting. Zero‑shot here means the model was not fine‑tuned on VAREX forms.

The score places it third among 2–4B parameter models as of March 2026 and competitive with larger models.

Relation to earlier Granite Vision models

Earlier Granite Vision models, such as Granite Vision 3.2 2B and Granite Vision 3.3 2B, already reached strong scores on document benchmarks like DocVQA and ChartQA.

For example, granite‑vision‑3.3‑2b reports ChartQA scores around 0.87 and DocVQA scores up to 0.91. Granite 4.0 3B Vision extends this focus on charts and tables with a modern DeepStack variant, ChartNet data, and stronger KVP extraction.

Comparison Table

The table below compares Granite 4.0 3B Vision with related and competing vision models.

Model	Params (approx.)	License	Primary focus	Key document benchmarks	Local hardware profile	Hosted price example
Granite 4.0 3B Vision	3B (3.5B base + 0.5B LoRA)	Apache 2.0	Chart, table, and KVP extraction for enterprise documents.	Chart2Summary 86.4%, Chart2CSV 62.1, PubTablesV2 92.1/79.3, VAREX 85.5% EM.	Community reports show practical use on 8–12 GB GPUs with about 8 GB model size.	Open‑source model; on‑prem or self‑hosted, no per‑token fee.
Granite Vision 3.3 2B	2B	Apache 2.0	General document VQA, tables, and charts for PDFs and scans.	ChartQA 0.87, DocVQA up to 0.91, OCRBench up to 0.79, LiveXiv VQA 0.61.	Compact enough for mid‑range GPUs; GGUF variants support CPU and smaller GPUs.	Available on watsonx; earlier Granite Vision 3.2 2B costs about USD 0.10 per million tokens.
Llama 3.2 11B Vision	11B	Meta Llama 3 license	Broad multimodal tasks including image understanding and document analysis.	Third‑party tests report about 75% chart interpretation, 78% document analysis, and 82% OCR accuracy on mixed benchmarks.	Needs stronger GPUs, often 16–24 GB VRAM, for smooth multimodal use.	On watsonx, llama‑3‑2‑11b‑vision‑instruct is priced around USD 0.35 per million tokens.
Qwen2.5‑VL‑7B ChartQA LoRA	7B + LoRA	Qwen license	Chart question answering and chart reasoning.	Reaches about 66.0% accuracy on ChartQA validation after LoRA fine‑tuning, up from 57.5% for the base model.	Needs more VRAM than 3B models; quantization can lower the requirement for local use.	No official IBM pricing; available as an open model on Hugging Face.

Pricing Table

Granite 4.0 3B Vision itself is an open‑source model, but many teams mix local and hosted options. The table below summarizes realistic pricing paths based on current IBM watsonx.ai information and related Granite Vision models.

Tier	Deployment option	What you pay for	Example pricing	Notes
Free / self‑hosted	Granite 4.0 3B Vision running on your own GPU or server	Your hardware, power, and engineering time; model weights are free under Apache 2.0.	No per‑token fee for the model itself.	Best for teams that want strict data control and already own suitable GPUs.
Paid API / SaaS	IBM watsonx.ai Granite Vision models (for example, granite‑vision‑3‑2‑2b)	Per‑token usage of a managed vision model plus plan fees.	Granite‑vision‑3‑2‑2b is listed at about USD 0.10 per million tokens, measured in resource units of 1,000 tokens each.	This uses an earlier Granite Vision model but gives a realistic price level for hosted document VLMs.
Enterprise plan	IBM watsonx.ai Standard plan plus model usage	Base subscription plus per‑token or per‑hour model charges.	The Standard plan starts around USD 1050 per month, with additional costs for foundation model usage and optional hosting.	Suits enterprises that want SLAs, support, and integration with other IBM tools beyond a single model.

Prices may change, and regional differences apply, so always check the latest IBM pricing page before planning costs.

USP — What Makes It Different

Granite 4.0 3B Vision stands out through its narrow but deep focus on document structure rather than broad natural image tasks. The ChartNet dataset and

DeepStack architecture together push chart summarization and chart‑to‑CSV extraction to the top of current benchmarks for models of similar or larger size. Its table extraction results on PubTablesV2, OmniDocBench, and TableVQA show that a 3B model can rival or beat much larger VLMs in this niche.

At the same time, the Apache 2.0 license and vLLM integration make it practical for local deployments where data cannot leave internal networks.

Pros and Cons

Pros

Strong results on chart summarization and chart‑to‑CSV tasks for its size.
Leading table extraction scores on PubTablesV2, OmniDocBench, and TableVQA.
High exact‑match accuracy on VAREX for form key‑value extraction.
Apache 2.0 license supports commercial local use without model fee.
vLLM integration with an OpenAI‑style API reduces application changes.
3B parameter scale fits mid‑range GPUs and some laptop‑class GPUs.

Cons

Focus on document images means weaker support for general natural image tasks than some larger VLMs.
CPU‑only setups run, but speed often does not suit production pipelines.
Training and benchmarks focus on English, so other languages may see lower accuracy.
Hosted pricing for Granite Vision models exists only for earlier 3.x versions so far, which can complicate planning for 4.0.

Quick Comparison Chart

The chart below maps common needs to suitable options.

Scenario	Recommended option	Reason
You want local, privacy‑first extraction of charts, tables, and forms on a mid‑range GPU.	Granite 4.0 3B Vision via vLLM on your own server.	Balanced accuracy and hardware needs, Apache 2.0 license, strong document benchmarks.
You need a managed API and do not want to run GPUs.	IBM watsonx.ai with Granite Vision models in the 3.x family.	Managed platform with per‑token billing and enterprise support.
You must cover both document extraction and broad image tasks with one large model.	Llama 3.2 11B Vision or similar large VLM.	Higher capacity for diverse multimodal tasks but higher cost and hardware needs.
You focus on chart question answering with custom fine‑tuning.	Qwen2.5‑VL‑7B with ChartQA LoRA.	Specialized improvement on ChartQA accuracy over the base model.

Demo or Real‑World Example

Use case: Local extraction from a financial report page

This example walks through a simple workflow: extract a chart, a table, and key fields from a page of a financial report using Granite 4.0 3B Vision running on vLLM.

A. Prepare the input image

Export a single page from a PDF report that contains a revenue trend chart, a quarterly revenue table, and a small summary box with key figures.
Save this page as a high‑resolution PNG file, for example report_q4_2025.png.

B. Start the Granite 4.0 3B Vision server

Activate your Python environment and start the vLLM server with the start_granite4_vision_server.py script.
Confirm that the server listens on http://localhost:8000/v1.

C. Extract the chart as CSV

Write a short Python script that loads report_q4_2025.png, encodes it as base64, and sends a chat completion request with the tag hart2csv>.
The response body contains CSV text with headers like Quarter,Revenue and numerical values that match the chart.

D. Extract the chart summary

Reuse the same image but change the tag to hart2summary>.
The model returns a paragraph that explains growth or decline across quarters, with explicit numbers where present in the chart.

E. Extract the table structure as JSON

Send another request with the same image and the tag <tables_json>.
The response includes a JSON object that lists table rows, columns, and cell content, which you can store directly in a database or transform into a DataFrame.

F. Extract key‑value pairs for a summary box

Craft a prompt such as “Extract the following fields: total_revenue, operating_income, net_income, earnings_per_share. Return JSON.” and send it with the page image.
The model returns a JSON object with the requested keys and their values pulled from the summary box or nearby text.

G. Integrate into an ETL pipeline

Wrap these steps into a small service that accepts PDFs, converts pages to images, calls Granite 4.0 3B Vision for each task, and writes CSV, JSON, and summary text into your data warehouse or lake.
Over time, this pipeline can replace manual data entry for recurring financial reports while keeping all processing on internal hardware.

This workflow mirrors how Granite 4.0 3B Vision was designed to operate: as a local, structured extraction engine for enterprise documents.

Conclusion

Granite 4.0 3B Vision offers a focused answer to chart, table, and form extraction needs in enterprise documents. The open Apache 2.0 license and vLLM integration give teams a clear path to private, on‑prem deployments.

For organizations that handle many complex PDFs and scanned forms, Granite 4.0 3B Vision is a strong candidate for the extraction layer in a broader document AI stack.

FAQ

1. Does Granite 4.0 3B Vision support languages other than English?

Training and documentation focus on English, and most benchmarks use English documents. It may work on other languages, but accuracy can drop, so careful evaluation is important.

2. Can Granite 4.0 3B Vision replace OCR engines?

The model includes strong OCR‑related abilities on tables, charts, and forms, similar to earlier Granite Vision models that score well on OCRBench. For simple text‑only scans, a classic OCR tool may still be faster, but Granite 4.0 3B Vision shines when layout and structure matter.

3. What GPU do I need to run it locally?

Community reports and third‑party tests suggest that GPUs with 8–12 GB VRAM, such as RTX 3060‑class cards, can run Granite 4.0 3B Vision at practical speeds. Larger GPUs improve throughput and batch size, especially in production.

4. How does it compare with larger proprietary models like GPT‑4V or Gemini?

Research on Granite Vision models shows that small Granite Vision variants can approach or match much larger proprietary models on document tasks like DocVQA and ChartQA. Granite 4.0 3B Vision extends this trend on chart and table extraction, but large proprietary models still lead on broad general‑purpose multimodal reasoning.

5. Is there a managed cloud option for Granite 4.0 3B Vision today?

Granite 4.0 3B Vision is currently available first as an open model on Hugging Face. For teams that want cloud hosting, using Granite Vision 3.x on watsonx is the closest managed alternative until Granite 4.0 appears as a hosted option.