Schematron-3B

Run Schematron-3B Guide (2026): Local Web Scraping AI

Q: Do I need a powerful GPU to run Schematron-3B locally?

Not necessarily. Community experience shows that 3–4B models like Schematron-3B can run on entry-level GPUs and even on CPUs when using smaller context windows. A GPU mainly improves speed and throughput, not basic functionality.

Q: Can Schematron-3B scrape any website out of the box?

Schematron-3B can parse any HTML, but it does not automatically know what data you want. You must define a JSON Schema and prompt the model accordingly. Different websites may require slightly different schemas depending on their structure.

Q: How is Schematron-3B better than using GPT-4 or GPT-5 for scraping?

Schematron-3B is purpose-built for HTML-to-JSON extraction. It reliably produces schema-conformant JSON and can be 40–80× cheaper than frontier APIs like GPT-4 or GPT-5 at large scale, while maintaining comparable extraction quality.

Q: Is the model safe to use with sensitive data?

Yes. When Schematron-3B is run locally or on your own servers, all raw HTML and generated JSON stay within your infrastructure. This provides significantly better privacy and data control compared to sending data to remote APIs.

Q: Can I combine Schematron-3B with other LLMs?

Yes. A common workflow is: search → fetch HTML pages → use Schematron-3B to extract structured JSON → pass that JSON to a general LLM (such as GPT-4.1) for reasoning or answer synthesis. This hybrid approach often delivers much higher accuracy and reliability.

Learn how to install, run, benchmark, and compare Schematron-3B, a 3B local AI model for HTML‑to‑JSON web scraping. Includes setup steps, code demo, benchmarks, pricing, and competitor comparison.

John Walter

Feb 10, 2026 • 13 min read

Run Schematron-3B Guide

Schematron‑3B is a 3‑billion‑parameter language model specialized for turning messy HTML pages into clean, strictly schema‑valid JSON.
Instead of trying to chat, translate, code, and scrape all at once (like general LLMs), it focuses on one thing: reliable web data extraction.

Key ideas:

Schema‑first extraction: you provide a JSON Schema (the structure of the data you want), plus HTML. The model outputs JSON that always matches the schema.
Long context (up to 128K tokens): it can handle very long, noisy pages without truncation, such as product lists, full article pages, or documentation.
Small but specialized: at 3B parameters, it runs locally on ordinary hardware but reaches extraction quality close to much larger models.

Benchmarks show:

Schematron‑3B scores 4.41/5 on structured extraction quality. The 8B version scores 4.64/5, only 0.1 behind GPT‑4.1 (4.74) on the same task.
When used as an HTML‑to‑JSON extractor in a web‑augmented QA pipeline, accuracy jumps from 8.54% (using only GPT‑5 Nano) to 82.87% when Schematron handles structured extraction.

This makes Schematron‑3B a strong fit if the main goal is:

Scraping e‑commerce sites for prices and product details
Aggregating articles, listings, or reviews into a database
Feeding clean structured data into downstream LLMs or analytics systems

2. Core Features of Schematron‑3B

2.1 Schema‑First, 100% JSON‑Conformant Outputs

The model is trained to always obey a given JSON Schema. You describe the fields and types you want (strings, numbers, arrays, nested objects), then feed that schema plus HTML.

Benefits:

No extra “cleanup” step to remove explanation text or hallucinated fields
Safer to plug directly into downstream systems that expect strict JSON
Easier validation: standard JSON Schema validators can be used to check outputs

According to the model card, the output is strict JSON that conforms to your schema, with no conversational fluff. This is reinforced in community demos which show the model returning just the data fields requested.

2.2 Long‑Context HTML Handling (Up to 128K Tokens)

Schematron models support context windows up to 128K tokens.

This matters because:

Real‑world HTML often includes huge navigation menus, tracking scripts, and long product lists.
General LLM APIs can struggle or become expensive when fed raw 100K+ token HTML.

The model is trained with curriculum strategies specifically to remain accurate at long contexts, and benchmarks confirm that it maintains quality even at those lengths.

2.3 Cost‑Performance “King” for Web Scraping

The 3B variant is described as the “cost‑performance king” in the Schematron family: it delivers nearly the same extraction quality as 8B, at about half the inference cost.

From internal benchmarks and public discussion:

Schematron‑3B: recommended default for most scraping and ingestion work
Schematron‑8B: marginally higher quality for edge cases, but ~2× cost per request

A Reddit engineering report shows that processing 1 million pages per day with a frontier model (GPT‑5) would cost roughly 20,000 USD, while using Schematron‑8B brings that down to about 480 USD, and Schematron‑3B to around 240 USD for the same workload. That is roughly 40–80× cheaper than frontier APIs for this specific task.

2.4 Local‑Friendly Deployment

Schematron‑3B is available as:

An open‑source model on Hugging Face (inference-net/Schematron-3B)
A backend for MCP (Model Context Protocol) servers using MLX on Apple Silicon
A model that can run in popular local LLM GUIs such as LM Studio (demonstrated for the 8B version at around 8 GB RAM on a Mac).

General VRAM guidance suggests that 3–4B models run comfortably on entry‑level GPUs (3–4 GB VRAM) at moderate context windows, with CPU‑only setups also possible at lower speed. Given the 8B variant is reported to run locally on a Mac with about 8 GB of RAM, the 3B variant is even more accessible for local setups.

3. Quick Comparison Chart

A short, high‑level comparison of Schematron‑3B vs other options for HTML‑to‑JSON extraction:

Feature / Model	Schematron‑3B	Schematron‑8B	GPT‑4.1 (API)	GPT‑5 (API, generic)	Gemini 2.5 Flash (API)
Parameter size	3B	8B	Frontier‑scale	Frontier‑scale	Frontier‑scale
Task specialty	HTML → JSON	HTML → JSON	General LLM	General LLM	General LLM
Extraction quality (0–5, benchmark)	4.41	4.64	4.74 (reference)	Not directly reported	Not directly on same test
Long‑context support	Up to 128K	Up to 128K	High, but costly	High, but costly	High
JSON schema adherence	100% targeted	100% targeted	Not schema‑guaranteed	Not schema‑guaranteed	Not schema‑guaranteed
Typical deployment	Local / API	Local / API	Cloud API only	Cloud API only	Cloud API only
Cost per 1M pages (rough, from report)	~240 USD	~480 USD	~20,000 USD	~20,000 USD (frontier ref)	Not specified
Best use case	Everyday scraping	Edge‑case scraping	Complex reasoning	General intelligence	Fast general extraction

Note: API models can also scrape HTML, but they are not schema‑first, and cost can be much higher at scraping scale.

4. How to Install and Run Schematron‑3B Locally

This section focuses on a practical, step‑by‑step setup for local use.

4.1 Option 1 – Quick Test via Hugging Face / Hosted API

If you only want to test the model before installing it:

The model is hosted as inference-net/Schematron-3B on Hugging Face.
Some third‑party sites list it with “free API” or quick‑start links for testing.
The creators also offer a serverless API with $10 free credits, so you can try it without infrastructure.

This is useful to prototype schemas and prompts before committing to a full local deployment.

4.2 Option 2 – Python + Hugging Face Transformers (Local)

Prerequisites:

Python 3.10+
torch installed (with CUDA or Metal/mps if you want GPU acceleration)
transformers and optional quantization libraries

1. Create a virtual environment

bashpython -m venv schematron-env
source schematron-env/bin/activate # On Windows: schematron-env\Scripts\activate

2. Install dependencies

bashpip install "torch" "transformers" "accelerate" "sentencepiece" "lxml" # Optional for 4-bit quantization: pip install "bitsandbytes"

3. Download the model

Using transformers in code automatically pulls from Hugging Face:

pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inference-net/Schematron-3B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype="auto" )

The model card confirms it accepts HTML plus JSON Schema and outputs strict JSON.

5. Demo: Extracting Product Data from HTML

Below is a simplified example inspired by the official demos and video walkthroughs.

5.1 Define Your JSON Schema

The schema tells the model exactly what fields to extract.

pythonimport json

product_schema = { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "number"}, "currency": {"type": "string"}, "in_stock": {"type": "boolean"} }, "required": ["name", "price"] }

5.2 Clean the HTML (Optional but Recommended)

The YouTube demo uses lxml to remove scripts and styles before sending HTML to the model.

pythonfrom lxml import html, etree

def clean_html(raw_html: str) -> str: doc = html.fromstring(raw_html) etree.strip_elements(doc, "script", "style", "noscript", with_tail=False) return etree.tostring(doc, encoding="unicode")

5.3 Build the Prompt and Run Inference

pythondef build_prompt(schema: dict, html_text: str) -> str: returnf"""
You are an HTML-to-JSON extraction model.

- Input: HTML of a web page.
- Goal: Return ONLY valid JSON that strictly conforms to this JSON Schema:
{json.dumps(schema, indent=2)}

HTML:
<document>
{html_text}
</document>
Return only JSON. No explanation.
""".strip() from transformers import pipeline
generator = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.0, # Deterministic output, as recommended in demos[cite:6] ) raw_html = open("sample_product_page.html").read() cleaned_html = clean_html(raw_html) prompt = build_prompt(product_schema, cleaned_html) output = generator(prompt)[0]["generated_text"] # Extract the JSON segment (if needed) start = output.find("{") end = output.rfind("}") json_str = output[start:end+1] product = json.loads(json_str) print(product)

In real demos, this approach extracts field values like product name and price from arbitrary product pages and returns valid JSON that exactly matches the schema.

6. Installing Schematron‑3B via MCP

If you use an AI IDE or agent framework that supports MCP (Model Context Protocol), a dedicated Schematron MCP server exists.

According to the MCP project:

The server runs Schematron‑3B locally using MLX (Apple Silicon‑optimized inference).
It exposes a tool for “HTML‑to‑JSON extraction” so agents can call it with HTML + schema and get JSON back.
The model path can be configured via SCHEMATRON_MODEL_PATH environment variable for custom installations.

This is ideal if:

You want your main LLM (e.g., GPT‑4.1) to call Schematron as a tool for scraping.
You prefer a GUI or agent interface instead of writing raw Python scripts.

7. Hardware Requirements and Performance

7.1 RAM and VRAM Considerations

Key data points from public sources:

8B Schematron has been demonstrated running locally on Mac via LM Studio using around 8 GB of RAM, which is in the same range as a heavy browser session.
General VRAM guidance shows that 3–4B parameter models work on entry‑level GPUs with 3–4 GB VRAM at moderate context windows (e.g., 4K tokens), while longer contexts demand more memory.

Implications for Schematron‑3B:

For small to medium pages (few thousand tokens), a mid‑range GPU or even CPU‑only setup is realistic.
For very long pages or heavy batch jobs at 128K context, a stronger GPU or more system RAM will help.
Quantized variants (e.g., 4‑bit) can reduce memory significantly while keeping most of the accuracy, similar to other 3B models in the ecosystem.

7.2 Speed vs. Frontier Models

The Reddit benchmark describes:

Average 0.54 seconds per page for Schematron models vs about 6 seconds per page for GPT‑5 on the same workload, roughly 10× faster per page.
In addition, because Schematron outputs compact JSON instead of long raw HTML, downstream LLMs process far fewer tokens, saving more time and money.

8. Benchmarks and How to Test Schematron‑3B Yourself

8.1 Published Benchmarks

From model card and blog analysis:

Extraction quality (0–5 rating, LLM‑as‑judge):
- Schematron‑3B: 4.41
- Schematron‑8B: 4.64
- GPT‑4.1: 4.74 (reference frontier model)

Web‑augmented QA (SimpleQA pipeline):

Setup	Accuracy
GPT‑5 Nano alone (no Schematron)	8.54%
GPT‑5 Nano + Schematron‑8B extraction	82.87%
GPT‑5 Nano + Schematron + SERP provider	64.2%
GPT‑5 Nano + Schematron + Exa provider	82.9%
Gemini 2.5 Flash baseline	80.61%
GPT‑4.1 + Schematron‑8B	85.58%

This shows that:

Structured JSON extraction is the key factor in accuracy (jump from 8.54% to 82.87%).
Specialized extraction models like Schematron can outperform much larger general LLMs (e.g., Schematron‑8B vs Gemini 2.5 Flash on this task).

8.2 How to Create Your Own Benchmark

To test Schematron‑3B in your environment:

Build a small dataset
- 50–200 HTML pages from real target websites (e‑commerce, blogs, job boards).
- For each page, manually label ground‑truth JSON with the fields you care about.
Define JSON Schemas
- One schema per page type (e.g., Product, Article, JobListing).
- Use simple types first: strings, numbers, booleans, arrays.
Write a test harness
- For each page:
  - Clean HTML → build prompt → call Schematron‑3B → parse JSON.
- Validate outputs using a JSON Schema validator.
Track metrics
- Schema adherence rate: % of outputs that pass JSON Schema validation (expect near 100% with proper prompting).
- Field‑level accuracy: precision/recall or simple match rate for each field.
- Latency: time per page, with and without batching.
- Cost: if using API, track tokens or request cost.
Compare to alternatives
- Run the same benchmark using a general LLM like GPT‑4.1, prompting it to “output JSON only”.
- Count how often it breaks schema, adds extra text, or misses fields.

This method gives a realistic view of how Schematron‑3B performs on your HTML and schema design.

9. How Schematron‑3B Differs from Competitors

9.1 Versus General LLMs (GPT‑4.x, GPT‑5, Gemini, etc.)

General LLMs:

Can parse HTML but are not specialized for it.
Prone to adding explanations, comments, or partial JSON that breaks downstream tools.
Have higher per‑token costs, especially at long context windows.

Schematron‑3B:

Trained explicitly on HTML‑to‑JSON with schema adherence as a hard constraint.
Optimized prompts and training ensure only JSON output, with no narrative text.
Much cheaper and faster at scale, up to 40–80× lower cost for large scraping workloads.

9.2 Versus Traditional Scraping Tools (BeautifulSoup, XPath, Regex)

Traditional tools like BeautifulSoup or XPath:

Require brittle, site‑specific selectors. Small HTML changes can break them.
Need manual maintenance for each site and layout.
Struggle with semantic understanding (e.g., guessing which span is “current price”).

Schematron‑3B:

Uses language understanding to adapt to different HTML structures and naming patterns.
You describe what you want (via schema), not how to locate it in DOM nodes.
Handles messy or inconsistent markup across multiple sites more robustly.

Traditional tools remain useful for pre‑cleaning or handling highly structured sites, but Schematron is more resilient for messy, modern web pages.

9.3 Unique Selling Points (USP) of Schematron‑3B

Schema‑First, Reliable JSON
Guarantees schema‑conformant JSON, which is rare among LLM‑based scrapers.
Long‑Context HTML Support
Robust handling of up to 128K tokens allows entire pages or multi‑page content to be processed in one go.
Cost‑Performance Leadership
Delivers near‑frontier quality at a fraction of the cost and resource usage, especially at scraping scale.
Local‑First, Privacy‑Friendly
Runs fully on local hardware (CPU, GPU, or Apple Silicon/MLX), keeping raw HTML and extracted data inside your infrastructure.
Ecosystem Integration
Available on Hugging Face, used via MCP, and integrated into multiple toolchains for web‑augmented LLM systems.

10. Pricing Overview

Schematron‑3B itself is distributed as an open‑source model; the “price” depends on how you use it:

10.1 Self‑Hosted / Local

Model cost: free to download (standard open‑source distribution on Hugging Face).
Infrastructure cost: your hardware, electricity, and any cloud VMs.
Best for: high‑volume scraping where API costs would be large, and for strict data privacy.

Given benchmark data, a self‑hosted Schematron deployment can reduce scraping costs dramatically compared with calling frontier APIs repeatedly.

10.2 Serverless API from the Creators

The creators offer:

A serverless API with $10 free credits to start.
A pricing structure that, according to their benchmarks, allows processing 1M pages daily for around 240–480 USD with Schematron models, compared to ~20,000 USD with GPT‑5 on the same workload.

This keeps operational overhead low while still taking advantage of the specialized model.

11. Practical Testing Scenarios

To get a realistic feel for Schematron‑3B, consider testing it with these practical scenarios:

11.1 E‑Commerce Price Monitoring

Goal: Extract product name, brand, price, currency, availability, rating, and URL.
Steps:
- Crawl product pages from several competitor sites.
- Write a generic Product JSON Schema that works across sites.
- Run Schematron‑3B extraction and compare to manual ground‑truth.
- Track schema adherence, price accuracy, and extraction latency.

11.2 News or Blog Aggregation

Goal: Turn arbitrary article pages into structured records: title, author, date, tags, summary.
Steps:
- Use a long context window for pages with heavy navigation and comments.
- Provide schema including a summary field to compress main ideas.
- Compare outputs to a general LLM’s extraction for accuracy and JSON cleanliness.

11.3 Knowledge Base Ingestion

Goal: Build a structured knowledge base from documentation pages.
Steps:
- Design a schema for Section, CodeExample, FAQItem, etc.
- Feed full documentation pages to Schematron‑3B.
- Use the resulting JSON as input for search, RAG, or analytics.

Each scenario helps measure not only extraction quality but also long‑term maintainability. Because the model understands semantics, changes in HTML layout often require zero or minimal maintenance compared to XPath rules.

12. Best Practices for Using Schematron‑3B

Invest time in schema design
Clear and minimal schemas usually give the most stable results. Avoid unnecessary nested fields at the beginning.
Keep temperature low (0–0.2)
Official demos set temperature to zero to ensure deterministic and repeatable JSON outputs.
Always validate outputs
Run JSON Schema validation as a post‑step. This both catches errors and provides feedback for prompt/schema tweaks.
Pre‑clean HTML
Remove scripts, styles, and irrelevant elements using lxml or another parser. Benchmarks and demos show this makes extraction more reliable.
Monitor for drift
Websites change. Periodically re‑sample pages and re‑evaluate accuracy so you can adjust schemas or prompts if needed.

13. When to Choose Schematron‑3B vs Schematron‑8B

Choose Schematron‑3B when:

Running on modest hardware or local machines.
Doing standard scraping: products, articles, listings, with relatively predictable structure.
Optimizing for cost and throughput.

Consider Schematron‑8B when:

You see frequent edge‑case failures with 3B on complex or very noisy pages.
You can afford slightly higher compute costs for a small quality lift.
You need the strongest possible extraction in a mission‑critical pipeline.

The model authors themselves recommend Schematron‑3B as the default and only switching to 8B for special cases.

FAQ

Q1. Do I need a powerful GPU to run Schematron‑3B locally?
Not necessarily. Community guidance shows 3–4B models running on entry‑level GPUs and even CPUs at smaller context windows, though a GPU improves speed.

Q2. Can Schematron‑3B scrape any website out of the box?
It can parse any HTML, but you must define a JSON Schema and prompt for your use case. Different sites may need slightly different schemas.

Q3. How is Schematron‑3B better than using GPT‑4 or GPT‑5 for scraping?
It is specialized for HTML‑to‑JSON, produces schema‑conformant JSON, and can be 40–80× cheaper at large scale than frontier APIs while keeping similar extraction quality.

Q4. Is the model safe to use with sensitive data?
When run locally or on your own servers, raw HTML and JSON never leave your infrastructure, which is better for privacy than remote APIs.

Q5. Can I combine Schematron‑3B with other LLMs?
Yes. A common pattern is: search → HTML pages → Schematron‑3B → structured JSON → a general LLM like GPT‑4.1 for reasoning or answer synthesis, greatly boosting accuracy.

15. Summary

Schematron‑3B is a specialized local AI model designed to convert messy, real‑world HTML into clean, schema‑valid JSON. It combines:

Schema‑first training with 100% JSON conformity goals
Long‑context resilience up to 128K tokens
Cost and speed advantages over frontier API models, especially at scale
Flexibility to run locally via Python, MCP servers, or GUI tools, including on consumer‑grade hardware.

For teams serious about web scraping, ingestion, or building web‑grounded AI agents, Schematron‑3B provides a modern alternative to fragile XPaths and expensive general LLM calls. By carefully designing schemas, validating outputs, and benchmarking against your real pages, it is possible to build a robust, cost‑effective, and privacy‑friendly web data pipeline.