Run and Install TADA TTS Locally: New Hallucination‑Free Free Speech TTS Model
Discover how to install, run, demo, benchmark and compare TADA, Hume AI’s new open‑source speech model with 1:1 text‑audio alignment, 5x faster TTS and zero content hallucinations—entirely on your local machine.
TADA is a brand‑new open‑source text‑to‑speech and speech‑language model that aligns every text token with exactly one acoustic vector, giving you fast, natural speech with zero content hallucinations and you can run it locally.
Although, there are lots of free TTS models, we'll see here how TADA is different from others.
In this guide, we’ll walk through what TADA is, why it “breaks the rules” of TTS, and how to install, run, benchmark, compare, demo and test it on your own hardware step by step.
What is TADA?
TADA (Text‑Acoustic Dual Alignment) is Hume AI’s open‑source speech‑language model that synchronizes text and audio in a single stream using a strict 1:1 mapping between text tokens and acoustic features.
Instead of generating dozens of audio frames for each word, TADA generates one rich acoustic vector per text token, which is later decoded into high‑fidelity speech.
Hume has released two main checkpoints: TADA‑1B, an English model based on Llama 3.2 1B, and TADA‑3B‑ml, a multilingual 3B‑parameter model covering English plus seven other languages. Both use the same TADA codec (HumeAI/tada-codec) and are published on Hugging Face under permissive open‑source licenses.
Why TADA “Breaks the Rules” of TTS
Most modern LLM‑based TTS systems discretize audio into fixed‑rate acoustic tokens—often 12.5 to 75 frames per second—while text is only 2–3 tokens per second.
This mismatch creates very long audio sequences, high memory use, latency problems, and frequent alignment failures like skipped or hallucinated words.
TADA solves this by synchronous tokenization: it compresses each variable‑length audio segment (for a word or subword) into a single continuous vector aligned exactly with one text token.
This gives three big practical wins for you as a user: much shorter sequences (2–3 “frames” per second of audio), greatly reduced inference cost, and an inductive bias that almost completely eliminates content hallucinations.
Key Features and USPs of TADA
Here are the main capabilities that make TADA stand out in today’s open‑source TTS ecosystem:
- 1:1 Text–Acoustic Alignment – Every text token has exactly one aligned acoustic vector, so the model cannot skip, insert, or re‑order content by construction.
- Very Low Hallucination Rate – On 1,000+ test samples from LibriTTSR, TADA produced zero hallucinated outputs when using a character‑error‑rate (CER) threshold of 0.15 to flag failures, while competing systems produced between 17 and 41 such samples.
- Fast Inference (RTF ≈ 0.09) – TADA‑1B achieves a real‑time factor (RTF) of about 0.09, over 5x faster than similar LLM‑based TTS models in the same benchmark.
- 10x Longer Context – With its compressed token stream, a 2,048‑token context window covers roughly 700 seconds of audio for TADA vs around 70 seconds for conventional fixed‑frame systems.
- Unified Speech‑Language Modeling – TADA can generate both text and speech in one transformer, acting as both a TTS model and a spoken language model (SLM), with a mechanism called Speech Free Guidance (SFG) to blend text‑only and speech‑conditioned logits.
- On‑Device Friendly – Thanks to the low frame rate and efficient architecture, TADA is lightweight enough to run on modern phones and edge devices, improving latency and privacy when you deploy it locally.
- Open Source & Permissive Licensing – The released checkpoints are open source with permissive licenses suitable for many commercial and research use cases, unlike some TTS models restricted to non‑commercial use.
TADA Architecture
Under the hood, TADA has three major components plus the LLM backbone.
- Aligner – A Wav2Vec2‑style model learns to align audio frames with Llama text tokens using CTC and Viterbi decoding, producing a time index for each text token. This step figures out when each token is spoken in the waveform.
- Encoder & Codec (“TADA‑Codec”) – A transformer‑based encoder aggregates the variable‑length audio segment around each aligned position into a single latent vector; a paired decoder then reconstructs 24 kHz audio from these latent vectors at around 2–3 frames per second.
- LLM Backbone + Flow‑Matching Head – The Llama 3.2 backbone receives both text embeddings and acoustic embeddings (shifted in time), and a flow‑matching head generates the next acoustic token plus its duration, while the language head predicts the next text token.
Because each autoregressive step covers one full token of speech, TADA can apply streamable rejection sampling at the token level—for example, rejecting samples where the speaker embedding drifts too far from the prompt voice—without a huge cost.
This is a big part of why it maintains speaker identity and avoids catastrophic failures in long runs.
Supported Languages and Use Cases
The initial releases cover:
- TADA‑1B (English) – Optimized for English tasks like narration, assistants, and content creation.
- TADA‑3B‑ml (Multilingual) – Adds Chinese, French, Italian, Japanese, Portuguese, Polish, and German alongside English.
Because TADA is a unified speech‑language model, it supports:
- Zero‑shot voice cloning from short reference audio (via the encoder and voice conditioning).
- Long‑form expressive narration (10+ minutes) with much lower risk of skipped or added content than typical systems.
- Dialogue and spoken conversation where you may want both generated speech and an aligned text transcript, which TADA provides “for free.”
Quick Performance Snapshot (TADA vs Other LLM‑TTS Models)
Here is a high‑level performance chart using numbers from TADA’s paper and benchmarks:
TADA is not always the very best on every perceptual score, but it is by far the fastest in this set and the only one that achieved zero hallucinations under the paper’s metric.
Installing TADA Locally
Let’s go step‑by‑step through installing and running TADA on your own machine.
1. Prerequisites
- Python environment (3.9+ recommended).
- GPU with CUDA: for real‑time or faster‑than‑real‑time TTS you realistically want a modern NVIDIA GPU; 10–12 GB VRAM is comfortable for TADA‑1B, and more for TADA‑3B‑ml (practical guidance, not from the paper).
- PyTorch + CUDA: install from
pytorch.orgfor your OS and CUDA version. - FFmpeg is helpful for saving audio in different formats (optional but recommended).
TADA uses PyTorch, Torchaudio, and Hume’s own encoder and model classes, which are pulled in via pip from the GitHub repo / package.
2. Install the Library
From the Hugging Face card and docs:
bash# Install directly from GitHub git+https://github.com/HumeAI/tada.git
pip install# Or, if you cloned the repo
pip install -e .
This command installs the TADA library, its codec, and required Python dependencies.
If you’re on Windows and pip install git+... fails with a “git not found” error, you’ll need to install Git and ensure it’s on your PATH before re‑running the command (general Git + pip behavior).
3. Verify GPU and PyTorch
Inside Python:
pythonimport torchprint(torch.cuda.is_available())
print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")
If this returns True for CUDA, you’re ready to run TADA on GPU.
Running a Simple Text‑to‑Speech Demo
The Hugging Face model card shows a minimal example for text‑to‑speech with a reference audio prompt. Here’s a slightly cleaned version:
pythonimport torchimport torchaudiofrom tada.modules.encoder import Encoderfrom tada.modules.tada import TadaForCausalLMdevice = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Load encoder and TADA model
encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder").to(device)
model = TadaForCausalLM.from_pretrained("HumeAI/tada-1b").to(device)
model.eval()
# 2. Load a short reference audio clip (for voice & style)
audio, sample_rate = torchaudio.load("samples/ljspeech.wav")
audio = audio.to(device)
prompt_text = "The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired."
prompt = encoder(audio, text=[prompt_text], sample_rate=sample_rate)
# 3. Generate new speech in the same voice
output = model.generate(
prompt=prompt,
text="Please call Stella. Ask her to bring these things with her from the store.",
)
# 4. Save result
waveform = output.audio # check actual field name in the docs
torchaudio.save("tada_output.wav", waveform.cpu(), sample_rate)
This follows the official usage: load the codec encoder and the TadaForCausalLM model from Hugging Face, encode a prompt audio + text, then call .generate() to synthesize a new utterance.
If you don’t care about voice cloning initially, you can use a generic reference audio from the provided samples or a neutral voice clip you recorded.
Building a Local Gradio Demo
Hume provides an official Gradio demo as a Hugging Face Space (HumeAI/tada). You can either use the hosted space or run something similar locally.
A simple pattern:
- Install Gradio:bash
pip installgradio - Reuse the encoder + model loading code from above.
- Wrap generation in a Gradio interface:
pythonimport gradio as grdef tts_fn(text, ref_audio): ref_audio
# ref_audio is a tuple (filepath, sample_rate) from Gradio
(path, sr) = audio, sample_rate = torchaudio.load(path)
audio = audio.to(device)
prompt = encoder(audio, text=[text], sample_rate=sample_rate)
out = model.generate(prompt=prompt, text=text)
waveform = out.audio # adjust to real attribute
return sample_rate, waveform.cpu().numpy()
demo = gr.Interface(
fn=tts_fn,
inputs=[
gr.Textbox(label="Text to Speak"),
gr.Audio(source="microphone", type="filepath", label="Reference Voice"),
],
outputs=gr.Audio(label="Generated Speech"),
title="TADA Local TTS Demo",
)
demo.launch()
The official HF Space includes extras like alignment visualizations and configuration sliders, and you can copy ideas from its app.py for a richer UI.
Benchmarking TADA Locally
To really understand how TADA performs on your hardware, you should benchmark:
- Real‑Time Factor (RTF) – generation time / audio duration.
- Latency to First Audio – how long it takes to produce the first playable chunk (if you stream).
- Quality (subjective + CER) – how natural it sounds and how often words are wrong or missing.
Official Benchmarks (From the Paper)
The TADA paper reports the following for voice cloning benchmarks (SeedTTS‑Eval and LibriTTSR‑Eval):
- TADA‑1B: RTF 0.09; LibriTTSR CER 0.55, SIM 80.2, oMOS 3.11.
- TADA‑3B‑ml: RTF 0.13; LibriTTSR CER 0.40, SIM 79.9, oMOS 3.17.
- Competing models (XTTS‑v2, Index‑TTS2, Higgs v2, VibeVoice 1.5B, FireRedTTS‑2) have higher RTFs (0.19–0.76) and non‑zero hallucination counts under the CER>0.15 criterion.
In reconstruction benchmarks, TADA’s codec runs at just 2–3 fps with oMOS around 3.34, matching or beating other continuous codecs that need 7.5–75 fps.
How to Measure RTF Yourself
You can compute RTF in a simple script:
- Use a fixed reference audio + prompt text.
- Time the
model.generatecall. - Measure the generated audio length (e.g., in seconds via
waveform.shape[-1] / sample_rate). - RTF = generation_time / audio_duration.
Example pattern (pseudo‑code you can adapt):
pythonimport timestart = time.perf_counter() start
out = model.generate(prompt=prompt, text=text)
elapsed = time.perf_counter() -waveform = out.audioduration = waveform.shape[-1] / sample_ratertf = elapsed / durationprint(f"RTF: {rtf:.3f}")
If your RTF is below 1.0, TADA is faster than real‑time; values near the reported 0.09–0.13 mean you’re very close to paper‑level performance.
Deeper Comparison: TADA vs Other Open‑Source TTS Models
Beyond the paper’s baselines, several popular open‑source TTS models are widely used today: XTTS‑v2, Mozilla TTS, ChatTTS, MeloTTS, Coqui TTS, Mimic 3, and Bark among others. Here’s how TADA fits in that landscape.
Conceptual / Architectural Differences
- XTTS‑v2, Bark, ChatTTS, Coqui TTS typically rely on fixed‑frame acoustic codecs (e.g., Encodec, DAC, Mimi) with 12.5–75 tokens per second of audio.
- TADA uses continuous latent vectors aligned 1:1 with text tokens at only 2–3 frames per second, massively reducing sequence length and compute.
- Some competitors use semantic tokens (like HuBERT units) between text and audio; TADA skips this semantic layer and models speech and text as a single synchronous stream.
- Traditional frameworks like Mozilla TTS and Coqui TTS are modular toolkits; TADA is an end‑to‑end LLM‑style model that can do TTS and spoken language modeling.
Feature‑Level Comparison Table
This table combines paper benchmarks with independent overviews of other open‑source TTS models:
The USP of TADA is not just sound quality; it’s the combination of speed, reliability, and unified modeling, enabled by the 1:1 text–acoustic design.
Pricing and Licensing
TADA itself is free to download and run locally; your only costs are hardware, electricity, and any surrounding infrastructure you use. Hume describes the models as open source with permissive licenses.
It typically allow commercial deployment, although you should always read the exact license text on Hugging Face or GitHub before shipping a product.
This stands in contrast to some other open‑source TTS models, like ChatTTS (Creative Commons NonCommercial) or certain Coqui models, which legally block or restrict commercial use even though the code is public.
Bark recently moved to an MIT license, making it commercially friendly as well, but it does not provide TADA’s strict 1:1 content guarantees.
If you prefer a hosted option, Hume also offers commercial APIs and infrastructure, but their pricing is separate from the open‑source TADA release and must be checked on Hume’s official site.
How to Systematically Test and Compare TADA Locally
To write a serious benchmarking article or choose a model for production, you should design a small, reproducible test suite.
1. Prepare Test Prompts
Use at least these categories:
- Short neutral sentences (5–10 seconds) – for latency and clarity.
- Long‑form paragraphs (60–120 seconds) – for drift, stability, and hallucinations.
- Emotion / prosody prompts – expressive text like stories or passionate reviews.
- Multilingual samples (if you use TADA‑3B‑ml) – to check pronunciation across languages.
Keep the same text for all models you compare.
2. Implement a Common Benchmark Script
For each model (TADA, XTTS‑v2, Bark, etc.):
- Load the model and any required codec or voice encoder, using the official instructions.
- Generate audio for each prompt while timing generation to compute RTF.
- Save outputs with a clear naming convention (e.g.,
model_promptid.wav).
You can optionally compute CER by passing generated audio through an ASR model (Parakeet‑TDT or Whisper) and comparing the transcript to the ground truth—the same metric TADA’s paper uses for hallucinations.
3. Subjective Listening Tests
Ask colleagues or testers to rate:
- Naturalness (1–5).
- Speaker similarity (if doing cloning).
- Intelligibility / correctness (did any words vanish or change?).
TADA’s human ratings on expressive long‑form speech were around 4.18/5 for speaker similarity and 3.78/5 for naturalness, placing second overall among evaluated systems. That gives you a reference point when designing your listening scale.
Limitations and Things to Watch Out For
Even though TADA is very strong in benchmarks, Hume explicitly calls out some limitations.
- Long‑Form Speaker Drift – In extremely long generations (10+ minutes), the speaker identity can drift slightly; they mitigate this with online rejection sampling, but it’s not perfect, and periodically resetting context is recommended.
- Modality Gap – When generating both speech and text, language quality can degrade relative to text‑only LLMs; Speech Free Guidance (SFG) reduces this but doesn’t fully close the gap.
- Pre‑training Focused on Speech Continuation – For assistant‑style tasks, fine‑tuning is recommended, since the pre‑training objective is mostly speech continuation rather than instruction following.
- Language Coverage – Current open‑source models focus on English and seven additional languages; global coverage is not yet as broad as some multilingual TTS toolkits.
Being aware of these helps you design realistic demos and benchmarks instead of over‑promising.
Real‑World Examples and Workflows
Here are some concrete ways you might use TADA locally.
1. Offline Voice Assistant
- Run TADA‑1B alongside a small on‑device LLM.
- Use microphone input to transcribe speech, run the assistant logic, then synthesize responses with TADA for low‑latency, private interaction.
- Benefit from zero content hallucinations on the TTS side, so what the assistant “says” in audio matches the text exactly.
2. Video or Podcast Voiceover
- Batch‑generate narration for scripts using TADA‑1B, with a single voice reference file to maintain consistent tone across episodes.
- Use the long context capabilities to synthesize multi‑minute segments without splitting sentences, reducing unnatural boundaries.
3. Multilingual Dubbing
- With TADA‑3B‑ml, feed translated scripts and a short original voice clip to create multilingual versions that preserve approximate speaker character and rhythm.
- The 1:1 alignment makes it easier to maintain timing across languages by adjusting prompt text and durations.
4. Research on Spoken Language Modeling
- Because TADA can generate both spoken and written outputs, it’s a strong base for experiments on conversational agents that “think in audio.”
- The paper evaluates TADA on tasks like Spoken StoryCloze and conversational perplexity, where it competes with or beats much larger multimodal models.
How TADA’s USP Compares to Competitors
If you need a simple mental model for where TADA fits:
- Choose TADA when:
- You care a lot about content correctness (no missing or invented words).
- You want long‑form, context‑rich speech with aligned transcripts.
- You plan to run on‑device or under tight latency/compute budgets.
- Choose XTTS‑v2 or Coqui TTS when:
- You need a mature ecosystem and lots of pre‑trained voices and languages, and you’re OK with a more conventional pipeline.
- Choose Bark or ChatTTS when:
- You want creative, expressive, or heavily stylized speech (with background sounds, laughs, etc.), and you can handle stricter licenses or higher risk of hallucinated content.
TADA’s unique selling proposition is that it treats speech as a first‑class citizen inside an LLM without sacrificing text‑like efficiency, which no older open‑source TTS system does to this extent yet.
FAQs
1. What exactly is TADA?
TADA (Text‑Acoustic Dual Alignment) is an open‑source speech‑language model from Hume AI that generates text and high‑quality speech together using a 1:1 mapping between text tokens and acoustic vectors.
2. Can I run TADA fully offline on my own PC?
Yes, you can install TADA via pip, download the models from Hugging Face, and run everything locally—no cloud calls required, though a modern GPU is strongly recommended for real‑time speed.
3. How fast is TADA compared to other TTS models?
On benchmark hardware, TADA‑1B reaches an RTF of about 0.09, making it over 5x faster than comparable LLM‑based TTS systems like XTTS‑v2, FireRedTTS‑2, or VibeVoice in the same evaluation.
4. Is TADA free for commercial use?
The released checkpoints are open source under permissive licenses, so many commercial uses are allowed, but you must always verify the specific license text on Hugging Face or GitHub for your use case.
5. How is TADA different from Bark or XTTS‑v2?
Bark and XTTS‑v2 are excellent TTS models, but they use higher‑rate acoustic tokens and can still hallucinate content, whereas TADA’s 1:1 alignment architecture is explicitly designed to minimize hallucinations while delivering much faster inference and longer context.