Running Zonos TTS on Windows: Multilingual Local Installation

Zonos-TTS, a recent offering from ZyphraAI, is a fully open-source, multilingual text-to-speech (TTS) model that supports real-time voice cloning and is commercially usable under the Apache 2.0 License.

Trained on 200,000 hours of English voice data, Zonos-TTS delivers impressive performance, with ZyphraAI's tests on an RTX 4090 graphics card showing the model running at approximately twice the real-time speed.

What is Zonos-TTS?

Zonos-TTS is a text-to-speech model designed to generate natural-sounding speech from text prompts using a speaker embedding or audio prefix. It allows for high-fidelity voice cloning with just 5 to 30 seconds of speech and enables conditioning based on speaking rate, pitch variation, audio quality, and emotions. The model supports multiple languages, including English, Japanese, Chinese, French, and German, outputting speech natively at 44kHz.

Key Features of Zonos-TTS:

Zero-shot TTS with voice cloning: Generates high-quality TTS output using a 10-30 second speaker sample.
Audio prefix inputs: Enhances speaker matching by adding text plus an audio prefix, which can elicit behaviors like whispering.
Multilingual support: Supports English, Japanese, Chinese, French, and German.
Audio quality and emotion control: Offers fine-grained control over speaking rate, pitch, and emotions like happiness, anger, sadness, and fear.
Fast performance: Runs with a real-time factor of approximately 2x on an RTX 4090.
WebUI Gradio interface: Comes with an easy-to-use Gradio interface for generating speech.
Simple installation and deployment: Can be installed easily using the provided Docker file.

Installation Methods

There are two primary methods to install Zonos-TTS on Windows:

Using Docker – Recommended for users who prefer a straightforward, containerized approach.
DIY Installation – A manual method that provides more control over the environment setup.

Why Choose Zonos-TTS? 💡

Feature	Zonos-TTS	Other TTS Tools
Speed	2x real-time	Often slower
Voice Cloning	5-second samples	Typically 1min+
Audio Quality	44kHz output	Usually 16-24kHz
Languages	5 supported	Often 1-2
Commercial Use	Allowed (Apache 2.0)	Many restrict usage

System Requirements 🖥️

Feature	Zonos-TTS	Other TTS Tools
Speed	2x real-time	Often slower
Voice Cloning	5-second samples	Typically 1min+
Audio Quality	44kHz output	Usually 16-24kHz
Languages	5 supported	Often 1-2
Commercial Use	Allowed (Apache 2.0)	Many restrict usage

Minimum:

OS: Windows 10/11 64-bit
RAM: 8GB+
GPU: NVIDIA GTX 1660 (6GB VRAM)
Storage: 10GB free space

Recommended:

GPU: RTX 3060 (12GB VRAM) or better
RAM: 16GB+
Python 3.10+

Installation Methods ⚙️

Method 1: Docker Installation (Recommended) 🐳

Step 1: Install Docker Desktop

Step 2: Launch PowerShell as Admin

git clone https://github.com/Zyphra/Zonos
cd Zonos

Step 3: Start Container

docker compose up

Step 4: Access Web Interface
Open http://localhost:7860 in your browser.

Alternatively, build and run the Docker image for development:

docker build -t Zonos .
docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
cd /Zonos
python3 sample.py  # Generates sample.wav

Replace /path/to/Zonos with your actual directory path.

Run Docker Compose:

docker compose up

Clone the Zonos repository:

git clone https://github.com/Zyphra/Zonos
cd Zonos

Method 2: Manual Installation 🔧

Step 1: Install Dependencies

Install Python 3.10+

Install Git:

winget install --id Git.Git

Install eSpeak-NG via Chocolatey:

choco install espeak-ng

Step 2: Set Up Python Environment

git clone https://github.com/Zyphra/Zonos
cd Zonos
python -m venv zonos-env
.\zonos-env\Scripts\activate
pip install -r requirements.txt

Step 3: Verify Installation

python sample.py
# Output: sample.wav created

Usage Examples

Using the Gradio Interface

Open http://localhost:7860.
Input text in the provided box.
Upload a 10-30 second audio sample for voice cloning.
Adjust parameters like speaking rate, pitch, and emotion.
Click "Generate" to produce speech.
Download the generated audio.

Using Python Code

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
model.bfloat16()

wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
spk_embedding = model.embed_spk_audio(wav, sampling_rate)

torch.manual_seed(421)

cond_dict = make_cond_dict(
    text="Hello, world!",
    speaker=spk_embedding.to(torch.bfloat16),
    language="en-us",
)

conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs, model.autoencoder.sampling_rate)

Tips and Troubleshooting

Issue	Solution
CUDA Out of Memory	Reduce batch size in `config.yml`
eSpeak Not Found	Add `C:\Program Files\eSpeak NG` to PATH
Gradio Port Conflict	Change port: `docker compose up --port 8080`
Slow Generation	Enable GPU in Docker Desktop Settings

Ensure GPU Support: Verify that PyTorch is using your GPU.
Check Dependencies: Resolve any version conflicts.
File Paths: Ensure file paths are correct.
Memory Issues: Reduce batch size or use a smaller model if needed.
Docker Issues: Verify Docker Desktop is running correctly.

Alternatives to Zonos-TTS

If Zonos-TTS does not meet your needs, consider these alternatives:🔄

StyleTTS 2
- Pros: Better for emotional speech
- Cons: No commercial license
Tortoise-TTS
- Pros: More voice presets
- Cons: Slower generation
Microsoft Azure TTS
- Pros: Enterprise support
- Cons: Monthly costs

Conclusion

Zonos-TTS is a significant advancement in open-source TTS technology, providing high-quality voice cloning and multilingual support. Whether using Docker or manual installation, this guide equips you with the steps to get Zonos-TTS running on your Windows machine.