Running Zonos TTS on Windows: Multilingual Local Installation
Zonos-TTS, a recent offering from ZyphraAI, is a fully open-source, multilingual text-to-speech (TTS) model that supports real-time voice cloning and is commercially usable under the Apache 2.0 License.
Trained on 200,000 hours of English voice data, Zonos-TTS delivers impressive performance, with ZyphraAI's tests on an RTX 4090 graphics card showing the model running at approximately twice the real-time speed.
What is Zonos-TTS?
Zonos-TTS is a text-to-speech model designed to generate natural-sounding speech from text prompts using a speaker embedding or audio prefix. It allows for high-fidelity voice cloning with just 5 to 30 seconds of speech and enables conditioning based on speaking rate, pitch variation, audio quality, and emotions. The model supports multiple languages, including English, Japanese, Chinese, French, and German, outputting speech natively at 44kHz.
Key Features of Zonos-TTS:
- Zero-shot TTS with voice cloning: Generates high-quality TTS output using a 10-30 second speaker sample.
- Audio prefix inputs: Enhances speaker matching by adding text plus an audio prefix, which can elicit behaviors like whispering.
- Multilingual support: Supports English, Japanese, Chinese, French, and German.
- Audio quality and emotion control: Offers fine-grained control over speaking rate, pitch, and emotions like happiness, anger, sadness, and fear.
- Fast performance: Runs with a real-time factor of approximately 2x on an RTX 4090.
- WebUI Gradio interface: Comes with an easy-to-use Gradio interface for generating speech.
- Simple installation and deployment: Can be installed easily using the provided Docker file.
Installation Methods
There are two primary methods to install Zonos-TTS on Windows:
- Using Docker – Recommended for users who prefer a straightforward, containerized approach.
- DIY Installation – A manual method that provides more control over the environment setup.
Why Choose Zonos-TTS? 💡
Feature | Zonos-TTS | Other TTS Tools |
---|---|---|
Speed | 2x real-time | Often slower |
Voice Cloning | 5-second samples | Typically 1min+ |
Audio Quality | 44kHz output | Usually 16-24kHz |
Languages | 5 supported | Often 1-2 |
Commercial Use | Allowed (Apache 2.0) | Many restrict usage |
System Requirements 🖥️
Feature | Zonos-TTS | Other TTS Tools |
---|---|---|
Speed | 2x real-time | Often slower |
Voice Cloning | 5-second samples | Typically 1min+ |
Audio Quality | 44kHz output | Usually 16-24kHz |
Languages | 5 supported | Often 1-2 |
Commercial Use | Allowed (Apache 2.0) | Many restrict usage |
Minimum:
- OS: Windows 10/11 64-bit
- RAM: 8GB+
- GPU: NVIDIA GTX 1660 (6GB VRAM)
- Storage: 10GB free space
Recommended:
- GPU: RTX 3060 (12GB VRAM) or better
- RAM: 16GB+
- Python 3.10+
Installation Methods ⚙️
Method 1: Docker Installation (Recommended) 🐳
Step 1: Install Docker Desktop
Step 2: Launch PowerShell as Admin
git clone https://github.com/Zyphra/Zonos
cd Zonos
Step 3: Start Container
docker compose up
Step 4: Access Web Interface
Open http://localhost:7860
in your browser.
Alternatively, build and run the Docker image for development:
docker build -t Zonos .
docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
cd /Zonos
python3 sample.py # Generates sample.wav
Replace /path/to/Zonos
with your actual directory path.
Run Docker Compose:
docker compose up
Clone the Zonos repository:
git clone https://github.com/Zyphra/Zonos
cd Zonos
Method 2: Manual Installation 🔧
Step 1: Install Dependencies
- Install Python 3.10+
Install Git:
winget install --id Git.Git
Install eSpeak-NG via Chocolatey:
choco install espeak-ng
Step 2: Set Up Python Environment
git clone https://github.com/Zyphra/Zonos
cd Zonos
python -m venv zonos-env
.\zonos-env\Scripts\activate
pip install -r requirements.txt
Step 3: Verify Installation
python sample.py
# Output: sample.wav created
Usage Examples
Using the Gradio Interface
- Open
http://localhost:7860
. - Input text in the provided box.
- Upload a 10-30 second audio sample for voice cloning.
- Adjust parameters like speaking rate, pitch, and emotion.
- Click "Generate" to produce speech.
- Download the generated audio.
Using Python Code
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
model.bfloat16()
wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
spk_embedding = model.embed_spk_audio(wav, sampling_rate)
torch.manual_seed(421)
cond_dict = make_cond_dict(
text="Hello, world!",
speaker=spk_embedding.to(torch.bfloat16),
language="en-us",
)
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs, model.autoencoder.sampling_rate)
Tips and Troubleshooting
Issue | Solution |
---|---|
CUDA Out of Memory | Reduce batch size in config.yml |
eSpeak Not Found | Add C:\Program Files\eSpeak NG to PATH |
Gradio Port Conflict | Change port: docker compose up --port 8080 |
Slow Generation | Enable GPU in Docker Desktop Settings |
- Ensure GPU Support: Verify that PyTorch is using your GPU.
- Check Dependencies: Resolve any version conflicts.
- File Paths: Ensure file paths are correct.
- Memory Issues: Reduce batch size or use a smaller model if needed.
- Docker Issues: Verify Docker Desktop is running correctly.
Alternatives to Zonos-TTS
If Zonos-TTS does not meet your needs, consider these alternatives:🔄
- StyleTTS 2
- Pros: Better for emotional speech
- Cons: No commercial license
- Tortoise-TTS
- Pros: More voice presets
- Cons: Slower generation
- Microsoft Azure TTS
- Pros: Enterprise support
- Cons: Monthly costs
Conclusion
Zonos-TTS is a significant advancement in open-source TTS technology, providing high-quality voice cloning and multilingual support. Whether using Docker or manual installation, this guide equips you with the steps to get Zonos-TTS running on your Windows machine.