Running Zonos-TTS Multilingual Locally on Ubuntu: Step by Step Guide
Zonos-TTS is an open-source, multilingual, real-time text-to-speech (TTS) model that offers high expressiveness and voice cloning capabilities. Released by ZyphraAI under the Apache 2.0 license, Zonos-TTS supports features like real-time voice cloning, audio prefix input, and fine control over speech attributes such as rate, pitch, and emotion.
This guide provides a step-by-step method to install and run Zonos-TTS locally on an Ubuntu system.
Architectural Overview of Zonos-TTS
Zonos-TTS leverages deep learning methodologies to generate naturalistic speech outputs from textual inputs. The framework incorporates speaker embeddings and audio prefix conditioning to enhance voice fidelity. Notable features include:
- High-Fidelity Voice Cloning: Capable of reproducing a speaker’s voice with minimal audio input (5-30 seconds of reference speech).
- Audio Prefix Conditioning: Enables nuanced speaker adaptation, supporting whispering and stylistic modifications.
- Extensive Multilingual Support: Facilitates speech synthesis in English, Japanese, Chinese, French, and German.
- Parametric Speech Modulation: Provides granular control over speech attributes, including tempo, pitch modulation, and emotional tone.
- Computational Efficiency: Achieves approximately 2× real-time inference speed on high-end GPUs (e.g., RTX 4090).
- Open-Source Accessibility: Distributed under Apache 2.0, allowing unrestricted commercial and academic use.
Prerequisites
Ensure your Ubuntu system meets the following requirements:
Hardware Requirements
- GPU: NVIDIA GPU recommended (e.g., RTX 4090 for optimal performance).
- RAM: Minimum 16GB.
- Storage: Adequate space for model files and dependencies.
Software Requirements
- Ubuntu: Latest LTS version recommended.
- Python: Version 3.7 or higher.
- pip: Python package installer.
- git: Required for cloning the repository.
- Docker: If using the containerized installation.
- eSpeak: For text normalization and phonemization.
Installation Methods
You can install Zonos-TTS via Docker or a manual (DIY) installation.
1. Docker Installation
Docker simplifies dependency management and deployment.
Steps:
Generate Sample Audio:
python3 sample.py
Run Docker Compose:
docker compose up
Clone the Zonos Repository:
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
Install Docker & Docker Compose:
sudo apt update
sudo apt install docker.io docker-compose
sudo systemctl start docker
sudo systemctl enable docker
2. DIY Installation
For manual installation, follow these steps:
Generate Sample Audio:
python3 sample.py
Clone the Zonos Repository:
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
Install Python Dependencies:
python3 -m pip install --upgrade uv
uv venv
source .venv/bin/activate
uv sync --no-group main
uv sync
Install eSpeak:
sudo apt install espeak-ng
Running Zonos-TTS with Python
Once installed, use Python to generate speech:
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
# Load the model
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model.bfloat16()
# Load example audio for voice cloning
wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
spk_embedding = model.embed_spk_audio(wav, sampling_rate)
torch.manual_seed(421)
# Define conditioning parameters
cond_dict = make_cond_dict(
text="Hello, world!",
speaker=spk_embedding.to(torch.bfloat16),
language="en-us",
)
# Prepare conditioning and generate speech
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()
# Save the generated audio
torchaudio.save("sample.wav", wavs, model.autoencoder.sampling_rate)
Customization Options
Model Selection
Zonos provides two models:
- Hybrid Model (default): Balances quality and computational efficiency.
Transformer Model: Use this for higher fidelity:
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
Customization Parameters
Model Variants
Zonos offers multiple model configurations:
- Hybrid Model (default): Optimized for balance between quality and efficiency.
Transformer Model: Higher fidelity output, albeit with increased computational demand.
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
Multilingual Capabilities
Adjust the language
parameter for speech synthesis in different languages:
cond_dict = make_cond_dict(
text="Bonjour le monde!",
speaker=spk_embedding.to(torch.bfloat16),
language="fr-fr",
)
Speech Style and Emotion Control
Fine-tune output speech by modifying expressive parameters:
cond_dict = make_cond_dict(
text="I am very happy!",
speaker=spk_embedding.to(torch.bfloat16),
language="en-us",
emotion="happiness",
speaking_rate=1.2,
pitch_variation=0.1,
)
Language Support
Modify the language
parameter for multilingual support:
cond_dict = make_cond_dict(
text="Bonjour le monde!",
speaker=spk_embedding.to(torch.bfloat16),
language="fr-fr",
)
Emotion and Speech Control
Fine-tune speech attributes:
cond_dict = make_cond_dict(
text="I am very happy!",
speaker=spk_embedding.to(torch.bfloat16),
language="en-us",
emotion="happiness",
speaking_rate=1.2,
pitch_variation=0.1,
)
Troubleshooting
- Dependency Issues: Use
uv sync
to fix missing packages. - GPU Errors: Ensure NVIDIA drivers are installed and PyTorch is configured for GPU usage.
- Audio Quality Issues: Experiment with voice cloning samples and adjust speech parameters.
- Installation Errors: Verify Docker is running or ensure all manual installation steps are followed correctly.
Use Cases
Zonos-TTS can be used for:
- Content Creation: Audiobooks, podcasts, video voiceovers.
- Accessibility: Text-to-speech applications for visually impaired users.
- Entertainment: Character voices for games, virtual assistants.
- Education: E-learning content, language pronunciation tools.
- Research: Speech synthesis and voice cloning advancements.
Community & Contribution Channels
Engage with the Zonos-TTS ecosystem via:
- GitHub: Zonos Repository
- Hugging Face Hub: Model downloads and discussions.
- Reddit Forums: Participate in r/AudioAI community discussions.
- ZyphraAI Website: Access documentation and API specifications.
Conclusion
Zonos-TTS is a powerful open-source TTS model, offering multilingual support and expressive voice synthesis. Whether using Docker for quick deployment or DIY installation for greater control, this guide helps set up and run Zonos-TTS efficiently on Ubuntu. Its applications range from content creation to accessibility and research, making it a versatile tool for real-time voice synthesis.
References
- Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
- Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
- Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
- Running Zonos TTS on Windows: Multilingual Local Installation
- Install Zonos-TTS on macOS for Voice Cloning & Speech Synthesis