Running Zonos-TTS Multilingual Locally on Ubuntu: Step by Step Guide

Zonos-TTS is an open-source, multilingual, real-time text-to-speech (TTS) model that offers high expressiveness and voice cloning capabilities. Released by ZyphraAI under the Apache 2.0 license, Zonos-TTS supports features like real-time voice cloning, audio prefix input, and fine control over speech attributes such as rate, pitch, and emotion.

This guide provides a step-by-step method to install and run Zonos-TTS locally on an Ubuntu system.

Architectural Overview of Zonos-TTS

Zonos-TTS leverages deep learning methodologies to generate naturalistic speech outputs from textual inputs. The framework incorporates speaker embeddings and audio prefix conditioning to enhance voice fidelity. Notable features include:

High-Fidelity Voice Cloning: Capable of reproducing a speaker’s voice with minimal audio input (5-30 seconds of reference speech).
Audio Prefix Conditioning: Enables nuanced speaker adaptation, supporting whispering and stylistic modifications.
Extensive Multilingual Support: Facilitates speech synthesis in English, Japanese, Chinese, French, and German.
Parametric Speech Modulation: Provides granular control over speech attributes, including tempo, pitch modulation, and emotional tone.
Computational Efficiency: Achieves approximately 2× real-time inference speed on high-end GPUs (e.g., RTX 4090).
Open-Source Accessibility: Distributed under Apache 2.0, allowing unrestricted commercial and academic use.

Prerequisites

Ensure your Ubuntu system meets the following requirements:

Hardware Requirements

GPU: NVIDIA GPU recommended (e.g., RTX 4090 for optimal performance).
RAM: Minimum 16GB.
Storage: Adequate space for model files and dependencies.

Software Requirements

Ubuntu: Latest LTS version recommended.
Python: Version 3.7 or higher.
pip: Python package installer.
git: Required for cloning the repository.
Docker: If using the containerized installation.
eSpeak: For text normalization and phonemization.

Installation Methods

You can install Zonos-TTS via Docker or a manual (DIY) installation.

1. Docker Installation

Docker simplifies dependency management and deployment.

Steps:

Generate Sample Audio:

python3 sample.py

Run Docker Compose:

docker compose up

Clone the Zonos Repository:

git clone https://github.com/Zyphra/Zonos.git
cd Zonos

Install Docker & Docker Compose:

sudo apt update
sudo apt install docker.io docker-compose
sudo systemctl start docker
sudo systemctl enable docker

2. DIY Installation

For manual installation, follow these steps:

Generate Sample Audio:

python3 sample.py

Clone the Zonos Repository:

git clone https://github.com/Zyphra/Zonos.git
cd Zonos

Install Python Dependencies:

python3 -m pip install --upgrade uv
uv venv
source .venv/bin/activate
uv sync --no-group main
uv sync

Install eSpeak:

sudo apt install espeak-ng

Running Zonos-TTS with Python

Once installed, use Python to generate speech:

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# Load the model
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model.bfloat16()

# Load example audio for voice cloning
wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
spk_embedding = model.embed_spk_audio(wav, sampling_rate)

torch.manual_seed(421)

# Define conditioning parameters
cond_dict = make_cond_dict(
    text="Hello, world!",
    speaker=spk_embedding.to(torch.bfloat16),
    language="en-us",
)

# Prepare conditioning and generate speech
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()

# Save the generated audio
torchaudio.save("sample.wav", wavs, model.autoencoder.sampling_rate)

Customization Options

Model Selection

Zonos provides two models:

Hybrid Model (default): Balances quality and computational efficiency.

Transformer Model: Use this for higher fidelity:

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

Customization Parameters

Model Variants

Zonos offers multiple model configurations:

Hybrid Model (default): Optimized for balance between quality and efficiency.

Transformer Model: Higher fidelity output, albeit with increased computational demand.

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

Multilingual Capabilities

Adjust the language parameter for speech synthesis in different languages:

cond_dict = make_cond_dict(
    text="Bonjour le monde!",
    speaker=spk_embedding.to(torch.bfloat16),
    language="fr-fr",
)

Speech Style and Emotion Control

Fine-tune output speech by modifying expressive parameters:

cond_dict = make_cond_dict(
    text="I am very happy!",
    speaker=spk_embedding.to(torch.bfloat16),
    language="en-us",
    emotion="happiness",
    speaking_rate=1.2,
    pitch_variation=0.1,
)

Language Support

Modify the language parameter for multilingual support:

cond_dict = make_cond_dict(
    text="Bonjour le monde!",
    speaker=spk_embedding.to(torch.bfloat16),
    language="fr-fr",
)

Emotion and Speech Control

Fine-tune speech attributes:

cond_dict = make_cond_dict(
    text="I am very happy!",
    speaker=spk_embedding.to(torch.bfloat16),
    language="en-us",
    emotion="happiness",
    speaking_rate=1.2,
    pitch_variation=0.1,
)

Troubleshooting

Dependency Issues: Use uv sync to fix missing packages.
GPU Errors: Ensure NVIDIA drivers are installed and PyTorch is configured for GPU usage.
Audio Quality Issues: Experiment with voice cloning samples and adjust speech parameters.
Installation Errors: Verify Docker is running or ensure all manual installation steps are followed correctly.

Use Cases

Zonos-TTS can be used for:

Content Creation: Audiobooks, podcasts, video voiceovers.
Accessibility: Text-to-speech applications for visually impaired users.
Entertainment: Character voices for games, virtual assistants.
Education: E-learning content, language pronunciation tools.
Research: Speech synthesis and voice cloning advancements.

Community & Contribution Channels

Engage with the Zonos-TTS ecosystem via:

GitHub: Zonos Repository
Hugging Face Hub: Model downloads and discussions.
Reddit Forums: Participate in r/AudioAI community discussions.
ZyphraAI Website: Access documentation and API specifications.

Conclusion

Zonos-TTS is a powerful open-source TTS model, offering multilingual support and expressive voice synthesis. Whether using Docker for quick deployment or DIY installation for greater control, this guide helps set up and run Zonos-TTS efficiently on Ubuntu. Its applications range from content creation to accessibility and research, making it a versatile tool for real-time voice synthesis.