Install LLaSA TTS 3B on Ubuntu: Voice Cloning & Text-to-Speech

LLaSA (LLaMA-based Speech Synthesis) is a text-to-speech (TTS) system that extends the text-based LLaMA language model by incorporating speech tokens. LLaSA models come in different sizes, such as 1B, 3B, and 8B.

This article focuses on running the LLaSA TTS 3B model on Ubuntu, providing a comprehensive guide covering installation, setup, and usage.

What is LLaSA TTS 3B? 🚀

LLaSA (LLaMA-based Speech Synthesis) is a cutting-edge text-to-speech system built on Meta's LLaMA architecture. The 3B parameter version offers:

  • Human-like voice synthesis 🗣️
  • Voice cloning from short audio samples (5-10 sec) 🎤
  • Support for long-form text processing 📜
  • GPU-accelerated inference (NVIDIA recommended)

System Requirements

Before installation, ensure your system meets the necessary requirements, as running LLaSA TTS 3B can be resource-intensive, particularly when loading additional models like Whisper for transcription.

  • VRAM: A dedicated GPU with sufficient video memory is required. Running LLaSA TTS 3B with Whisper (large turbo) in 4-bit quantization requires approximately 8.5GB of VRAM. Without Whisper, the requirement drops to about 6.5GB.
  • RAM: A minimum of 16GB RAM is recommended for optimal performance.
  • Storage: At least 50GB of free storage is advisable for downloading and installing models, libraries, and dependencies.
  • Operating System: This guide is tailored for Ubuntu, but steps might be adaptable to other Linux distributions.
  • CUDA-enabled GPU: An NVIDIA GPU with CUDA support is highly recommended for faster inference. Ensure appropriate NVIDIA drivers are installed.
Component Minimum Spec Recommended Spec
GPU (NVIDIA) 6GB VRAM (4-bit) 12GB+ VRAM (FP16)
RAM 16GB 32GB
Storage 50GB HDD 100GB NVMe SSD
OS Ubuntu 20.04 LTS Ubuntu 22.04 LTS
CUDA Version 11.7 12.1

Key Notes:

  • CPU-only mode possible but extremely slow (not recommended)
  • Requires Python 3.9+ and PyTorch 2.0+
  • Tested on NVIDIA RTX 3060 (12GB) and A100 (40GB)

Installation Steps

1. Clone the Repository

Start by cloning the local-llasa-tts repository from GitHub, which contains necessary scripts and files.

git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts

2. Install Dependencies

Install required Python packages using pip:

pip install -r ./requirements_base.txt
pip install -r ./requirements_native_hf.txt

3. Setting Up llama.cpp (Optional)

If you wish to use llama.cpp for inference, follow these steps:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1

Download a GGUF version of the LLaSA TTS 3B model, available on Hugging Face, and run inference:

./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"

4. Running the Gradio App

Launch the Gradio web interface for interacting with the model:

python ./hf_app.py

5. Long Text Inference with VLLM

For long texts, use VLLM for efficient inference:

pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb

Usage and Examples

1. Using the Gradio App

Access the web interface (usually http://localhost:7860), enter text, select a voice, and generate speech.

2. Command-Line Inference

Generate speech from text using the Transformers library:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "HKUSTAudio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

text = "Hello, this is a test of the LLaSA TTS model."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    speech = model.generate(**inputs)

3. Voice Cloning

LLaSA TTS supports voice cloning with a few seconds of audio input. Example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import soundfile as sf

model_name = "HKUSTAudio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

audio_path = "path/to/audio_sample.wav"
audio, sr = sf.read(audio_path)
voice_embedding = voice_encoder_model.encode(audio, sr)

text = "Hello, I am a cloned voice."
inputs = tokenizer(text, return_tensors="pt")
inputs["voice_embedding"] = torch.tensor(voice_embedding).unsqueeze(0)

with torch.no_grad():
    speech = model.generate(**inputs)

sf.write("cloned_voice_speech.wav", speech.cpu().numpy(), sr)

Advanced Features 🧠

6.1 Long Text Processing with VLLM

pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb

Key Features:

  • Automatic text chunking
  • Context-aware synthesis
  • Batch processing

6.2 Performance Optimization

Technique VRAM Reduction Speed Boost
4-bit Quant 40% 1.2x
FP16 Precision 50% 3x
Flash Attention - 5x

Enable optimizations in code:

model = AutoModelForCausalLM.from_pretrained(
    "HKUSTAudio/Llasa-3B",
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2"
)

Troubleshooting 🚑

1: CUDA Out of Memory Error
➔ Reduce batch size ➔ Use 4-bit quantization ➔ Upgrade GPU

2: Audio Artifacts
➔ Check sample rate (16kHz recommended) ➔ Clean input text ➔ Increase num_mel_bins in config

3: Slow Inference

# Enable GPU acceleration
model.to("cuda")

# Use Torch Compile
model = torch.compile(model)

Alternative TTS Solutions 🔄

Model VRAM Languages Voice Cloning
LLaSA 3B 8GB 50+ ✅ (5 sec)
Coqui TTS 4GB 20+
Bark 12GB 100+ ✅ (10 sec)
Tortoise TTS 16GB English ✅ (1 min)

Points to Consider:

1: Can I run this on Google Colab?
A: Yes! Use T4 GPU with this Colab template.

2: Commercial use allowed?
A: Check LLaMA's licensing. Non-commercial research only.

3: Chinese/Japanese support?
A: Yes, via custom tokenizers.

Optimizing Performance

  • Quantization: Reduces memory and computational needs by converting weights to lower precision.
  • Hardware Acceleration: Utilize GPU acceleration for faster inference.
  • Batching: Improves throughput by processing multiple inputs simultaneously.
  • Caching: Stores frequently accessed data to minimize repeated computations.

Alternatives to LLaSA TTS 3B

Other TTS solutions include:

  • Coqui TTS: Open-source, supports custom training.
  • Mozilla TTS: Customizable with pre-trained models.
  • Tacotron 2: Neural network-based, high-quality synthesis.
  • FastSpeech: Transformer-based, fast inference speed.
  • Bark: Generates multilingual speech and other audio elements.

Conclusion 🎯

LLaSA TTS 3B brings state-of-the-art speech synthesis to Ubuntu users. With proper GPU setup and our optimization tips, you can deploy realistic voice AI for:

  • Audiobook generation 📚
  • IVR systems 📞
  • Podcast automation 🎙️
  • Voice cloning apps 👥

References

  1. Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
  2. Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
  3. Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
  4. Run Llasa TTS 3B on Windows: A Step-by-Step Guide
  5. Install Llasa TTS 3B on macOS: Voice Cloning & Text-to-Speech