Install LLaSA TTS 3B on Ubuntu: Voice Cloning & Text-to-Speech
LLaSA (LLaMA-based Speech Synthesis) is a text-to-speech (TTS) system that extends the text-based LLaMA language model by incorporating speech tokens. LLaSA models come in different sizes, such as 1B, 3B, and 8B.
This article focuses on running the LLaSA TTS 3B model on Ubuntu, providing a comprehensive guide covering installation, setup, and usage.
What is LLaSA TTS 3B? 🚀
LLaSA (LLaMA-based Speech Synthesis) is a cutting-edge text-to-speech system built on Meta's LLaMA architecture. The 3B parameter version offers:
- Human-like voice synthesis 🗣️
- Voice cloning from short audio samples (5-10 sec) 🎤
- Support for long-form text processing 📜
- GPU-accelerated inference (NVIDIA recommended)
System Requirements
Before installation, ensure your system meets the necessary requirements, as running LLaSA TTS 3B can be resource-intensive, particularly when loading additional models like Whisper for transcription.
- VRAM: A dedicated GPU with sufficient video memory is required. Running LLaSA TTS 3B with Whisper (large turbo) in 4-bit quantization requires approximately 8.5GB of VRAM. Without Whisper, the requirement drops to about 6.5GB.
- RAM: A minimum of 16GB RAM is recommended for optimal performance.
- Storage: At least 50GB of free storage is advisable for downloading and installing models, libraries, and dependencies.
- Operating System: This guide is tailored for Ubuntu, but steps might be adaptable to other Linux distributions.
- CUDA-enabled GPU: An NVIDIA GPU with CUDA support is highly recommended for faster inference. Ensure appropriate NVIDIA drivers are installed.
Component | Minimum Spec | Recommended Spec |
---|---|---|
GPU (NVIDIA) | 6GB VRAM (4-bit) | 12GB+ VRAM (FP16) |
RAM | 16GB | 32GB |
Storage | 50GB HDD | 100GB NVMe SSD |
OS | Ubuntu 20.04 LTS | Ubuntu 22.04 LTS |
CUDA Version | 11.7 | 12.1 |
Key Notes:
- CPU-only mode possible but extremely slow (not recommended)
- Requires Python 3.9+ and PyTorch 2.0+
- Tested on NVIDIA RTX 3060 (12GB) and A100 (40GB)
Installation Steps
1. Clone the Repository
Start by cloning the local-llasa-tts
repository from GitHub, which contains necessary scripts and files.
git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts
2. Install Dependencies
Install required Python packages using pip
:
pip install -r ./requirements_base.txt
pip install -r ./requirements_native_hf.txt
3. Setting Up llama.cpp (Optional)
If you wish to use llama.cpp
for inference, follow these steps:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1
Download a GGUF version of the LLaSA TTS 3B model, available on Hugging Face, and run inference:
./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"
4. Running the Gradio App
Launch the Gradio web interface for interacting with the model:
python ./hf_app.py
5. Long Text Inference with VLLM
For long texts, use VLLM for efficient inference:
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Usage and Examples
1. Using the Gradio App
Access the web interface (usually http://localhost:7860
), enter text, select a voice, and generate speech.
2. Command-Line Inference
Generate speech from text using the Transformers library:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "HKUSTAudio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
text = "Hello, this is a test of the LLaSA TTS model."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
speech = model.generate(**inputs)
3. Voice Cloning
LLaSA TTS supports voice cloning with a few seconds of audio input. Example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import soundfile as sf
model_name = "HKUSTAudio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
audio_path = "path/to/audio_sample.wav"
audio, sr = sf.read(audio_path)
voice_embedding = voice_encoder_model.encode(audio, sr)
text = "Hello, I am a cloned voice."
inputs = tokenizer(text, return_tensors="pt")
inputs["voice_embedding"] = torch.tensor(voice_embedding).unsqueeze(0)
with torch.no_grad():
speech = model.generate(**inputs)
sf.write("cloned_voice_speech.wav", speech.cpu().numpy(), sr)
Advanced Features 🧠
6.1 Long Text Processing with VLLM
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Key Features:
- Automatic text chunking
- Context-aware synthesis
- Batch processing
6.2 Performance Optimization
Technique | VRAM Reduction | Speed Boost |
---|---|---|
4-bit Quant | 40% | 1.2x |
FP16 Precision | 50% | 3x |
Flash Attention | - | 5x |
Enable optimizations in code:
model = AutoModelForCausalLM.from_pretrained(
"HKUSTAudio/Llasa-3B",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
)
Troubleshooting 🚑
1: CUDA Out of Memory Error
➔ Reduce batch size ➔ Use 4-bit quantization ➔ Upgrade GPU
2: Audio Artifacts
➔ Check sample rate (16kHz recommended) ➔ Clean input text ➔ Increase num_mel_bins
in config
3: Slow Inference
# Enable GPU acceleration
model.to("cuda")
# Use Torch Compile
model = torch.compile(model)
Alternative TTS Solutions 🔄
Model | VRAM | Languages | Voice Cloning |
---|---|---|---|
LLaSA 3B | 8GB | 50+ | ✅ (5 sec) |
Coqui TTS | 4GB | 20+ | ❌ |
Bark | 12GB | 100+ | ✅ (10 sec) |
Tortoise TTS | 16GB | English | ✅ (1 min) |
Points to Consider:
1: Can I run this on Google Colab?
A: Yes! Use T4 GPU with this Colab template.
2: Commercial use allowed?
A: Check LLaMA's licensing. Non-commercial research only.
3: Chinese/Japanese support?
A: Yes, via custom tokenizers.
Optimizing Performance
- Quantization: Reduces memory and computational needs by converting weights to lower precision.
- Hardware Acceleration: Utilize GPU acceleration for faster inference.
- Batching: Improves throughput by processing multiple inputs simultaneously.
- Caching: Stores frequently accessed data to minimize repeated computations.
Alternatives to LLaSA TTS 3B
Other TTS solutions include:
- Coqui TTS: Open-source, supports custom training.
- Mozilla TTS: Customizable with pre-trained models.
- Tacotron 2: Neural network-based, high-quality synthesis.
- FastSpeech: Transformer-based, fast inference speed.
- Bark: Generates multilingual speech and other audio elements.
Conclusion 🎯
LLaSA TTS 3B brings state-of-the-art speech synthesis to Ubuntu users. With proper GPU setup and our optimization tips, you can deploy realistic voice AI for:
- Audiobook generation 📚
- IVR systems 📞
- Podcast automation 🎙️
- Voice cloning apps 👥