Run Llasa TTS 3B on Windows: A Step-by-Step Guide

Llasa 3B is an advanced open-source AI model that generates lifelike, emotionally expressive speech in English and Chinese. Built on the LLaMA framework, it integrates speech tokens via the XCodec2 architecture for seamless text-to-speech (TTS) and voice cloning capabilities[1][3][7]. While running it locally on Windows can be challenging, this guide simplifies the process with clear instructions, troubleshooting tips, and alternative solutions.
System Requirements
Before starting, ensure your system meets these requirements:
- Operating System: Windows 10/11 (64-bit)
- GPU: NVIDIA GPU with 8.5GB+ VRAM (e.g., RTX 2080 Ti, RTX 3090, or newer)
- Software: Python 3.9, Conda, CUDA Toolkit 11.8+
- Storage: 15GB+ free space (for models and dependencies)
Setting Up Llasa 3B on Windows
To run Llasa 3B on your Windows machine, follow these steps:
Step 1: Install XCodec2
Why XCodec2?
XCodec2 is critical for decoding speech tokens into audio. Follow these steps:
Install XCodec2
pip install xcodec2==0.1.3
(Note: Use Python 3.9 to avoid compatibility issues)[4]
Create a Conda Environment
conda create -n llasa_tts python=3.9 -y
conda activate llasa_tts
Step 2: Set Up Llasa 3B for Text-to-Speech
Code Implementation
- A file
output.wav
will generate in your working directory.
Run the Script
python text_to_speech.py
Save the Script
Create a file text_to_speech.py
and paste this code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model
# Initialize models
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().cuda()
Codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
# Customize your input text here
input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection...'
# Token processing functions (retain from original code)
# ... [include the same functions as in the original code] ...
# Generate and save audio
with torch.no_grad():
# ... [include the same generation logic as in the original code] ...
sf.write("output.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
Install Required Libraries
pip install torch transformers soundfile --extra-index-url https://download.pytorch.org/whl/cu118
Troubleshooting Common Issues
Issue | Solution |
---|---|
CUDA Out of Memory | Reduce input text length or upgrade to a GPU with more VRAM. |
XCodec2 Errors | Reinstall xcodec2 in a fresh Conda environment with Python 3.9. |
Missing Dependencies | Ensure torch , transformers , and soundfile are CUDA-compatible. |
Alternative Methods to Run Llasa 3B
1. Google Colab (No Local GPU Required)
Use a Colab notebook to run Llasa 3B in the cloud[2]. Recommended for users with limited VRAM.
2. Replicate API
Run Llasa-3B-Long via Replicate’s API for a serverless experience[6].
Advanced Features: Voice Cloning
While this guide focuses on standard TTS, Llasa 3B supports voice cloning by:
- Providing a reference audio file.
- Fine-tuning the model with speaker embeddings[4][7].
(Note: Requires additional code and computational resources)
Key Takeaways
- Llasa 3B produces natural, emotionally nuanced speech but demands significant GPU power.
- Use Conda to manage dependencies and avoid conflicts.
- For low-resource systems, cloud options like Colab or Replicate are ideal.
By following this guide, you’ll harness Llasa 3B’s capabilities directly on your Windows machine. Experiment with input texts, adjust parameters like temperature
for creativity, and explore voice cloning for personalized outputs.
References
- Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
- Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
- Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
- Run DeepSeek-VL2 on macOS: Step-by-Step Installation Guide
- Install and Run DeepSeek-VL2 on Ubuntu: A Step-by-Step Guide