Install and Run LLaSA TTS 3B on Windows: Step by Step Guide

LLaSA-3B revolutionizes text-to-speech technology with emotional nuance recognition and bilingual capabilities (English/Chinese). Built on Meta's LLaMA framework, this open-source model leverages XCodec2 architecture for studio-quality audio output at 24kHz sampling rate. Perfect for developers creating voice assistants, audiobook tools, or multilingual content platforms.
System Requirements Checklist
Before installation, verify your Windows setup meets these specs:
- Operating System: Windows 10 or later
- Python: Version 3.9 is recommended to avoid compatibility issues.
- RAM: At least 16GB is required, but 32GB is preferred for optimal performance.
- Storage: A minimum of 50GB of free space, preferably an NVMe SSD with 100GB, for models, libraries, and dependencies.
- GPU (NVIDIA): A dedicated NVIDIA GPU with CUDA support is recommended. Minimum 6GB VRAM for 4-bit quantization or 12GB+ VRAM for FP16 processing. CPU-only mode is possible but extremely slow.
Component | Minimum | Recommended |
---|---|---|
RAM | 16GB | 32GB DDR4 |
Storage | 50GB HDD | 100GB NVMe SSD |
GPU | NVIDIA GTX 1660 (6GB) | RTX 3090 (24GB) |
Python | 3.8 | 3.9 |
Critical Notes:
- 🔴 CPU-only mode possible but impractical (10x slower)
- 🟢 CUDA 11.8+ required for GPU acceleration
- 💡 Validate CUDA compatibility:
nvidia-smi
in Command Prompt
Step-by-Step Installation Walkthrough
1. Install XCodec2
XCodec2 is required for decoding speech tokens into audio.
pip install xcodec2==0.1.3
2. Environment Setup
# Create dedicated Conda environment
conda create -n llasa_tts python=3.9 -y
conda activate llasa_tts
# Install core dependencies
pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install xcodec2==0.1.3 transformers==4.31.0
Pro Tip: Use Windows Subsystem for Linux (WSL2) for smoother CLI operations.
3. Model Deployment
git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts
# Download 4-bit quantized model (3.8GB)
wget https://huggingface.co/srinivasbilla/llasa-3b-Q4_K_M-GGUF/resolve/main/llasa-3b-q4_k_m.gguf
OR
For inference using llama.cpp
:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1
Download a GGUF version of the LLaSA TTS 3B model and run inference:
./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"
4. Hardware Acceleration Setup
- Install NVIDIA CUDA Toolkit 12.1
Verify installation:
nvcc --version # Should show CUDA 12.1+
nvidia-smi # Check GPU memory allocation
5. Running the Gradio App
Launch the Gradio web interface to interact with the model:
python ./hf_app.py
6. Long Text Inference with VLLM
For efficient inference of longer texts:
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Complete Text-to-Speech Script Walkthrough
Code Implementation for Text-to-Speech
Here’s a basic implementation to convert text into speech using LLaSA 3B:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
OR
# text_to_speech.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torchaudio
model_path = "HKUST-Audio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,
device_map="auto",
torch_dtype=torch.float16)
def synthesize(text, output_file="output.wav"):
inputs = tokenizer(f"<|TEXT|>{text}<|SPEECH|>", return_tensors="pt").to("cuda")
with torch.inference_mode():
outputs = model.generate(**inputs,
max_new_tokens=500,
temperature=0.7,
top_p=0.95)
audio = decode_speech_tokens(outputs[0])
torchaudio.save(output_file, audio, 24000)
print(f"Generated {output_file}")
Key Parameters to Tweak:
temperature
(0.1-1.0): Lower = more deterministictop_p
(0.5-0.95): Controls vocabulary diversitymax_new_tokens
: Adjust based on text length (1 token ≈ 0.75s speech)
Running the Script
- Save the code as a
.py
file (e.g.,text_to_speech.py
). - Open the terminal and activate the Conda environment:
conda activate llasa_tts
. - Navigate to the script directory.
- Run the script:
python text_to_speech.py
. - The generated speech will be saved as
gen.wav
.
Advanced Features Unleashed
Real-Time Voice Cloning
# Clone voices with 5-second reference audio
from llasa.voice_cloning import VoiceCloneEngine
cloner = VoiceCloneEngine()
cloner.load_reference("reference.wav")
cloner.generate("Target text here", output_file="clone_output.wav")
Batch Processing
# Process multiple texts via CLI
python -m llasa.batch \
--input-file texts.txt \
--output-dir ./audio_output \
--batch-size 8 \
--precision fp16
Performance Optimization Table
Technique | VRAM Usage | Speed (RTX 4090) | Quality |
---|---|---|---|
FP32 | 24GB | 1.0x | Lossless |
FP16 | 12GB | 1.8x | Near-lossless |
4-bit Quant | 6GB | 2.5x | Good |
8-bit Quant | 8GB | 2.1x | Excellent |
Troubleshooting Matrix
Symptom | Solution |
---|---|
CUDA Out of Memory | Reduce batch size, enable 4-bit quantization |
Audio Artifacts | Increase top_p to 0.9+, check sample rate consistency |
Slow Inference | Enable Flash Attention 2, use llama.cpp optimizations |
Chinese Text Failures | Ensure proper tokenization with tokenizer.apply_chat_template() |
Cloud Alternatives for Low-Spec Hardware
Docker Deployment
docker pull kjjk10/llasa-3b-long
docker run -p 7860:7860 --gpus all kjjk10/llasa-3b-long
# Access via http://localhost:7860
Google Colab Pro
!pip install -q llasa-tts
from llasa import RemoteEngine
engine = RemoteEngine(api_key="your_key")
engine.synthesize("Your text", voice="chinese-female")
Alternative Solutions
- Google Colab: Utilize free cloud GPUs if local hardware is insufficient.
- Cloud-Based Services: Services like Replicate allow model execution without local hardware.
- Docker: Use
kjjk10/llasa-3b-long
for streamlined deployment.
Conclusion
By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
Running the Script
- Save the code as a
.py
file (e.g.,text_to_speech.py
). - Open the terminal and activate the Conda environment:
conda activate llasa_tts
. - Navigate to the script directory.
- Run the script:
python text_to_speech.py
. - The generated speech will be saved as
gen.wav
.
Troubleshooting
- Compatibility Issues: Use Python 3.9 to avoid dependency conflicts.
- CUDA Errors: Ensure your NVIDIA drivers are installed and compatible with your CUDA version.
- Out of Memory: Reduce batch size or use a GPU with higher VRAM.
- Slow Inference: A CUDA-enabled GPU is highly recommended.
Alternative Solutions
- Google Colab: Utilize free cloud GPUs if local hardware is insufficient.
- Cloud-Based Services: Services like Replicate allow model execution without local hardware.
- Docker: Use
kjjk10/llasa-3b-long
for streamlined deployment.
Conclusion
By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis.
References
- Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
- Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
- Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
- Run DeepSeek-VL2 on macOS: Step-by-Step Installation Guide
- Install and Run DeepSeek-VL2 on Ubuntu: A Step-by-Step Guide