Llasa 3B

Run Llasa TTS 3B on Windows: A Step-by-Step Guide

Anas Mohammad

Feb 7, 2025 • 2 min read

Llasa 3B is an advanced open-source AI model that generates lifelike, emotionally expressive speech in English and Chinese. Built on the LLaMA framework, it integrates speech tokens via the XCodec2 architecture for seamless text-to-speech (TTS) and voice cloning capabilities[1][3][7]. While running it locally on Windows can be challenging, this guide simplifies the process with clear instructions, troubleshooting tips, and alternative solutions.

System Requirements

Before starting, ensure your system meets these requirements:

Operating System: Windows 10/11 (64-bit)
GPU: NVIDIA GPU with 8.5GB+ VRAM (e.g., RTX 2080 Ti, RTX 3090, or newer)
Software: Python 3.9, Conda, CUDA Toolkit 11.8+
Storage: 15GB+ free space (for models and dependencies)

Setting Up Llasa 3B on Windows

To run Llasa 3B on your Windows machine, follow these steps:

Step 1: Install XCodec2

Why XCodec2?

XCodec2 is critical for decoding speech tokens into audio. Follow these steps:

Install XCodec2

pip install xcodec2==0.1.3

(Note: Use Python 3.9 to avoid compatibility issues)[4]

Create a Conda Environment

conda create -n llasa_tts python=3.9 -y
conda activate llasa_tts

Step 2: Set Up Llasa 3B for Text-to-Speech

Code Implementation

A file output.wav will generate in your working directory.

Run the Script

python text_to_speech.py

Save the Script
Create a file text_to_speech.py and paste this code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model

# Initialize models
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().cuda()
Codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()

# Customize your input text here
input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection...'

# Token processing functions (retain from original code)
# ... [include the same functions as in the original code] ...

# Generate and save audio
with torch.no_grad():
    # ... [include the same generation logic as in the original code] ...
    sf.write("output.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)

Install Required Libraries

pip install torch transformers soundfile --extra-index-url https://download.pytorch.org/whl/cu118

Troubleshooting Common Issues

Issue	Solution
CUDA Out of Memory	Reduce input text length or upgrade to a GPU with more VRAM.
XCodec2 Errors	Reinstall xcodec2 in a fresh Conda environment with Python 3.9.
Missing Dependencies	Ensure `torch`, `transformers`, and `soundfile` are CUDA-compatible.

Alternative Methods to Run Llasa 3B

1. Google Colab (No Local GPU Required)

Use a Colab notebook to run Llasa 3B in the cloud[2]. Recommended for users with limited VRAM.

2. Replicate API

Run Llasa-3B-Long via Replicate’s API for a serverless experience[6].

Advanced Features: Voice Cloning

While this guide focuses on standard TTS, Llasa 3B supports voice cloning by:

Providing a reference audio file.
Fine-tuning the model with speaker embeddings[4][7].
(Note: Requires additional code and computational resources)

Key Takeaways

Llasa 3B produces natural, emotionally nuanced speech but demands significant GPU power.
Use Conda to manage dependencies and avoid conflicts.
For low-resource systems, cloud options like Colab or Replicate are ideal.

By following this guide, you’ll harness Llasa 3B’s capabilities directly on your Windows machine. Experiment with input texts, adjust parameters like temperature for creativity, and explore voice cloning for personalized outputs.