Install YuE-7B on Ubuntu : Step by Step Guide

YuE-7B is an open-source text-to-audio model designed to generate high-quality, realistic audio clips from simple text prompts.
Developed by Declare Lab and powered by Stability AI, it utilizes advanced machine learning techniques like Flow Matching and CLAP-Ranked Preference Optimization (CRPO) to produce audio that aligns closely with user expectations.
This guide will walk you through setting up YuE-7B on Ubuntu, covering installation, usage, troubleshooting, and real-world applications.
General Requirements:
- NVIDIA Drivers: Ensure that your NVIDIA drivers are correctly installed and up to date for GPU support[2].
- Homebrew: Consider using Homebrew, a package manager, to simplify the installation of dependencies[1].
- GPU: A high-performance GPU with at least 80GB of GPU memory (e.g., NVIDIA A100) is recommended for optimal performance[2].
- Docker: You can use Docker to manage the YuE Interface[2]. Use Docker Compose for simpler setup and management, defining and running multi-container Docker applications with a single configuration file[2].
- RunPod: Alternatively, RunPod allows you to quickly deploy an instance based on the YuE Interface image[2].
To get started, follow these general steps using Docker Compose[2]:
- Install Docker and Docker Compose.
- Download the
docker-compose.yml
file from the YuE-Interface GitHub repository[2]. - Modify the
docker-compose.yml
file to map the host's model and output directories[2]. - Run
docker-compose up -d
in the same directory as thedocker-compose.yml
file[2].
After the container is running, access the Gradio web UI at http://localhost:7860
[2]. If deployed on RunPod, use the provided RunPod URL to access the interface[2].
Overview of YuE-7B
Key Features
- High-Quality Audio Generation – Produces audio clips up to 30 seconds long at a 44.1 kHz sample rate.
- Fast Inference – Generates audio in around 3.7 seconds on an A40 GPU, making it suitable for real-time applications.
- Open Source – Fully customizable, allowing modifications based on user requirements.
- User-Friendly Interface – Simple text prompts enable easy audio generation.
Technical Architecture
YuE-7B employs a combination of Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT) architectures. It follows a three-stage training process:
- Pre-training – Initial training on large datasets for basic audio generation.
- Fine-tuning – Optimized for specific datasets to improve performance.
- Preference Optimization – Utilizes CRPO to enhance output quality based on user preferences.
Setting Up YuE-7B on Ubuntu
Prerequisites
Ensure your system meets the following requirements before installation:
- OS: Ubuntu 20.04 or later
- Python: Version 3.8 or higher
- RAM: At least 6 GB for smooth operation
- GPU: NVIDIA GPU (e.g., A40, RTX series) for optimal performance
Step 1: Install Python and Pip
If Python isn’t installed, run:
sudo apt update
sudo apt install python3 python3-pip
Step 2: Install Required Libraries
Install dependencies via pip:
pip install torch torchaudio transformers
Step 3: Clone the YuE-7B Repository
Retrieve the source code from GitHub:
git clone https://github.com/declare-lab/YuE-7B.git
cd YuE-7B
Step 4: Install YuE-7B
Use pip to install YuE-7B in editable mode:
pip install -e .
Step 5: Verify Installation
Ensure the installation was successful:
import YuE-7B
print(YuE-7B.__version__)
If the version number appears without errors, the setup is complete.
Generating Audio with YuE-7B
Step 1: Import Necessary Libraries
import torchaudio
from YuE-7B import YuE-7BInference
from IPython.display import Audio
Step 2: Initialize the Model
model = YuE-7BInference(name='declare-lab/YuE-7B')
Step 3: Generate Audio from a Text Prompt
audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)
Step 4: Play or Save the Generated Audio
Play audio directly in a notebook:
Audio(data=audio, rate=44100)
Save it as a WAV file:
torchaudio.save('output.wav', audio.unsqueeze(0), sample_rate=44100)
Troubleshooting Common Issues
1. Installation Errors
Verify that dependencies are correctly installed and that your Python version is compatible.
2. Insufficient RAM
Close unnecessary applications or upgrade hardware if memory-related errors occur.
3. Audio Quality Issues
Increase the sampling steps in the generate
function for better output quality, but note that this may increase processing time.
Practical Applications of YuE-7B
YuE-7B can be applied across various industries:
- Game Development – Generates dynamic sound effects for interactive experiences.
- Film Production – Creates realistic soundscapes for movie scenes.
- Education – Enhances learning materials with custom-generated audio.
- Accessibility Tools – Converts written content into audio for visually impaired users.
Conclusion
YuE-7B offers a seamless text-to-audio generation experience on Ubuntu, enabling high-quality, AI-driven sound production.
With its powerful architecture and ease of use, it opens new possibilities in gaming, film production, education, and accessibility. By following this guide, you can harness YuE-7B effectively for your projects.