Run Kimi-Audio on Ubuntu: Installation and Usage Guide

Run Kimi-Audio on Ubuntu: Installation and Usage Guide

Kimi-Audio is Moonshot AI's state-of-the-art 7B parameter audio foundation model capable of speech recognition, audio generation, and multimodal conversations.

System Requirements

Hardware

  • GPU: Minimum NVIDIA RTX 3090 (24GB VRAM) / Recommended RTX 6000 Ada (48GB VRAM)16
  • RAM: 64GB DDR4 minimum
  • Storage: 100GB+ free SSD space (for models and datasets)

Software

  • OS: Ubuntu 22.04 LTS (recommended)
  • NVIDIA Drivers: 535+ with CUDA 12.26
  • Dependencies: Python 3.10+, PyTorch 2.1+, Ninja build system6

Installation Process

Step 1: System Setup

bash# Update system packages
sudo apt update && sudo apt
full-upgrade -y

# Install essential tools
sudo apt install
-y git-lfs build-essential ninja-build ffmpeg

Step 2: CUDA Toolkit Installation

bashwget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
sudo apt-get
update
sudo apt-get -y install cuda-toolkit-12-2

Step 3: Python Environment Setup

bash# Install Miniconda
wget
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# Create virtual environment
conda create -n kimi-audio python=3.10
-y
conda activate kimi-audio

Step 4: Model Repository Setup

bash# Clone main repository
git
clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio

# Initialize submodules
git
submodule update --init --recursive

# Manual fix for GLM tokenizer (critical step)[6]
git
clone https://github.com/THUDM/GLM-4-Voice.git
cp -r GLM-4-Voice/ glm4_voice/
mv glm4_voice/ tokenizers/GLM4/

Step 5: Dependency Installation

bash# Install PyTorch with CUDA 12.2
pip install
torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122

# Install project requirements
pip install
-r requirements.txt

# Additional audio processing libraries
pip install soundfile librosa==0.10.1 torchaudio==2.1
.0

Model Download and Configuration

Download Pre-trained Models

bash# Install Hugging Face Hub tools
pip install
huggingface_hub

# Download model weights
huggingface-cli download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./models/7B-Instruct

Environment Configuration

Create config.yaml:

textmodel_path: "./models/7B-Instruct"
device: "cuda:0"
audio_sample_rate: 24000
text_tokenizer: "Qwen-7B"
audio_tokenizer: "GLM4-Voice"
max_audio_length: 600

Inference Execution

Basic ASR Example

pythonfrom kimia_infer.api.kimia import KimiAudio
import soundfile as sf

model = KimiAudio(config_path="config.yaml")
messages = [
{"role": "user", "message_type": "text", "content": "Transcribe this:"},
{"role": "user", "message_type": "audio", "content": "test.wav"}
]

_, transcription = model.generate(messages, output_type="text")
print(f"Transcription: {transcription}")

Audio Conversation Example

pythonmessages = [
{"role": "user", "message_type": "audio", "content": "question.wav"}
]

audio_output, text_output = model.generate(
messages,
audio_temperature=0.7,
text_temperature=0.3,
output_type="both"
)

sf.write("response.wav", audio_output.cpu().numpy(), 24000)

Advanced Configuration

Performance Optimization

bash# Enable FlashAttention
export USE_FLASH_ATTENTION=1

# Set memory-efficient attention
export MAX_JOBS=4
pip install xformers==0.0
.23

Batch Processing Script

Create batch_process.py:

pythonimport glob
from tqdm import tqdm

audio_files = glob.glob("dataset/*.wav")

for file in tqdm(audio_files):
messages = [
{"role": "user", "message_type": "text", "content": "Describe this audio:"},
{"role": "user", "message_type": "audio", "content": file}
]


_, description = model.generate(messages)
with open(f"{file}.txt", "w") as f:
f.write(description)

Troubleshooting Guide

Common Issues

  1. CUDA Out of Memory:pythonmodel.enable_gradient_checkpointing()
    • Reduce max_audio_length in config
    • Enable gradient checkpointing
  2. Tokenizer Errors:bashrm -rf tokenizers/GLM4
    git clone https://github.com/THUDM/GLM-4-Voice.git tokenizers/GLM4
  3. Audio Generation Artifacts:pythonsampling_params = {
    "audio_prior_temperature": 0.5,
    "audio_top_k": 50,
    "audio_repetition_penalty": 1.2
    }

Benchmark Results

TaskWERBLEUMCD
ASR (Librispeech)2.1%--
Audio Captioning-42.5-
Speech Emotion85.3% Acc--
Text-to-Speech--3.8

Custom Training Guide

Data Preparation

bash# Convert audio to 24kHz mono
find ./custom_data -name "*.wav" -exec ffmpeg -i {} -ar 24000 -ac 1 {}.converted.wav \;

# Create manifest.json

python tools/create_manifest.py --input_dir ./custom_data --output manifest.json

Training Command

bashaccelerate launch train.py \
--model_name_or_path ./models/7B-Instruct \
--train_files manifest.json \
--output_dir ./finetuned_model \
--per_device_train_batch_size 2 \
--learning_rate 1e-5 \
--num_train_epochs 3

Deployment Strategies

Docker Setup

textFROM nvidia/cuda:12.2.0-devel-ubuntu22.04
RUN apt update && apt install -y git-lfs python3.10
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD ["python", "api_server.py"]

REST API Setup

pythonfrom fastapi import FastAPI
import uvicorn

app = FastAPI()
model = KimiAudio(config_path="config.yaml")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
with open("temp.wav", "wb") as f:
f.write(await file.read())


messages = [
{"role": "user", "message_type": "audio", "content": "temp.wav"}
]


_, text = model.generate(messages)
return {"transcription": text}

uvicorn.run(app, host="0.0.0.0", port=8000)

Maintenance and Updates

  1. Model Updates:bashgit pull origin main
    huggingface-cli download moonshotai/Kimi-Audio-7B-Instruct --revision v2.0 --local-dir ./models
  2. Dependency Management:bashconda env export > environment.yml
    pip freeze > requirements.txt
  3. Logging Setup:pythonimport logging
    logging.basicConfig(
    filename='kimi_audio.log',
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )

This guide covers installation, configuration, optimization, and deployment of Kimi-Audio on Ubuntu systems. For the complete 5,000-word version with extended troubleshooting scenarios, advanced optimization techniques, and production deployment checklists, refer to the official documentation and technical report.

Key Pro Tips:

  • Use bitsandbytes for 4-bit quantization when using consumer GPUs
  • Implement speculative decoding for faster inference
  • Leverage NVIDIA Triton for production-scale deployment
  • Monitor GPU memory usage with nvidia-smi dmon during inference

For the full technical specifications and architecture details, consult the Kimi-Audio Technical Report and GitHub repository.

References

  1. Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
  2. Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
  3. Running Kimi-Audio on Mac: A Complete Guide
  4. Running Kimi-Audio on Windows: An Installation Guide