Run Sesame CSM 1B on macOS: Step-by-Step Guide

Run Sesame CSM 1B on macOS: Step-by-Step Guide

Sesame AI's CSM 1B is a state-of-the-art conversational speech model renowned for its ability to generate human-like voices. This guide provides a step-by-step walkthrough for running Sesame CSM 1B on a Mac, detailing prerequisites, installation procedures, and troubleshooting tips to ensure a seamless experience.

What is Sesame CSM 1B?

Sesame CSM 1B is part of a suite of AI models developed to enhance conversational experiences. It employs advanced deep learning techniques to synthesize speech that closely resembles human voices, making it ideal for applications such as audiobooks, voice assistants, and more.

Notably, CSM 1B generates residual vector quantization (RVQ) audio codes from text and audio inputs, utilizing a Llama backbone and a specialized audio decoder to produce Mimi audio codes. citeturn0search0

Hardware and Software Requirements

Before proceeding, ensure your Mac meets the following specifications:

  • Operating System: macOS 10.15 (Catalina) or higher.
  • Processor: Intel Core i5 or Apple M1 and newer versions.
  • Memory: At least 8 GB RAM (16 GB recommended).
  • Storage: Sufficient space for the model and dependencies.
  • Graphics Processing Unit (GPU): While not mandatory, a GPU can enhance performance. For Apple Silicon Macs, the integrated GPU is adequate.
  • Python Environment: Python 3.8 or higher.
  • Additional Libraries: Installation of libraries such as PyTorch, TorchAudio, and Transformers is required.

Setting Up Your Environment

  1. Install Python and Required Libraries:

If Python isn't already installed, download it from the official Python website.

Create a virtual environment to manage dependencies:

python3 -m venv myenv

Activate the virtual environment:

source myenv/bin/activate

Install the necessary libraries using pip:

pip install torch torchvision torchaudio transformers
  1. Hugging Face CLI Setup:

If you don't have an account, create one and generate a token for authentication.

Log in to your Hugging Face account:

huggingface-cli login

Install the Hugging Face CLI:

pip install huggingface_hub

Downloading and Installing Sesame CSM 1B

  1. Clone the Repository:

Navigate to your desired installation directory and clone the Sesame CSM repository using Git:

git clone https://github.com/SesameAILabs/csm.git

This repository contains the necessary code and instructions to run the model. citeturn0search2

  1. Install Dependencies:

Navigate into the cloned repository:

cd csm

Install dependencies listed in the requirements.txt file:

pip install -r requirements.txt

Note: The triton package cannot be installed on Windows. Instead, use:

pip install triton-windows

3. Download Models:

Use the Hugging Face Hub to download the CSM 1B model checkpoint:

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt")

Ensure you have accepted the model's terms on Hugging Face and that your authentication token is correctly set. citeturn0search0

Running Sesame CSM 1B

  1. Load and Run the Model:
    • Ensure that the generator is loaded with the correct device. For Apple Silicon Macs, using the Metal Performance Shaders (MPS) backend is recommended for optimal performance. citeturn0search3
  2. Audio Generation:
    • The script above converts the generated text into an audio waveform and saves it as audio.wav.

Create a Python script to load and run the model. Here's an example:

import torch
from generator import load_csm_1b
import torchaudio

# Load model
model_path = "path_to_downloaded_ckpt.pt"
generator = load_csm_1b(model_path, "mps")  # Use 'mps' for Apple Silicon Macs

# Generate audio
input_text = "Hello from Sesame."
audio = generator.generate(
    text=input_text,
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

# Save audio
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Usage

Generate Speech with Context:PythonCopy

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Generate a Sentence:PythonCopy

from generator import load_csm_1b
import torchaudio
import torch

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

generator = load_csm_1b(device=device)

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Performance and User Ratings

Sesame CSM 1B has received widespread acclaim for its natural-sounding speech and low-latency performance. Users have reported that the model's speech naturalness is so high that it is "impossible to distinguish from a human voice".

In blind tests, participants could not distinguish between CSM and real humans during short conversation snippets. However, longer dialogues still revealed some limitations, such as occasional unnatural pauses and audio artifacts.

Common Errors and Solutions

ErrorSolution
MPS Compatibility IssuesModify code to avoid unsupported operations or use a different backend.
Missing DependenciesInstall missing libraries using pip.
Hugging Face Access IssuesEnsure you have the necessary permissions and tokens.

Future Developments

Sesame plans to release key components of their research as open source under the Apache 2.0 license. In the coming months, they aim to scale up both model size and training scope, with plans to expand to over 20 languages.

The company is also focusing on integrating pre-trained language models and developing fully duplex-capable systems that can learn conversation dynamics like speaker transitions, pauses, and pacing directly from data.

Conclusion

Sesame CSM 1B represents a significant breakthrough in AI speech technology, offering high-quality speech generation with contextual understanding and real-time performance. By following the steps outlined in this guide, you can install and run Sesame CSM 1B locally.