sesame csm

Nari Dia 1.6B vs Sesame CSM 1B: Which Is the Best TTS?

Anas Mohammad

May 1, 2025 • 4 min read

Text-to-speech (TTS) technology has seen rapid advancements, evolving from robotic voices to lifelike AI-generated speech. In 2025, two of the leading open-source models are Nari Dia 1.6B and Sesame CSM 1B.

Both offer impressive capabilities in realistic speech synthesis, but they cater to different use cases and offer distinct strengths.

Overview of the Models

Nari Dia 1.6B

Developer: Nari Labs
Parameters: 1.6 billion
Focus: Expressive, dialogue-driven speech with nonverbal cues
Language Support: English only
License: Apache 2.0 (open source)
Key Strengths: Emotion control, nonverbal sound generation, voice cloning, full offline support

Sesame CSM 1B

Developer: Sesame AI Labs
Parameters: 1 billion
Focus: Real-time, context-aware conversational speech
Language Support: English (primary)
License: Open source
Key Strengths: Low-latency performance, multimodal learning, seamless context integration

Technical Architecture

Feature	Nari Dia 1.6B	Sesame CSM 1B
Model Size	1.6B parameters	1B parameters
Core Technology	TTS-optimized language model	Multimodal transformer with RVQ
Input Modalities	Text + optional audio prompt	Text + optional audio context
Output Format	Direct waveform generation	RVQ codes → waveform reconstruction
Dialogue Support	Multi-speaker via text tags `[S1]`, `[S2]`	Context-aware dialogue modeling
Nonverbal Sounds	Yes (e.g., `(laughs)`, `(coughs)`)	Not natively; possible with custom context
Voice Cloning	Via audio conditioning	Contextual prompting
Real-Time Capability	Fast on GPU	Low-latency, real-time on GPU or CPU
Customization	Fully modifiable and fine-tunable	Customizable via context and open weights

Key Features and Innovations

Nari Dia 1.6B

Expressive Dialogue Generation: Optimized for natural back-and-forth exchanges using speaker tags.
Emotion and Tone Control: Users can guide intonation and style using audio prompts.
Nonverbal Sound Synthesis: Generates authentic nonverbal cues like laughter or coughing from text.
Voice Cloning: Supports imitation of speaker characteristics via sample conditioning.
Offline Operation: Can run locally for maximum privacy and data security.
Community and Openness: Apache 2.0 licensed with active developer support and open-source code.

Sesame CSM 1B

Multimodal Input: Processes both text and audio in a unified architecture.
Conversational Context Awareness: Tracks speaker turns and dialogue flow using context segments.
Real-Time Output: Engineered for interactive, low-latency applications.
Flexible Style Transfer: Voice, tone, and flow can be influenced using previous utterances.
Accessible for Developers: CPU-compatible for lightweight workloads; hosted models on Hugging Face.

Installation and Hardware Requirements

Nari Dia 1.6B

GPU: NVIDIA GPU with ≥10GB VRAM (e.g., RTX 3070/4070)
OS: Linux preferred; Windows compatible
Dependencies: Python 3.8+, PyTorch 2.0+, CUDA 12.6, uv recommended
Setup: GitHub repo → virtual environment → install deps → run UI or script
CPU Support: Planned for future
Demo: Hugging Face ZeroGPU Space available

Sesame CSM 1B

GPU (Optional): Recommended for real-time; CPU support available
OS: Linux recommended
Dependencies: Python 3.10+, PyTorch, torchaudio
Setup: GitHub repo → virtual environment → install deps → Hugging Face CLI for model access
Advanced Features: Requires scripting for context segments

Performance and Output Quality

Naturalness and Expressiveness

Nari Dia 1.6B delivers highly expressive, realistic dialogue, with superior emotional inflection and timing. Outperforms many competitors in natural delivery.
Sesame CSM 1B shines in fluid, context-driven conversations. Adapts dynamically to conversation history for lifelike exchanges.

Speaker Consistency and Cloning

Dia 1.6B: Uses audio prompts and random seed control for consistent voices.
CSM 1B: Maintains speaker identity through context history and previous audio inputs.

Nonverbal Sound Handling

Dia 1.6B: Generates expressive sounds from written cues like (laughs) or (clears throat).
CSM 1B: Mimics such sounds only via contextual scripting.

Latency and Interactivity

Dia 1.6B: Fast generation on GPUs, but lacks CPU optimization for now.
CSM 1B: Built for low-latency, real-time output, including CPU fallback.

Customization, Privacy, and Ecosystem

Customization

Dia 1.6B: Fully open, supports fine-tuning for new voices or domains.
CSM 1B: Modular and adaptable via structured context; less direct for voice cloning.

Privacy and Control

Dia 1.6B: Ideal for privacy-first environments with local-only processing.
CSM 1B: Supports local deployment but shines in connected, dynamic settings.

Community Support

Dia 1.6B: Developer Discord, open demos, and detailed documentation.
CSM 1B: Hugging Face-hosted models, GitHub discussions, and research-friendly tools.

Use Cases and Applications

Best Use Cases for Nari Dia 1.6B

Audiobooks, podcasts, YouTube narration
Conversational AI with expressive delivery
Voiceover for privacy-sensitive environments
Offline TTS deployment

Best Use Cases for Sesame CSM 1B

Real-time AI assistants and customer service bots
Educational apps and interactive simulations
Low-latency voice interfaces
Multimodal speech research and prototyping

Strengths and Weaknesses

Model	Strengths	Weaknesses
Nari Dia 1.6B	- Expressive and realistic speech- Nonverbal sound generation- Full local control- Voice cloning	- Requires powerful GPU- English only- No current CPU support
Sesame CSM 1B	- Low-latency real-time output- Context-aware dialogue- CPU support- Multimodal learning	- Lacks direct nonverbal cue support- Slightly less expressive- Requires structured context input

Direct Comparison Summary

Feature	Nari Dia 1.6B	Sesame CSM 1B
Parameters	1.6B	1B
Dialogue Modeling	Speaker tags (`[S1]`, `[S2]`)	Context segments
Nonverbal Sound Support	Yes (text-based)	Partial (via context only)
Voice Cloning	Audio conditioning	Contextual prompting
Real-Time Capability	GPU only	GPU/CPU (low-latency)
Customization	Fine-tuning, modifiable code	Context-based customization
Privacy & Offline Use	Full offline control	Local or cloud-supported
Language Support	English only	English (primary)
Hardware Requirements	GPU with 10GB+ VRAM	GPU preferred, CPU usable
Community	Active Discord, demos available	GitHub + Hugging Face
License	Apache 2.0	Open source

Which Should You Choose?

Choose Nari Dia 1.6B if:

You need high-fidelity, emotionally expressive voice output.
Your content benefits from nonverbal expressions like laughter or sighs.
You require strict data privacy and offline deployment.
You have a compatible GPU and plan to fine-tune or experiment.

Choose Sesame CSM 1B if:

You’re building real-time, interactive speech applications.
You need dynamic conversational adaptation based on dialogue history.
You're developing on CPU-constrained devices or prefer fast prototyping.
You’re exploring multimodal AI or educational simulations.

Final Thoughts

Both Nari Dia 1.6B and Sesame CSM 1B push the boundaries of open-source TTS in 2025. Whether you prioritize realism and privacy (Dia), or real-time adaptability and contextual flow (CSM), each model offers unique advantages.

Experiment with both to find which aligns best with your workflow, audience, and creative vision.

Overview of the Models

Nari Dia 1.6B

Sesame CSM 1B

Technical Architecture

Key Features and Innovations

Nari Dia 1.6B

Sesame CSM 1B

Installation and Hardware Requirements

Nari Dia 1.6B

Sesame CSM 1B

Performance and Output Quality

Naturalness and Expressiveness

Speaker Consistency and Cloning

Nonverbal Sound Handling

Latency and Interactivity

Customization, Privacy, and Ecosystem

Customization

Privacy and Control

Community Support

Use Cases and Applications

Best Use Cases for Nari Dia 1.6B

Best Use Cases for Sesame CSM 1B

Strengths and Weaknesses

Direct Comparison Summary

Which Should You Choose?

Choose Nari Dia 1.6B if:

Choose Sesame CSM 1B if:

Final Thoughts

Sign up for more like this.