Voice Cloning

Chatterbox uses zero-shot voice cloning to synthesize speech in any voice using just a short reference audio clip. You don’t need to train or fine-tune the model—simply provide a reference audio file and the model will clone the voice characteristics.

How Zero-Shot Cloning Works

Chatterbox extracts voice characteristics from your reference audio and uses them to condition the speech generation. The model analyzes:

Speaker identity using a voice encoder that creates speaker embeddings
Speech patterns by tokenizing the reference audio into speech tokens
Prosody and style through the conditioning mechanism

The model then applies these characteristics to generate new speech with your target text.

Prepare your reference audio

Choose a clean audio clip of the voice you want to clone. The clip should:

Be at least 5 seconds long (Turbo requires 5+ seconds, standard models work with shorter clips)
Ideally be 6-15 seconds for best results
Have clear speech without background noise
Contain natural speaking patterns representative of the target voice

Load the model

Initialize the Chatterbox model on your preferred device:

from chatterbox.tts_turbo import ChatterboxTurboTTS
import torchaudio as ta

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

For the standard English or multilingual models:

from chatterbox.tts import ChatterboxTTS
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# English model
model = ChatterboxTTS.from_pretrained(device="cuda")

# Multilingual model
multilingual_model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

Generate speech with voice cloning

Pass the audio_prompt_path parameter to clone the voice:

text = "Hello, this is a test of voice cloning technology."
wav = model.generate(text, audio_prompt_path="reference_voice.wav")
ta.save("output.wav", wav, model.sr)

For multilingual models, also specify the language:

french_text = "Bonjour, comment ça va?"
wav = multilingual_model.generate(
    french_text, 
    language_id="fr",
    audio_prompt_path="reference_voice.wav"
)
ta.save("french_output.wav", wav, multilingual_model.sr)

Using the audio_prompt_path Parameter

The audio_prompt_path parameter accepts a path to your reference audio file:

# Chatterbox Turbo
wav = model.generate(
    text="Your text here",
    audio_prompt_path="path/to/reference.wav"
)

# Standard Chatterbox
wav = model.generate(
    text="Your text here",
    audio_prompt_path="path/to/reference.wav",
    cfg_weight=0.5,
    exaggeration=0.5
)

When you provide audio_prompt_path, the model automatically calls prepare_conditionals() internally to extract voice characteristics from the reference audio.

Best Practices for Reference Audio

Audio Quality: Use high-quality reference audio (44.1kHz or 48kHz) with minimal background noise. The model will resample automatically, but starting with clean audio produces better results.

Duration Guidelines

Minimum: 5 seconds (required for Turbo model)
Recommended: 6-15 seconds
Maximum: The model uses only the first 6-10 seconds depending on the model variant

# From tts_turbo.py - Turbo uses 10 seconds at 22050Hz
DEC_COND_LEN = 10 * 22050  # 10 seconds

# From tts.py - Standard model uses 10 seconds at 22050Hz  
DEC_COND_LEN = 10 * 22050  # 10 seconds

Content Recommendations

Choose reference audio that matches the speaking style you want to generate. If you want energetic speech, use an energetic reference. For calm narration, use calm reference audio.

Clear articulation: Avoid mumbled or unclear speech
Natural pacing: Not too fast or too slow
Single speaker: Ensure only one person is speaking
Complete sentences: Avoid fragments or cutoffs
Representative style: Match the intended output style

Format Support

The model uses librosa for loading audio, which supports:

WAV
MP3
FLAC
OGG
and other common formats

Pre-computing Voice Conditionals

For repeated use of the same voice, you can pre-compute conditionals to save time:

# Prepare conditionals once
model.prepare_conditionals(
    "reference_voice.wav",
    exaggeration=0.5
)

# Generate multiple outputs without re-processing the reference
wav1 = model.generate("First sentence.")
wav2 = model.generate("Second sentence.")
wav3 = model.generate("Third sentence.")

This approach is more efficient when generating multiple outputs with the same voice.

Adjusting Exaggeration

The exaggeration parameter controls how expressive the cloned voice sounds:

# Neutral expression (default for Turbo)
wav = model.generate(
    text,
    audio_prompt_path="reference.wav",
    exaggeration=0.0
)

# Moderate expression (default for standard models)
wav = model.generate(
    text,
    audio_prompt_path="reference.wav",
    exaggeration=0.5
)

# High expression
wav = model.generate(
    text,
    audio_prompt_path="reference.wav",
    exaggeration=0.7
)

Turbo Model Note: The Turbo model ignores the exaggeration parameter during generation. Exaggeration is only used during prepare_conditionals() for the standard Chatterbox and multilingual models.

Built-in Voice

All models come with a built-in default voice that’s used when you don’t provide an audio_prompt_path:

# Uses built-in voice
wav = model.generate("Hello world!")

This is useful for quick testing but provides limited customization compared to voice cloning.

Troubleshooting

”Audio prompt must be longer than 5 seconds”

The Turbo model requires at least 5 seconds of reference audio. Ensure your audio file meets this minimum duration:

import librosa

# Check duration
wav, sr = librosa.load("reference.wav", sr=None)
duration = len(wav) / sr
print(f"Duration: {duration:.2f} seconds")

Poor Voice Similarity

If the generated voice doesn’t match your reference well:

Use higher quality reference audio - Reduce background noise
Try a longer reference clip - Use 10-15 seconds instead of 5-6
Adjust cfg_weight (standard models only) - Try lower values like cfg_weight=0.3
Match speaking style - Ensure reference style matches your target text

Example: Complete Voice Cloning Workflow

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

# Initialize model
model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Your reference audio and target text
reference_audio = "speaker_sample.wav"
text = "Hi there, Sarah here from MochaFone calling you back."

# Generate with voice cloning
wav = model.generate(text, audio_prompt_path=reference_audio)

# Save the result
ta.save("cloned_voice_output.wav", wav, model.sr)

Get Started

Models

Guides

How Zero-Shot Cloning Works

Using the audio_prompt_path Parameter

Best Practices for Reference Audio

Duration Guidelines

Content Recommendations

Format Support

Pre-computing Voice Conditionals

Adjusting Exaggeration

Built-in Voice

Troubleshooting

”Audio prompt must be longer than 5 seconds”

Poor Voice Similarity

Example: Complete Voice Cloning Workflow

Get Started

Models

Guides

Documentation Index

​How Zero-Shot Cloning Works

​Using the audio_prompt_path Parameter

​Best Practices for Reference Audio

​Duration Guidelines

​Content Recommendations

​Format Support

​Pre-computing Voice Conditionals

​Adjusting Exaggeration

​Built-in Voice

​Troubleshooting

​”Audio prompt must be longer than 5 seconds”

​Poor Voice Similarity

​Example: Complete Voice Cloning Workflow

How Zero-Shot Cloning Works

Using the audio_prompt_path Parameter

Best Practices for Reference Audio

Duration Guidelines

Content Recommendations

Format Support

Pre-computing Voice Conditionals

Adjusting Exaggeration

Built-in Voice

Troubleshooting

”Audio prompt must be longer than 5 seconds”

Poor Voice Similarity

Example: Complete Voice Cloning Workflow