🎤 TTS Voice Creator - Clone and Design

Voice Clone 3-20s audio file rapid clone Voice Clone/Base (WAV File) Rapid voice clone from 3 to 20 seconds of user audio input. Best when you need to preserve the speaker identity from a WAV reference.
Voice Design description to voice Performs voice design based on user-provided descriptions. Instruction Control is strong; it creates a persona rather than cloning an existing WAV speaker.
Voice Custom style over target timbres CustomVoice provides style control over target timbres via user instructions, covering speaker combinations such as gender, age, language, and dialect.
TTS Generation playground for voices/models Playground for all reachable TTS models and voices. Pick a backend, fetch its voices, then preview, stream, or save audio.
STT -> TTS transcribe then speak Upload audio, transcribe it through the configured STT endpoint, edit the text, then synthesize it with any reachable TTS backend.
Routing map apps to voices Route OpenAI-compatible TTS requests by app, incoming voice, language, and backend.
How to use in apps How to use the Creator proxy and TTS backends in Open WebUI, SillyTavern, Home Assistant, and other compatible apps.
Get Voices scrape voice sources Scrape public voice clip and dataset indexes, preview direct MP3 files, and open source pages for download or licensing details.
Settings URLs, keys, folders Configure TTS/STT backend URLs, API keys, voice folders, playback mode, and NVIDIA endpoints.

Load audio

Drop an audio / video file here WAV · MP3 · OGG · FLAC · M4A · MP4 · MKV · WEBM or click to browse

YouTube / URL

Microphone

0:00
Voice Design samples Preview a sample or use it to fill the editor below.
Name
Sex
Language
Description
Sample text
Actions

Describe the voice

Describe the voice you want. Qwen3-TTS VoiceDesign will synthesise it from the description and sample text.

DimensionExamples
GenderMale, female, neutral
AgeChild (5-12), teenager (13-18), young adult (19-35), middle-aged (36-55), elderly (55+)
PitchHigh, mid, low, slightly high, slightly low
Speaking rateFast, moderate, slow, slightly fast, slightly slow
EmotionCheerful, calm, gentle, serious, lively, composed, soothing
CharacteristicsMagnetic, crisp, husky, smooth, sweet, rich, powerful
Use caseNews broadcasting, advertisement voice-over, audiobook, animated character, voice assistant, documentary narration

Reference transcript & generate

This shared reference transcript is used by samples, prompt presets, generation, preview, download, and export to the Voice Clone Library.

Voice Design prompt library

Voice Custom

CustomVoice uses Qwen's configured premium/custom speakers. It is the best place to test style instructions when you can use one of the CustomVoice timbres.

Best for style control over configured target timbres.
model voicestyle-awarestream-capable model
It does not automatically reuse arbitrary WAV voices from the Voice Clone library. For your own recurring character, create/fine-tune/configure that voice here, or design/export a WAV and then clone it.

TTS voice routing

Recommended OpenAI-compatible TTS base URL: . If an app asks for the full speech endpoint instead of a base URL, use /v1/audio/speech. Rules can turn an incoming app voice such as default into a real cloned voice before the request is sent to Qwen3-TTS. The app name is read from request JSON fields app/client, headers such as X-TTS-App, or guessed from User-Agent/Origin. If a client cannot send that, use app * or give each app a unique incoming voice name. For Open WebUI, set Audio → Text-to-Speech → Additional Parameters to {"app":"Open WebUI"} so these routes match explicitly. For Home Assistant, set the TTS agent base URL to this Creator proxy, not the direct Qwen backend, and use extra payload {"app":"Home Assistant"}. Use Response splitting Punctuation for lower perceived latency. Backend chooses whether this route uses normal Voice Clone, low-latency Streaming, or Voice Design presets such as vd_.... Streaming routes cannot apply before/after sounds without buffering. Language rules use lightweight text detection for EN, DE, FR, ES, IT, PT, NL, and PL. Optional before/after sounds are audio files inside the configured voices folder, for example sounds/start.wav.
Do not use 0.0.0.0 in Open WebUI. Use this machine's LAN IP, hostname, or Docker service name instead.
No routes loaded.
On
App
Input voice
Language
Backend
Output voice
Before sound
After sound

Routing log

Recent route tests and proxy requests. This log is kept in memory and resets when the server restarts.

No routing log entries yet.

Get voices

Not scraped yet.

Edit the source list one URL per line. External sources may block scraping; source errors are shown without hiding successful results. Check each source page for license, consent, and usage rights before importing or publishing a voice.

Click Scrape sources to fetch Aiartes VoiceAI clips, yaph/tts-samples MP3 files, and the jim-schwoebel voice dataset index.

Use voices in other apps

The editor creates and manages the voice files. External apps should connect to the Creator proxy or a reachable TTS backend, then use one of the active voice names.

SillyTavern

Use an OpenAI-compatible TTS provider. Paste one active voice into the voice field, or paste the comma-separated list where SillyTavern accepts custom voices.

Open WebUI

Configure TTS as OpenAI-compatible audio. Use the creator proxy if you want Routing rules such as incoming voice default mapped by language.

Home Assistant

Use this as a REST example for automations or scripts that call the TTS backend. Save the returned audio somewhere Home Assistant can play from.

Generic curl test

Quick terminal test for the voice list and speech endpoint after restarting the TTS container.

VoiceDesign virtual voices

Use saved Voice Design prompt presets without exporting WAVs. Point the external app at this creator app as an OpenAI-compatible TTS proxy and select a vd_... voice.

Streaming TTS

Use this when the target app can play audio progressively. For routed streaming, keep response format WAV and avoid before/after route sounds, otherwise the proxy must buffer before playback.

Important after voice changes

After enabling, hiding, adding, renaming, cropping, or normalising voices, restart the Qwen3-TTS container so its engine scans the updated active_voices folder. Then refresh the model or voice list in the target app.

Virtual VoiceDesign voices are different: they use saved prompt presets through this app's proxy and do not need a WAV export or TTS-container rescan. They do need the faster-qwen3-tts-voicedesign container reachable from Settings.

Voice ID

Build from parts  (LANG · GENDER · Name helper)

Transcript (reference text)

Audio preview

No audio loaded yet. Go to tab 1 (trim a file) or tab 2 (voice design).

Voice library

1 Load or record source

Drop an audio / video file here WAV · MP3 · OGG · FLAC · M4A · MP4 · MKV · WEBM or click to browse
Record a fresh sample
0:00
Input level -∞ dB
1.00x
Best peaks: -18 to -9 dB, never red.
Use this as the spoken script if you record.
Quiet room 20 cm from mic No clipping Natural pace
If the microphone is blocked:
  • Chrome, Brave, Edge: click the lock/tune icon in the address bar, set Microphone to Allow, then reload.
  • Firefox: click the microphone or lock icon in the address bar, remove Blocked or choose Allow, then reload.
  • Safari: open Safari Settings, Websites, Microphone, then allow this site.
  • Requested device not found: choose or enable a microphone in your OS input settings, then reload.
  • Browsers require localhost or HTTPS for microphone access.
Load a file, paste a URL, or record a sample. Then trim, name, and save the voice.
Select 3-20 seconds for best cloning.

2 Name and save voice

Voices
Image
Language / Sex / Name
Filetype
Length
dB
Benchmark
Rating
Play / Pause
Edit

TTS generation playground

Pick any reachable TTS backend, fetch its voices, then synthesize text. WAV/NVIDIA clone backends preserve reference identity; instruction-control backends follow style better.

Generate speech

Checking available TTS backends...

After changing active voices, restart the TTS container so the engine reads the updated voice folder.

This is sent as instruct. In this setup, Voice Clone/Base and Streaming are fastest but usually preserve WAV identity more than they obey per-request style. CustomVoice and Voice Design are the style-aware choices.

STT -> TTS workspace

Upload speech audio, transcribe it with the configured STT endpoint, then synthesize the resulting text with any available TTS backend.

Source speech

Uses Settings -> Whisper/STT URL by default.
0:00
Record or upload speech audio, then transcribe it with the selected recognition engine.
No source audio loaded.

Synthesize transcription

Checking available TTS backends...

Settings

Configure the service URLs you actually use first. Advanced payloads, folders, and keys are tucked away below.

Local stack
Quick setup 1. Check core TTS URLs. 2. Point STT at Whisper, Parakeet, or the NVIDIA router. 3. Save settings.

Core connections

These are the endpoints you change most often. Qwen3 TTS, NVIDIA TTS, and STT are grouped separately.

TTS Text to Speech Qwen3 engines: clone, design, custom, streaming
Uploaded/cloned WAV voices. Expected: POST /v1/audio/speech.
Prompt-designed voices and vd_... virtual voices.
Style over configured speakers such as Ryan, Vivian, Serena.
Progressive low-latency WAV playback.
NVIDIA speech stack router, Magpie TTS, Parakeet ASR, clone NIM
OpenAI-compatible base URL for the NVIDIA speech router.
Direct Magpie endpoint. Fixed speakers, not WAV cloning.
Direct Parakeet endpoint for transcription.
Magpie Zeroshot clone endpoint with audio_prompt.
Magpie Flow clone endpoint with audio_prompt and transcript.
STT Speech to Text Whisper, Parakeet, or NVIDIA router
Reference text recognition. Expected: POST /v1/audio/transcriptions.

Playback behavior

Small behavior switches for previews and OpenAI-compatible TTS calls.

Buffered keeps Save WAV available. Streaming starts sooner.
Controls payload shape for the normal TTS API URL.
Advanced request payloads JSON extras sent to each backend
Extra fields for the 8020 WAV voice clone/base model.
Extra fields for the 8023 streaming model.
Extra fields for the CustomVoice backend.
Extra fields for Voice Design and virtual vd_... voices.
Usually empty. Magpie accepts fixed speaker voices such as sofia, aria, jason, leo, and john.
Optional multipart fields. The app supplies text, language, and audio_prompt.
Optional multipart fields. The app also sends the saved reference transcript.
Voice folders container paths and Portainer volume mounts
Contains active_voices, hidden_voices, sounds, and metadata.
New cloned/exported voices are saved here.
API keys usually empty for local containers
Ready