TTS Voice Creator - Clone and Design

Voice Clone 3-20s audio file rapid clone

Voice Design description to voice

Voice Custom style over target timbres

TTS Generation playground for voices/models

STT -> TTS transcribe then speak

Routing map apps to voices

How to use in apps

Get Voices scrape voice sources

Settings URLs, keys, folders

Load audio

Drop an audio / video file here WAV · MP3 · OGG · FLAC · M4A · MP4 · MKV · WEBM or click to browse

YouTube / URL

Microphone

0:00

Voice Design samples Preview a sample or use it to fill the editor below.

Name

Sex

Language

Description

Sample text

Actions

Describe the voice

Describe the voice you want. Qwen3-TTS VoiceDesign will synthesise it from the description and sample text.

Gender

Language

Name

Dimension	Examples
Gender	Male, female, neutral
Age	Child (5-12), teenager (13-18), young adult (19-35), middle-aged (36-55), elderly (55+)
Pitch	High, mid, low, slightly high, slightly low
Speaking rate	Fast, moderate, slow, slightly fast, slightly slow
Emotion	Cheerful, calm, gentle, serious, lively, composed, soothing
Characteristics	Magnetic, crisp, husky, smooth, sweet, rich, powerful
Use case	News broadcasting, advertisement voice-over, audiobook, animated character, voice assistant, documentary narration

Reference transcript & generate

This shared reference transcript is used by samples, prompt presets, generation, preview, download, and export to the Voice Clone Library.

Voice Design prompt library

Preset

Voice Custom

CustomVoice uses Qwen's configured premium/custom speakers. It is the best place to test style instructions when you can use one of the CustomVoice timbres.

Best for style control over configured target timbres.

model voicestyle-awarestream-capable model

It does not automatically reuse arbitrary WAV voices from the Voice Clone library. For your own recurring character, create/fine-tune/configure that voice here, or design/export a WAV and then clone it.

TTS voice routing

Recommended OpenAI-compatible TTS base URL: . If an app asks for the full speech endpoint instead of a base URL, use /v1/audio/speech. Rules can turn an incoming app voice such as default into a real cloned voice before the request is sent to Qwen3-TTS. The app name is read from request JSON fields app/client, headers such as X-TTS-App, or guessed from User-Agent/Origin. If a client cannot send that, use app * or give each app a unique incoming voice name. For Open WebUI, set Audio → Text-to-Speech → Additional Parameters to {"app":"Open WebUI"} so these routes match explicitly. For Home Assistant, set the TTS agent base URL to this Creator proxy, not the direct Qwen backend, and use extra payload {"app":"Home Assistant"}. Use Response splitting Punctuation for lower perceived latency. Backend chooses whether this route uses normal Voice Clone, low-latency Streaming, or Voice Design presets such as vd_.... Streaming routes cannot apply before/after sounds without buffering. Language rules use lightweight text detection for EN, DE, FR, ES, IT, PT, NL, and PL. Optional before/after sounds are audio files inside the configured voices folder, for example sounds/start.wav.

Do not use 0.0.0.0 in Open WebUI. Use this machine's LAN IP, hostname, or Docker service name instead.

App

Input voice

Language

Backend

Output voice

Before sound

After sound

Routing log

Recent route tests and proxy requests. This log is kept in memory and resets when the server restarts.

No routing log entries yet.

Get voices

Not scraped yet.

Edit the source list one URL per line. External sources may block scraping; source errors are shown without hiding successful results. Check each source page for license, consent, and usage rights before importing or publishing a voice.

Click Scrape sources to fetch Aiartes VoiceAI clips, yaph/tts-samples MP3 files, and the jim-schwoebel voice dataset index.

Use voices in other apps

The editor creates and manages the voice files. External apps should connect to the Creator proxy or a reachable TTS backend, then use one of the active voice names.

SillyTavern

Use an OpenAI-compatible TTS provider. Paste one active voice into the voice field, or paste the comma-separated list where SillyTavern accepts custom voices.

Open WebUI

Configure TTS as OpenAI-compatible audio. Use the creator proxy if you want Routing rules such as incoming voice default mapped by language.

Home Assistant

Use this as a REST example for automations or scripts that call the TTS backend. Save the returned audio somewhere Home Assistant can play from.

Generic curl test

Quick terminal test for the voice list and speech endpoint after restarting the TTS container.

VoiceDesign virtual voices

Use saved Voice Design prompt presets without exporting WAVs. Point the external app at this creator app as an OpenAI-compatible TTS proxy and select a vd_... voice.

Streaming TTS

Use this when the target app can play audio progressively. For routed streaming, keep response format WAV and avoid before/after route sounds, otherwise the proxy must buffer before playback.

Important after voice changes

After enabling, hiding, adding, renaming, cropping, or normalising voices, restart the Qwen3-TTS container so its engine scans the updated active_voices folder. Then refresh the model or voice list in the target app.

Virtual VoiceDesign voices are different: they use saved prompt presets through this app's proxy and do not need a WAV export or TTS-container rescan. They do need the faster-qwen3-tts-voicedesign container reachable from Settings.

Voice ID

Build from parts (LANG · GENDER · Name helper)

Language

Gender

Name (no spaces)

Transcript (reference text)

Audio preview

No audio loaded yet. Go to tab 1 (trim a file) or tab 2 (voice design).

Sample sentence for generated playback and benchmarks

Select TTS Engine

Target dBFS

1 Load or record source

Drop an audio / video file here WAV · MP3 · OGG · FLAC · M4A · MP4 · MKV · WEBM or click to browse

Audio or video URL

Record a fresh sample

0:00

Input level -∞ dB

Mic gain 1.00x

Best peaks: -18 to -9 dB, never red.

Read sample

Use this as the spoken script if you record.

Quiet room 20 cm from mic No clipping Natural pace

If the microphone is blocked:

Chrome, Brave, Edge: click the lock/tune icon in the address bar, set Microphone to Allow, then reload.
Firefox: click the microphone or lock icon in the address bar, remove Blocked or choose Allow, then reload.
Safari: open Safari Settings, Websites, Microphone, then allow this site.
Requested device not found: choose or enable a microphone in your OS input settings, then reload.
Browsers require localhost or HTTPS for microphone access.

Load a file, paste a URL, or record a sample. Then trim, name, and save the voice.

Start

End

Select 3-20 seconds for best cloning.

2 Name and save voice

Language

Gender

Voice ID

Voices

Show disabled

About disabled voices: The Active toggle moves the complete voice package between active_voices and hidden_voices. Qwen3-TTS should scan only active_voices. To permanently remove a voice, delete its .wav and .reference.txt files directly.

Image

Language / Sex / Name

Filetype

Length

Benchmark

Rating

Play / Pause

Edit

TTS generation playground

Pick any reachable TTS backend, fetch its voices, then synthesize text. WAV/NVIDIA clone backends preserve reference identity; instruction-control backends follow style better.

Generate speech

Backend

Checking available TTS backends...

Backend voice

Playback

After changing active voices, restart the TTS container so the engine reads the updated voice folder.

Target Text (text to synthesize)

Style Instruction (optional) This is sent as instruct. In this setup, Voice Clone/Base and Streaming are fastest but usually preserve WAV identity more than they obey per-request style. CustomVoice and Voice Design are the style-aware choices.

STT -> TTS workspace

Upload speech audio, transcribe it with the configured STT endpoint, then synthesize the resulting text with any available TTS backend.

Source speech

Speech recognition

Uses Settings -> Whisper/STT URL by default.

Speech audio

0:00

Record or upload speech audio, then transcribe it with the selected recognition engine.

Source preview

No source audio loaded.

Transcribed text

Synthesize transcription

Backend

Checking available TTS backends...

Backend voice

Playback

Style Instruction (optional)

Settings

Configure the service URLs you actually use first. Advanced payloads, folders, and keys are tucked away below.

Local stack

Quick setup 1. Check core TTS URLs. 2. Point STT at Whisper, Parakeet, or the NVIDIA router. 3. Save settings.

Core connections

These are the endpoints you change most often. Qwen3 TTS, NVIDIA TTS, and STT are grouped separately.

TTS Text to Speech Qwen3 engines: clone, design, custom, streaming

Voice Clone/Base URL (WAV voices) Uploaded/cloned WAV voices. Expected: POST /v1/audio/speech.

Voice Design URL (instruction voices) Prompt-designed voices and vd_... virtual voices.

CustomVoice URL Style over configured speakers such as Ryan, Vivian, Serena.

Streaming URL Progressive low-latency WAV playback.

NVIDIA speech stack router, Magpie TTS, Parakeet ASR, clone NIM

NVIDIA router URL (TTS + STT) OpenAI-compatible base URL for the NVIDIA speech router.

NVIDIA Magpie TTS URL Direct Magpie endpoint. Fixed speakers, not WAV cloning.

NVIDIA Parakeet ASR URL Direct Parakeet endpoint for transcription.

NVIDIA Zeroshot NIM URL Magpie Zeroshot clone endpoint with audio_prompt.

NVIDIA Flow NIM URL Magpie Flow clone endpoint with audio_prompt and transcript.

STT Speech to Text Whisper, Parakeet, or NVIDIA router

Whisper/STT URL Reference text recognition. Expected: POST /v1/audio/transcriptions.

Playback behavior

Small behavior switches for previews and OpenAI-compatible TTS calls.

TTS preview playback Buffered keeps Save WAV available. Streaming starts sooner.

OpenAI-compatible request style Controls payload shape for the normal TTS API URL.

Advanced request payloads JSON extras sent to each backend

Voice Clone/Base params Extra fields for the 8020 WAV voice clone/base model.

Streaming params Extra fields for the 8023 streaming model.

CustomVoice params Extra fields for the CustomVoice backend.

Voice Design params Extra fields for Voice Design and virtual vd_... voices.

NVIDIA Magpie params Usually empty. Magpie accepts fixed speaker voices such as sofia, aria, jason, leo, and john.

NVIDIA Zeroshot params Optional multipart fields. The app supplies text, language, and audio_prompt.

NVIDIA Flow params Optional multipart fields. The app also sends the saved reference transcript.

Voice folders container paths and Portainer volume mounts

Voice scan directory Contains active_voices, hidden_voices, sounds, and metadata.

Active voices directory New cloned/exported voices are saved here.

API keys usually empty for local containers

TTS API key (optional)

Voice Design API key (optional)

Whisper API key (optional)

🎤 TTS Voice Creator - Clone and Design

Load audio

YouTube / URL

Microphone

Trim selection

Describe the voice

Reference transcript & generate

Generated voice

Export to Voice Clone Library

Voice Design prompt library

Voice Custom

TTS voice routing

Sound browser

Routing log

Get voices

Use voices in other apps

SillyTavern

Open WebUI

Home Assistant

Generic curl test

VoiceDesign virtual voices

Streaming TTS

Important after voice changes

Voice ID

Transcript (reference text)

Audio preview

1 Load or record source

2 Name and save voice

TTS generation playground

Generate speech

STT -> TTS workspace

Source speech

Synthesize transcription

Settings

Core connections

Playback behavior

Load audio

YouTube / URL

Microphone

Trim selection

Describe the voice

Reference transcript & generate

Generated voice

Export to Voice Clone Library

Voice Design prompt library

Voice Custom

TTS voice routing

Sound browser

Routing log

Get voices

Use voices in other apps

SillyTavern

Open WebUI

Home Assistant

Generic curl test

VoiceDesign virtual voices

Streaming TTS

Important after voice changes

Voice ID

Transcript (reference text)

Audio preview

Voice library

1 Load or record source

2 Name and save voice

TTS generation playground

Generate speech

STT -> TTS workspace

Source speech

Synthesize transcription

Settings

Core connections

Playback behavior