TTS Providers
Configure text-to-speech for voiceover generation — ElevenLabs, Inworld, Kokoro (free local), Gemini TTS, and OpenAI TTS.
The TTS provider generates the voiceover audio and word-level timestamps that drive karaoke-style animated captions. Every TTS provider in OpenReels produces both an audio file and a WordTimestamp[] array mapping each word to its start/end time in the audio.
Supported providers
| Provider | Default model/voice | Env var | Flag value | Cost |
|---|---|---|---|---|
| ElevenLabs | Multilingual v2 | ELEVENLABS_API_KEY | elevenlabs | ~$0.18/1K chars |
| Inworld | TTS-1.5 Max, "Dennis" | INWORLD_TTS_API_KEY | inworld | ~$0.01/1K chars |
| Kokoro | 82M local model, "af_heart" | none | kokoro | Free |
| Gemini TTS | gemini-2.5-flash-preview-tts, "Kore" | GOOGLE_API_KEY | gemini-tts | ~$0.02/1K chars |
| OpenAI TTS | gpt-4o-mini-tts, "alloy" | OPENAI_API_KEY | openai-tts | ~$0.05/1K chars |
Usage
pnpm start "topic" --tts-provider elevenlabs
pnpm start "topic" --tts-provider inworld
pnpm start "topic" --tts-provider kokoro
pnpm start "topic" --tts-provider gemini-tts
pnpm start "topic" --tts-provider openai-ttsThe --provider local shortcut sets TTS to Kokoro automatically:
pnpm start "topic" --provider localTimestamps and the alignment system
Animated captions require word-level timestamps — the exact start and end time of every word in the voiceover audio. Providers handle this differently:
Native timestamps (no extra processing):
- ElevenLabs — Returns character-level timestamps with the audio. OpenReels aggregates these into word timestamps.
- Inworld — Returns word-level timestamps directly via
WORDtimestamp mode.
Whisper alignment (automatic post-processing):
- Kokoro, Gemini TTS, and OpenAI TTS do not return timestamps. These providers are wrapped in the
AlignedTTSProviderdecorator, which:- Generates the audio from the inner provider
- Runs the audio through a local Whisper model (
whisper-small.en_timestamped) for automatic speech recognition - Aligns the Whisper output to the known transcript using greedy window matching with substring fallback
- Transcodes WAV audio to MP3 (the pipeline's standard format)
The Whisper model (~460MB) auto-downloads from HuggingFace on first use. Alignment adds a few seconds of processing time per voiceover but produces reliable word-level timestamps.
If Whisper produces zero usable words, the pipeline hard fails rather than producing broken captions. This is intentional — interpolated timestamps without anchor points create unwatchable results.
Provider details
ElevenLabs
The highest quality option. Uses the Multilingual v2 model with character-level timestamp alignment built in. Supports multiple languages.
- Audio format: MP3
- Timestamps: Native (character-level, aggregated to words)
- Voice: Configurable via voice ID (default:
yl2ZDV1MzN4HbQJbMihG) - Limit: No hard character limit
ELEVENLABS_API_KEY=your-key-hereInworld
Low-cost option with native word timestamps. Uses the TTS-1.5 Max model.
- Audio format: MP3
- Timestamps: Native (word-level)
- Voice: "Dennis" (configurable)
- Limit: 2,000 characters per request. Longer scripts need a different provider.
INWORLD_TTS_API_KEY=your-key-hereKokoro (free, local)
Zero-cost local TTS using the Kokoro 82M model via kokoro-js. Runs entirely on your machine — no API key, no network calls. The model (~86MB) auto-downloads from HuggingFace on first run.
Kokoro runs in a subprocess to avoid ONNX runtime conflicts. The AlignedTTSProvider decorator handles timestamp extraction via Whisper.
- Audio format: WAV (transcoded to MP3)
- Timestamps: Whisper alignment
- Voice: Configurable with
--kokoro-voice(default:af_heart) - Quality: Good for English. Not as natural as ElevenLabs but entirely free.
pnpm start "topic" --tts-provider kokoro --kokoro-voice af_heartGemini TTS
Uses the Gemini 2.5 Flash TTS model. Reuses your existing GOOGLE_API_KEY — no additional key needed. Included automatically when using --provider google.
- Audio format: WAV (raw 24kHz 16-bit mono PCM, transcoded to MP3)
- Timestamps: Whisper alignment
- Voice: "Kore" (hardcoded)
OpenAI TTS
Uses gpt-4o-mini-tts. Reuses your existing OPENAI_API_KEY.
- Audio format: WAV (transcoded to MP3)
- Timestamps: Whisper alignment
- Voice: "alloy" (hardcoded)
Cost comparison
For a typical 800-character script:
| Provider | Cost | Quality |
|---|---|---|
| Kokoro | Free | Good (English) |
| Inworld | ~$0.008 | Good |
| Gemini TTS | ~$0.016 | Good |
| OpenAI TTS | ~$0.040 | Very good |
| ElevenLabs | ~$0.144 | Excellent |
ElevenLabs costs roughly 18x more than Inworld but delivers noticeably more natural-sounding speech. Kokoro is the only option that costs nothing.