๐ Voice / TTS Architecture
Four-tier text-to-speech fallback chain. Best-quality cloud voices when online; bundled neural models when not. Used by Prism AAC, in-call notifications, accessibility narration, and any app feature that speaks.
๐๏ธ Four-Tier Fallback Chain
| Tier | Engine | Quality | Offline | Cost | When |
|---|---|---|---|---|---|
| 1 | Inworld TTS-2 (cloud) | Best-in-class โ natural prosody, 60+ voices, voice cloning | โ | Per-character billing; free for ro/uk/ru/de/ko/ar (Synalux absorbs cost) | Default for paid tiers; default for free tier in subsidized languages |
| 1.5 | Kokoro-82M neural (WASM) | Very good โ locally-run neural voice | โ en/es/fr/pt/ja/zh | $0 | Free tier non-subsidized langs; offline; Tier 1 unavailable |
| 2 | OS Web Speech API premium voices | Good โ varies by OS (Apple voices best) | โ | $0 | Tier 1.5 unavailable in userโs language; bandwidth-saver mode |
| 3 | WASM espeak-ng | Acceptable โ robotic but always works | โ | $0 | Last resort; covers 100+ languages |

The chain is automatic โ no user configuration required. The voice picker (paid) lets users override their default with any Tier 1 voice.
๐ค Voice Cloning (Paid)
โSpeak in MY voice โ I trained it last week.โ
- 3-minute training sample uploaded once via the Voice Picker UI.
- Inworld voice-clone training takes ~10 minutes; the cloned voice is then available alongside the standard 60+ voices.
- Workspace-scoped โ clones are bound to the workspace; staff voices auto-shared, patient voices private.
- Use cases: AAC users speak in a parentโs voice; clinicians who prefer to hear notes in their own voice; demo videos for parents/caregivers.

๐ Subsidized Languages (Free Tier)
Synalux absorbs Inworld TTS-2 cost for these languages on the free tier so users in regions where the platform serves underserved populations get premium voices at no cost:
ro (Romanian) ยท uk (Ukrainian) ยท ru (Russian) ยท de (German) ยท ko (Korean) ยท ar (Arabic)
๐ฉบ Why This Matters Clinically
- Children with speech impairments depend on AACโs TTS for their actual voice โ robotic Tier-3 fallbacks arenโt acceptable for daily use; Tier 1 / 1.5 must succeed.
- Bilingual households need accurate prosody in both languages โ Tier 1 is critical for non-English.
- Offline reliability โ a child at school without Wi-Fi must still have their device speak. Tier 1.5 + 2 + 3 are bundled; the device always talks.
๐๏ธ Architecture
POST /api/v1/tts Generate TTS audio (auto tier-route based on lang + tier + connectivity)
POST /api/v1/tts/public Anonymous TTS for AAC widgets (rate-limited per IP)
GET /api/v1/tts/voices List available voices for the user's tier + language
POST /api/v1/voices/clone Submit a voice-clone training sample (paid)Routing logic:
synthesize(text, lang, voice?) โ
if voice && Tier1.available(voice): return Tier1.speak(text, voice)
if Tier1.available(lang) && (paid || Tier1.subsidizes(lang)): return Tier1.speak(text)
if Tier1_5.supports(lang): return Tier1_5.speak(text) // Kokoro
if Tier2.has(lang): return Tier2.speak(text) // Web Speech API
return Tier3.speak(text) // espeak-ng๐ณ Plans
| Free | Standard | Advanced | Enterprise | |
|---|---|---|---|---|
| Default Inworld voice | โ | โ | โ | โ |
| All 60+ Inworld voices | โ | โ | โ | โ |
| Voice cloning | โ | โ | โ | โ |
| In-call clinical dictation TTS feedback | โ | โ | โ | โ |
| Custom voice library (workspace-curated) | โ | โ | โ | โ |
๐ Where TTS Is Used
- Prism AAC โ picture tiles โ speech, keyboard speak button, math panel AI tutor.
- In-call notifications โ meeting started, patient joined, recording started.
- Accessibility narration โ for users with low vision; reads notifications + form errors.
- AAC chat โ assistant responses can be auto-spoken with the userโs selected voice.