Appearance
Configure Transcription Action
Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.
Action Structure
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "DEEPGRAM",
"languages": ["en-US"]
}Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | string | Yes | — | Always "configure_transcription" |
session_id | string (UUID) | Yes | — | Session identifier from event |
provider | string | No | Current provider | STT provider to switch to. Valid values: "AZURE", "DEEPGRAM", "ELEVEN_LABS". Omitting keeps the current provider. |
languages | string[] | No | Provider default | BCP-47 language codes (1–4 entries). Fully replaces the current config. Omitting resets to provider default (auto-detection). |
At least one of provider or languages should be provided; sending neither is a no-op.
Behavioral Details
Full Replace Semantics
Both provider and languages use full replace semantics — they never merge with existing settings.
provider field | languages field | Result |
|---|---|---|
| Provided | Provided | Switches to new provider with specified languages |
| Provided | Omitted | Switches to new provider; languages reset to [] (default) |
| Omitted | Provided | Keeps current provider; languages fully replaced |
| Omitted | Omitted | No-op (transcription unchanged) |
Brief Audio Gap During Restart
Any change — language or provider — requires the transcription engine to restart. Audio received during the restart is dropped and will not appear in any user_speak event.
| Change type | Typical gap |
|---|---|
| Language change only | ~100–500 ms |
| Provider switch | ~200–800 ms |
Design your call flow to trigger changes at natural pause points (e.g., after the assistant finishes speaking) to minimize the impact of the gap.
Barge-In Latency After Provider Switch
Each provider has different Voice Activity Detection (VAD) characteristics. Switching providers may change barge-in latency for the immediate strategy:
| Provider | Approximate barge-in latency |
|---|---|
| Azure | ~20–80 ms |
| Deepgram | ~20–100 ms |
| ElevenLabs | ~30–120 ms |
Compatible Channels
The configure_transcription action is accepted on all three delivery channels:
- HTTP webhook response
- Client-transport WebSocket
- External API POST
Multi-Language Support per Provider
Not all providers support simultaneous multi-language detection. When more than one language code is supplied, providers that only accept a single language will silently use the first entry and ignore the rest.
| Provider value | Multi-language support | Notes |
|---|---|---|
"AZURE" | ✅ Up to 4 languages | All entries used for Language Identification (LID) |
"DEEPGRAM" | ❌ Single language only | Only the first entry is used; rest are ignored |
"ELEVEN_LABS" | ❌ Single language only | Only the first entry is used; rest are ignored |
Recommendation: When targeting Deepgram or ElevenLabs, always supply exactly one language code. Supplying multiple codes will not cause an error, but only the first will take effect.
Examples
Change Language Only (Keep Current Provider)
Switch an active session to German transcription:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"languages": ["de-DE"]
}Switch Provider Only (Languages Reset to Default)
Switch from Azure to Deepgram; languages reset to auto-detection:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "DEEPGRAM"
}Switch Provider and Language Simultaneously
Switch to ElevenLabs and set English as the recognition language:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "ELEVEN_LABS",
"languages": ["en-US"]
}Use Multiple Languages Simultaneously
Enable multi-language detection for German and English:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"languages": ["de-DE", "en-US"]
}Up to 4 language codes may be provided in a single request.
Reset to Provider Default
Omit languages to restore automatic language detection:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}Switching Language Based on User Input
A common pattern: detect the caller's preferred language from their first utterance, then reconfigure transcription mid-call.
javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
// Start with multi-language detection
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Hello! Guten Tag! Please speak in your preferred language.',
});
}
if (event.type === 'user_speak') {
const detectedLanguage = event.language; // BCP-47 code from STT
if (detectedLanguage && detectedLanguage.startsWith('de')) {
// Caller is speaking German — lock transcription to German only
return res.json({
type: 'configure_transcription',
session_id: event.session.id,
languages: ['de-DE'],
});
}
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`,
});
}
});Provider Fallback Pattern
Switch to a backup provider if the primary fails or for specific call scenarios:
javascript
// Switch to Deepgram for better handling of a specific language/accent
return res.json({
type: 'configure_transcription',
session_id: event.session.id,
provider: 'DEEPGRAM',
languages: ['en-US'],
});Next Steps
- Actions Overview - Complete action reference
- Event Types - What events carry transcribed text
- Barge-In Configuration - Control how users interrupt the assistant