Appearance
Configure Transcription Action
Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.
Action Structure
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "DEEPGRAM",
"languages": ["en-US"]
}Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | string | Yes | — | Always "configure_transcription" |
session_id | string (UUID) | Yes | — | Session identifier from event |
provider | string | No | Current provider | STT provider to switch to. Valid values: "AZURE", "DEEPGRAM", "ELEVEN_LABS". Omitting keeps the current provider. |
languages | string[] | No | Provider default | BCP-47 language codes (1–4 entries). Fully replaces the current config. Omitting resets to provider default (auto-detection). |
custom_vocabulary | string[] | No | — | Words or phrases to boost STT recognition accuracy. Max 100 entries, max 200 characters per entry. Fully replaces the current session-level vocabulary. Merged with client-level vocabulary configured during onboarding. Supported by Azure, Deepgram, and ElevenLabs. |
vad | object | No | Current setting | Voice-activity detection tuning, applied for the rest of the session. See VAD Configuration. |
At least one of provider, languages, custom_vocabulary, or vad should be provided; sending none of them is a no-op.
Configuring VAD Session-Wide
Use this action to set or change VAD parameters for the entire remaining session (equivalent to setting vad on every subsequent speak).
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"vad": {
"end_of_turn_silence_ms": 1200
}
}Out-of-range or invalid values are silently ignored.
Behavioral Details
Full Replace Semantics
Both provider and languages use full replace semantics — they never merge with existing settings.
provider field | languages field | Result |
|---|---|---|
| Provided | Provided | Switches to new provider with specified languages |
| Provided | Omitted | Switches to new provider; languages reset to [] (default) |
| Omitted | Provided | Keeps current provider; languages fully replaced |
| Omitted | Omitted | No-op (transcription unchanged) |
Custom Vocabulary
Pass a custom_vocabulary array to boost recognition of domain-specific terms, product names, proper nouns, or technical terms your callers are likely to use.
- Entries are matched case-insensitively during deduplication and merged with client-level vocabulary.
- Multi-word phrases (e.g.
"SIP-Trunk") are supported by all providers. - If omitted, the current session vocabulary is kept unchanged.
- Max 100 entries; max 200 characters per entry.
Supported providers: Azure, Deepgram, ElevenLabs
Brief Audio Gap During Restart
Any change — language or provider — requires the transcription engine to restart. Audio received during the restart is dropped and will not appear in any user_speak event.
| Change type | Typical gap |
|---|---|
| Language change only | ~100–500 ms |
| Provider switch | ~200–800 ms |
Design your call flow to trigger changes at natural pause points (e.g., after the assistant finishes speaking) to minimize the impact of the gap.
Barge-In Latency After Provider Switch
Each provider has different Voice Activity Detection (VAD) characteristics. Switching providers may change barge-in latency for the immediate strategy:
| Provider | Approximate barge-in latency |
|---|---|
| Azure | ~20–80 ms |
| Deepgram | ~20–100 ms |
| ElevenLabs | ~30–120 ms |
Compatible Channels
The configure_transcription action is accepted on all three delivery channels:
- HTTP webhook response
- Client-transport WebSocket
- External API POST
Multi-Language Support per Provider
Not all providers support simultaneous multi-language detection. When more than one language code is supplied, providers that only accept a single language will silently use the first entry and ignore the rest.
| Provider value | Multi-language support | Notes |
|---|---|---|
"AZURE" | ✅ Up to 4 languages | All entries used for Language Identification (LID) |
"DEEPGRAM" | ✅ Multilingual | Auto-detects across the supplied languages; supply none for full auto-detect |
"ELEVEN_LABS" | ❌ Single language only | Only the first entry is used; rest are ignored |
Recommendation: When targeting ElevenLabs, supply exactly one language code. Deepgram and Azure both accept multiple codes; supplying none lets the provider auto-detect.
Examples
Change Language Only (Keep Current Provider)
Switch an active session to German transcription:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"languages": ["de-DE"]
}Switch Provider Only (Languages Reset to Default)
Switch from Azure to Deepgram; languages reset to auto-detection:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "DEEPGRAM"
}Switch Provider and Language Simultaneously
Switch to ElevenLabs and set English as the recognition language:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "ELEVEN_LABS",
"languages": ["en-US"]
}Use Multiple Languages Simultaneously
Enable multi-language detection for German and English:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"languages": ["de-DE", "en-US"]
}Up to 4 language codes may be provided in a single request.
Reset to Provider Default
Omit languages to restore automatic language detection:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}Boost Recognition with Custom Vocabulary
Improve accuracy for product names and technical terms:
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"custom_vocabulary": ["sipgate", "VoIP", "ISDN", "Portsplitter"]
}Switching Language Based on User Input
A common pattern: detect the caller's preferred language from their first utterance, then reconfigure transcription mid-call.
javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
// Start with multi-language detection
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Hello! Guten Tag! Please speak in your preferred language.',
});
}
if (event.type === 'user_speak') {
const detectedLanguage = event.language; // BCP-47 code from STT
if (detectedLanguage && detectedLanguage.startsWith('de')) {
// Caller is speaking German — lock transcription to German only
return res.json({
type: 'configure_transcription',
session_id: event.session.id,
languages: ['de-DE'],
});
}
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`,
});
}
});Provider Fallback Pattern
Switch to a backup provider if the primary fails or for specific call scenarios:
javascript
// Switch to Deepgram for better handling of a specific language/accent
return res.json({
type: 'configure_transcription',
session_id: event.session.id,
provider: 'DEEPGRAM',
languages: ['en-US'],
});Next Steps
- Actions Overview - Complete action reference
- Event Types - What events carry transcribed text
- Barge-In Configuration - Control how users interrupt the assistant