Configure Transcription Action

Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.

Action Structure

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "DEEPGRAM",
  "languages": ["en-US"]
}

Fields

Field	Type	Required	Default	Description
`type`	string	Yes	—	Always `"configure_transcription"`
`session_id`	string (UUID)	Yes	—	Session identifier from event
`provider`	string	No	Current provider	STT provider to switch to. Valid values: `"AZURE"`, `"DEEPGRAM"`, `"ELEVEN_LABS"`. Omitting keeps the current provider.
`languages`	string[]	No	Provider default	BCP-47 language codes (1–4 entries). Fully replaces the current config. Omitting resets to provider default (auto-detection).
`custom_vocabulary`	string[]	No	—	Words or phrases to boost STT recognition accuracy. Max 100 entries, max 200 characters per entry. Fully replaces the current session-level vocabulary. Merged with client-level vocabulary configured during onboarding. Supported by Azure, Deepgram, and ElevenLabs.
`vad`	object	No	Current setting	Voice-activity detection tuning, applied for the rest of the session. See VAD Configuration.

At least one of provider, languages, custom_vocabulary, or vad should be provided; sending none of them is a no-op.

Configuring VAD Session-Wide

Use this action to set or change VAD parameters for the entire remaining session (equivalent to setting vad on every subsequent speak).

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "vad": {
    "end_of_turn_silence_ms": 1200
  }
}

Out-of-range or invalid values are silently ignored.

Behavioral Details

Full Replace Semantics

Both provider and languages use full replace semantics — they never merge with existing settings.

`provider` field	`languages` field	Result
Provided	Provided	Switches to new provider with specified languages
Provided	Omitted	Switches to new provider; languages reset to `[]` (default)
Omitted	Provided	Keeps current provider; languages fully replaced
Omitted	Omitted	No-op (transcription unchanged)

Custom Vocabulary

Pass a custom_vocabulary array to boost recognition of domain-specific terms, product names, proper nouns, or technical terms your callers are likely to use.

Entries are matched case-insensitively during deduplication and merged with client-level vocabulary.
Multi-word phrases (e.g. "SIP-Trunk") are supported by all providers.
If omitted, the current session vocabulary is kept unchanged.
Max 100 entries; max 200 characters per entry.

Supported providers: Azure, Deepgram, ElevenLabs

Brief Audio Gap During Restart

Any change — language or provider — requires the transcription engine to restart. Audio received during the restart is dropped and will not appear in any user_speak event.

Change type	Typical gap
Language change only	~100–500 ms
Provider switch	~200–800 ms

Design your call flow to trigger changes at natural pause points (e.g., after the assistant finishes speaking) to minimize the impact of the gap.

Barge-In Latency After Provider Switch

Each provider has different Voice Activity Detection (VAD) characteristics. Switching providers may change barge-in latency for the immediate strategy:

Provider	Approximate barge-in latency
Azure	~20–80 ms
Deepgram	~20–100 ms
ElevenLabs	~30–120 ms

Compatible Channels

The configure_transcription action is accepted on all three delivery channels:

HTTP webhook response
Client-transport WebSocket
External API POST

Multi-Language Support per Provider

Not all providers support simultaneous multi-language detection. When more than one language code is supplied, providers that only accept a single language will silently use the first entry and ignore the rest.

Provider value	Multi-language support	Notes
`"AZURE"`	✅ Up to 4 languages	All entries used for Language Identification (LID)
`"DEEPGRAM"`	✅ Multilingual	Auto-detects across the supplied languages; supply none for full auto-detect
`"ELEVEN_LABS"`	❌ Single language only	Only the first entry is used; rest are ignored

Recommendation: When targeting ElevenLabs, supply exactly one language code. Deepgram and Azure both accept multiple codes; supplying none lets the provider auto-detect.

Examples

Change Language Only (Keep Current Provider)

Switch an active session to German transcription:

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "languages": ["de-DE"]
}

Switch Provider Only (Languages Reset to Default)

Switch from Azure to Deepgram; languages reset to auto-detection:

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "DEEPGRAM"
}

Switch Provider and Language Simultaneously

Switch to ElevenLabs and set English as the recognition language:

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "ELEVEN_LABS",
  "languages": ["en-US"]
}

Use Multiple Languages Simultaneously

Enable multi-language detection for German and English:

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "languages": ["de-DE", "en-US"]
}

Up to 4 language codes may be provided in a single request.

Reset to Provider Default

Omit languages to restore automatic language detection:

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

Boost Recognition with Custom Vocabulary

Improve accuracy for product names and technical terms:

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "custom_vocabulary": ["sipgate", "VoIP", "ISDN", "Portsplitter"]
}

Switching Language Based on User Input

A common pattern: detect the caller's preferred language from their first utterance, then reconfigure transcription mid-call.

javascript

app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    // Start with multi-language detection
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'Hello! Guten Tag! Please speak in your preferred language.',
    });
  }

  if (event.type === 'user_speak') {
    const detectedLanguage = event.language; // BCP-47 code from STT

    if (detectedLanguage && detectedLanguage.startsWith('de')) {
      // Caller is speaking German — lock transcription to German only
      return res.json({
        type: 'configure_transcription',
        session_id: event.session.id,
        languages: ['de-DE'],
      });
    }

    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `You said: ${event.text}`,
    });
  }
});

Provider Fallback Pattern

Switch to a backup provider if the primary fails or for specific call scenarios:

javascript

// Switch to Deepgram for better handling of a specific language/accent
return res.json({
  type: 'configure_transcription',
  session_id: event.session.id,
  provider: 'DEEPGRAM',
  languages: ['en-US'],
});

Next Steps

Actions Overview - Complete action reference
Event Types - What events carry transcribed text
Barge-In Configuration - Control how users interrupt the assistant

Configure Transcription Action ​

Action Structure ​

Fields ​

Configuring VAD Session-Wide ​

Behavioral Details ​

Full Replace Semantics ​

Custom Vocabulary ​

Brief Audio Gap During Restart ​

Barge-In Latency After Provider Switch ​

Compatible Channels ​

Multi-Language Support per Provider ​

Examples ​

Change Language Only (Keep Current Provider) ​

Switch Provider Only (Languages Reset to Default) ​

Switch Provider and Language Simultaneously ​

Use Multiple Languages Simultaneously ​

Reset to Provider Default ​

Boost Recognition with Custom Vocabulary ​

Switching Language Based on User Input ​

Provider Fallback Pattern ​

Next Steps ​

Configure Transcription Action

Action Structure

Fields

Configuring VAD Session-Wide

Behavioral Details

Full Replace Semantics

Custom Vocabulary

Brief Audio Gap During Restart

Barge-In Latency After Provider Switch

Compatible Channels

Multi-Language Support per Provider

Examples

Change Language Only (Keep Current Provider)

Switch Provider Only (Languages Reset to Default)

Switch Provider and Language Simultaneously

Use Multiple Languages Simultaneously

Reset to Provider Default

Boost Recognition with Custom Vocabulary

Switching Language Based on User Input

Provider Fallback Pattern

Next Steps