Skip to content

Configure Transcription Action

Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.

Action Structure

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "DEEPGRAM",
  "languages": ["en-US"]
}

Fields

FieldTypeRequiredDefaultDescription
typestringYesAlways "configure_transcription"
session_idstring (UUID)YesSession identifier from event
providerstringNoCurrent providerSTT provider to switch to. Valid values: "AZURE", "DEEPGRAM", "ELEVEN_LABS". Omitting keeps the current provider.
languagesstring[]NoProvider defaultBCP-47 language codes (1–4 entries). Fully replaces the current config. Omitting resets to provider default (auto-detection).
custom_vocabularystring[]NoWords or phrases to boost STT recognition accuracy. Max 100 entries, max 200 characters per entry. Fully replaces the current session-level vocabulary. Merged with client-level vocabulary configured during onboarding. Supported by Azure, Deepgram, and ElevenLabs.
vadobjectNoCurrent settingVoice-activity detection tuning, applied for the rest of the session. See VAD Configuration.

At least one of provider, languages, custom_vocabulary, or vad should be provided; sending none of them is a no-op.

Configuring VAD Session-Wide

Use this action to set or change VAD parameters for the entire remaining session (equivalent to setting vad on every subsequent speak).

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "vad": {
    "end_of_turn_silence_ms": 1200
  }
}

Out-of-range or invalid values are silently ignored.

Behavioral Details

Full Replace Semantics

Both provider and languages use full replace semantics — they never merge with existing settings.

provider fieldlanguages fieldResult
ProvidedProvidedSwitches to new provider with specified languages
ProvidedOmittedSwitches to new provider; languages reset to [] (default)
OmittedProvidedKeeps current provider; languages fully replaced
OmittedOmittedNo-op (transcription unchanged)

Custom Vocabulary

Pass a custom_vocabulary array to boost recognition of domain-specific terms, product names, proper nouns, or technical terms your callers are likely to use.

  • Entries are matched case-insensitively during deduplication and merged with client-level vocabulary.
  • Multi-word phrases (e.g. "SIP-Trunk") are supported by all providers.
  • If omitted, the current session vocabulary is kept unchanged.
  • Max 100 entries; max 200 characters per entry.

Supported providers: Azure, Deepgram, ElevenLabs

Brief Audio Gap During Restart

Any change — language or provider — requires the transcription engine to restart. Audio received during the restart is dropped and will not appear in any user_speak event.

Change typeTypical gap
Language change only~100–500 ms
Provider switch~200–800 ms

Design your call flow to trigger changes at natural pause points (e.g., after the assistant finishes speaking) to minimize the impact of the gap.

Barge-In Latency After Provider Switch

Each provider has different Voice Activity Detection (VAD) characteristics. Switching providers may change barge-in latency for the immediate strategy:

ProviderApproximate barge-in latency
Azure~20–80 ms
Deepgram~20–100 ms
ElevenLabs~30–120 ms

Compatible Channels

The configure_transcription action is accepted on all three delivery channels:

  • HTTP webhook response
  • Client-transport WebSocket
  • External API POST

Multi-Language Support per Provider

Not all providers support simultaneous multi-language detection. When more than one language code is supplied, providers that only accept a single language will silently use the first entry and ignore the rest.

Provider valueMulti-language supportNotes
"AZURE"✅ Up to 4 languagesAll entries used for Language Identification (LID)
"DEEPGRAM"✅ MultilingualAuto-detects across the supplied languages; supply none for full auto-detect
"ELEVEN_LABS"❌ Single language onlyOnly the first entry is used; rest are ignored

Recommendation: When targeting ElevenLabs, supply exactly one language code. Deepgram and Azure both accept multiple codes; supplying none lets the provider auto-detect.

Examples

Change Language Only (Keep Current Provider)

Switch an active session to German transcription:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "languages": ["de-DE"]
}

Switch Provider Only (Languages Reset to Default)

Switch from Azure to Deepgram; languages reset to auto-detection:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "DEEPGRAM"
}

Switch Provider and Language Simultaneously

Switch to ElevenLabs and set English as the recognition language:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "ELEVEN_LABS",
  "languages": ["en-US"]
}

Use Multiple Languages Simultaneously

Enable multi-language detection for German and English:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "languages": ["de-DE", "en-US"]
}

Up to 4 language codes may be provided in a single request.

Reset to Provider Default

Omit languages to restore automatic language detection:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

Boost Recognition with Custom Vocabulary

Improve accuracy for product names and technical terms:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "custom_vocabulary": ["sipgate", "VoIP", "ISDN", "Portsplitter"]
}

Switching Language Based on User Input

A common pattern: detect the caller's preferred language from their first utterance, then reconfigure transcription mid-call.

javascript
app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    // Start with multi-language detection
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'Hello! Guten Tag! Please speak in your preferred language.',
    });
  }

  if (event.type === 'user_speak') {
    const detectedLanguage = event.language; // BCP-47 code from STT

    if (detectedLanguage && detectedLanguage.startsWith('de')) {
      // Caller is speaking German — lock transcription to German only
      return res.json({
        type: 'configure_transcription',
        session_id: event.session.id,
        languages: ['de-DE'],
      });
    }

    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `You said: ${event.text}`,
    });
  }
});

Provider Fallback Pattern

Switch to a backup provider if the primary fails or for specific call scenarios:

javascript
// Switch to Deepgram for better handling of a specific language/accent
return res.json({
  type: 'configure_transcription',
  session_id: event.session.id,
  provider: 'DEEPGRAM',
  languages: ['en-US'],
});

Next Steps