Skip to content

Configure Transcription Action

Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.

Action Structure

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "DEEPGRAM",
  "languages": ["en-US"]
}

Fields

FieldTypeRequiredDefaultDescription
typestringYesAlways "configure_transcription"
session_idstring (UUID)YesSession identifier from event
providerstringNoCurrent providerSTT provider to switch to. Valid values: "AZURE", "DEEPGRAM", "ELEVEN_LABS". Omitting keeps the current provider.
languagesstring[]NoProvider defaultBCP-47 language codes (1–4 entries). Fully replaces the current config. Omitting resets to provider default (auto-detection).

At least one of provider or languages should be provided; sending neither is a no-op.

Behavioral Details

Full Replace Semantics

Both provider and languages use full replace semantics — they never merge with existing settings.

provider fieldlanguages fieldResult
ProvidedProvidedSwitches to new provider with specified languages
ProvidedOmittedSwitches to new provider; languages reset to [] (default)
OmittedProvidedKeeps current provider; languages fully replaced
OmittedOmittedNo-op (transcription unchanged)

Brief Audio Gap During Restart

Any change — language or provider — requires the transcription engine to restart. Audio received during the restart is dropped and will not appear in any user_speak event.

Change typeTypical gap
Language change only~100–500 ms
Provider switch~200–800 ms

Design your call flow to trigger changes at natural pause points (e.g., after the assistant finishes speaking) to minimize the impact of the gap.

Barge-In Latency After Provider Switch

Each provider has different Voice Activity Detection (VAD) characteristics. Switching providers may change barge-in latency for the immediate strategy:

ProviderApproximate barge-in latency
Azure~20–80 ms
Deepgram~20–100 ms
ElevenLabs~30–120 ms

Compatible Channels

The configure_transcription action is accepted on all three delivery channels:

  • HTTP webhook response
  • Client-transport WebSocket
  • External API POST

Multi-Language Support per Provider

Not all providers support simultaneous multi-language detection. When more than one language code is supplied, providers that only accept a single language will silently use the first entry and ignore the rest.

Provider valueMulti-language supportNotes
"AZURE"✅ Up to 4 languagesAll entries used for Language Identification (LID)
"DEEPGRAM"❌ Single language onlyOnly the first entry is used; rest are ignored
"ELEVEN_LABS"❌ Single language onlyOnly the first entry is used; rest are ignored

Recommendation: When targeting Deepgram or ElevenLabs, always supply exactly one language code. Supplying multiple codes will not cause an error, but only the first will take effect.

Examples

Change Language Only (Keep Current Provider)

Switch an active session to German transcription:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "languages": ["de-DE"]
}

Switch Provider Only (Languages Reset to Default)

Switch from Azure to Deepgram; languages reset to auto-detection:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "DEEPGRAM"
}

Switch Provider and Language Simultaneously

Switch to ElevenLabs and set English as the recognition language:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "ELEVEN_LABS",
  "languages": ["en-US"]
}

Use Multiple Languages Simultaneously

Enable multi-language detection for German and English:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "languages": ["de-DE", "en-US"]
}

Up to 4 language codes may be provided in a single request.

Reset to Provider Default

Omit languages to restore automatic language detection:

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

Switching Language Based on User Input

A common pattern: detect the caller's preferred language from their first utterance, then reconfigure transcription mid-call.

javascript
app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    // Start with multi-language detection
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'Hello! Guten Tag! Please speak in your preferred language.',
    });
  }

  if (event.type === 'user_speak') {
    const detectedLanguage = event.language; // BCP-47 code from STT

    if (detectedLanguage && detectedLanguage.startsWith('de')) {
      // Caller is speaking German — lock transcription to German only
      return res.json({
        type: 'configure_transcription',
        session_id: event.session.id,
        languages: ['de-DE'],
      });
    }

    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `You said: ${event.text}`,
    });
  }
});

Provider Fallback Pattern

Switch to a backup provider if the primary fails or for specific call scenarios:

javascript
// Switch to Deepgram for better handling of a specific language/accent
return res.json({
  type: 'configure_transcription',
  session_id: event.session.id,
  provider: 'DEEPGRAM',
  languages: ['en-US'],
});

Next Steps