Skip to content

Action Types

Complete reference for all actions you can return from event handlers.

Overview

Actions are responses that tell the AI Flow service what to do next. All actions require a session_id and type field.

Base Action Structure

typescript
interface BaseAction {
  session_id: string; // UUID from the event's session.id
  type: string;       // Action type identifier
}

Action Summary

Action TypeDescriptionPrimary Use Case
speakSpeak text or SSMLRespond to user with synthesized speech
audioPlay pre-recorded audioPlay hold music, pre-recorded messages
hangupEnd the callTerminate conversation
transferTransfer to another numberRoute to human agent or department
barge_inManually interrupt playbackStop current audio immediately
configure_transcriptionChange STT language(s) mid-callSwitch recognition language without hanging up

Speak Action

Speaks text or SSML to the user.

typescript
interface AiFlowActionSpeak {
  type: "speak";
  session_id: string;

  // Either text OR ssml (not both)
  text?: string; // Plain text to speak
  ssml?: string; // SSML markup for advanced control

  // Optional configurations
  tts?: TtsConfig;      // TTS provider settings
  barge_in?: BargeInConfig; // Barge-in behavior
}

Examples:

typescript
// Simple text
return {
  type: "speak",
  session_id: event.session.id,
  text: "Hello, how can I help you?",
};

// With SSML
return {
  type: "speak",
  session_id: event.session.id,
  ssml: `
    <speak version="1.0" xml:lang="en-US">
      <voice name="en-US-JennyNeural">
        <prosody rate="slow">Please listen carefully.</prosody>
        <break time="500ms"/>
        Your account balance is <say-as interpret-as="currency">$42.50</say-as>
      </voice>
    </speak>
  `,
};

// With custom TTS provider
return {
  type: "speak",
  session_id: event.session.id,
  text: "Hello in a different voice",
  tts: {
    provider: "azure",
    language: "en-US",
    voice: "en-US-JennyNeural",
  },
};

Audio Action

Plays pre-recorded audio to the user.

typescript
interface AiFlowActionAudio {
  type: "audio";
  session_id: string;
  audio: string; // Base64 encoded WAV (16kHz, mono, 16-bit)
  barge_in?: BargeInConfig;
}

Example:

typescript
// Play hold music or pre-recorded message
return {
  type: "audio",
  session_id: event.session.id,
  audio: base64EncodedWavData,
  barge_in: {
    strategy: "minimum_characters",
    minimum_characters: 3,
  },
};

Audio Format Requirements:

  • Format: WAV
  • Sample Rate: 16kHz
  • Channels: Mono
  • Bit Depth: 16-bit PCM
  • Encoding: Base64

Hangup Action

Ends the call.

typescript
interface AiFlowActionHangup {
  type: "hangup";
  session_id: string;
}

Example:

typescript
onUserSpeak: async (event) => {
  if (event.text.toLowerCase().includes("goodbye")) {
    return {
      type: "hangup",
      session_id: event.session.id,
    };
  }
};

Transfer Action

Transfers the call to another phone number.

typescript
interface AiFlowActionTransfer {
  type: "transfer";
  session_id: string;
  target_phone_number: string; // E.164 format recommended
  caller_id_name: string;
  caller_id_number: string;
}

Example:

typescript
// Transfer to sales department
return {
  type: "transfer",
  session_id: event.session.id,
  target_phone_number: "+1234567890",
  caller_id_name: "Sales Department",
  caller_id_number: "+1234567890",
};

Barge-In Action

Manually triggers barge-in (interrupts current playback).

typescript
interface AiFlowActionBargeIn {
  type: "barge_in";
  session_id: string;
}

Example:

typescript
// Manually interrupt current playback
return {
  type: "barge_in",
  session_id: event.session.id,
};

Configure Transcription Action

Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.

typescript
import { TranscriptionProvider } from "@sipgate/ai-flow-sdk";

interface AiFlowActionConfigureTranscription {
  type: "configure_transcription";
  session_id: string;
  provider?: TranscriptionProvider; // "AZURE" | "DEEPGRAM" | "ELEVEN_LABS" — omit to keep current
  languages?: string[];             // BCP-47 codes, 1-4 entries — omit to reset to provider default
}

At least one of provider or languages should be provided; sending neither is a no-op.

Both fields use full replace semantics — they never merge with existing settings.

Examples:

typescript
// Switch to German
return {
  type: "configure_transcription",
  session_id: event.session.id,
  languages: ["de-DE"],
};

// Multi-language detection (German + English)
return {
  type: "configure_transcription",
  session_id: event.session.id,
  languages: ["de-DE", "en-US"],
};

// Switch STT provider to Deepgram
return {
  type: "configure_transcription",
  session_id: event.session.id,
  provider: "DEEPGRAM",
};

// Switch provider AND language in one step
return {
  type: "configure_transcription",
  session_id: event.session.id,
  provider: "DEEPGRAM",
  languages: ["en-US"],
};

// Reset to provider default (automatic detection)
return {
  type: "configure_transcription",
  session_id: event.session.id,
};

Audio gap during restart: Any change requires the transcription engine to restart. Audio during the restart (~100–500 ms for language-only change, ~200–800 ms for provider switch) is dropped.

Multi-language support depends on the active STT provider:

  • Azure: up to 4 languages, all used for simultaneous Language Identification (LID)
  • Deepgram / ElevenLabs: single language only — only the first entry is used; additional entries are silently ignored

Barge-in latency after provider switch (for immediate strategy):

  • Azure: ~20–80 ms
  • Deepgram: ~20–100 ms
  • ElevenLabs: ~30–120 ms

Type Safety

All actions are fully typed. Import types from the SDK:

typescript
import type {
  AiFlowAction,
  AiFlowActionSpeak,
  AiFlowActionAudio,
  AiFlowActionHangup,
  AiFlowActionTransfer,
  AiFlowActionBargeIn,
  AiFlowActionConfigureTranscription,
} from "@sipgate/ai-flow-sdk";
import { TranscriptionProvider } from "@sipgate/ai-flow-sdk";

onUserSpeak: async (event) => {
  const action: AiFlowActionSpeak = {
    type: "speak",
    session_id: event.session.id,
    text: "Hello!",
  };
  return action;
};

Next Steps