Appearance
Action Types
Complete reference for all actions you can return from event handlers.
Overview
Actions are responses that tell the AI Flow service what to do next. All actions require a session_id and type field.
Base Action Structure
typescript
interface BaseAction {
session_id: string; // UUID from the event's session.id
type: string; // Action type identifier
}Action Summary
| Action Type | Description | Primary Use Case |
|---|---|---|
speak | Speak text or SSML | Respond to user with synthesized speech |
audio | Play pre-recorded audio | Play hold music, pre-recorded messages |
mix_audio | Loop a background sound mixed into speech | Add ambient noise (café, office, train station) under the agent |
hangup | End the call | Terminate conversation |
transfer | Transfer to another number | Route to human agent or department |
barge_in | Manually interrupt playback | Stop current audio immediately |
configure_transcription | Change STT language(s) mid-call | Switch recognition language without hanging up |
Speak Action
Speaks text or SSML to the user.
typescript
interface AiFlowActionSpeak {
type: "speak";
session_id: string;
// Either text OR ssml (not both)
text?: string; // Plain text to speak
ssml?: string; // SSML markup for advanced control
// Optional configurations
tts?: TtsConfig; // TTS provider settings
barge_in?: BargeInConfig; // Barge-in behavior
user_input_timeout_seconds?: number; // Wait this long for the caller to start
vad?: VadConfig; // Tune end-of-turn silence (advanced — see /sdk/vad)
}Examples:
typescript
// Simple text
return {
type: "speak",
session_id: event.session.id,
text: "Hello, how can I help you?",
};
// With SSML
return {
type: "speak",
session_id: event.session.id,
ssml: `
<speak version="1.0" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="slow">Please listen carefully.</prosody>
<break time="500ms"/>
Your account balance is <say-as interpret-as="currency">$42.50</say-as>
</voice>
</speak>
`,
};
// With custom TTS provider
return {
type: "speak",
session_id: event.session.id,
text: "Hello in a different voice",
tts: {
provider: "azure",
language: "en-US",
voice: "en-US-JennyNeural",
},
};Audio Action
Plays pre-recorded audio to the user.
typescript
interface AiFlowActionAudio {
type: "audio";
session_id: string;
audio: string; // Base64 encoded WAV (16kHz, mono, 16-bit)
barge_in?: BargeInConfig;
}Example:
typescript
// Play hold music or pre-recorded message
return {
type: "audio",
session_id: event.session.id,
audio: base64EncodedWavData,
barge_in: {
strategy: "minimum_characters",
minimum_characters: 3,
},
};Audio Format Requirements:
- Format: WAV
- Sample Rate: 16kHz
- Channels: Mono
- Bit Depth: 16-bit PCM
- Encoding: Base64
Mix Audio Action
Play a looping background sound (e.g. train station, café, office) under the call. The loop plays continuously for the rest of the session — both during the assistant's TTS turns and during silences. Sending mix_audio again replaces the active loop; sending with stop: true removes it. The loop is dropped automatically when the session ends.
typescript
import { readFileSync } from "node:fs";
interface AiFlowActionMixAudio {
type: "mix_audio";
session_id: string;
/** Base64-encoded WAV (16 kHz, mono, 16-bit PCM). Required unless stop=true. */
audio?: string;
/** Mix volume for the background loop, 0.0–1.0. Defaults to 0.5. */
volume?: number;
/** When true, removes the active background loop. */
stop?: boolean;
}Example — start an ambient loop alongside the greeting:
typescript
// Load and base64-encode the loop once at startup
const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64");
onSessionStart: async (event) => {
return [
{
type: "mix_audio",
session_id: event.session.id,
audio: AMBIENT_AUDIO,
volume: 0.3,
},
{
type: "speak",
session_id: event.session.id,
text: "Welcome, how can I help you?",
},
];
};Example — stop the ambient before hanging up:
typescript
onUserSpeak: async (event) => {
if (event.text.toLowerCase().includes("goodbye")) {
return [
{ type: "mix_audio", session_id: event.session.id, stop: true },
{ type: "speak", session_id: event.session.id, text: "Goodbye!" },
{ type: "hangup", session_id: event.session.id },
];
}
};Audio Format Requirements: identical to the audio action — WAV, 16 kHz, mono, 16-bit PCM, base64-encoded. Same FFmpeg conversion command applies.
Best practice — keep ambient quiet. Background loops should sit under the agent's voice. Start around volume: 0.3 and adjust from there. Loudness-normalize source files to about -30 LUFS so different presets stay comparable at a given volume value.
Hangup Action
Ends the call.
typescript
interface AiFlowActionHangup {
type: "hangup";
session_id: string;
}Example:
typescript
onUserSpeak: async (event) => {
if (event.text.toLowerCase().includes("goodbye")) {
return {
type: "hangup",
session_id: event.session.id,
};
}
};Transfer Action
Transfers the call to another phone number. Pass an optional timeout to enable transfer fallback — if the target doesn't pick up (or rejects / hangs up), the service re-emits session_start with the same session.id so the agent can handle the call again.
typescript
interface AiFlowActionTransfer {
type: "transfer";
session_id: string;
target_phone_number: string; // E.164 format without leading + recommended
caller_id_name: string;
caller_id_number: string;
/** Optional transfer timeout in seconds (5–120). Enables transfer fallback. */
timeout?: number;
}Example:
typescript
// Transfer to sales department — fall back to the agent after 30s of no answer
return {
type: "transfer",
session_id: event.session.id,
target_phone_number: "1234567890",
caller_id_name: "Sales Department",
caller_id_number: "1234567890",
timeout: 30,
};Barge-In Action
Manually triggers barge-in (interrupts current playback).
typescript
interface AiFlowActionBargeIn {
type: "barge_in";
session_id: string;
}Example:
typescript
// Manually interrupt current playback
return {
type: "barge_in",
session_id: event.session.id,
};Configure Transcription Action
Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.
typescript
import { TranscriptionProvider } from "@sipgate/ai-flow-sdk";
interface AiFlowActionConfigureTranscription {
type: "configure_transcription";
session_id: string;
provider?: TranscriptionProvider; // "AZURE" | "DEEPGRAM" | "ELEVEN_LABS" | "SIPGATE_QWEN" | "SIPGATE_PARAKEET" — omit to keep current
languages?: string[]; // BCP-47 codes, 1-4 entries — omit to reset to provider default
custom_vocabulary?: string[]; // Words/phrases to boost STT recognition
vad?: VadConfig; // Session-wide VAD tuning — see /sdk/vad
}At least one of provider, languages, custom_vocabulary, or vad should be provided; sending none is a no-op.
Both fields use full replace semantics — they never merge with existing settings.
Examples:
typescript
// Switch to German
return {
type: "configure_transcription",
session_id: event.session.id,
languages: ["de-DE"],
};
// Multi-language detection (German + English)
return {
type: "configure_transcription",
session_id: event.session.id,
languages: ["de-DE", "en-US"],
};
// Switch STT provider to Deepgram
return {
type: "configure_transcription",
session_id: event.session.id,
provider: "DEEPGRAM",
};
// Switch provider AND language in one step
return {
type: "configure_transcription",
session_id: event.session.id,
provider: "DEEPGRAM",
languages: ["en-US"],
};
// Reset to provider default (automatic detection)
return {
type: "configure_transcription",
session_id: event.session.id,
};Audio gap during restart: Any change requires the transcription engine to restart. Audio during the restart (~100–500 ms for language-only change, ~200–800 ms for provider switch) is dropped.
Multi-language support depends on the active STT provider:
- Azure: up to 4 languages, all used for simultaneous Language Identification (LID)
- Deepgram: multilingual auto-detection across all supplied languages
- ElevenLabs: single language only — only the first entry is used; additional entries are silently ignored
- Qwen (hosted by sipgate) (
"SIPGATE_QWEN") / Parakeet (hosted by sipgate) ("SIPGATE_PARAKEET"): single language only — only the first entry is used; additional entries are silently ignored
Barge-in latency after provider switch (for immediate strategy):
- Azure: ~20–80 ms
- Deepgram: ~20–100 ms
- ElevenLabs: ~30–120 ms
- Qwen / Parakeet (hosted by sipgate): ~30–120 ms
Type Safety
All actions are fully typed. Import types from the SDK:
typescript
import type {
AiFlowAction,
AiFlowActionSpeak,
AiFlowActionAudio,
AiFlowActionMixAudio,
AiFlowActionHangup,
AiFlowActionTransfer,
AiFlowActionBargeIn,
AiFlowActionConfigureTranscription,
} from "@sipgate/ai-flow-sdk";
import { TranscriptionProvider } from "@sipgate/ai-flow-sdk";
onUserSpeak: async (event) => {
const action: AiFlowActionSpeak = {
type: "speak",
session_id: event.session.id,
text: "Hello!",
};
return action;
};Next Steps
- TTS Providers - Configure text-to-speech voices
- Barge-In Configuration - Control interruption behavior
- API Reference - Complete API documentation