# sipgate AI Flow API - LLM Reference > Complete reference for code generation LLMs (Claude, ChatGPT, Cursor, etc.) > This document contains all API and SDK knowledge in a single file for efficient code generation. ## Table of Contents 1. [Overview](#overview) 2. [Core Concepts](#core-concepts) 3. [Authentication](#authentication) 4. [Integration Methods](#integration-methods) 5. [Event Types](#event-types) 6. [Action Types](#action-types) 7. [Barge-In Configuration](#barge-in-configuration) 8. [TTS Providers](#tts-providers) 9. [TypeScript SDK](#typescript-sdk) 10. [Complete Examples](#complete-examples) --- ## Overview sipgate AI Flow is a voice assistant platform for building AI-powered voice applications with real-time speech processing. It supports: - **Language Agnostic**: Works with ANY programming language via HTTP/WebSocket with plain JSON - **TypeScript SDK**: Convenient SDK available for TypeScript/JavaScript developers - **Event-Driven**: Clean event-driven architecture for handling conversations - **Real-Time Speech**: Built-in speech-to-text and text-to-speech (Azure, ElevenLabs) - **Multiple Integration Methods**: HTTP webhooks, WebSocket, or TypeScript SDK ### Architecture The platform uses an event-driven model: 1. AI Flow Service receives phone calls 2. Service sends events (JSON) to your application via HTTP or WebSocket 3. Your application processes events and returns actions (JSON) 4. Service executes actions (speak, transfer, hangup, etc.) ``` Phone Call → AI Flow Service → Your Application ↓ Process Event ↓ Return Action ↓ AI Flow Service ← Execute Action ↓ Phone Call ``` --- ## Core Concepts ### Event-Driven Architecture Your application receives events and responds with actions: **Events (Service → Your App):** - `session_start` - Call begins - `user_speak` - User speaks (after STT, includes `barged_in` flag if interrupted) - `assistant_speak` - Assistant speaks - `assistant_speech_ended` - Assistant finishes speaking - `user_input_timeout` - Timeout waiting for user input - `session_end` - Call ends **Actions (Your App → Service):** - `speak` - Speak text/SSML to user - `audio` - Play pre-recorded audio - `transfer` - Transfer call to another number - `hangup` - End the call - `barge_in` - Manually interrupt playback ### Session Information All events include session information: ```json { "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "+1234567890", "direction": "inbound", "from_phone_number": "+9876543210", "to_phone_number": "+1234567890" } } ``` ### Event Flow ``` session_start → user_speak → assistant_speak → assistant_speech_ended → user_speak ↓ ↓ ↓ user_speak (barged_in=true) ←┘ user_input_timeout (no speech) ↓ ↓ session_end ←──────────────────────────────┘ ``` **Notes:** - After `assistant_speech_ended`, the system waits for user input - If configured with `user_input_timeout_seconds`, a timeout event fires if no speech detected - `user_speak` with `barged_in=true` indicates the user interrupted during assistant speech - All paths eventually lead to `session_end` --- ## Authentication ### Shared Secret Authentication AI Flow can authenticate webhook requests using shared secrets sent in HTTP headers. **Request Headers:** - `X-API-TOKEN`: The shared secret you configured - `Content-Type`: `application/json` - `User-Agent`: Service identifier **Example Validation:** Python: ```python SHARED_SECRET = os.environ.get('AI_FLOW_SHARED_SECRET') @app.route('/webhook', methods=['POST']) def webhook(): provided_secret = request.headers.get('X-API-TOKEN') if provided_secret != SHARED_SECRET: abort(401) # Process event... ``` Node.js: ```javascript const SHARED_SECRET = process.env.AI_FLOW_SHARED_SECRET; app.post('/webhook', (req, res) => { if (req.headers['x-api-token'] !== SHARED_SECRET) { return res.status(401).json({ error: 'Unauthorized' }); } // Process event... }); ``` Go: ```go sharedSecret := os.Getenv("AI_FLOW_SHARED_SECRET") func webhook(w http.ResponseWriter, r *http.Request) { if r.Header.Get("X-API-TOKEN") != sharedSecret { w.WriteHeader(http.StatusUnauthorized) return } // Process event... } ``` --- ## Integration Methods ### HTTP Webhooks **Best for:** Serverless functions, REST APIs, simple integrations Your webhook endpoint receives POST requests with JSON events. **Requirements:** 1. Accept POST requests at a public URL 2. Parse JSON from request body 3. Return JSON actions or `204 No Content` 4. Respond quickly (under 1 second recommended) 5. Use HTTPS in production **Request Format:** ```http POST /webhook HTTP/1.1 Content-Type: application/json X-API-TOKEN: your-secret { "type": "user_speak", "text": "Hello", "session": { ... } } ``` **Response Format (with action):** ```http HTTP/1.1 200 OK Content-Type: application/json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?" } ``` **Response Format (no action):** ```http HTTP/1.1 204 No Content ``` **Python Example:** ```python from flask import Flask, request, jsonify app = Flask(__name__) @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'session_start': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Welcome! How can I help you?' }) if event['type'] == 'user_speak': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': f"You said: {event['text']}" }) return '', 204 ``` **Node.js Example:** ```javascript const express = require('express'); const app = express(); app.use(express.json()); app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { return res.json({ type: 'speak', session_id: event.session.id, text: 'Welcome! How can I help you?' }); } if (event.type === 'user_speak') { return res.json({ type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }); } res.status(204).send(); }); app.listen(3000); ``` **Go Example:** ```go package main import ( "encoding/json" "net/http" ) type Event struct { Type string `json:"type"` Text string `json:"text,omitempty"` Session Session `json:"session"` } type Session struct { ID string `json:"id"` } func webhook(w http.ResponseWriter, r *http.Request) { var event Event json.NewDecoder(r.Body).Decode(&event) if event.Type == "user_speak" { action := map[string]interface{}{ "type": "speak", "session_id": event.Session.ID, "text": "You said: " + event.Text, } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } w.WriteHeader(http.StatusNoContent) } func main() { http.HandleFunc("/webhook", webhook) http.ListenAndServe(":3000", nil) } ``` ### WebSocket **Best for:** Real-time applications, lower latency, high-volume scenarios AI Flow Service initiates WebSocket connection to your server when calls start. **How it works:** 1. AI Flow Service connects to your WebSocket server 2. Service sends JSON events as text messages 3. Your server processes events and sends JSON actions back 4. Connection remains open for the duration of the call **Message Format (Receive Event):** ```json { "type": "user_speak", "text": "Hello", "session": { ... } } ``` **Message Format (Send Action):** ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello!" } ``` **Python Example:** ```python import asyncio import websockets import json async def handle_message(websocket, path): async for message in websocket: event = json.loads(message) if event['type'] == 'user_speak': action = { 'type': 'speak', 'session_id': event['session']['id'], 'text': f"You said: {event['text']}" } await websocket.send(json.dumps(action)) async def main(): async with websockets.serve(handle_message, "localhost", 8765): await asyncio.Future() asyncio.run(main()) ``` **Node.js Example:** ```javascript const WebSocket = require('ws'); const wss = new WebSocket.Server({ port: 8080 }); wss.on('connection', (ws) => { ws.on('message', (data) => { const event = JSON.parse(data.toString()); if (event.type === 'user_speak') { const action = { type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }; ws.send(JSON.stringify(action)); } }); ws.on('error', (error) => { console.error('WebSocket error:', error); }); }); ``` **Go Example:** ```go package main import ( "encoding/json" "github.com/gorilla/websocket" "net/http" ) var upgrader = websocket.Upgrader{ CheckOrigin: func(r *http.Request) bool { return true }, } func websocketHandler(w http.ResponseWriter, r *http.Request) { conn, err := upgrader.Upgrade(w, r, nil) if err != nil { return } defer conn.Close() for { var event map[string]interface{} err := conn.ReadJSON(&event) if err != nil { break } if event["type"] == "user_speak" { session := event["session"].(map[string]interface{}) action := map[string]interface{}{ "type": "speak", "session_id": session["id"], "text": "You said: " + event["text"].(string), } conn.WriteJSON(action) } } } func main() { http.HandleFunc("/ws", websocketHandler) http.ListenAndServe(":8080", nil) } ``` --- ## Event Types ### Base Event Structure All events include: - `type`: Event type identifier (string) - `session`: Session information object ### 1. session_start Triggered when a new call session begins. **Structure:** ```json { "type": "session_start", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "+1234567890", "direction": "inbound", "from_phone_number": "+9876543210", "to_phone_number": "+1234567890" } } ``` **Fields:** - `type`: Always `"session_start"` - `session.id`: UUID - Unique session identifier - `session.account_id`: Account identifier - `session.phone_number`: Phone number for this flow session - `session.direction`: `"inbound"` or `"outbound"` (optional) - `session.from_phone_number`: Caller's phone number - `session.to_phone_number`: Callee's phone number **Response:** Can return any action or `204 No Content` **Example:** ```python if event['type'] == 'session_start': return { 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Welcome! How can I help you today?' } ``` ### 2. user_speak Triggered when the user speaks (after speech-to-text completes). **Structure:** ```json { "type": "user_speak", "text": "Hello, I need help", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "+1234567890" } } ``` **Fields:** - `type`: Always `"user_speak"` - `text`: Recognized speech text (string) - `session`: Session information **Response:** Can return any action or `204 No Content` **Example:** ```javascript if (event.type === 'user_speak') { const userText = event.text.toLowerCase(); if (userText.includes('goodbye')) { return { type: 'hangup', session_id: event.session.id }; } return { type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }; } ``` ### 3. assistant_speak Triggered after the assistant starts speaking. May be omitted for some TTS models. **Structure:** ```json { "type": "assistant_speak", "text": "Hello! How can I help?", "duration_ms": 2000, "speech_started_at": 1234567890, "session": { "id": "550e8400-e29b-41d4-a716-446655440000" } } ``` **Fields:** - `type`: Always `"assistant_speak"` - `text`: Text that was spoken (optional) - `ssml`: SSML that was used (optional) - `duration_ms`: Duration of speech in milliseconds (number) - `speech_started_at`: Unix timestamp in milliseconds when speech started (number) - `session`: Session information **Response:** Can return any action or `204 No Content` **Use for:** Tracking metrics, triggering next actions, logging ### 4. assistant_speech_ended Triggered when the assistant finishes speaking. **Structure:** ```json { "type": "assistant_speech_ended", "session": { "id": "550e8400-e29b-41d4-a716-446655440000" } } ``` **Fields:** - `type`: Always `"assistant_speech_ended"` - `session`: Session information **Response:** Can return any action or `204 No Content` ### 5. Barge-In Detection User interruptions are detected via the `barged_in` flag in `user_speak` events. When a user interrupts the assistant mid-speech, the `user_speak` event includes `barged_in: true`. **Example:** ```json { "type": "user_speak", "text": "Wait", "barged_in": true, "session": { "id": "550e8400-e29b-41d4-a716-446655440000" } } ``` **Handling:** ```python if event['type'] == 'user_speak': if event.get('barged_in'): # User interrupted return { 'type': 'speak', 'session_id': event['session']['id'], 'text': "I'm listening, please continue." } else: # Normal speech processing return process_user_input(event['text']) ``` ### 6. user_input_timeout Triggered when no user speech is detected within the configured timeout period after the assistant finishes speaking. **Structure:** ```json { "type": "user_input_timeout", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "+1234567890", "direction": "inbound", "from_phone_number": "+9876543210", "to_phone_number": "+1234567890" } } ``` **Fields:** - `type`: Always `"user_input_timeout"` - `session`: Session information **When Triggered:** 1. A `speak` action includes a `user_input_timeout_seconds` field 2. The assistant finishes speaking (`assistant_speech_ended` event fires) 3. The specified timeout period elapses without any user speech detected **Response:** Can return any action or `204 No Content` **Examples:** Retry question: ```python if event['type'] == 'user_input_timeout': return { 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Are you still there? Please say yes or no.', 'user_input_timeout_seconds': 5 } ``` Hangup after multiple timeouts: ```javascript const timeoutCounts = new Map(); if (event.type === 'user_input_timeout') { const sessionId = event.session.id; const count = (timeoutCounts.get(sessionId) || 0) + 1; timeoutCounts.set(sessionId, count); if (count >= 3) { return { type: 'hangup', session_id: sessionId }; } return { type: 'speak', session_id: sessionId, text: `I didn't hear anything. Please respond. Attempt ${count} of 3.`, user_input_timeout_seconds: 5 }; } ``` Go example: ```go var timeoutCounts = make(map[string]int) if event.Type == "user_input_timeout" { sessionID := event.Session.ID timeoutCounts[sessionID]++ count := timeoutCounts[sessionID] if count >= 3 { return map[string]interface{}{ "type": "hangup", "session_id": sessionID, } } return map[string]interface{}{ "type": "speak", "session_id": sessionID, "text": fmt.Sprintf("I didn't hear anything. Please respond. Attempt %d of 3.", count), "user_input_timeout_seconds": 5, } } ``` Ruby example: ```ruby timeout_counts = {} if event['type'] == 'user_input_timeout' session_id = event['session']['id'] timeout_counts[session_id] ||= 0 timeout_counts[session_id] += 1 count = timeout_counts[session_id] if count >= 3 return { type: 'hangup', session_id: session_id } end { type: 'speak', session_id: session_id, text: "I didn't hear anything. Please respond. Attempt #{count} of 3.", user_input_timeout_seconds: 5 } end ``` ### 7. session_end Triggered when the call session ends. **Structure:** ```json { "type": "session_end", "session": { "id": "550e8400-e29b-41d4-a716-446655440000" } } ``` **Fields:** - `type`: Always `"session_end"` - `session`: Session information **Response:** **NO ACTION ALLOWED** - Use only for cleanup **Example:** ```javascript if (event.type === 'session_end') { console.log(`Session ${event.session.id} ended`); // Cleanup, logging, analytics only // Do NOT return any action return null; } ``` --- ## Action Types ### Base Action Structure All actions require: - `session_id`: UUID from the event's `session.id` (string) - `type`: Action type identifier (string) ### 1. speak Speak text or SSML to the user using text-to-speech. **Structure:** ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?", "tts": { "provider": "azure", "language": "en-US", "voice": "en-US-JennyNeural" }, "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3 } } ``` **Fields:** - `type`: Always `"speak"` (required) - `session_id`: Session identifier (required) - `text`: Plain text to speak (required if `ssml` not provided) - `ssml`: SSML markup for advanced control (required if `text` not provided) - `tts`: TTS provider configuration (optional) - `barge_in`: Barge-in behavior configuration (optional) - `user_input_timeout_seconds`: Timeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a `user_input_timeout` event is sent (optional) **Note:** Either `text` OR `ssml` is required (not both) **Simple Example:** ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?" } ``` **SSML Example:** ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "ssml": "Please listen carefully.Your account balance is $42.50" } ``` **With Custom TTS:** ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello in a different voice", "tts": { "provider": "eleven_labs", "voice": "21m00Tcm4TlvDq8ikWAM" } } ``` **With User Input Timeout:** ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "What is your account number?", "user_input_timeout_seconds": 5 } ``` **Behavior:** - Timer starts when the assistant finishes speaking (`assistant_speech_ended` event) - Timer is cleared when the user starts speaking (any STT event) - If timeout is reached, a `user_input_timeout` event is sent - Your application can respond with any action (e.g., repeat question, hangup) **Example with timeout handling:** ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { return res.json({ type: 'speak', session_id: event.session.id, text: 'What is your account number?', user_input_timeout_seconds: 5 }); } if (event.type === 'user_input_timeout') { return res.json({ type: 'speak', session_id: event.session.id, text: 'I didn\'t hear anything. Let me try again. What is your account number?', user_input_timeout_seconds: 5 }); } if (event.type === 'user_speak') { return res.json({ type: 'speak', session_id: event.session.id, text: `Your account number is ${event.text}` }); } }); ``` ### 2. audio Play pre-recorded audio to the user. **Structure:** ```json { "type": "audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3 } } ``` **Fields:** - `type`: Always `"audio"` (required) - `session_id`: Session identifier (required) - `audio`: Base64 encoded WAV audio data (required) - `barge_in`: Barge-in behavior configuration (optional) **Audio Format Requirements:** - **Format**: WAV - **Sample Rate**: 16kHz - **Channels**: Mono (single channel) - **Bit Depth**: 16-bit PCM - **Encoding**: Base64 **Python Example:** ```python import base64 with open('hold-music.wav', 'rb') as audio_file: audio_data = audio_file.read() base64_audio = base64.b64encode(audio_data).decode('utf-8') action = { 'type': 'audio', 'session_id': event['session']['id'], 'audio': base64_audio } ``` **Converting Audio with FFmpeg:** ```bash ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav ``` ### 3. hangup End the call. **Structure:** ```json { "type": "hangup", "session_id": "550e8400-e29b-41d4-a716-446655440000" } ``` **Fields:** - `type`: Always `"hangup"` (required) - `session_id`: Session identifier (required) **Example:** ```python if 'goodbye' in event['text'].lower(): return { 'type': 'hangup', 'session_id': event['session']['id'] } ``` ### 4. transfer Transfer the call to another phone number. **Structure:** ```json { "type": "transfer", "session_id": "550e8400-e29b-41d4-a716-446655440000", "target_phone_number": "+1234567890", "caller_id_name": "Support Department", "caller_id_number": "+1234567890" } ``` **Fields:** - `type`: Always `"transfer"` (required) - `session_id`: Session identifier (required) - `target_phone_number`: Phone number to transfer to, E.164 format recommended (required) - `caller_id_name`: Caller ID name to display (required) - `caller_id_number`: Caller ID number to display (required) **Example:** ```javascript if (event.text.toLowerCase().includes('sales')) { return { type: 'transfer', session_id: event.session.id, target_phone_number: '+1234567890', caller_id_name: 'Sales Department', caller_id_number: '+1234567890' }; } ``` **Phone Number Format:** - Use E.164 format: `+1234567890` ✅ - Avoid: `123-456-7890` ❌ ### 5. barge_in Manually trigger barge-in (interrupt current playback). **Structure:** ```json { "type": "barge_in", "session_id": "550e8400-e29b-41d4-a716-446655440000" } ``` **Fields:** - `type`: Always `"barge_in"` (required) - `session_id`: Session identifier (required) --- ## Barge-In Configuration Control how users can interrupt the assistant while speaking. ### Configuration Structure ```json { "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3, "allow_after_ms": 500 } } ``` ### Strategies #### 1. none Disables barge-in completely. Audio plays fully without interruption. ```json { "barge_in": { "strategy": "none" } } ``` **Use cases:** - Critical information - Legal disclaimers - Emergency instructions #### 2. manual Allows manual barge-in via API only (no automatic detection). ```json { "barge_in": { "strategy": "manual" } } ``` **Use cases:** - Custom interruption logic - Button-triggered interruption - External event-based interruption #### 3. minimum_characters (Default) Automatically detects barge-in when user speech exceeds character threshold. ```json { "barge_in": { "strategy": "minimum_characters", "minimum_characters": 5, "allow_after_ms": 500 } } ``` **Use cases:** - Natural conversation flow - Customer service scenarios - Interactive voice menus ### Configuration Options **minimum_characters:** - Default: `3` - Range: `1` to `100` - Higher values require more speech before interruption **allow_after_ms:** - Default: `0` (immediate) - Range: `0` to `10000` (10 seconds) - Delay before barge-in is allowed (protection period) ### Examples **Natural Conversation:** ```json { "type": "speak", "session_id": "session-123", "text": "I can help you with billing, support, or sales.", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3 } } ``` **Critical Information (No Interruption):** ```json { "type": "speak", "session_id": "session-123", "text": "Your verification code is 1-2-3-4-5-6.", "barge_in": { "strategy": "none" } } ``` **Protected Announcement:** ```json { "type": "speak", "session_id": "session-123", "text": "Your account number is 1234567890.", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 10, "allow_after_ms": 2000 } } ``` --- ## TTS Providers Configure text-to-speech providers for different voices and languages. ### Supported Providers 1. **Azure Cognitive Services** - 400+ voices in 140+ languages 2. **ElevenLabs** - Ultra-realistic conversational voices ### Azure Cognitive Services **Configuration:** ```json { "tts": { "provider": "azure", "language": "en-US", "voice": "en-US-JennyNeural" } } ``` **Popular Voices:** | Language | Voice Name | Gender | Description | | -------- | ------------------ | ------ | ---------------------- | | en-US | en-US-JennyNeural | Female | Friendly, professional | | en-US | en-US-GuyNeural | Male | Clear, neutral | | en-GB | en-GB-SoniaNeural | Female | British, professional | | en-GB | en-GB-RyanNeural | Male | British, friendly | | de-DE | de-DE-KatjaNeural | Female | Professional, clear | | de-DE | de-DE-ConradNeural | Male | Deep, authoritative | **Full Example:** ```json { "type": "speak", "session_id": "session-123", "text": "Hallo, wie kann ich Ihnen helfen?", "tts": { "provider": "azure", "language": "de-DE", "voice": "de-DE-KatjaNeural" } } ``` ### ElevenLabs **Configuration:** ```json { "tts": { "provider": "eleven_labs", "voice": "21m00Tcm4TlvDq8ikWAM" } } ``` The `voice` field accepts the ElevenLabs voice ID as a string. If omitted, the first available voice will be used. **Minimal Configuration (uses default voice):** ```json { "tts": { "provider": "eleven_labs" } } ``` **Popular Voices:** | Voice Name | ID | Description | | ---------- | -------------------- | ------------------------------------------------------------------------ | | Rachel | 21m00Tcm4TlvDq8ikWAM | Matter-of-fact, personable woman. Great for conversational use cases. | | Sarah | EXAVITQu4vr4xnSDxMaL | Young adult woman with confident, warm tone. Reassuring and professional.| | George | JBFqnCBsd6RMkjVDRZzb | Warm resonance that instantly captivates listeners. | | Thomas | GBv7mTt0atIp3Br8iCZE | Soft and subdued male voice, optimal for narrations or meditations | | Adam | pNInz6obpgDQGcFmaJgB | - | | Brian | nPczCjzI2devNBz1zQrb | Middle-aged man with resonant and comforting tone. Great for narrations. | | Charlie | IKne3meq5aSn9XLyUdCD | Young Australian male with confident and energetic voice. | | Lily | pFZP5JQG7iQjIQuC4Bku | Velvety British female voice delivers news with warmth and clarity. | **Full Example:** ```json { "type": "speak", "session_id": "session-123", "text": "Hello! How can I help you today?", "tts": { "provider": "eleven_labs", "voice": "21m00Tcm4TlvDq8ikWAM" } } ``` ### Choosing a Provider **Use Azure when:** - You need many languages (140+) - You want consistent quality - You need regional accents - Budget is a concern **Use ElevenLabs when:** - You need the most natural voices - Conversational quality is critical - You're working with English/European languages - You want distinct personalities --- ## TypeScript SDK The `@sipgate/ai-flow-sdk` provides a convenient TypeScript/JavaScript SDK that wraps the API. ### Installation ```bash npm install @sipgate/ai-flow-sdk # or yarn add @sipgate/ai-flow-sdk # or pnpm add @sipgate/ai-flow-sdk ``` **Requirements:** - Node.js >= 22.0.0 - TypeScript 5.x (recommended) ### Basic Usage ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const assistant = AiFlowAssistant.create({ debug: true, onSessionStart: async (event) => { console.log(`Session started for ${event.session.phone_number}`); return "Hello! How can I help you today?"; }, onUserSpeak: async (event) => { console.log(`User said: ${event.text}`); return `You said: ${event.text}`; }, onSessionEnd: async (event) => { console.log(`Session ${event.session.id} ended`); }, }); ``` ### AiFlowAssistant.create(options) Creates a new assistant instance. **Options:** ```typescript interface AiFlowAssistantOptions { // Optional API key for authentication apiKey?: string; // Enable debug logging debug?: boolean; // Event handlers onSessionStart?: (event: AiFlowEventSessionStart) => Promise; onUserSpeak?: (event: AiFlowEventUserSpeak) => Promise; onUserInputTimeout?: (event: AiFlowEventUserInputTimeout) => Promise; onAssistantSpeak?: (event: AiFlowEventAssistantSpeak) => Promise; onAssistantSpeechEnded?: (event: AiFlowEventAssistantSpeechEnded) => Promise; onSessionEnd?: (event: AiFlowEventSessionEnd) => Promise; onUserBargeIn?: (event: AiFlowEventUserBargeIn) => Promise; } type InvocationResponseType = AiFlowAction | string | null | undefined; ``` ### Response Types Event handlers can return three types: **1. Simple String (auto-converted to speak action):** ```typescript onUserSpeak: async (event) => { return "Hello, how can I help?"; } ``` **2. Action Object (for advanced control):** ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "Hello!", barge_in: { strategy: "minimum_characters", minimum_characters: 3 } }; } ``` **3. No Response (null/undefined):** ```typescript onAssistantSpeak: async (event) => { // Track metrics, no response needed trackMetrics(event); return null; } ``` ### Express.js Integration ```typescript import express from "express"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const app = express(); app.use(express.json()); const assistant = AiFlowAssistant.create({ onSessionStart: async (event) => { return "Welcome! How can I help you today?"; }, onUserSpeak: async (event) => { return processUserInput(event.text); }, onSessionEnd: async (event) => { await cleanupSession(event.session.id); }, }); // Webhook endpoint app.post("/webhook", assistant.express()); // Health check app.get("/health", (req, res) => { res.json({ status: "ok" }); }); const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`AI Flow assistant running on port ${PORT}`); }); ``` ### WebSocket Integration ```typescript import WebSocket from "ws"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const wss = new WebSocket.Server({ port: 8080, perMessageDeflate: false, }); const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Hello from WebSocket!"; }, }); wss.on("connection", (ws, req) => { console.log("New WebSocket connection"); ws.on("message", assistant.ws(ws)); ws.on("error", (error) => { console.error("WebSocket error:", error); }); ws.on("close", () => { console.log("WebSocket connection closed"); }); }); console.log("WebSocket server listening on port 8080"); ``` ### Custom Integration ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Hello!"; }, }); // Custom integration app.post("/custom-webhook", async (req, res) => { const event = req.body; const action = await assistant.onEvent(event); if (action) { res.json(action); } else { res.status(204).send(); } }); ``` ### Type Definitions All types are exported from the SDK: ```typescript import type { // Events AiFlowEventSessionStart, AiFlowEventUserSpeak, AiFlowEventAssistantSpeak, AiFlowEventAssistantSpeechEnded, AiFlowEventSessionEnd, AiFlowEventUserBargeIn, // Actions AiFlowAction, AiFlowActionSpeak, AiFlowActionAudio, AiFlowActionHangup, AiFlowActionTransfer, AiFlowActionBargeIn, // Session AiFlowEventSessionInfo, } from "@sipgate/ai-flow-sdk"; onUserSpeak: async (event: AiFlowEventUserSpeak) => { const text: string = event.text; const sessionId: string = event.session.id; return { type: "speak", session_id: sessionId, text: `You said: ${text}`, } as AiFlowAction; } ``` ### SDK Action Types (TypeScript) All SDK action types with full TypeScript definitions: ```typescript // Speak Action interface AiFlowActionSpeak { type: "speak"; session_id: string; text?: string; ssml?: string; tts?: { provider: "azure"; language?: string; voice?: string; } | { provider: "eleven_labs"; voice?: string; // ElevenLabs voice ID (optional, uses default if omitted) }; barge_in?: { strategy: "none" | "manual" | "minimum_characters"; minimum_characters?: number; allow_after_ms?: number; }; } // Audio Action interface AiFlowActionAudio { type: "audio"; session_id: string; audio: string; // Base64 encoded WAV barge_in?: { strategy: "none" | "manual" | "minimum_characters"; minimum_characters?: number; allow_after_ms?: number; }; } // Hangup Action interface AiFlowActionHangup { type: "hangup"; session_id: string; } // Transfer Action interface AiFlowActionTransfer { type: "transfer"; session_id: string; target_phone_number: string; caller_id_name: string; caller_id_number: string; } // Barge-In Action interface AiFlowActionBargeIn { type: "barge_in"; session_id: string; } ``` ### SDK Event Types (TypeScript) All SDK event types with full TypeScript definitions: ```typescript // Session Info (included in all events) interface AiFlowEventSessionInfo { id: string; account_id: string; phone_number: string; direction?: "inbound" | "outbound"; from_phone_number: string; to_phone_number: string; } // Session Start Event interface AiFlowEventSessionStart { type: "session_start"; session: AiFlowEventSessionInfo; } // User Speak Event interface AiFlowEventUserSpeak { type: "user_speak"; text: string; barged_in?: boolean; // true if user interrupted session: AiFlowEventSessionInfo; } // Assistant Speak Event interface AiFlowEventAssistantSpeak { type: "assistant_speak"; text?: string; ssml?: string; duration_ms: number; speech_started_at: number; session: AiFlowEventSessionInfo; } // Assistant Speech Ended Event interface AiFlowEventAssistantSpeechEnded { type: "assistant_speech_ended"; session: AiFlowEventSessionInfo; } // Session End Event interface AiFlowEventSessionEnd { type: "session_end"; session: AiFlowEventSessionInfo; } ``` --- ## Complete Examples ### Complete Python (Flask) Example ```python from flask import Flask, request, jsonify, abort import os app = Flask(__name__) # Configuration SHARED_SECRET = os.environ.get('AI_FLOW_SHARED_SECRET') # Session state (use database in production) sessions = {} def authenticate(): provided_secret = request.headers.get('X-API-TOKEN') if provided_secret != SHARED_SECRET: abort(401) @app.route('/webhook', methods=['POST']) def webhook(): authenticate() try: event = request.json if not event or 'type' not in event: return jsonify({'error': 'Invalid event'}), 400 session_id = event['session']['id'] event_type = event['type'] # Session Start if event_type == 'session_start': sessions[session_id] = { 'started_at': time.time(), 'phone_number': event['session']['phone_number'], 'timeout_count': 0 } return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': 'Welcome to AI Flow! How can I help you today?', 'user_input_timeout_seconds': 8, 'barge_in': { 'strategy': 'minimum_characters', 'minimum_characters': 3 } }) # User Speak elif event_type == 'user_speak': user_text = event['text'].lower() # Handle goodbye if 'goodbye' in user_text or 'bye' in user_text: return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': 'Goodbye! Have a great day!' }) # Handle transfer request if 'transfer' in user_text or 'agent' in user_text: return jsonify({ 'type': 'transfer', 'session_id': session_id, 'target_phone_number': '+1234567890', 'caller_id_name': 'Support Team', 'caller_id_number': '+1234567890' }) # Echo response return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': f"You said: {event['text']}. How else can I help?", 'tts': { 'provider': 'azure', 'language': 'en-US', 'voice': 'en-US-JennyNeural' } }) # Assistant Speak elif event_type == 'assistant_speak': print(f"Assistant spoke for {event['duration_ms']}ms") return '', 204 # Session End elif event_type == 'session_end': if session_id in sessions: duration = time.time() - sessions[session_id]['started_at'] print(f"Session {session_id} ended after {duration:.2f} seconds") del sessions[session_id] return '', 204 # Handle barge-in elif event_type == 'user_speak' and event.get('barged_in'): return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': "I'm listening, please continue." }) # Handle user input timeout elif event_type == 'user_input_timeout': # Track timeout count if 'timeout_count' not in sessions.get(session_id, {}): sessions[session_id]['timeout_count'] = 0 sessions[session_id]['timeout_count'] += 1 count = sessions[session_id]['timeout_count'] # Hangup after 3 timeouts if count >= 3: return jsonify({ 'type': 'hangup', 'session_id': session_id }) # Retry with prompt return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': f"I didn't hear anything. Please respond. Attempt {count} of 3.", 'user_input_timeout_seconds': 5 }) return '', 204 except Exception as e: print(f"Error processing webhook: {e}") return jsonify({'error': 'Internal server error'}), 500 @app.route('/health', methods=['GET']) def health(): return jsonify({'status': 'ok'}) if __name__ == '__main__': app.run(host='0.0.0.0', port=3000, debug=True) ``` ### Complete Node.js (Express) Example ```javascript const express = require('express'); const app = express(); app.use(express.json()); // Configuration const SHARED_SECRET = process.env.AI_FLOW_SHARED_SECRET; // Session state (use database in production) const sessions = new Map(); // Authentication middleware const authenticate = (req, res, next) => { const providedSecret = req.headers['x-api-token']; if (providedSecret !== SHARED_SECRET) { return res.status(401).json({ error: 'Unauthorized' }); } next(); }; app.post('/webhook', authenticate, (req, res) => { try { const event = req.body; if (!event || !event.type) { return res.status(400).json({ error: 'Invalid event' }); } const sessionId = event.session.id; const eventType = event.type; // Session Start if (eventType === 'session_start') { sessions.set(sessionId, { startedAt: Date.now(), phoneNumber: event.session.phone_number, timeoutCount: 0 }); return res.json({ type: 'speak', session_id: sessionId, text: 'Welcome to AI Flow! How can I help you today?', user_input_timeout_seconds: 8, barge_in: { strategy: 'minimum_characters', minimum_characters: 3 } }); } // User Speak if (eventType === 'user_speak') { const userText = event.text.toLowerCase(); // Handle goodbye if (userText.includes('goodbye') || userText.includes('bye')) { return res.json({ type: 'speak', session_id: sessionId, text: 'Goodbye! Have a great day!' }); } // Handle transfer request if (userText.includes('transfer') || userText.includes('agent')) { return res.json({ type: 'transfer', session_id: sessionId, target_phone_number: '+1234567890', caller_id_name: 'Support Team', caller_id_number: '+1234567890' }); } // Echo response return res.json({ type: 'speak', session_id: sessionId, text: `You said: ${event.text}. How else can I help?`, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' } }); } // Assistant Speak if (eventType === 'assistant_speak') { console.log(`Assistant spoke for ${event.duration_ms}ms`); return res.status(204).send(); } // Session End if (eventType === 'session_end') { const session = sessions.get(sessionId); if (session) { const duration = (Date.now() - session.startedAt) / 1000; console.log(`Session ${sessionId} ended after ${duration.toFixed(2)} seconds`); sessions.delete(sessionId); } return res.status(204).send(); } // Handle barge-in if (eventType === 'user_speak' && event.barged_in) { return res.json({ type: 'speak', session_id: sessionId, text: "I'm listening, please continue." }); } // Handle user input timeout if (eventType === 'user_input_timeout') { const session = sessions.get(sessionId); if (session) { session.timeoutCount = (session.timeoutCount || 0) + 1; // Hangup after 3 timeouts if (session.timeoutCount >= 3) { return res.json({ type: 'hangup', session_id: sessionId }); } // Retry with prompt return res.json({ type: 'speak', session_id: sessionId, text: `I didn't hear anything. Please respond. Attempt ${session.timeoutCount} of 3.`, user_input_timeout_seconds: 5 }); } } res.status(204).send(); } catch (error) { console.error('Error processing webhook:', error); res.status(500).json({ error: 'Internal server error' }); } }); app.get('/health', (req, res) => { res.json({ status: 'ok' }); }); const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`AI Flow webhook server running on port ${PORT}`); }); ``` ### Complete TypeScript SDK Example ```typescript import express from "express"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; import type { AiFlowEventSessionStart, AiFlowEventUserSpeak, AiFlowEventAssistantSpeak, AiFlowEventUserInputTimeout, AiFlowEventSessionEnd, AiFlowEventUserBargeIn, } from "@sipgate/ai-flow-sdk"; const app = express(); app.use(express.json()); // Session state (use database in production) interface SessionData { startedAt: number; phoneNumber: string; conversationHistory: string[]; timeoutCount: number; } const sessions = new Map(); // Create assistant const assistant = AiFlowAssistant.create({ debug: true, apiKey: process.env.AI_FLOW_API_KEY, onSessionStart: async (event: AiFlowEventSessionStart) => { const sessionId = event.session.id; // Initialize session sessions.set(sessionId, { startedAt: Date.now(), phoneNumber: event.session.phone_number, conversationHistory: [], timeoutCount: 0, }); console.log(`Session ${sessionId} started for ${event.session.phone_number}`); // Return greeting return { type: "speak", session_id: sessionId, text: "Welcome to AI Flow! How can I help you today?", user_input_timeout_seconds: 8, tts: { provider: "azure", language: "en-US", voice: "en-US-JennyNeural", }, barge_in: { strategy: "minimum_characters", minimum_characters: 3, }, }; }, onUserSpeak: async (event: AiFlowEventUserSpeak) => { const sessionId = event.session.id; const userText = event.text; // Add to conversation history const session = sessions.get(sessionId); if (session) { session.conversationHistory.push(`User: ${userText}`); } console.log(`User said: ${userText}`); // Handle goodbye if (userText.toLowerCase().includes('goodbye') || userText.toLowerCase().includes('bye')) { return { type: "speak", session_id: sessionId, text: "Goodbye! Have a great day!", }; } // Handle transfer request if (userText.toLowerCase().includes('transfer') || userText.toLowerCase().includes('agent')) { return { type: "transfer", session_id: sessionId, target_phone_number: "+1234567890", caller_id_name: "Support Team", caller_id_number: "+1234567890", }; } // Echo response const response = `You said: ${userText}. How else can I help?`; if (session) { session.conversationHistory.push(`Assistant: ${response}`); } return response; // Simple string response }, onAssistantSpeak: async (event: AiFlowEventAssistantSpeak) => { console.log(`Assistant spoke for ${event.duration_ms}ms`); // Track metrics // trackMetrics({ // sessionId: event.session.id, // duration: event.duration_ms, // text: event.text, // }); return null; // No response needed }, onUserInputTimeout: async (event: AiFlowEventUserInputTimeout) => { const sessionId = event.session.id; const session = sessions.get(sessionId); if (session) { session.timeoutCount++; console.log(`Timeout #${session.timeoutCount} for session ${sessionId}`); // Hangup after 3 timeouts if (session.timeoutCount >= 3) { return { type: "hangup", session_id: sessionId, }; } // Retry with prompt return { type: "speak", session_id: sessionId, text: `I didn't hear anything. Please respond. Attempt ${session.timeoutCount} of 3.`, user_input_timeout_seconds: 5, }; } return null; }, onSessionEnd: async (event: AiFlowEventSessionEnd) => { const sessionId = event.session.id; const session = sessions.get(sessionId); if (session) { const duration = (Date.now() - session.startedAt) / 1000; console.log(`Session ${sessionId} ended after ${duration.toFixed(2)} seconds`); console.log('Conversation history:', session.conversationHistory); // Cleanup sessions.delete(sessionId); } return null; // No action allowed on session_end }, // DEPRECATED: Use onUserSpeak with barged_in instead onUserBargeIn: async (event: AiFlowEventUserBargeIn) => { console.log(`User interrupted with: ${event.text}`); return "I'm listening, please continue."; }, }); // Webhook endpoint app.post("/webhook", assistant.express()); // Health check app.get("/health", (req, res) => { res.json({ status: "ok", sessions: sessions.size }); }); // Start server const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`AI Flow assistant running on port ${PORT}`); }); ``` --- ## Best Practices ### General Guidelines 1. **Respond Quickly**: Keep response times under 1 second for optimal user experience 2. **Handle All Events**: Even if you don't need to respond, implement all event handlers 3. **Use Type Safety**: Leverage TypeScript types when using the SDK 4. **Error Handling**: Always handle errors gracefully and return fallback responses 5. **State Management**: Use databases for session state in production, not in-memory maps 6. **Logging**: Log all events and actions for debugging and analytics 7. **Testing**: Test with real phone calls before deploying to production ### Security 1. **Use HTTPS**: Always use HTTPS in production 2. **Validate Shared Secrets**: Always verify the shared secret sent by AI Flow 3. **Store Secrets Securely**: Use environment variables or secret management services 4. **Use Strong Secrets**: Generate cryptographically secure random secrets 5. **Rate Limiting**: Implement rate limiting to prevent abuse 6. **Input Validation**: Validate all incoming events ### Performance 1. **Async Processing**: Process long-running tasks asynchronously 2. **Caching**: Cache frequently used data (audio files, responses) 3. **Database Connection Pooling**: Use connection pooling for database operations 4. **Minimize External API Calls**: Batch requests when possible ### Barge-In Configuration 1. **Use `none` sparingly**: Only for truly critical information 2. **Default to `minimum_characters`**: Provides natural conversation flow 3. **Set protection periods**: For important announcements 4. **Test with users**: Find the right balance for your use case ### TTS Provider Selection 1. **Choose based on use case**: Azure for multi-language, ElevenLabs for quality 2. **Test voices**: Try different voices to find the best fit 3. **Consider latency**: ElevenLabs may have higher latency 4. **Monitor costs**: Track TTS usage and costs ### Session Management 1. **Initialize state on session_start**: Set up any required session tracking 2. **Clean up on session_end**: Always clean up resources 3. **Track conversation history**: Log interactions for analytics 4. **Handle disconnections**: Be prepared for unexpected disconnections ### User Input Timeout Handling 1. **Set appropriate timeouts**: Use 5-10 seconds for most prompts, longer for complex questions 2. **Track timeout counts**: Limit retry attempts (typically 2-3 before escalation) 3. **Provide clear prompts**: On timeout, give the user clear instructions 4. **Escalate gracefully**: After multiple timeouts, offer human transfer or hangup 5. **Reset on success**: Clear timeout counter when user successfully responds 6. **Use context-aware timeouts**: Adjust timeout duration based on conversation state **Example Strategy:** ```javascript // First timeout: gentle reminder if (timeoutCount === 1) { return { text: "Are you still there?", user_input_timeout_seconds: 8 }; } // Second timeout: clearer instruction if (timeoutCount === 2) { return { text: "Please speak now or say 'agent' to talk to a person.", user_input_timeout_seconds: 10 }; } // Third timeout: escalate or hangup if (timeoutCount >= 3) { return { type: "transfer", ... } or { type: "hangup" }; } ``` --- ## Quick Reference ### HTTP Status Codes | Code | Meaning | When to Use | | ---- | ------------------ | ------------------------------ | | 200 | OK | Returning an action | | 204 | No Content | No action needed | | 400 | Bad Request | Invalid event format | | 401 | Unauthorized | Invalid or missing credentials | | 500 | Internal Error | Server error | ### Event Response Matrix | Event Type | Can Return Action? | Common Responses | | ----------------- | ------------------ | -------------------------- | | session_start | ✅ Yes | speak, audio | | user_speak | ✅ Yes | speak, transfer, hangup | | assistant_speak | ✅ Yes | Usually none (track only) | | assistant_speech_ended | ✅ Yes | Usually none (track only) | | user_input_timeout | ✅ Yes | speak, transfer, hangup | | session_end | ❌ No | None (cleanup only) | ### Audio Format Checklist - [ ] Format: WAV - [ ] Sample Rate: 16kHz - [ ] Channels: Mono - [ ] Bit Depth: 16-bit PCM - [ ] Encoding: Base64 ### Phone Number Format - ✅ E.164 format: `+1234567890` - ❌ Formatted: `(123) 456-7890` - ❌ Dashes: `123-456-7890` --- This reference document contains all essential information for building applications with sipgate AI Flow API. For the latest updates and additional information, refer to the official documentation at https://sipgate.github.io/sipgate-ai-flow-api/