--- url: /sipgate-ai-flow-api/api.md --- # API Reference Welcome to the sipgate AI Flow API documentation. This documentation is **language-agnostic** and describes the HTTP and WebSocket protocols that power the AI Flow service. ## Overview sipgate AI Flow is a voice assistant platform that uses an **event-driven architecture**. Your application receives events (like when a user speaks) and responds with actions (like speaking text back to the user). ## Architecture ```mermaid graph TB A[Phone Call] --> B[AI Flow Service] B --> C{Event Type} C -->|session_start| D[Your Webhook/WebSocket] C -->|user_speak| D C -->|assistant_speak| D C -->|session_end| D D --> E[Process Event] E --> F{Response Type} F -->|speak| G[Action: Speak] F -->|audio| H[Action: Audio] F -->|transfer| I[Action: Transfer] F -->|hangup| J[Action: Hangup] G --> B H --> B I --> B J --> B B --> A ``` ## Integration Methods ### HTTP Webhooks Receive events via HTTP POST requests to your webhook endpoint. **Best for:** * Serverless functions (AWS Lambda, Google Cloud Functions) * REST APIs * Simple integrations [Learn more →](/api/http-webhooks) ### WebSocket Maintain a persistent WebSocket connection for real-time event streaming. **Best for:** * Real-time applications * Lower latency requirements * High-volume scenarios [Learn more →](/api/websocket) ## Event-Driven Flow ```mermaid sequenceDiagram participant Phone as Phone Call participant Service as AI Flow Service participant App as Your Application Phone->>Service: Call Starts Service->>App: POST /webhook
{type: "session_start", ...} App->>Service: {type: "speak", text: "Hello!"} Service->>Phone: Plays Audio Phone->>Service: User Speaks Service->>App: POST /webhook
{type: "user_speak", text: "..."} App->>Service: {type: "speak", text: "How can I help?"} Service->>Phone: Plays Audio Phone->>Service: Call Ends Service->>App: POST /webhook
{type: "session_end", ...} ``` ## Core Concepts ### Events Events are JSON objects sent from the AI Flow service to your application: * **session\_start** - When a call begins * **user\_speak** - When the user speaks (includes `barged_in` flag if user interrupted) * **assistant\_speak** - After your assistant speaks * **session\_end** - When the call ends [View all events →](/api/events) ### Actions Actions are JSON objects you send back to the AI Flow service: * **speak** - Speak text or SSML * **audio** - Play pre-recorded audio * **hangup** - End the call * **transfer** - Transfer to another number [View all actions →](/api/actions) ## Quick Example Here's a minimal example using HTTP webhooks: **1. Receive an event:** ```json POST /webhook Content-Type: application/json { "type": "user_speak", "text": "Hello", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` **2. Respond with an action:** ```json HTTP/1.1 200 OK Content-Type: application/json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?" } ``` ## Language Support This API works with **any programming language** that can: * Receive HTTP requests (for webhooks) * Receive WebSocket connections (for real-time) * Parse and generate JSON (for events and actions) Examples are provided in multiple languages throughout the documentation. ## Next Steps * **[Quick Start](/api/quick-start)** - Build your first integration * **[HTTP Webhooks](/api/http-webhooks)** - Set up HTTP integration * **[WebSocket](/api/websocket)** - Set up WebSocket integration * **[Event Types](/api/events)** - Complete event reference * **[Action Types](/api/actions)** - Complete action reference ## TypeScript SDK If you're using TypeScript, check out our [TypeScript SDK documentation](/sdk/) for a more convenient wrapper around this API. ## For AI-Assisted Development Using AI coding assistants like **Claude Code**, **ChatGPT**, or **Cursor**? We publish two auto-generated files following the [llms.txt spec](https://llmstxt.org/): * **[`/llms.txt`](/llms.txt)** — short index, auto-discovered by AI tooling. * **[`/llms-full.txt`](/llms-full.txt)** — full documentation corpus in a single file, ideal for pasting into an LLM context. --- --- url: /sipgate-ai-flow-api/api/authentication.md --- # Authentication How to authenticate requests with the AI Flow API. ## API Key Authentication The AI Flow service can authenticate your webhook endpoint using shared secrets. ### Setting Up Shared Secrets When configuring your webhook settings, you can optionally store a shared secret if you want to use AI Flow authentication: 1. In your webhook settings, optionally store a shared secret for AI Flow authentication 2. The service will send this shared secret in request headers for validation 3. Validate the shared secret in your webhook handler to authenticate requests ### Verifying Shared Secrets AI Flow sends the shared secret in the request headers. Validate this shared secret in your webhook handler: ### Python (Flask) ```python from flask import Flask, request, abort import os # The shared secret you configured in your webhook settings SHARED_SECRET = os.environ.get('AI_FLOW_SHARED_SECRET') @app.route('/webhook', methods=['POST']) def webhook(): # Verify shared secret sent by AI Flow provided_secret = request.headers.get('X-API-TOKEN') if provided_secret != SHARED_SECRET: abort(401) # Process event event = request.json # ... ``` ### Node.js (Express) ```javascript // The shared secret you configured in your webhook settings const SHARED_SECRET = process.env.AI_FLOW_SHARED_SECRET; app.post('/webhook', (req, res) => { const providedSecret = req.headers['X-API-TOKEN']; if (providedSecret !== SHARED_SECRET) { return res.status(401).json({ error: 'Unauthorized' }); } // Process event const event = req.body; // ... }); ``` ### Go ```go import "os" // The shared secret you configured in your webhook settings var sharedSecret = os.Getenv("AI_FLOW_SHARED_SECRET") func webhook(w http.ResponseWriter, r *http.Request) { providedSecret := r.Header.Get("X-API-TOKEN") if providedSecret != sharedSecret { w.WriteHeader(http.StatusUnauthorized) return } // Process event // ... } ``` ## Request Headers The AI Flow service sends the following headers: * `X-API-TOKEN` - The shared secret you configured in your webhook settings * `Content-Type: application/json` - Always JSON * `User-Agent` - Service identifier ## Response Headers Your responses should include: * `Content-Type: application/json` - When returning an action * `HTTP Status Code`: * `200` - Action returned * `204` - No action (No Content) * `400` - Invalid request * `401` - Unauthorized * `500` - Server error ## Security Best Practices 1. **Use HTTPS** - Always use HTTPS in production 2. **Validate Shared Secrets** - Always verify the shared secret sent by AI Flow 3. **Store Secrets Securely** - Use environment variables or secret management 4. **Use Strong Secrets** - Generate cryptographically secure random secrets 5. **Rate Limiting** - Implement rate limiting to prevent abuse 6. **Input Validation** - Validate all incoming events ## Environment Variables Store shared secrets securely: ### Python ```python import os SHARED_SECRET = os.environ.get('AI_FLOW_SHARED_SECRET') ``` ### Node.js ```javascript const SHARED_SECRET = process.env.AI_FLOW_SHARED_SECRET; ``` ### Go ```go import "os" sharedSecret := os.Getenv("AI_FLOW_SHARED_SECRET") ``` ## Next Steps * **[HTTP Webhooks](/api/http-webhooks)** - Complete HTTP integration guide * **[Quick Start](/api/quick-start)** - Build your first integration --- --- url: /sipgate-ai-flow-api/api/quick-start.md --- # Quick Start Get up and running with the AI Flow API in minutes, using any programming language. ## Prerequisites * A webhook endpoint that can receive HTTP POST requests * Ability to send HTTP responses with JSON * (Optional) WebSocket support for real-time integration ## Step 1: Set Up Your Webhook Endpoint Create an HTTP endpoint that receives POST requests. Here are examples in different languages: ### Python (Flask) ```python from flask import Flask, request, jsonify app = Flask(__name__) @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': # Respond with a speak action return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': f"You said: {event['text']}" }) return '', 204 # No response needed if __name__ == '__main__': app.run(port=3000) ``` ### Node.js (Express) ```javascript const express = require('express'); const app = express(); app.use(express.json()); app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak') { return res.json({ type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }); } res.status(204).send(); }); app.listen(3000, () => { console.log('Webhook server running on port 3000'); }); ``` ### Go ```go package main import ( "encoding/json" "net/http" ) type Event struct { Type string `json:"type"` Text string `json:"text,omitempty"` Session Session `json:"session"` } type Session struct { ID string `json:"id"` } func webhook(w http.ResponseWriter, r *http.Request) { var event Event json.NewDecoder(r.Body).Decode(&event) if event.Type == "user_speak" { action := map[string]interface{}{ "type": "speak", "session_id": event.Session.ID, "text": "You said: " + event.Text, } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } w.WriteHeader(http.StatusNoContent) } func main() { http.HandleFunc("/webhook", webhook) http.ListenAndServe(":3000", nil) } ``` ### Ruby (Sinatra) ```ruby require 'sinatra' require 'json' post '/webhook' do event = JSON.parse(request.body.read) if event['type'] == 'user_speak' return JSON.generate({ type: 'speak', session_id: event['session']['id'], text: "You said: #{event['text']}" }) end status 204 end ``` ## Step 2: Handle Session Start Respond when a call begins: ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'session_start': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Welcome! How can I help you today?' }) if event['type'] == 'user_speak': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': f"You said: {event['text']}" }) return '', 204 ``` ## Step 3: Expose Your Endpoint Make your endpoint accessible to the AI Flow service: 1. **Local Development**: Use a tunneling service like ngrok: ```bash ngrok http 3000 ``` 2. **Production**: Deploy to a public URL (AWS, Heroku, Railway, etc.) 3. **Configure**: Add your webhook URL in the AI Flow dashboard ## Step 4: Test Your Integration 1. Make a test call to your configured phone number 2. Speak something 3. Check your server logs to see events 4. Verify responses are working ## Complete Example Here's a complete example that handles all event types: ```python from flask import Flask, request, jsonify app = Flask(__name__) @app.route('/webhook', methods=['POST']) def webhook(): event = request.json session_id = event['session']['id'] if event['type'] == 'session_start': return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': 'Hello! How can I help you today?' }) elif event['type'] == 'user_speak': user_text = event['text'].lower() if 'goodbye' in user_text or 'bye' in user_text: return jsonify({ 'type': 'hangup', 'session_id': session_id }) return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': f"You said: {event['text']}" }) elif event['type'] == 'session_end': print(f"Session {session_id} ended") return '', 204 return '', 204 if __name__ == '__main__': app.run(port=3000, debug=True) ``` ## Next Steps * **[HTTP Webhooks](/api/http-webhooks)** - Detailed HTTP integration guide * **[WebSocket](/api/websocket)** - Real-time WebSocket integration * **[Event Types](/api/events)** - Complete event reference * **[Action Types](/api/actions)** - Complete action reference --- --- url: /sipgate-ai-flow-api/api/http-webhooks.md --- # HTTP Webhooks Receive events via HTTP POST requests to your webhook endpoint. ## Overview HTTP webhooks are the simplest way to integrate with AI Flow. The service sends events as JSON in HTTP POST requests to your endpoint. ## How It Works ```mermaid sequenceDiagram participant Call as Phone Call participant Service as AI Flow Service participant YourApp as Your Webhook Endpoint Call->>Service: User speaks Service->>YourApp: POST /webhook
JSON event YourApp->>YourApp: Process event YourApp->>Service: HTTP 200
JSON action Service->>Call: Execute action ``` ## Endpoint Requirements Your webhook endpoint must: 1. **Accept POST requests** at a public URL 2. **Parse JSON** from the request body 3. **Return JSON actions** or `204 No Content` 4. **Respond as quickly as possible** 5. **Use HTTPS** in production ## Request Format All requests are POST with JSON body: ```http POST /webhook HTTP/1.1 Host: your-domain.com Content-Type: application/json X-API-Key: your-api-key (optional) { "type": "user_speak", "text": "Hello", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` ## Response Format ### Return an Action ```http HTTP/1.1 200 OK Content-Type: application/json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?" } ``` ### No Action Needed ```http HTTP/1.1 204 No Content ``` ## Implementation Examples ### Python (Flask) ```python from flask import Flask, request, jsonify app = Flask(__name__) @app.route('/webhook', methods=['POST']) def webhook(): event = request.json # Handle different event types if event['type'] == 'session_start': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Welcome!' }) elif event['type'] == 'user_speak': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': f"You said: {event['text']}" }) # No response needed return '', 204 if __name__ == '__main__': app.run(port=3000) ``` ### Node.js (Express) ```javascript const express = require('express'); const app = express(); app.use(express.json()); app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { return res.json({ type: 'speak', session_id: event.session.id, text: 'Welcome!' }); } if (event.type === 'user_speak') { return res.json({ type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }); } res.status(204).send(); }); app.listen(3000); ``` ### Go ```go package main import ( "encoding/json" "net/http" ) type Event struct { Type string `json:"type"` Text string `json:"text,omitempty"` Session Session `json:"session"` } type Session struct { ID string `json:"id"` } func webhook(w http.ResponseWriter, r *http.Request) { var event Event json.NewDecoder(r.Body).Decode(&event) if event.Type == "user_speak" { action := map[string]interface{}{ "type": "speak", "session_id": event.Session.ID, "text": "You said: " + event.Text, } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } w.WriteHeader(http.StatusNoContent) } func main() { http.HandleFunc("/webhook", webhook) http.ListenAndServe(":3000", nil) } ``` ### Ruby (Sinatra) ```ruby require 'sinatra' require 'json' post '/webhook' do event = JSON.parse(request.body.read) if event['type'] == 'user_speak' return JSON.generate({ type: 'speak', session_id: event['session']['id'], text: "You said: #{event['text']}" }) end status 204 end ``` ## Error Handling Handle errors gracefully: ```python @app.route('/webhook', methods=['POST']) def webhook(): try: event = request.json if not event or 'type' not in event: return jsonify({'error': 'Invalid event'}), 400 # Process event # ... except Exception as e: print(f"Error processing webhook: {e}") return jsonify({'error': 'Internal server error'}), 500 ``` ## Best Practices 1. **Idempotency** - Handle duplicate events gracefully 2. **Async Processing** - Process long-running tasks asynchronously 3. **Logging** - Log all events for debugging 4. **Validation** - Validate event structure 5. **Error Responses** - Return appropriate HTTP status codes ## Testing Locally Use a tunneling service to expose your local server: ```bash # Using ngrok ngrok http 3000 # Using localtunnel npx localtunnel --port 3000 ``` Then configure the tunnel URL in your AI Flow dashboard. ## Production Deployment Deploy to any platform that supports HTTP: * **AWS Lambda** - Serverless functions * **Google Cloud Functions** - Serverless * **Heroku** - Platform as a service * **Railway** - Modern deployment * **Your own server** - VPS, dedicated server ## Next Steps * **[WebSocket](/api/websocket)** - Real-time WebSocket integration * **[Event Types](/api/events)** - Complete event reference * **[Action Types](/api/actions)** - Complete action reference --- --- url: /sipgate-ai-flow-api/api/websocket.md --- # WebSocket Integration Maintain a persistent WebSocket connection for real-time event streaming. ## Overview WebSocket provides lower latency and real-time bidirectional communication compared to HTTP webhooks. When a phone call starts, the AI Flow Service initiates a WebSocket connection to your application. Your application runs a WebSocket server that accepts these connections. ## How It Works ```mermaid sequenceDiagram participant Call as Phone Call participant Service as AI Flow Service participant App as Your Application Call->>Service: Call Starts Service->>App: WebSocket Connection App->>Service: Connection Established Call->>Service: User speaks Service->>App: JSON Event App->>App: Process event App->>Service: JSON Action Service->>Call: Execute action ``` ## Connection When a phone call starts, the AI Flow Service initiates a WebSocket connection to your application. Your application must run a WebSocket server that accepts incoming connections. ### WebSocket Server Your application needs to expose a WebSocket endpoint that the AI Flow Service can connect to. The service will connect to your configured WebSocket URL when calls begin. ### Connection URL Configure your WebSocket server URL in the AI Flow dashboard. The service will connect to this URL, for example: ``` wss://your-domain.com/ai-flow/websocket ``` or for local development: ``` ws://localhost:8080/websocket ``` ## Message Format ### Receiving Events Events are sent as JSON strings: ```json { "type": "user_speak", "text": "Hello", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890" } } ``` ### Sending Actions Send actions as JSON strings: ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?" } ``` ## Implementation Examples ### Python ```python import asyncio import websockets import json async def handle_message(websocket, path): async for message in websocket: event = json.loads(message) if event['type'] == 'user_speak': action = { 'type': 'speak', 'session_id': event['session']['id'], 'text': f"You said: {event['text']}" } await websocket.send(json.dumps(action)) async def main(): async with websockets.serve(handle_message, "localhost", 8765): await asyncio.Future() # run forever asyncio.run(main()) ``` ### Node.js ```javascript const WebSocket = require('ws'); const wss = new WebSocket.Server({ port: 8080 }); wss.on('connection', (ws) => { ws.on('message', (data) => { const event = JSON.parse(data.toString()); if (event.type === 'user_speak') { const action = { type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }; ws.send(JSON.stringify(action)); } }); ws.on('error', (error) => { console.error('WebSocket error:', error); }); }); ``` ### Go ```go package main import ( "encoding/json" "github.com/gorilla/websocket" "net/http" ) var upgrader = websocket.Upgrader{ CheckOrigin: func(r *http.Request) bool { return true }, } func websocketHandler(w http.ResponseWriter, r *http.Request) { conn, err := upgrader.Upgrade(w, r, nil) if err != nil { return } defer conn.Close() for { var event map[string]interface{} err := conn.ReadJSON(&event) if err != nil { break } if event["type"] == "user_speak" { session := event["session"].(map[string]interface{}) action := map[string]interface{}{ "type": "speak", "session_id": session["id"], "text": "You said: " + event["text"].(string), } conn.WriteJSON(action) } } } func main() { http.HandleFunc("/ws", websocketHandler) http.ListenAndServe(":8080", nil) } ``` ## Connection Management The AI Flow Service manages the WebSocket connection lifecycle. When a call starts, it connects to your server. When the call ends, it may close the connection. ### Handling Connections Your WebSocket server should handle incoming connections from the AI Flow Service: ```javascript const WebSocket = require('ws'); const wss = new WebSocket.Server({ port: 8080 }); wss.on('connection', (ws, req) => { console.log('New connection from AI Flow Service'); ws.on('message', (data) => { const event = JSON.parse(data.toString()); handleEvent(event, ws); }); ws.on('error', (error) => { console.error('WebSocket error:', error); }); ws.on('close', () => { console.log('Connection closed by AI Flow Service'); }); }); ``` ## Heartbeat The AI Flow Service may send ping frames to keep the connection alive. Your server should respond with pong frames: ```javascript wss.on('connection', (ws) => { ws.on('ping', () => { ws.pong(); // Respond to ping with pong }); // ... rest of connection handling }); ``` ## Error Handling ```python async def handle_message(websocket, path): try: async for message in websocket: try: event = json.loads(message) action = process_event(event) if action: await websocket.send(json.dumps(action)) except json.JSONDecodeError: print(f"Invalid JSON: {message}") except Exception as e: print(f"Error processing event: {e}") except websockets.exceptions.ConnectionClosed: print("Connection closed") except Exception as e: print(f"WebSocket error: {e}") ``` ## Advantages Over HTTP * **Lower Latency** - No HTTP overhead * **Persistent Connection** - No connection setup per request * **Bidirectional** - Can send messages anytime, from either side * **Real-time** - Instant event delivery * **Proactive Communication** - Send actions without waiting for events; receive events without requests ## When to Use WebSocket WebSockets enable a more flexible communication pattern than HTTP webhooks. Unlike the request/response model of HTTP webhooks, WebSockets allow: * **Proactive event delivery** - The AI Flow Service can send events to your application at any time, not just in response to a request * **Unsolicited actions** - Your application can send actions to the service without waiting for an event first * **True bidirectional communication** - Both sides can initiate communication independently Use WebSocket when: * You need the lowest possible latency * You're handling high-volume traffic * You can run a persistent WebSocket server * You're building a real-time application * You have control over your server infrastructure * You need to send actions proactively without waiting for events Use HTTP webhooks when: * You're using serverless functions (which can't maintain WebSocket connections) * You want simpler deployment * You prefer the request/response model (each event triggers a response) * You're building a simple integration * You can't run a persistent server ## Next Steps * **[HTTP Webhooks](/api/http-webhooks)** - HTTP integration alternative * **[Event Flow](/api/event-flow)** - Understand the event lifecycle * **[Event Types](/api/events)** - Complete event reference --- --- url: /sipgate-ai-flow-api/api/event-flow.md --- # Event Flow Understand the complete lifecycle of events and actions in AI Flow. ## Complete Flow Diagram ```mermaid stateDiagram-v2 [*] --> SessionStart: Call Begins SessionStart --> UserSpeak: Assistant Greets UserSpeak --> AssistantSpeak: User Responds AssistantSpeak --> AssistantSpeechEnded: Speech Completes AssistantSpeechEnded --> UserSpeak: Wait for User AssistantSpeak --> UserBargeIn: User Interrupts UserBargeIn --> UserSpeak: Continue Conversation UserSpeak --> SessionEnd: User Says Goodbye AssistantSpeechEnded --> SessionEnd: Call Ends SessionEnd --> [*] ``` ## Sequence Diagram ```mermaid sequenceDiagram participant Phone as Phone Call participant Service as AI Flow Service participant App as Your Application Note over Phone,App: Call Begins Phone->>Service: Call Initiated Service->>App: Event: session_start App->>Service: Action: speak "Welcome!" Service->>Phone: Plays Audio Note over Phone,App: User Speaks Phone->>Service: User Speech Service->>App: Event: user_speak
{text: "Hello"} App->>Service: Action: speak "Hello! How can I help?" Service->>Phone: Plays Audio Note over Phone,App: Assistant Speaking Service->>App: Event: assistant_speak
{duration_ms: 2000} Service->>App: Event: assistant_speech_ended Note over App: Speech completed Note over Phone,App: User Interrupts Phone->>Service: User Speech (during playback) Service->>App: Event: user_speak with barged_in flag
{text: "Wait"} App->>Service: Action: speak "Yes, I'm listening" Service->>Phone: Plays Audio Note over Phone,App: Call Ends Phone->>Service: Call Terminated Service->>App: Event: session_end Note over App: Cleanup (no action) ``` ## Event Lifecycle ### 1. Session Start ```mermaid graph LR A[Call Begins] --> B[session_start Event] B --> C{Your Response} C -->|speak| D[Greet User] C -->|audio| E[Play Welcome] C -->|null| F[Silent Start] D --> G[Continue] E --> G F --> G ``` **Event:** ```json { "type": "session_start", "session": { "id": "session-123", "phone_number": "1234567890", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` ### 2. User Speak ```mermaid graph LR A[User Speaks] --> B[Speech-to-Text] B --> C[user_speak Event] C --> D{Your Logic} D -->|speak| E[Respond] D -->|transfer| F[Transfer Call] D -->|hangup| G[End Call] E --> I[Continue] F --> J[Call Transferred] G --> K[Call Ended] ``` **Event:** ```json { "type": "user_speak", "text": "Hello", "session": { "id": "session-123" } } ``` ### 3. Assistant Speak ```mermaid graph LR A[Assistant Speaks] --> B[assistant_speak Event] B --> C{Your Response} C -->|null| D[Track Metrics] C -->|speak| E[Continue Speaking] C -->|audio| F[Play Audio] D --> G[Wait for User] E --> G F --> G ``` **Event:** ```json { "type": "assistant_speak", "text": "Hello! How can I help?", "duration_ms": 2000, "speech_started_at": 1234567890, "session": { "id": "session-123" } } ``` ### 4. Assistant Speech Ended ```mermaid graph LR A[Speech Ends] --> B[assistant_speech_ended Event] B --> C{Your Response} C -->|speak| D[Continue Conversation] C -->|null| E[Track Completion] D --> F[Wait for User] E --> F ``` **Event:** ```json { "type": "assistant_speech_ended", "session": { "id": "session-123" } } ``` ### 5. User Barge In ```mermaid graph LR A[User Interrupts] --> B[user_speak with barged_in flag Event] B --> C{Your Response} C -->|speak| D[Acknowledge] C -->|null| E[Continue] D --> F[Listen] E --> F ``` **Event:** ```json { "type": "user_speak with barged_in flag", "text": "Wait", "session": { "id": "session-123" } } ``` ### 6. Session End ```mermaid graph LR A[Call Ends] --> B[session_end Event] B --> C[Cleanup] C --> D[No Action Allowed] ``` **Event:** ```json { "type": "session_end", "session": { "id": "session-123" } } ``` ## State Management ```mermaid stateDiagram-v2 [*] --> Idle Idle --> Greeting: session_start Greeting --> Listening: speak action Listening --> Processing: user_speak Processing --> Speaking: speak action Speaking --> SpeechEnded: assistant_speech_ended SpeechEnded --> Listening: wait for user Speaking --> Interrupted: user_speak with barged_in flag Interrupted --> Listening: acknowledge Listening --> Ended: hangup action SpeechEnded --> Ended: session_end Speaking --> Ended: session_end Processing --> Ended: session_end Ended --> [*] ``` ## Response Timing ```mermaid gantt title Event Response Timeline dateFormat X axisFormat %L ms section Call Flow session_start event :0, 50 Process & respond :50, 100 speak action :100, 150 Audio playback :150, 2150 assistant_speak event :2150, 2200 user_speak event :2200, 2250 Process & respond :2250, 2300 speak action :2300, 2350 ``` ## Outbound Call Flow For outbound calls initiated via the REST API, the flow is identical to inbound — except that `session.direction` is `"outbound"` and the AI flow dials the recipient first. ```mermaid sequenceDiagram participant API as REST Client participant Service as AI Flow Service participant Phone as Recipient Phone participant App as Your Application Note over API,App: Initiate Outbound Call API->>Service: POST /v3/ai-flows/:id/call Service-->>API: 201 Created Note over Service,App: Recipient Answers Service->>Phone: Dial toPhoneNumber Phone-->>Service: Answers Service->>App: Event: session_start
{ direction: "outbound" } App->>Service: Action: speak "Hello, this is an automated call..." Service->>Phone: Plays Audio Note over Phone,App: Conversation continues normally Phone->>Service: User Speech Service->>App: Event: user_speak App->>Service: Action: speak / hangup / transfer ``` ::: tip Check `event.session.direction === "outbound"` in your `session_start` handler to customize the opening message for calls your assistant initiated. ::: ::: warning Access Required Outbound calls require explicit access granted by sipgate support. See the [Outbound Calls guide](/api/guides/outbound-calls) for details. ::: ## Best Practices 1. **Respond Quickly** - Keep response times under 1 second 2. **Handle All Events** - Even if you don't need to respond 3. **Clean Up State** - Use `session_end` for cleanup 4. **Track Metrics** - Use `assistant_speak` for analytics 5. **Handle Errors** - Always return valid responses or 204 ## Next Steps * **[Event Types](/api/events)** - Complete event reference * **[Action Types](/api/actions)** - Complete action reference * **[HTTP Webhooks](/api/http-webhooks)** - HTTP integration * **[WebSocket](/api/websocket)** - WebSocket integration --- --- url: /sipgate-ai-flow-api/api/guides/outbound-calls.md --- # Outbound Calls Initiate AI-powered outbound calls programmatically — your assistant dials the recipient and handles the conversation once connected. ::: warning Access Required Outbound calls are **only available upon request** and after a positive review by sipgate support. This restriction exists to prevent fraud and spam. Please contact support to request access before using this feature. ::: ## How It Works ``` POST /v3/ai-flows/:aiFlowId/call → AI flow dials toPhoneNumber → Recipient answers → Session is created with direction: "outbound" → Webhook fires to your AI flow's webhook URL → Normal event/action flow begins ``` The call lifecycle after connection is identical to an inbound call — the same events (`session_start`, `user_speak`, etc.) and actions (`speak`, `transfer`, `hangup`, etc.) apply. ## Prerequisites * Access granted by sipgate support (see warning above) * AI flow must have a `phone_number` configured (used as caller ID) * Target phone number in E.164 format without leading + (e.g. `4915790000687`) ## Initiating a Call ### Endpoint ```http POST /v3/ai-flows/:aiFlowId/call Authorization: Bearer Content-Type: application/json ``` ### Request Body ```json { "billingDevice": "e2", "toPhoneNumber": "4915790000687" } ``` | Field | Type | Description | |-----------------|--------|-------------------------------------| | `billingDevice` | string | Billing device suffix (e.g. `"e2"`) | | `toPhoneNumber` | string | Target phone number in E.164 format without leading + | ### Response The endpoint returns `201 Created` when the call has been successfully initiated. ### Error Responses | Status | Reason | |--------|------------------------------------------| | `400` | AI flow has no `phone_number` configured | | `404` | AI flow not found | ## Session Direction When an outbound call connects, the `session_start` event's session object will contain `"direction": "outbound"`: ```json { "type": "session_start", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "phone_number": "4915790000687", "direction": "outbound", "from_phone_number": "4921155660", "to_phone_number": "4915790000687" } } ``` Use the `direction` field to tailor your greeting — for outbound calls your assistant initiated the contact, so the opening message should reflect that context. ## Example ```http POST /v3/ai-flows/:aiFlowId/call Authorization: Bearer Content-Type: application/json { "billingDevice": "e2", "toPhoneNumber": "4915790000687" } ``` ::: tip TypeScript SDK Using the `@sipgate/ai-flow-sdk`? See the **[Outbound Calls SDK guide](/sdk/outbound-calls)** for `assistant.call()` with full examples. ::: ## Next Steps * **[Event Types](/api/events)** — complete event reference * **[Event Flow](/api/event-flow)** — full call lifecycle * **[Action Types](/api/actions)** — how to respond to events --- --- url: /sipgate-ai-flow-api/api/guides/phone-number-routing.md --- # Phone Number Routing: Multiple Assistants, One Webhook When you have multiple phone numbers - each for a different purpose - you don't need separate webhook endpoints. Route them all to a single endpoint and dispatch based on the called number. ## The Pattern Every sipgate AI Flow event includes the phone number in the session object: ```json { "type": "session_start", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "to_phone_number": "4921234567890", "from_phone_number": "4915112345678", "direction": "inbound" } } ``` Use `to_phone_number` to determine which assistant or behavior to use. ## Basic Routing ```typescript const ASSISTANTS = { '4921234567890': { name: 'Sales', greeting: 'Hi! Thanks for calling our sales team. How can I help?', systemPrompt: 'You are a helpful sales assistant...', }, '4921234567891': { name: 'Support', greeting: 'Hello! You\'ve reached customer support. What can I help you with?', systemPrompt: 'You are a friendly support agent...', }, '4921234567892': { name: 'Booking', greeting: 'Welcome! I can help you book an appointment. When would you like to come in?', systemPrompt: 'You are an appointment booking assistant...', }, } export async function POST(req: Request) { const event = await req.json() const phoneNumber = event.session.to_phone_number // Get assistant config for this number const assistant = ASSISTANTS[phoneNumber] if (!assistant) { // Unknown number - use fallback return speak(event.session.id, "Sorry, this number is not configured.") } switch (event.type) { case 'session_start': return speak(event.session.id, assistant.greeting) case 'user_speak': return handleUserSpeak(event, assistant) // ... other events } } ``` ## Database-Driven Routing For dynamic configuration, store the mapping in a database: ```typescript // Database schema // phone_numbers: id, phone_number, assistant_id // assistants: id, name, greeting, system_prompt, voice_provider, voice_id async function getAssistantForNumber(phoneNumber: string) { const { data } = await supabase .from('phone_numbers') .select(` phone_number, assistants ( id, name, greeting, system_prompt, voice_provider, voice_id ) `) .eq('phone_number', phoneNumber) .single() return data?.assistants } export async function POST(req: Request) { const event = await req.json() const phoneNumber = event.session.to_phone_number const assistant = await getAssistantForNumber(phoneNumber) if (!assistant) { return speak(event.session.id, "This number is not currently in service.") } // Route to appropriate handler return handleEvent(event, assistant) } ``` ## Normalizing Phone Numbers Phone numbers can arrive in different formats. Normalize before lookup: ```typescript function normalizePhoneNumber(phone: string): string { // Remove spaces, dashes, parentheses let normalized = phone.replace(/[\s\-\(\)]/g, '') // Remove leading + if present (E.164 without leading +) if (normalized.startsWith('+')) { normalized = normalized.slice(1) } return normalized } async function getAssistantForNumber(phoneNumber: string) { const normalized = normalizePhoneNumber(phoneNumber) const { data } = await supabase .from('phone_numbers') .select('*, assistants(*)') .eq('phone_number', normalized) .single() return data?.assistants } ``` ## Multi-Language Routing Route to different languages based on phone number: ```typescript const LANGUAGE_NUMBERS = { '4921234567890': { language: 'de-DE', voice: 'de-DE-KatjaNeural' }, '4421234567890': { language: 'en-GB', voice: 'en-GB-SoniaNeural' }, '3321234567890': { language: 'fr-FR', voice: 'fr-FR-DeniseNeural' }, } function getLanguageConfig(phoneNumber: string) { return LANGUAGE_NUMBERS[phoneNumber] || { language: 'en-US', voice: 'en-US-JennyNeural', } } export async function POST(req: Request) { const event = await req.json() const langConfig = getLanguageConfig(event.session.to_phone_number) // Use language-specific TTS return Response.json({ type: 'speak', session_id: event.session.id, text: getGreeting(langConfig.language), tts: { provider: 'azure', language: langConfig.language, voice: langConfig.voice, }, }) } ``` ## Routing by Caller Number You can also route based on who's calling (`from_phone_number`): ```typescript async function handleSessionStart(event: SessionStartEvent) { const callerNumber = event.session.from_phone_number // Check if this is a known VIP customer const customer = await getCustomerByPhone(callerNumber) if (customer?.is_vip) { return speak("Welcome back! I see you're a VIP member. How can I assist you today?") } // Check if this is a repeat caller const previousCalls = await getRecentCalls(callerNumber) if (previousCalls.length > 0) { const lastTopic = previousCalls[0].topic return speak(`Hello again! Are you calling about ${lastTopic}, or something new?`) } // First-time caller return speak("Welcome! How can I help you today?") } ``` ## Fallback Handling Always handle unknown numbers gracefully: ```typescript async function getAssistantForNumber(phoneNumber: string) { const normalized = normalizePhoneNumber(phoneNumber) const { data } = await supabase .from('phone_numbers') .select('*, assistants(*)') .eq('phone_number', normalized) .single() if (!data?.assistants) { // Log for debugging console.warn(`No assistant configured for: ${normalized}`) // Return a default fallback assistant return { id: 'fallback', name: 'Fallback', greeting: "I'm sorry, but this number is not currently configured. Please try again later.", system_prompt: 'Politely explain that the service is unavailable.', voice_provider: 'azure', voice_id: 'en-US-JennyNeural', } } return data.assistants } ``` ## Caching for Performance If you're looking up the same numbers repeatedly, cache the results: ```typescript const assistantCache = new Map() const CACHE_TTL_MS = 60000 // 1 minute async function getAssistantForNumber(phoneNumber: string): Promise { const normalized = normalizePhoneNumber(phoneNumber) // Check cache const cached = assistantCache.get(normalized) if (cached && Date.now() - cached.cachedAt < CACHE_TTL_MS) { return cached.assistant } // Fetch from database const { data } = await supabase .from('phone_numbers') .select('*, assistants(*)') .eq('phone_number', normalized) .single() const assistant = data?.assistants || getFallbackAssistant() // Cache result assistantCache.set(normalized, { assistant, cachedAt: Date.now() }) return assistant } ``` ## Complete Example ```typescript // types.ts interface Assistant { id: string name: string greeting: string system_prompt: string voice_provider: 'azure' | 'eleven_labs' voice_id: string language: string } // routing.ts const assistantCache = new Map() function normalizePhoneNumber(phone: string): string { let normalized = phone.replace(/[\s\-\(\)]/g, '') if (normalized.startsWith('+')) normalized = normalized.slice(1) return normalized } async function getAssistant(phoneNumber: string): Promise { const normalized = normalizePhoneNumber(phoneNumber) // Check cache if (assistantCache.has(normalized)) { return assistantCache.get(normalized)! } // Fetch from database const { data } = await db .from('phone_numbers') .select('*, assistants(*)') .eq('phone_number', normalized) .single() const assistant = data?.assistants || { id: 'fallback', name: 'Fallback', greeting: 'This number is not configured.', system_prompt: 'Explain the service is unavailable.', voice_provider: 'azure', voice_id: 'en-US-JennyNeural', language: 'en-US', } assistantCache.set(normalized, assistant) return assistant } // webhook.ts export async function POST(req: Request): Promise { const event = await req.json() const sessionId = event.session.id // Route to assistant based on called number const assistant = await getAssistant(event.session.to_phone_number) switch (event.type) { case 'session_start': console.log(`Call to ${assistant.name} assistant`) return speak(sessionId, assistant.greeting, assistant) case 'user_speak': const response = await generateLLMResponse(event.text, assistant) return speak(sessionId, response, assistant) case 'session_end': return new Response(null, { status: 204 }) default: return new Response(null, { status: 204 }) } } function speak(sessionId: string, text: string, assistant: Assistant): Response { const ttsConfig = assistant.voice_provider === 'azure' ? { provider: 'azure' as const, language: assistant.language, voice: assistant.voice_id, } : { provider: 'eleven_labs' as const, voice: assistant.voice_id, } return Response.json({ type: 'speak', session_id: sessionId, text, tts: ttsConfig, }) } ``` ## Best Practices 1. **Normalize phone numbers** - Handle different formats (49, 0049, +49, etc.) and strip any leading + 2. **Always have a fallback** - Unknown numbers should get a polite message, not an error 3. **Cache lookups** - Database queries on every event add latency 4. **Log unknown numbers** - Helps you spot misconfiguration 5. **Use `to_phone_number`** - That's the number they dialed (your number) 6. **Consider `from_phone_number`** - For personalization based on caller ## Related Documentation * **[Session Start Event](/api/events/session-start)** - Event structure with phone numbers * **[HTTP Webhooks](/api/http-webhooks)** - Webhook endpoint setup --- --- url: /sipgate-ai-flow-api/api/guides/streaming-llm-responses.md --- # Streaming LLM Responses: Sentence-by-Sentence Best Practices When integrating Large Language Models (LLMs) like OpenAI's GPT, Anthropic's Claude, or similar services with sipgate AI Flow, how you stream responses significantly impacts the naturalness of synthesized speech. This guide shows you how to achieve smooth, natural-sounding voice output by sending complete sentences rather than individual tokens. ## The Problem: Token-by-Token Streaming LLMs stream responses token-by-token (small text fragments). Sending each token directly to the TTS provider creates choppy, unnatural speech: ```typescript // ❌ BAD: Sends every token immediately for await (const chunk of llmStream) { await sendAction({ type: 'speak', session_id: sessionId, text: chunk.content, // Individual tokens: "Hello", ", ", "how", " ", "can", " ", "I"... tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } ``` **Why this sounds bad:** * Each TTS call treats the text as a complete utterance with sentence-ending prosody (falling intonation, longer pauses) * Results in robotic, choppy speech: "Hello↘️ \[pause] how↘️ \[pause] can↘️ \[pause] I↘️ \[pause] help↘️ \[pause]" * TTS providers optimize for complete sentences, not fragments ## The Solution: Sentence Segmentation **✅ Best Practice:** Buffer LLM tokens and send complete sentences to the TTS provider. ```mermaid sequenceDiagram participant LLM as OpenAI/Claude participant App as Your Application participant Segmenter as Intl.Segmenter participant Flow as AI Flow LLM->>App: Token: "Hello" LLM->>App: Token: ", how" LLM->>App: Token: " can I" LLM->>App: Token: " help?" App->>Segmenter: Buffer: "Hello, how can I help?" Segmenter->>App: Sentence detected! App->>Flow: speak: "Hello, how can I help?" Flow->>Flow: Synthesize complete sentence LLM->>App: Token: " I'm" LLM->>App: Token: " here" Note right of App: Continue buffering... ``` **Benefits:** * Natural prosody and intonation * Appropriate pauses between sentences * Better voice quality from TTS providers * Maintains low latency (sentences typically complete within 1-2 seconds) ## Prompting LLMs for Voice Output **Critical:** Instruct your LLM to avoid abbreviations that interfere with speech synthesis and sentence detection. ### The Problem with Abbreviations Abbreviations like "Dr.", "bzw.", "z.B.", "etc." cause two issues: 1. **Incorrect sentence segmentation** - `Intl.Segmenter` detects periods as sentence boundaries: ```typescript // "Dr. Smith will help you." // Incorrectly splits into: // Sentence 1: "Dr." // Sentence 2: "Smith will help you." ``` 2. **Poor TTS pronunciation** - Text-to-speech may mispronounce abbreviations: * "Dr." → "D R" or "Doctor point" instead of "Doctor" * "bzw." → "B Z W" instead of "beziehungsweise" * "z.B." → "Z B" instead of "zum Beispiel" ### System Prompt Guidelines Add these instructions to your LLM system prompt: ```typescript const systemPrompt = `You are a voice assistant. Follow these rules strictly: VOICE OUTPUT RULES: - Write out all abbreviations fully (e.g., "Doctor" not "Dr.", "for example" not "e.g.") - Use complete words instead of shortened forms - Avoid punctuation-based abbreviations that end with periods - Use natural, spoken language as if talking to someone in person Examples: ❌ "Dr. Smith can help you with that." ✅ "Doctor Smith can help you with that." ❌ "You can use method A, B, or C, e.g., the first one." ✅ "You can use method A, B, or C, for example the first one." ❌ "This is available Mon.-Fri." ✅ "This is available Monday through Friday." Your responses will be converted to speech, so write exactly how you would say it out loud.` ``` ### Language-Specific Examples **English:** ```typescript const englishVoiceRules = ` - "Dr." → "Doctor" - "Mr." → "Mister" - "Mrs." → "Missus" - "e.g." → "for example" - "i.e." → "that is" - "etc." → "and so on" or "etcetera" - "vs." → "versus" - "approx." → "approximately" ` ``` **German:** ```typescript const germanVoiceRules = ` - "Dr." → "Doktor" - "bzw." → "beziehungsweise" - "z.B." → "zum Beispiel" - "usw." → "und so weiter" - "ca." → "circa" - "etc." → "et cetera" or "und so weiter" - "inkl." → "inklusive" - "ggf." → "gegebenenfalls" - "evtl." → "eventuell" ` ``` ### Complete OpenAI Example ```typescript const stream = await openai.chat.completions.create({ model: 'gpt-4', messages: [ { role: 'system', content: `You are a voice assistant for customer service. CRITICAL VOICE RULES: - Never use abbreviations with periods (Dr., e.g., etc.) - Write everything as you would speak it out loud - Use complete words: "Doctor" not "Dr.", "for example" not "e.g." - Your responses will be synthesized to speech Be helpful, concise, and conversational.` }, { role: 'user', content: userMessage } ], stream: true, }) ``` ### Complete Anthropic Example ```typescript const stream = await anthropic.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 1024, system: `You are a voice assistant. Your responses will be converted to speech. VOICE OUTPUT REQUIREMENTS: - Write all abbreviations in full (Doctor, not Dr.) - Avoid period-based abbreviations (e.g., i.e., etc.) - Use natural spoken language - Write numbers as words when it sounds more natural Examples: Wrong: "Dr. Schmidt can help you, e.g., with billing." Right: "Doctor Schmidt can help you, for example with billing." Wrong: "Available Mon.-Fri., 9 a.m.-5 p.m." Right: "Available Monday through Friday, 9 AM to 5 PM."`, messages: [ { role: 'user', content: userMessage } ], stream: true, }) ``` ### Testing Your Prompt Verify your LLM follows voice rules by testing with edge cases: ```typescript const testCases = [ "Tell me about Dr. Smith", "What are the benefits, e.g., cost savings?", "This applies to companies like IBM, Microsoft, etc.", "Available Mon.-Fri.", ] // Expected responses should have NO abbreviations with periods ``` ::: warning Common Mistake Don't rely on post-processing to fix abbreviations. LLMs are excellent at following voice guidelines when properly instructed. Post-processing is fragile and language-dependent. ::: ## Using JavaScript's Built-in Sentence Segmenter JavaScript provides `Intl.Segmenter` - a native API for text segmentation, including sentence detection. It's available in Node.js ≥16. ### Basic Example ```typescript // Create a sentence segmenter (do this once, reuse for performance) const sentenceSegmenter = new Intl.Segmenter('en', { granularity: 'sentence' }) function* extractSentences(text: string): Generator { const segments = sentenceSegmenter.segment(text) for (const segment of segments) { yield segment.segment.trim() } } // Usage const text = "Hello, how can I help? I'm here to assist you today." for (const sentence of extractSentences(text)) { console.log(sentence) // Output: // "Hello, how can I help?" // "I'm here to assist you today." } ``` ### Streaming with OpenAI ```typescript import OpenAI from 'openai' const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }) const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' }) async function streamOpenAIResponse( sessionId: string, userMessage: string, sendAction: (action: any) => Promise ) { let buffer = '' let lastSentenceEnd = 0 const stream = await openai.chat.completions.create({ model: 'gpt-4', messages: [{ role: 'user', content: userMessage }], stream: true, }) for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content if (!content) continue // Add token to buffer buffer += content // Check for complete sentences const segments = Array.from(segmenter.segment(buffer)) // Find complete sentences (all but possibly the last incomplete one) for (let i = lastSentenceEnd; i < segments.length - 1; i++) { const sentence = segments[i].segment.trim() if (sentence) { // Send complete sentence to TTS await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } lastSentenceEnd = i + 1 } } // Send any remaining text as final sentence const remainingSegments = Array.from(segmenter.segment(buffer)) for (let i = lastSentenceEnd; i < remainingSegments.length; i++) { const sentence = remainingSegments[i].segment.trim() if (sentence) { await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } } } ``` ### Streaming with Anthropic Claude ```typescript import Anthropic from '@anthropic-ai/sdk' const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }) const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' }) async function streamClaudeResponse( sessionId: string, userMessage: string, sendAction: (action: any) => Promise ) { let buffer = '' let lastSentenceEnd = 0 const stream = await anthropic.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 1024, messages: [{ role: 'user', content: userMessage }], stream: true, }) for await (const event of stream) { if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') { const content = event.delta.text // Add token to buffer buffer += content // Check for complete sentences const segments = Array.from(segmenter.segment(buffer)) // Send complete sentences for (let i = lastSentenceEnd; i < segments.length - 1; i++) { const sentence = segments[i].segment.trim() if (sentence) { await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } lastSentenceEnd = i + 1 } } } // Send remaining text const remainingSegments = Array.from(segmenter.segment(buffer)) for (let i = lastSentenceEnd; i < remainingSegments.length; i++) { const sentence = remainingSegments[i].segment.trim() if (sentence) { await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } } } ``` ## Reusable Helper Class For production use, extract this logic into a reusable helper: ```typescript export class SentenceStreamBuffer { private buffer = '' private lastSentenceEnd = 0 private segmenter: Intl.Segmenter constructor(locale: string = 'en') { this.segmenter = new Intl.Segmenter(locale, { granularity: 'sentence' }) } /** * Add a token/chunk to the buffer and return any complete sentences. * @returns Array of complete sentences ready to be sent to TTS */ push(chunk: string): string[] { this.buffer += chunk const segments = Array.from(this.segmenter.segment(this.buffer)) const completeSentences: string[] = [] // Extract complete sentences (all but possibly the last incomplete one) for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) { const sentence = segments[i].segment.trim() if (sentence) { completeSentences.push(sentence) } this.lastSentenceEnd = i + 1 } return completeSentences } /** * Flush remaining buffer as final sentence(s). * Call this when the stream ends. */ flush(): string[] { const segments = Array.from(this.segmenter.segment(this.buffer)) const remainingSentences: string[] = [] for (let i = this.lastSentenceEnd; i < segments.length; i++) { const sentence = segments[i].segment.trim() if (sentence) { remainingSentences.push(sentence) } } // Reset state this.buffer = '' this.lastSentenceEnd = 0 return remainingSentences } /** * Reset the buffer (useful for error handling or conversation resets) */ reset(): void { this.buffer = '' this.lastSentenceEnd = 0 } } ``` ### Using the Helper ```typescript async function streamLLMToVoice( sessionId: string, llmStream: AsyncIterable, sendAction: (action: any) => Promise, locale: string = 'en' ) { const buffer = new SentenceStreamBuffer(locale) try { // Process streaming tokens for await (const token of llmStream) { const sentences = buffer.push(token) // Send each complete sentence to TTS for (const sentence of sentences) { await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } } // Send any remaining text when stream completes const finalSentences = buffer.flush() for (const sentence of finalSentences) { await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } } catch (error) { buffer.reset() // Clean up on error throw error } } ``` ## Multi-Language Support `Intl.Segmenter` supports multiple languages out of the box: ```typescript // German const germanSegmenter = new Intl.Segmenter('de', { granularity: 'sentence' }) // Spanish const spanishSegmenter = new Intl.Segmenter('es', { granularity: 'sentence' }) // French const frenchSegmenter = new Intl.Segmenter('fr', { granularity: 'sentence' }) // Reusable buffer with language detection function createStreamBuffer(languageCode: string): SentenceStreamBuffer { return new SentenceStreamBuffer(languageCode) } ``` ## Handling Edge Cases ### Short Responses For very short responses (single sentence or fragment), the buffer approach still works: ```typescript // LLM response: "Hello!" buffer.push("Hello!") // Returns: [] buffer.flush() // Returns: ["Hello!"] ``` ### Incomplete Sentences During Interruption If the user interrupts (barge-in), you may have incomplete sentences in the buffer: ```typescript // Handle barge-in event function handleBargeIn(sessionId: string) { const buffer = sessionBuffers.get(sessionId) if (buffer) { // Option 1: Discard incomplete sentence buffer.reset() // Option 2: Send incomplete sentence as-is (for context) const remaining = buffer.flush() // Log or store for context but don't send to TTS } } ``` ### Very Long Sentences Sometimes LLMs generate very long sentences. Consider adding a character limit: ```typescript class SentenceStreamBuffer { private maxSentenceLength = 500 // characters push(chunk: string): string[] { this.buffer += chunk // Force break on very long buffers if (this.buffer.length > this.maxSentenceLength && this.buffer.includes(' ')) { const lastSpace = this.buffer.lastIndexOf(' ', this.maxSentenceLength) const forcedSentence = this.buffer.substring(0, lastSpace).trim() this.buffer = this.buffer.substring(lastSpace).trim() this.lastSentenceEnd = 0 return [forcedSentence] } // Normal sentence detection... const segments = Array.from(this.segmenter.segment(this.buffer)) const completeSentences: string[] = [] for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) { const sentence = segments[i].segment.trim() if (sentence) { completeSentences.push(sentence) } this.lastSentenceEnd = i + 1 } return completeSentences } // ... rest of class } ``` ## Performance Considerations ### Buffer Management For production deployments with many concurrent sessions: ```typescript // Store buffers per session const sessionBuffers = new Map() function getOrCreateBuffer(sessionId: string, locale: string): SentenceStreamBuffer { if (!sessionBuffers.has(sessionId)) { sessionBuffers.set(sessionId, new SentenceStreamBuffer(locale)) } return sessionBuffers.get(sessionId)! } // Clean up on session end function handleSessionEnd(sessionId: string) { sessionBuffers.delete(sessionId) } ``` ### Timeout Protection Add timeout to prevent indefinitely buffered text: ```typescript class SentenceStreamBufferWithTimeout extends SentenceStreamBuffer { private lastPushTime = Date.now() private timeout = 5000 // 5 seconds push(chunk: string): string[] { this.lastPushTime = Date.now() return super.push(chunk) } hasTimedOut(): boolean { return Date.now() - this.lastPushTime > this.timeout } flushIfTimedOut(): string[] { if (this.hasTimedOut()) { return this.flush() } return [] } } ``` ## Complete Example: Express.js Integration ```typescript import express from 'express' import OpenAI from 'openai' const app = express() const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }) const sessionBuffers = new Map() app.use(express.json()) app.post('/webhook', async (req, res) => { const event = req.body switch (event.type) { case 'user_speak': // Don't await - respond immediately to avoid timeout handleUserSpeak(event).catch(console.error) return res.status(204).send() case 'session_end': sessionBuffers.delete(event.session.id) return res.status(204).send() default: return res.status(204).send() } }) async function handleUserSpeak(event: any) { const sessionId = event.session.id const userText = event.text // Get or create buffer for this session const buffer = sessionBuffers.get(sessionId) || new SentenceStreamBuffer('en') sessionBuffers.set(sessionId, buffer) // Stream LLM response const stream = await openai.chat.completions.create({ model: 'gpt-4', messages: [{ role: 'user', content: userText }], stream: true, }) for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content if (!content) continue const sentences = buffer.push(content) for (const sentence of sentences) { await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } } // Flush remaining text const finalSentences = buffer.flush() for (const sentence of finalSentences) { await sendAction({ type: 'speak', session_id: sessionId, text: sentence, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, }) } } async function sendAction(action: any) { // Send action via WebSocket or HTTP to sipgate AI Flow // Implementation depends on your integration method await fetch('https://your-aiflow-endpoint/actions', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(action), }) } app.listen(3000, () => console.log('Server running on port 3000')) ``` ## Fallback for Older Node.js Versions If you're using Node.js <16, you can use a simple regex-based fallback: ```typescript // Simple fallback for environments without Intl.Segmenter function splitSentencesSimple(text: string): string[] { // Basic sentence splitting (not as robust as Intl.Segmenter) // Matches sentence endings followed by whitespace return text .split(/(?<=[.!?])\s+/) .map(s => s.trim()) .filter(s => s.length > 0) } // Use in SentenceStreamBuffer as fallback class SentenceStreamBufferLegacy { private buffer = '' push(chunk: string): string[] { this.buffer += chunk const sentences = splitSentencesSimple(this.buffer) if (sentences.length > 1) { // Keep last sentence in buffer (might be incomplete) const complete = sentences.slice(0, -1) this.buffer = sentences[sentences.length - 1] return complete } return [] } flush(): string[] { const sentence = this.buffer.trim() this.buffer = '' return sentence ? [sentence] : [] } } ``` ::: warning Regex Limitations The regex fallback is less robust than `Intl.Segmenter` and may incorrectly split on abbreviations (Dr., e.g., etc.). If using the fallback, it's even more critical to follow the [LLM prompting guidelines](#prompting-llms-for-voice-output) to avoid abbreviations. ::: ## Best Practices Summary 1. **Prompt LLMs to avoid abbreviations** - Instruct your LLM to write out "Doctor" not "Dr.", "for example" not "e.g." to prevent incorrect segmentation and poor pronunciation 2. **Always segment sentences** - Never send individual tokens to TTS, always buffer and send complete sentences 3. **Use `Intl.Segmenter`** - Native, robust, multi-language support (Node.js ≥16) 4. **Buffer per session** - Keep separate buffers for concurrent conversations 5. **Clean up on session end** - Delete buffers to prevent memory leaks 6. **Handle timeouts** - Flush buffer if no new tokens arrive within 5 seconds 7. **Support multiple languages** - Pass correct locale to `Intl.Segmenter` 8. **Handle barge-in** - Reset or discard incomplete sentences on interruption 9. **Limit sentence length** - Force breaks for very long sentences (500+ characters) ::: tip Token Accumulation Speed In practice, sentences complete quickly (typically 1-2 seconds with modern LLMs). Users won't notice the buffering delay, but they will notice the dramatic improvement in speech quality. ::: ## Related Documentation * **[Speak Action](/api/actions/speak)** - Complete reference for the speak action * **[TTS Providers](/api/tts-providers)** - Azure and ElevenLabs configuration * **[Barge-In Best Practices](/api/guides/barge-in-best-practices)** - Handling interruptions during speech * **[Async Hold Pattern](/api/guides/async-hold-pattern)** - Managing long-running LLM requests --- --- url: /sipgate-ai-flow-api/api/guides/barge-in-best-practices.md --- # Barge-In Best Practices: Handling User Interruptions Gracefully When users interrupt your voice AI assistant mid-sentence, how you respond makes the difference between a frustrating experience and a natural conversation. This guide covers best practices for handling barge-in interruptions using the `barged_in` flag in `user_speak` events. ## Why Users Interrupt In natural human conversations, interruptions happen constantly: * **"Got it!"** - They understood and don't need the rest * **"Wait, actually..."** - They want to change direction * **"No, that's not what I meant"** - Correcting a misunderstanding * **"Yes yes, I know"** - Impatient, want to move on * **"Hold on"** - Something came up A voice assistant that ignores interruptions or handles them poorly feels robotic. Done well, barge-in handling makes your assistant feel responsive and human-like. ## How Barge-In Detection Works When a user speaks while the assistant is talking, sipgate AI Flow: 1. **Stops the assistant's speech** immediately 2. **Sends a `user_speak` event** with `barged_in: true` and what the user said 3. **Waits for your response** (action or 204 No Content) ```mermaid sequenceDiagram participant User participant Flow as AI Flow participant App as Your Application Flow->>User: Speaking: "Let me explain how..." User->>Flow: Interrupts: "Got it, thanks!" Flow->>Flow: Stops playback Flow->>App: Event: user_speak {text: "Got it, thanks!", barged_in: true} App->>Flow: Action: speak "Great! What else can I help with?" Flow->>User: "Great! What else can I help with?" ``` ## Basic Handling Check the `barged_in` flag to detect interruptions and respond appropriately: ```typescript async function handleUserSpeak(event: { type: 'user_speak' text: string barged_in?: boolean session: { id: string } }) { if (event.barged_in) { // User interrupted - acknowledge quickly return { type: 'speak', session_id: event.session.id, text: "Of course. What would you like to know?", tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' }, } } // Normal speech processing return processUserInput(event) } ``` ## Respond to What They Said The `text` field contains what the user said when interrupting. Use it to respond appropriately: ```typescript function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean }) { if (!event.barged_in) { // Normal processing for non-interruptions return processNormalSpeech(event) } const text = event.text.toLowerCase() // User understood - move on if (text.includes('got it') || text.includes('understood') || text.includes('okay')) { return speak("Great! What else can I help you with?") } // User wants to change direction if (text.includes('actually') || text.includes('wait') || text.includes('no')) { return speak("Of course. What would you like instead?") } // User is correcting something if (text.includes('not what i') || text.includes('i meant')) { return speak("I apologize for the confusion. Please tell me more.") } // User has a new question - process it directly if (text.length > 25 || text.includes('?')) { return processAsNewQuestion(event.session.id, event.text) } // Default acknowledgment return speak("I'm listening.") } ``` ## Natural Acknowledgment Phrases Vary your responses to avoid sounding robotic: ```typescript const ACKNOWLEDGMENTS = { understood: [ "Great! What else can I help with?", "Perfect. Anything else?", "Alright! What's next?", ], redirect: [ "Of course. What would you like instead?", "Sure thing. Go ahead.", "No problem. What did you have in mind?", ], listening: [ "I'm listening.", "Go ahead.", "Yes?", ], } // German equivalents const ACKNOWLEDGMENTS_DE = { understood: [ "Sehr gut! Kann ich sonst noch helfen?", "Alles klar. Was noch?", "Prima! Was möchten Sie noch wissen?", ], redirect: [ "Natürlich. Was kann ich für Sie tun?", "Kein Problem. Was hätten Sie gerne?", "Selbstverständlich. Bitte?", ], listening: [ "Ich höre.", "Ja bitte?", "Ja?", ], } ``` ## When to Process vs. Acknowledge If the user said something substantial during a barge-in, treat it as a new question rather than just acknowledging: ```typescript function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean }) { if (!event.barged_in) { return processNormalSpeech(event) } const interruptText = event.text.trim() // Substantial interruption = likely a complete thought or question if (interruptText.length > 25 || interruptText.includes('?')) { // Process as a complete question return processUserQuestion(event.text) } // Short interruption = just acknowledge return speak("I'm listening.") } ``` This provides a smoother experience - users don't have to repeat themselves. ## Silent Acknowledgment Sometimes the best response is no response. Return `204 No Content` to simply listen: ```typescript function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean }) { if (!event.barged_in) { return processNormalSpeech(event) } const text = event.text.toLowerCase() // User just said "um", "uh", background noise, etc. if (text.length < 3) { return new Response(null, { status: 204 }) } // User said "stop" or similar - they probably want silence if (text === 'stop' || text === 'quiet') { return new Response(null, { status: 204 }) } return speak("I'm listening.") } ``` ## Configure Barge-In Sensitivity Use the [barge-in configuration](/api/barge-in) to control when interruptions trigger: ### Immediate Response (Most Natural) ⚡ ```typescript // Most responsive - triggers on voice detection (20-100ms) return { type: 'speak', session_id: sessionId, text: "I can help you with billing, support, or sales...", barge_in: { strategy: 'immediate', allow_after_ms: 500, // Protect first 500ms from accidental noise }, } ``` **Best for:** * Natural conversations where instant response matters * Customer service with high urgency * Interactive dialogues **Trade-off:** May trigger on background noise. Use `allow_after_ms` as buffer. ### Character-Based (Balanced) ```typescript // Balanced - triggers after 3+ characters recognized return { type: 'speak', session_id: sessionId, text: "Let me explain how this works...", barge_in: { strategy: 'minimum_characters', minimum_characters: 3, // Trigger quickly but reliably }, } // Protect important information return { type: 'speak', session_id: sessionId, text: "Your confirmation code is 7-4-2-9. Please write this down.", barge_in: { strategy: 'minimum_characters', minimum_characters: 10, // Require more speech allow_after_ms: 3000, // Protect first 3 seconds }, } ``` ### No Interruption (Critical Info) ```typescript // Never allow interruption for critical info return { type: 'speak', session_id: sessionId, text: "This call may be recorded for quality assurance.", barge_in: { strategy: 'none', }, } ``` ### Strategy Comparison | Strategy | Latency | Reliability | Use Case | |----------|---------|-------------|----------| | `immediate` | 20-100ms | May trigger on noise | Most natural conversations | | `minimum_characters` | 50-200ms | Very reliable | Balanced approach | | `manual` | N/A | Perfect | Custom logic | | `none` | N/A | Perfect | Critical info only | ## Handling Impatient Users Some users interrupt frequently. Keep acknowledgments brief: ```typescript // Track interruption count per session const interruptCounts = new Map() function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean, session: any }) { if (!event.barged_in) { return processNormalSpeech(event) } const sessionId = event.session.id const count = (interruptCounts.get(sessionId) || 0) + 1 interruptCounts.set(sessionId, count) // User interrupts a lot - be extra brief if (count > 3) { return speak("Yes?") } return speak("Of course. What would you like?") } ``` ## Complete Example ```typescript export async function POST(req: Request): Promise { const event = await req.json() const sessionId = event.session.id switch (event.type) { case 'user_speak': return handleUserSpeak(event) case 'session_end': // Clean up session state sessionStates.delete(sessionId) interruptCounts.delete(sessionId) return new Response(null, { status: 204 }) default: return new Response(null, { status: 204 }) } } function handleUserSpeak(event: { type: 'user_speak' text: string barged_in?: boolean session: { id: string } }): Response { const sessionId = event.session.id // Handle normal speech if (!event.barged_in) { return processNormalUserSpeech(event) } // Barge-in handling const text = event.text.trim().toLowerCase() // Very short - probably noise, stay silent if (text.length < 3) { return new Response(null, { status: 204 }) } // User understood / confirmed if (text.includes('got it') || text.includes('thanks') || text.includes('okay')) { return speak(sessionId, "Great! What else can I help with?") } // User wants to redirect if (text.includes('actually') || text.includes('wait') || text.includes('but')) { return speak(sessionId, "Of course. What would you like?") } // Substantial text - treat as new input if (event.text.length > 25 || event.text.includes('?')) { return processNormalUserSpeech(event) } // Default return speak(sessionId, "I'm listening.") } function speak(sessionId: string, text: string): Response { return Response.json({ type: 'speak', session_id: sessionId, text, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural', }, }) } ``` ## Best Practices Summary 1. **Respond to intent** - Use the `text` field to understand why they interrupted 2. **Be brief** - Short acknowledgments sound natural ("Got it!" not "I understand that you have indicated...") 3. **Vary your phrases** - Rotate through different acknowledgments 4. **Process substantial interruptions** - If they said a lot, treat it as a new question 5. **Sometimes stay silent** - Return 204 for noise or "stop" commands 6. **Configure sensitivity** - Use `barge_in` config to protect important information 7. **Keep impatient users happy** - Shorter responses for frequent interrupters 8. **Clean up state** - If you're tracking conversation state, consider resetting flags like "expecting confirmation" when the user interrupts ::: tip Async Operations If you're using the [Async Hold Pattern](/api/guides/async-hold-pattern) for slow operations, remember to cancel pending work when the user interrupts - they've moved on and don't want the old answer. ::: ## Related Documentation * **[Barge-In Configuration](/api/barge-in)** - Configure interruption sensitivity * **[User Speak Event with Barge-In Flag](/api/events/user-speak)** - Event reference * **[Event Flow](/api/event-flow)** - Complete event lifecycle --- --- url: /sipgate-ai-flow-api/api/guides/async-hold-pattern.md --- # Handling Long-Running Requests in Voice AI: The Async Hold Pattern When building voice AI assistants that integrate with external tools (like MCP servers, RAG systems, or slow APIs), you'll inevitably face a challenge: **sipgate AI Flow has webhook timeout limits**, but your backend operations might take much longer. This guide explains how to implement an elegant solution we call the "Async Hold Pattern" - keeping callers engaged while your system processes their request in the background. ## The Problem sipgate AI Flow enforces webhook timeout limits of approximately **5 seconds**. If your server doesn't respond in time, the platform may drop the connection or return an error to the caller. But what if your assistant needs to: * Query a slow external API (20+ seconds) * Search through a large knowledge base * Call an MCP (Model Context Protocol) server * Perform complex RAG operations * Fetch real-time data from third-party services You can't make the caller wait in silence, and you can't speed up the external service. So what do you do? ## The Solution: Async Hold Pattern Instead of blocking on the slow operation, we: 1. **Start the operation in the background** (don't await it) 2. **Wait briefly** for a quick response (e.g., 4 seconds) 3. **If completed quickly** → return the result directly 4. **If still pending** → tell the caller to wait, then check again when they're done listening This leverages a key insight: **sipgate AI Flow sends an `assistant_speech_ended` event when the assistant finishes speaking**. We can use this to create a polling loop that keeps the caller informed. ```mermaid flowchart TD A[/"user_speak event"/] --> B["Start slow operation
(don't await)"] B --> C["Wait up to 4 seconds"] C --> D{Completed?} D -->|Yes| E[/"Return result to caller"/] D -->|No| F["Return hold message:
'One moment, let me check...'"] F --> G[/"assistant_speech_ended event"/] G --> H["Wait up to 4 seconds"] H --> I{Completed?} I -->|Yes| J[/"Return result to caller"/] I -->|No| K["Return next hold message:
'Still searching...'"] K --> G style A fill:#e1f5fe style G fill:#e1f5fe style E fill:#c8e6c9 style J fill:#c8e6c9 style F fill:#fff3e0 style K fill:#fff3e0 ``` ## Implementation ### Step 1: Create a Pending State Manager First, we need a way to store the background promise and track state across webhook calls. Since each webhook call is a separate HTTP request, we need shared state: ```typescript // pending-state.ts interface PendingState { promise: Promise<{ response: string; error?: string }> startedAt: number holdMessageCount: number userMessage: string } // In-memory store (use Redis for multi-instance deployments) const pendingStates = new Map() // Hold messages - rotate through these while waiting const HOLD_MESSAGES = [ 'One moment, let me check...', 'Still searching...', 'Just a moment longer...', 'Almost there...', ] // How long to wait before responding (stay under sipgate's timeout!) const WAIT_BEFORE_RESPONSE_MS = 4000 export function startPending( sessionId: string, promise: Promise<{ response: string; error?: string }>, userMessage: string ): void { pendingStates.set(sessionId, { promise, startedAt: Date.now(), holdMessageCount: 0, userMessage, }) } export function hasPending(sessionId: string): boolean { return pendingStates.has(sessionId) } export function cancelPending(sessionId: string): void { pendingStates.delete(sessionId) } export function getNextHoldMessage(sessionId: string): string { const state = pendingStates.get(sessionId) if (!state) return HOLD_MESSAGES[0] const index = Math.min(state.holdMessageCount, HOLD_MESSAGES.length - 1) state.holdMessageCount++ return HOLD_MESSAGES[index] } export async function waitForCompletion( sessionId: string ): Promise<{ response: string; error?: string } | null> { const state = pendingStates.get(sessionId) if (!state) return null // Race between the promise and a timeout const timeoutPromise = new Promise((resolve) => { setTimeout(() => resolve(null), WAIT_BEFORE_RESPONSE_MS) }) const result = await Promise.race([state.promise, timeoutPromise]) if (result !== null) { // Completed! Clean up and return pendingStates.delete(sessionId) return result } return null // Still pending } ``` ### Step 2: Handle the Initial `user_speak` Event When a user speaks, start the background operation and wait briefly: ```typescript // webhook-handler.ts async function handleUserSpeak(event: { type: 'user_speak' session: { id: string } text: string }) { const sessionId = event.session.id // Cancel any existing pending operation (user asked a new question) cancelPending(sessionId) // Start the slow operation in background (DON'T await the full operation!) const operationPromise = performSlowOperation(event.text) .then(result => ({ response: result })) .catch(error => ({ response: '', error: String(error) })) // Wait up to 4 seconds for completion const INITIAL_WAIT_MS = 4000 const timeoutPromise = new Promise((resolve) => { setTimeout(() => resolve(null), INITIAL_WAIT_MS) }) const quickResult = await Promise.race([operationPromise, timeoutPromise]) if (quickResult !== null) { // Completed quickly! Return result directly - no hold message needed console.log('Operation completed within 4s - returning direct response') if (quickResult.error) { return createSpeakResponse(sessionId, 'I\'m sorry, there was an error.') } return createSpeakResponse(sessionId, quickResult.response) } // Taking too long - switch to hold pattern console.log('Operation taking >4s - using hold pattern') // Store the promise for the assistant_speech_ended handler startPending(sessionId, operationPromise, event.text) // Return hold message return createSpeakResponse(sessionId, getNextHoldMessage(sessionId)) } function createSpeakResponse(sessionId: string, text: string) { return { type: 'speak', session_id: sessionId, text: text, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural', }, } } ``` ### Step 3: Handle the `assistant_speech_ended` Event This is the key insight: sipgate AI Flow sends an `assistant_speech_ended` event when the assistant finishes speaking. **You can return a new action from this event!** ```typescript async function handleAssistantSpeechEnded(event: { type: 'assistant_speech_ended' session: { id: string } }) { const sessionId = event.session.id // No pending operation? Nothing to do - return 204 No Content if (!hasPending(sessionId)) { return new Response(null, { status: 204 }) } // Wait for completion (up to 4 seconds to maximize processing time) const result = await waitForCompletion(sessionId) if (result !== null) { // Done! Return the actual response console.log('Operation completed - returning result') if (result.error) { return createSpeakResponse( sessionId, 'I\'m sorry, I couldn\'t find that information.' ) } return createSpeakResponse(sessionId, result.response) } // Still pending - say another hold message and wait for next speech_ended console.log('Still waiting - returning another hold message') return createSpeakResponse(sessionId, getNextHoldMessage(sessionId)) } ``` ### Step 4: Wire Up the Webhook Router ```typescript export async function POST(request: Request) { const event = await request.json() switch (event.type) { case 'session_start': return handleSessionStart(event) case 'user_speak': return handleUserSpeak(event) case 'assistant_speech_ended': return handleAssistantSpeechEnded(event) case 'session_end': // Clean up any pending state cancelPending(event.session.id) return handleSessionEnd(event) default: // Return 204 for events we don't handle return new Response(null, { status: 204 }) } } ``` ## Key Insights ### Why 4 Seconds? sipgate AI Flow has approximately a 5 second timeout. We use 4 seconds to: * Leave buffer for network latency * Allow time for the response to be transmitted * Stay safely under the limit You can adjust this value, but always leave at least 500ms-1s of buffer. ### The `assistant_speech_ended` Event is Powerful Many developers overlook this event, but it's the key to the async pattern. When the assistant finishes speaking, sipgate sends this event and **waits for your response**. You can: * Return a new `speak` action to continue talking * Return `204 No Content` to stay silent and wait for user input * Check if your background operation completed This creates a natural polling mechanism without awkward silences. ### Memory vs. Redis The example uses an in-memory `Map` for simplicity. This works for single-instance deployments, but for production with multiple server instances behind a load balancer, use Redis: ```typescript import Redis from 'ioredis' const redis = new Redis(process.env.REDIS_URL) // Store metadata in Redis (promises can't be serialized) export async function startPending(sessionId: string, ...) { await redis.setex( `pending:${sessionId}`, 120, // 2 minute TTL JSON.stringify({ startedAt: Date.now(), holdMessageCount: 0, userMessage }) ) // Keep promise reference in memory (same instance will handle it) pendingPromises.set(sessionId, promise) } ``` ### Always Cancel Previous Operations When a user asks a new question while you're still processing the old one, cancel the old operation: ```typescript // In user_speak handler - always cancel previous pending operation cancelPending(event.session.id) ``` This prevents confusion and wasted resources. The user doesn't care about the old answer anymore. ### Clean Up on Session End Always clean up when the call ends: ```typescript case 'session_end': cancelPending(event.session.id) return handleSessionEnd(event) ``` ## Advanced: Caching Slow Initializations If your slow operation has a one-time initialization step (like discovering available tools from an MCP server), cache it separately: ```typescript // BAD: Fetching tool definitions on every request async function handleUserSpeak(event) { const tools = await mcpServer.listTools() // SLOW - 5+ seconds! const response = await llm.generate({ tools, message: event.text }) return createSpeakResponse(sessionId, response) } // GOOD: Cache tool definitions, only fetch once async function handleUserSpeak(event) { // Tools were cached when server was configured const tools = await database.get('mcp_server_tools', serverId) // FAST - <100ms const response = await llm.generate({ tools, message: event.text }) return createSpeakResponse(sessionId, response) } // Cache tools when MCP server is configured (admin action) async function configureMcpServer(serverUrl: string) { const tools = await mcpServer.listTools() // Slow, but only happens once await database.set('mcp_server_tools', serverId, tools) } ``` This separates "one-time setup" (caching tool definitions) from "per-request work" (calling tools), dramatically improving response times. ## Complete Minimal Example Here's a self-contained example you can adapt: ```typescript // ============================================ // pending-state.ts // ============================================ const pendingStates = new Map holdCount: number }>() const HOLD_MESSAGES = [ 'One moment please...', 'Still searching...', 'Almost there...', ] const WAIT_MS = 4000 export const pending = { start(id: string, promise: Promise<{ response: string; error?: string }>) { pendingStates.set(id, { promise, holdCount: 0 }) }, has(id: string): boolean { return pendingStates.has(id) }, cancel(id: string): void { pendingStates.delete(id) }, getHoldMessage(id: string): string { const state = pendingStates.get(id) if (!state) return HOLD_MESSAGES[0] return HOLD_MESSAGES[Math.min(state.holdCount++, HOLD_MESSAGES.length - 1)] }, async wait(id: string): Promise<{ response: string; error?: string } | null> { const state = pendingStates.get(id) if (!state) return null const timeout = new Promise(r => setTimeout(() => r(null), WAIT_MS)) const result = await Promise.race([state.promise, timeout]) if (result !== null) { pendingStates.delete(id) } return result }, } // ============================================ // webhook.ts // ============================================ import { pending } from './pending-state' export async function POST(req: Request): Promise { const event = await req.json() const sessionId = event.session.id // Handle user_speak - start background operation if (event.type === 'user_speak') { pending.cancel(sessionId) // Cancel any previous operation // Start slow operation (don't await fully!) const promise = slowExternalApiCall(event.text) .then(result => ({ response: result })) .catch(err => ({ response: '', error: String(err) })) // Wait up to 4 seconds const timeout = new Promise(r => setTimeout(() => r(null), 4000)) const quick = await Promise.race([promise, timeout]) // If completed quickly, return result directly if (quick !== null) { if (quick.error) { return speak(sessionId, 'Sorry, there was an error.') } return speak(sessionId, quick.response) } // Taking too long - use hold pattern pending.start(sessionId, promise) return speak(sessionId, pending.getHoldMessage(sessionId)) } // Handle assistant_speech_ended - check if operation completed if (event.type === 'assistant_speech_ended') { if (!pending.has(sessionId)) { return new Response(null, { status: 204 }) } const result = await pending.wait(sessionId) if (result !== null) { if (result.error) { return speak(sessionId, 'The request could not be processed.') } return speak(sessionId, result.response) } // Still waiting - another hold message return speak(sessionId, pending.getHoldMessage(sessionId)) } // Handle session_end - clean up if (event.type === 'session_end') { pending.cancel(sessionId) } return new Response(null, { status: 204 }) } function speak(sessionId: string, text: string): Response { return Response.json({ type: 'speak', session_id: sessionId, text, tts: { provider: 'azure', language: 'en-US', voice: 'en-US-GuyNeural', }, }) } // Your slow operation (replace with actual implementation) async function slowExternalApiCall(query: string): Promise { // Simulating a slow API call await new Promise(r => setTimeout(r, 15000)) return `Here's what I found about "${query}"...` } ``` ## Conclusion The Async Hold Pattern transforms a technical limitation into a natural conversation flow. Instead of timing out or making users wait in awkward silence, your assistant says "One moment please..." - just like a human would. **Key takeaways:** 1. **Start slow operations without awaiting** - let them run in the background 2. **Wait briefly (4 seconds)** before deciding to use hold messages 3. **Use the `assistant_speech_ended` event** to poll for completion 4. **Keep messages varied** - rotate through different hold phrases 5. **Always clean up** - cancel pending operations when no longer needed 6. **Cache when possible** - separate one-time setup from per-request work This pattern works with any slow backend operation: MCP servers, RAG pipelines, external APIs, database queries, or anything else that might exceed the webhook timeout. *** *For more information about sipgate AI Flow events and actions, see the [sipgate AI Flow API documentation](https://sipgate.github.io/sipgate-ai-flow-api/).* --- --- url: /sipgate-ai-flow-api/api/guides/testing-voice-assistants.md --- # Testing Voice Assistants Without Making Phone Calls Testing voice assistants is challenging - you can't just write unit tests and call it a day. Real phone calls are slow, awkward to automate, and expensive at scale. This guide covers practical strategies for testing your sipgate AI Flow integration at every level. ## The Testing Challenge Voice assistants have unique testing challenges: * **Real calls are slow** - Each test takes 30+ seconds of actual talking * **Hard to automate** - You can't easily script "say this, wait for response" * **Expensive at scale** - Phone minutes add up during development * **Environment-dependent** - Need a publicly accessible webhook URL * **Non-deterministic** - Speech recognition varies, LLM responses vary The solution: test at multiple levels, saving real phone calls for final validation. ## Testing Pyramid for Voice AI ```mermaid graph TB subgraph "Testing Pyramid" A["🔺 Real Phone Calls
(Few, Final Validation)"] B["🔸 Event Simulation
(HTTP requests to your webhook)"] C["🔹 Chat Simulator
(Test LLM logic via text)"] D["🟦 Unit Tests
(Business logic, utilities)"] end D --> C --> B --> A style A fill:#ffcdd2 style B fill:#fff3e0 style C fill:#e3f2fd style D fill:#c8e6c9 ``` ## Level 1: Unit Tests Test your business logic in isolation - no sipgate, no LLM calls. ```typescript // utils/intent-detection.ts export function detectIntent(text: string): 'greeting' | 'question' | 'goodbye' | 'unknown' { const lower = text.toLowerCase() if (lower.match(/^(hi|hello|hey|good morning)/)) return 'greeting' if (lower.match(/(bye|goodbye|see you|thanks)/)) return 'goodbye' if (lower.includes('?')) return 'question' return 'unknown' } // utils/intent-detection.test.ts import { detectIntent } from './intent-detection' describe('detectIntent', () => { it('detects greetings', () => { expect(detectIntent('Hello there')).toBe('greeting') expect(detectIntent('Hi!')).toBe('greeting') expect(detectIntent('Good morning')).toBe('greeting') }) it('detects questions', () => { expect(detectIntent('What are your hours?')).toBe('question') expect(detectIntent('Can you help me?')).toBe('question') }) it('detects goodbyes', () => { expect(detectIntent('Goodbye')).toBe('goodbye') expect(detectIntent('Thanks, bye!')).toBe('goodbye') }) }) ``` **What to unit test:** * Intent detection logic * Response formatting * State machine transitions * Phone number normalization * TTS configuration building ## Level 2: Chat Simulator Build a text-based interface that uses the same LLM logic as your voice assistant. This lets you rapidly iterate on prompts and conversation flow without any phone infrastructure. ```typescript // The key insight: extract your LLM logic into a shared service // lib/conversation-service.ts export async function generateResponse(params: { systemPrompt: string conversationHistory: { role: 'user' | 'assistant'; content: string }[] userMessage: string }): Promise { // Your LLM call logic here // This is used by BOTH the webhook AND the chat simulator } ``` ```typescript // Webhook uses it async function handleUserSpeak(event: UserSpeakEvent) { const response = await generateResponse({ systemPrompt: assistant.system_prompt, conversationHistory: history, userMessage: event.text, }) return speak(response) } // Chat simulator uses the SAME function async function handleChatMessage(message: string, sessionId: string) { const response = await generateResponse({ systemPrompt: assistant.system_prompt, conversationHistory: history, userMessage: message, }) return { response } } ``` **Benefits:** * Test conversation flow in seconds, not minutes * Iterate on system prompts quickly * Debug LLM issues without phone overhead * Share sessions with teammates for review **Limitations:** * Doesn't test speech recognition accuracy * Doesn't test TTS pronunciation * Doesn't test real-time timing ## Level 3: Event Simulation Send fake sipgate events directly to your webhook. This tests your actual webhook handler without needing a phone call. ### Manual Testing with curl ```bash # Simulate session_start curl -X POST http://localhost:3000/api/webhook \ -H "Content-Type: application/json" \ -d '{ "type": "session_start", "session": { "id": "test-session-123", "account_id": "test-account", "phone_number": "1234567890", "direction": "inbound", "from_phone_number": "0987654321", "to_phone_number": "1234567890" } }' # Simulate user_speak curl -X POST http://localhost:3000/api/webhook \ -H "Content-Type: application/json" \ -d '{ "type": "user_speak", "session": { "id": "test-session-123", "account_id": "test-account", "phone_number": "1234567890" }, "text": "What are your business hours?" }' # Simulate user_speak with interruption (barge_in) curl -X POST http://localhost:3000/api/webhook \ -H "Content-Type: application/json" \ -d '{ "type": "user_speak", "session": { "id": "test-session-123", "account_id": "test-account", "phone_number": "1234567890" }, "text": "Actually, never mind", "barged_in": true }' # Simulate session_end curl -X POST http://localhost:3000/api/webhook \ -H "Content-Type: application/json" \ -d '{ "type": "session_end", "session": { "id": "test-session-123", "account_id": "test-account", "phone_number": "1234567890" }, "reason": "caller_hangup" }' ``` ### Automated Integration Tests ```typescript // tests/webhook.test.ts import { describe, it, expect, beforeEach } from 'vitest' const WEBHOOK_URL = 'http://localhost:3000/api/webhook' describe('Webhook Integration', () => { const sessionId = `test-${Date.now()}` it('handles session_start and returns greeting', async () => { const response = await fetch(WEBHOOK_URL, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ type: 'session_start', session: { id: sessionId, account_id: 'test', phone_number: '1234567890', direction: 'inbound', from_phone_number: '0987654321', to_phone_number: '1234567890', }, }), }) expect(response.ok).toBe(true) const data = await response.json() expect(data.type).toBe('speak') expect(data.text).toBeTruthy() expect(data.session_id).toBe(sessionId) }) it('handles user_speak and returns response', async () => { const response = await fetch(WEBHOOK_URL, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ type: 'user_speak', session: { id: sessionId, account_id: 'test', phone_number: '1234567890' }, text: 'What are your hours?', }), }) expect(response.ok).toBe(true) const data = await response.json() expect(data.type).toBe('speak') expect(data.text).toBeTruthy() }) it('handles barge-in gracefully', async () => { const response = await fetch(WEBHOOK_URL, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ type: 'user_speak', barged_in: true, session: { id: sessionId, account_id: 'test', phone_number: '1234567890' }, text: 'Wait', }), }) expect(response.ok).toBe(true) // Could be 204 or a speak action if (response.status !== 204) { const data = await response.json() expect(data.type).toBe('speak') } }) it('handles session_end and cleans up', async () => { const response = await fetch(WEBHOOK_URL, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ type: 'session_end', session: { id: sessionId, account_id: 'test', phone_number: '1234567890' }, reason: 'caller_hangup', }), }) expect(response.ok).toBe(true) }) }) ``` ### Conversation Flow Tests Test complete conversation scenarios: ```typescript // tests/flows/booking-flow.test.ts async function simulateConversation(messages: string[]): Promise { const sessionId = `test-${Date.now()}` const responses: string[] = [] // Start session await sendEvent({ type: 'session_start', session: { id: sessionId, ... } }) // Simulate each user message for (const message of messages) { const response = await sendEvent({ type: 'user_speak', session: { id: sessionId, ... }, text: message, }) responses.push(response.text) } // End session await sendEvent({ type: 'session_end', session: { id: sessionId, ... } }) return responses } describe('Booking Flow', () => { it('completes a booking conversation', async () => { const responses = await simulateConversation([ 'I want to book an appointment', 'Tomorrow at 2pm', 'John Smith', 'Yes, that is correct', ]) expect(responses[0]).toMatch(/when|date|time/i) expect(responses[1]).toMatch(/name/i) expect(responses[2]).toMatch(/confirm/i) expect(responses[3]).toMatch(/booked|confirmed|scheduled/i) }) it('handles corrections mid-flow', async () => { const responses = await simulateConversation([ 'I want to book an appointment', 'Tomorrow at 2pm', 'Actually, make it 3pm instead', ]) expect(responses[2]).toMatch(/3|three|pm/i) }) }) ``` ## Level 4: Local Development with ngrok For testing with real sipgate infrastructure (but simulated calls), expose your local server: ```bash # Start your dev server npm run dev # In another terminal, expose it ngrok http 3000 ``` Configure the ngrok URL as your webhook endpoint in sipgate. Now sipgate can reach your local development server. **Use cases:** * Test webhook authentication * Test with sipgate's actual event format * Debug production issues locally ## Level 5: Real Phone Calls Save these for final validation. Create a testing checklist: ```markdown ## Pre-Release Phone Test Checklist ### Basic Flow - [ ] Call connects and greeting plays - [ ] Assistant responds to simple question - [ ] Assistant handles "I don't understand" gracefully - [ ] Call ends cleanly when user says goodbye ### Barge-In - [ ] Interrupting mid-sentence works - [ ] Assistant acknowledges interruption - [ ] No "stale" responses after interruption ### Edge Cases - [ ] Long silence from user (10+ seconds) - [ ] Very long user input (30+ seconds of speaking) - [ ] Background noise doesn't trigger false responses - [ ] Accent/dialect recognition (if applicable) ### Error Handling - [ ] Network timeout during LLM call - [ ] Invalid user input - [ ] Session state recovery after errors ``` ## Testing Utilities ### Event Factory Create a helper for generating test events: ```typescript // tests/utils/event-factory.ts export function createSessionStartEvent(overrides = {}) { return { type: 'session_start', session: { id: `test-${Date.now()}`, account_id: 'test-account', phone_number: '1234567890', direction: 'inbound', from_phone_number: '0987654321', to_phone_number: '1234567890', }, ...overrides, } } export function createUserSpeakEvent(sessionId: string, text: string, overrides = {}) { return { type: 'user_speak', session: { id: sessionId, account_id: 'test-account', phone_number: '1234567890', }, text, ...overrides, } } export function createBargeInEvent(sessionId: string, text: string, overrides = {}) { return { type: 'user_speak', barged_in: true, session: { id: sessionId, account_id: 'test-account', phone_number: '1234567890', }, text, ...overrides, } } ``` ### Response Assertions ```typescript // tests/utils/assertions.ts export function assertSpeakAction(response: any, options: { containsText?: string sessionId?: string } = {}) { expect(response.type).toBe('speak') expect(response.text).toBeTruthy() expect(response.tts).toBeDefined() if (options.containsText) { expect(response.text.toLowerCase()).toContain(options.containsText.toLowerCase()) } if (options.sessionId) { expect(response.session_id).toBe(options.sessionId) } } export function assertTransferAction(response: any, targetNumber?: string) { expect(response.type).toBe('transfer') expect(response.target).toBeTruthy() if (targetNumber) { expect(response.target).toBe(targetNumber) } } ``` ## CI/CD Integration Run event simulation tests in your pipeline: ```yaml # .github/workflows/test.yml name: Test Voice Assistant on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Node uses: actions/setup-node@v4 with: node-version: '20' - name: Install dependencies run: npm ci - name: Run unit tests run: npm test - name: Start server run: npm run dev & env: # Use test/mock API keys OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY_TEST }} - name: Wait for server run: npx wait-on http://localhost:3000/api/health - name: Run integration tests run: npm run test:integration ``` ## Best Practices Summary 1. **Extract shared logic** - Same LLM service for chat and voice 2. **Test the pyramid** - Most tests at unit level, fewest at phone level 3. **Automate event simulation** - Integration tests catch regressions 4. **Use deterministic test data** - Fixed session IDs, predictable inputs 5. **Test conversation flows** - Not just individual events 6. **Create test utilities** - Event factories, response assertions 7. **Run in CI** - Catch issues before deployment 8. **Save phone tests for validation** - Manual checklist for final sign-off ## Related Documentation * **[HTTP Webhooks](/api/http-webhooks)** - Webhook endpoint reference * **[Event Types](/api/events)** - All event structures * **[Action Types](/api/actions)** - Response format reference --- --- url: /sipgate-ai-flow-api/api/events.md --- # Event Types Complete reference for all events sent by the AI Flow service. ## Overview Events are JSON objects sent from the AI Flow service to your application. All events include a `type` field and session information. ## Base Event Structure All events include session information: ```json { "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890", "direction": "inbound", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` The `direction` field indicates whether the call was initiated by the caller (`"inbound"`) or by the AI flow via the outbound call API (`"outbound"`). Use it in your `session_start` handler to tailor the greeting accordingly. ## Event Types | Event Type | Transport | Description | When Triggered | |--------------------------|--------------------|-----------------------------|--------------------------------------------------------------------------------| | `session_start` | HTTP + WebSocket | Call session begins | When a new call is initiated | | `user_speech_started` | **WebSocket only** | Speech onset detected | When VAD detects the user starting to speak (before full transcript) | | `user_speak` | HTTP + WebSocket | User speech detected | After speech-to-text completes (includes `barged_in` flag if user interrupted) | | `dtmf_received` | HTTP + WebSocket | DTMF digit pressed | When the user presses a key on their phone keypad | | `assistant_speak` | HTTP + WebSocket | Assistant finished speaking | After TTS playback completes | | `assistant_speech_ended` | HTTP + WebSocket | Assistant finished speaking | After speech playback ends | | `user_input_timeout` | HTTP + WebSocket | User input timeout reached | When no speech detected after timeout | | `session_end` | HTTP + WebSocket | Call session ends | When the call terminates | | `sms_failed` | HTTP + WebSocket | SMS delivery failed | After a `send_sms` action fails — includes `reason` so the agent can react | ## Quick Reference * **[Session Start](/api/events/session-start)** - Call begins * **[User Speech Started](/api/events/user-speech-started)** - Speech onset detected (WebSocket only) * **[User Speak](/api/events/user-speak)** - User speaks (includes barge-in detection) * **[DTMF Received](/api/events/dtmf-received)** - User pressed a phone key * **[Assistant Speak](/api/events/assistant-speak)** - Assistant speaks * **[Assistant Speech Ended](/api/events/assistant-speech-ended)** - Assistant finished speaking * **[User Input Timeout](/api/events/user-input-timeout)** - Timeout reached waiting for user * **[Session End](/api/events/session-end)** - Call ends * **SMS Failed** — emitted when a `send_sms` action fails; see below. ## SMS Failed Emitted to your webhook / WebSocket when a `send_sms` action fails. The call continues normally — handle this event to react conversationally (e.g. apologize, retry with a corrected number). ```json { "type": "sms_failed", "session": { "id": "550e8400-...", "account_id": "...", "phone_number": "...", "from_phone_number": "...", "to_phone_number": "..." }, "recipient": "4915112345678", "reason": "sender_not_allowed", "message": "SMSC returned faultCode 403" } ``` | Field | Type | Description | |-------------|--------|--------------------------------------------------------------------------------------------| | `type` | string | Always `"sms_failed"` | | `session` | object | Standard session info | | `recipient` | string | Phone number that failed (the `phone_number` from your `send_sms` action) | | `reason` | string | One of: `sender_not_allowed`, `insufficient_balance`, `no_sms_extension`, `smsc_unavailable`, `unknown` | | `message` | string | Optional human-readable detail (safe to log, may contain technical error text) | See **[Send SMS Action](/api/actions/send-sms)** for details on each failure reason. ## Event Flow ```mermaid graph LR A[session_start] --> B[user_speak] B --> C[assistant_speak] C --> B C --> D[user_speak with barged_in=true] D --> B B --> E[session_end] C --> E ``` ## Handling Events ### HTTP Webhook ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json event_type = event['type'] if event_type == 'session_start': # Handle session start pass elif event_type == 'user_speak': # Handle user speech pass # ... handle other events ``` ### WebSocket ```javascript ws.on('message', (data) => { const event = JSON.parse(data.toString()); switch (event.type) { case 'session_start': // Handle session start break; case 'user_speak': // Handle user speech break; // ... handle other events } }); ``` ## Response Requirements All events (except `session_end`) accept a single action, an array of actions (executed in sequence), or `204 No Content`: * **session\_start**: Can return action(s) or `204 No Content` * **user\_speak**: Can return action(s) or `204 No Content` (check `barged_in` flag for interruptions) * **dtmf\_received**: Can return action(s) or `204 No Content` * **assistant\_speak**: Can return action(s) or `204 No Content` * **assistant\_speech\_ended**: Can return action(s) or `204 No Content` * **user\_input\_timeout**: Can return action(s) or `204 No Content` * **session\_end**: **No action allowed**, cleanup only ## Next Steps * **[Session Start Event](/api/events/session-start)** - Detailed reference * **[User Speak Event](/api/events/user-speak)** - Detailed reference * **[Action Types](/api/actions)** - How to respond to events --- --- url: /sipgate-ai-flow-api/api/events/session-start.md --- # Session Start Event Triggered when a new call session begins. ## Event Structure ```json { "type": "session_start", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890", "direction": "inbound", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"session_start"` | | `session.id` | string (UUID) | Yes | Unique session identifier | | `session.account_id` | string | Yes | Account identifier | | `session.phone_number` | string | Yes | Phone number for this flow session | | `session.direction` | string | No | `"inbound"` or `"outbound"` | | `session.from_phone_number` | string | Yes | Phone number of the caller | | `session.to_phone_number` | string | Yes | Phone number of the callee | ## Response You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common responses: ### Greet the User ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Welcome! How can I help you today?" } ``` ### Play Welcome Audio ```json { "type": "audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio": "base64-encoded-wav-data" } ``` ### No Response ```http HTTP/1.1 204 No Content ``` ## Examples ### Python (Flask) ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'session_start': session_id = event['session']['id'] return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': 'Welcome! How can I help you?' }) return '', 204 ``` ### Node.js (Express) ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { return res.json({ type: 'speak', session_id: event.session.id, text: 'Welcome! How can I help you?' }); } res.status(204).send(); }); ``` ### Go ```go func webhook(w http.ResponseWriter, r *http.Request) { var event map[string]interface{} json.NewDecoder(r.Body).Decode(&event) if event["type"] == "session_start" { session := event["session"].(map[string]interface{}) action := map[string]interface{}{ "type": "speak", "session_id": session["id"], "text": "Welcome! How can I help you?", } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } w.WriteHeader(http.StatusNoContent) } ``` ## Use Cases * **Initialize session state** - Set up conversation context * **Greet the user** - Welcome message * **Log call information** - Track incoming calls * **Route based on number** - Different greetings for different numbers ## Best Practices 1. **Respond quickly** - Keep greeting under 2 seconds 2. **Initialize state** - Set up any session tracking 3. **Log session info** - Record call metadata 4. **Handle errors** - Always return a valid response ## Next Steps * **[User Speak Event](/api/events/user-speak)** - Handle user input * **[Action Types](/api/actions)** - All available actions * **[Event Flow](/api/event-flow)** - Understand the complete flow --- --- url: /sipgate-ai-flow-api/api/events/user-speech-started.md --- # User Speech Started Event Triggered when the user's speech is first detected — before the full transcript is available. Uses Voice Activity Detection (VAD) and typically fires 20–120 ms after the user starts speaking. ::: info WebSocket only This event is only delivered via WebSocket connections. It is not sent to HTTP webhook endpoints. ::: ## Event Structure ```json { "type": "user_speech_started", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890", "direction": "inbound", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"user_speech_started"` | | `session.id` | string (UUID) | Yes | Session identifier | | `session.account_id` | string | Yes | Account identifier | | `session.phone_number` | string | Yes | Phone number for this flow session | ## Behaviour * Fires **at most once per speech turn** — subsequent partial transcripts within the same turn are suppressed * Resets automatically after the corresponding `user_speak` event is received, so it fires again on the next speech turn * No response or actions are expected; the service ignores any payload returned for this event ## Use Cases * **Show "user is speaking" indicators** in real-time dashboards or call monitoring UIs * **Start latency optimisations early** — e.g. pre-warm LLM context or fetch data before the full transcript arrives * **Interrupt ongoing workflows** — cancel queued background processing when the user begins to speak ## Example (TypeScript SDK) ```typescript import { AiFlowAssistant } from '@sipgate/ai-flow-sdk'; import WebSocket from 'ws'; const assistant = AiFlowAssistant.create({ onUserSpeechStarted: async (event) => { console.log('User started speaking, session:', event.session.id); // No return value needed }, onUserSpeak: async (event) => { return `You said: ${event.text}`; }, }); const wss = new WebSocket.Server({ port: 3000 }); wss.on('connection', (ws) => { ws.on('message', assistant.ws(ws)); }); ``` ## Example (Raw WebSocket) ```javascript ws.on('message', (data) => { const event = JSON.parse(data.toString()); if (event.type === 'user_speech_started') { console.log('User started speaking in session', event.session.id); // No response needed — the service ignores any reply } if (event.type === 'user_speak') { ws.send(JSON.stringify({ type: 'speak', session_id: event.session.id, text: `You said: ${event.text}`, })); } }); ``` ## Next Steps * **[User Speak Event](/api/events/user-speak)** - Full transcript after STT completes * **[Barge-In Guide](/api/barge-in)** - Interrupting assistant speech * **[WebSocket Integration](/api/websocket)** - How to connect via WebSocket --- --- url: /sipgate-ai-flow-api/api/events/user-speak.md --- # User Speak Event Triggered when the user speaks and speech-to-text completes. ## Event Structure ```json { "type": "user_speak", "text": "Hello, I need help", "barged_in": false, "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890" } } ``` ### Barge-In Detection When a user interrupts the assistant mid-speech, the event includes `barged_in: true`: ```json { "type": "user_speak", "text": "Wait", "barged_in": true, "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890" } } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"user_speak"` | | `text` | string | Yes | Recognized speech text | | `barged_in` | boolean | No | `true` if user interrupted assistant, `false` or omitted otherwise | | `session.id` | string (UUID) | Yes | Session identifier | | `session.account_id` | string | Yes | Account identifier | | `session.phone_number` | string | Yes | Phone number for this flow session | ## End-of-Utterance Detection The service does not send a `user_speak` event after every individual STT segment. Instead, it buffers recognized speech and uses an on-device model to detect when the user has actually finished speaking. ### How it works After each STT recognition result, the service checks whether the accumulated text is a complete utterance: | Condition | Behaviour | |-----------|-----------| | Utterance is complete (e.g. full sentence, question) | `user_speak` is emitted immediately with the full accumulated text | | Utterance is incomplete (e.g. dangling fragment like *"Ich möchte"*) | Service waits up to **2 seconds** for the user to continue speaking | | User continues speaking within 2 seconds | The 2-second timer resets; both segments are merged into one event | | 2 seconds pass with no further speech | `user_speak` is emitted with all buffered text | ### Practical implications * The `text` field may contain **multiple speech segments merged** into a single string when the user speaks in bursts. * Your webhook receives **one** `user_speak` per coherent utterance, not one per STT segment. * Response latency is lowest for complete sentences — the model triggers the event immediately without waiting. ### Language sensitivity The end-of-utterance model uses **language-specific thresholds** to decide what counts as a complete utterance. The active language is determined by the `languages` field set via the [`configure_transcription`](/api/actions/configure-transcription) action. If no language is configured, a default threshold is used. Setting the correct language improves detection accuracy and reduces unnecessary delays. ## Response You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common responses: ### Speak Back ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "I understand. How can I help you?" } ``` ### Transfer Call ```json { "type": "transfer", "session_id": "550e8400-e29b-41d4-a716-446655440000", "target_phone_number": "1234567890", "caller_id_name": "Support", "caller_id_number": "1234567890" } ``` ### Hangup ```json { "type": "hangup", "session_id": "550e8400-e29b-41d4-a716-446655440000" } ``` ## Examples ### Python (Flask) ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': session_id = event['session']['id'] user_text = event['text'].lower() if 'goodbye' in user_text or 'bye' in user_text: return jsonify({ 'type': 'hangup', 'session_id': session_id }) if 'transfer' in user_text: return jsonify({ 'type': 'transfer', 'session_id': session_id, 'target_phone_number': '1234567890', 'caller_id_name': 'Support', 'caller_id_number': '1234567890' }) return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': f"You said: {event['text']}" }) return '', 204 ``` ### Node.js (Express) ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak') { const userText = event.text.toLowerCase(); if (userText.includes('goodbye') || userText.includes('bye')) { return res.json({ type: 'hangup', session_id: event.session.id }); } if (userText.includes('transfer')) { return res.json({ type: 'transfer', session_id: event.session.id, target_phone_number: '1234567890', caller_id_name: 'Support', caller_id_number: '1234567890' }); } return res.json({ type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }); } res.status(204).send(); }); ``` ### Go ```go func webhook(w http.ResponseWriter, r *http.Request) { var event map[string]interface{} json.NewDecoder(r.Body).Decode(&event) if event["type"] == "user_speak" { session := event["session"].(map[string]interface{}) text := strings.ToLower(event["text"].(string)) if strings.Contains(text, "goodbye") || strings.Contains(text, "bye") { action := map[string]interface{}{ "type": "hangup", "session_id": session["id"], } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } action := map[string]interface{}{ "type": "speak", "session_id": session["id"], "text": "You said: " + event["text"].(string), } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } w.WriteHeader(http.StatusNoContent) } ``` ## Handling Barge-In You can check the `barged_in` flag to provide special handling for interruptions: ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': if event.get('barged_in'): # User interrupted - acknowledge quickly return jsonify({ 'type': 'speak', 'text': 'Yes, I\'m listening.' }) else: # Normal speech processing return process_user_input(event['text']) ``` See the **[Barge-In Best Practices Guide](/api/guides/barge-in-best-practices)** for detailed strategies. ## Use Cases * **Process user input** - Understand what the user wants * **Detect interruptions** - Handle barge-in with `barged_in` flag * **Route conversations** - Direct to appropriate handler * **Collect information** - Gather details from user * **Transfer calls** - Route to human agents * **End calls** - Handle goodbye messages ## Best Practices 1. **Process quickly** - Respond within 1-2 seconds 2. **Handle barge-in gracefully** - Check `barged_in` flag for interruptions 3. **Handle errors** - Always return a valid response 4. **Log interactions** - Track conversation for analytics 5. **Validate input** - Check for expected patterns ## Next Steps * **[Assistant Speak Event](/api/events/assistant-speak)** - Track when assistant speaks * **[Action Types](/api/actions)** - All available actions * **[Event Flow](/api/event-flow)** - Understand the complete flow --- --- url: /sipgate-ai-flow-api/api/events/assistant-speak.md --- # Assistant Speak Event Triggered after the assistant starts speaking. Event may be omitted for some text-to-speech models. ## Event Structure ```json { "type": "assistant_speak", "text": "Hello! How can I help you?", "ssml": "Hello!", "duration_ms": 2000, "speech_started_at": 1234567890, "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890" } } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"assistant_speak"` | | `text` | string | No | Text that was spoken | | `ssml` | string | No | SSML that was used (if applicable) | | `duration_ms` | number | Yes | Duration of speech in milliseconds | | `speech_started_at` | number | Yes | Unix timestamp (ms) when speech started | | `session.id` | string (UUID) | Yes | Session identifier | | `session.account_id` | string | Yes | Account identifier | | `session.phone_number` | string | Yes | Phone number for this flow session | ## Response You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common uses: * **Track metrics** - Log conversation analytics * **Chain actions** - Trigger follow-up actions * **No response** - Just track the event ## Examples ### Track Metrics ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'assistant_speak': # Track metrics track_metrics({ 'session_id': event['session']['id'], 'duration_ms': event['duration_ms'], 'text': event.get('text', '') }) return '', 204 ``` ### Chain Actions ```python # Store what to do next session_state = {} @app.route('/webhook', methods=['POST']) def webhook(): event = request.json session_id = event['session']['id'] if event['type'] == 'user_speak': # Set next action session_state[session_id] = 'play_audio' return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': 'Please listen to this message.' }) if event['type'] == 'assistant_speak': # Execute next action if session_state.get(session_id) == 'play_audio': del session_state[session_id] return jsonify({ 'type': 'audio', 'session_id': session_id, 'audio': 'base64-audio-data' }) return '', 204 ``` ## Use Cases * **Analytics** - Track conversation metrics * **Action chaining** - Trigger follow-up actions * **Logging** - Record what was said * **Timing** - Measure response times ## Best Practices 1. **Don't block** - Process quickly 2. **Track metrics** - Use for analytics 3. **Chain carefully** - Avoid infinite loops 4. **Log interactions** - For debugging ## Next Steps * **[User Speak Event](/api/events/user-speak)** - Handle user input * **[Action Types](/api/actions)** - All available actions * **[Event Flow](/api/event-flow)** - Understand the complete flow --- --- url: /sipgate-ai-flow-api/api/events/assistant-speech-ended.md --- # Assistant Speech Ended Event Triggered after the assistant finishes speaking. ## Event Structure ```json { "type": "assistant_speech_ended", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890" } } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"assistant_speech_ended"` | | `session.id` | string (UUID) | Yes | Session identifier | | `session.account_id` | string | Yes | Account identifier | | `session.phone_number` | string | Yes | Phone number for this flow session | ## Response You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common uses: * **Trigger follow-up actions** - Continue the conversation flow * **Track completion** - Log that speech finished * **No response** - Just track the event ## Examples ### Trigger Follow-Up Action ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json session_id = event['session']['id'] if event['type'] == 'assistant_speech_ended': # Trigger next action in conversation flow return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': 'Is there anything else I can help you with?' }) return '', 204 ``` ### Track Completion ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'assistant_speech_ended': # Log that speech completed log_speech_completed(event['session']['id']) return '', 204 ``` ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'assistant_speech_ended') { // Trigger next action return res.json({ type: 'speak', session_id: event.session.id, text: 'Is there anything else I can help you with?' }); } res.status(204).send(); }); ``` ### Go ```go func webhook(w http.ResponseWriter, r *http.Request) { var event map[string]interface{} json.NewDecoder(r.Body).Decode(&event) if event["type"] == "assistant_speech_ended" { session := event["session"].(map[string]interface{}) action := map[string]interface{}{ "type": "speak", "session_id": session["id"], "text": "Is there anything else I can help you with?", } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } w.WriteHeader(http.StatusNoContent) } ``` ## Use Cases * **Continue conversation** - Trigger follow-up questions or actions * **Track completion** - Log that speech playback finished * **Chain actions** - Execute next step in conversation flow * **Analytics** - Track when assistant finishes speaking ## Difference from assistant\_speak * **assistant\_speak** - Triggered when assistant **starts** speaking (includes duration, text, etc.) * **assistant\_speech\_ended** - Triggered when assistant **finishes** speaking (simpler, just session info) ## Best Practices 1. **Use for follow-ups** - Great for continuing conversation flow 2. **Track timing** - Log when speech completes 3. **Chain actions** - Trigger next action in sequence 4. **Don't block** - Process quickly ## Next Steps * **[Assistant Speak Event](/api/events/assistant-speak)** - When assistant starts speaking * **[User Speak Event](/api/events/user-speak)** - Handle user input * **[Action Types](/api/actions)** - All available actions * **[Event Flow](/api/event-flow)** - Understand the complete flow --- --- url: /sipgate-ai-flow-api/api/events/dtmf-received.md --- # DTMF Received Event Triggered when the user presses a key on their phone keypad during a call. ## Event Structure ```json { "type": "dtmf_received", "digit": "1", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890", "direction": "inbound", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` ## Fields | Field | Type | Description | |---------|--------|-----------------------------------------------------| | `type` | string | Always `"dtmf_received"` | | `digit` | string | The key pressed: `0`–`9`, `*`, or `#` | | `session` | object | Session information (see [Base Event Structure](/api/events)) | ## Example ### IVR Menu ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'session_start': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Press 1 for sales, press 2 for support.' }) if event['type'] == 'dtmf_received': digit = event['digit'] session_id = event['session']['id'] if digit == '1': return jsonify({ 'type': 'transfer', 'session_id': session_id, 'transfer_to': '49211100200' }) elif digit == '2': return jsonify({ 'type': 'transfer', 'session_id': session_id, 'transfer_to': '49211100201' }) else: return jsonify({ 'type': 'speak', 'session_id': session_id, 'text': 'Invalid selection. Press 1 for sales, press 2 for support.' }) return '', 204 ``` ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'dtmf_received') { const { digit } = event; console.log(`User pressed: ${digit}`); if (digit === '#') { return res.json({ type: 'hangup', session_id: event.session.id }); } } res.status(204).send(); }); ``` ## TypeScript SDK ```typescript const assistant = AiFlowAssistant.create({ onDtmfReceived: async (event) => { console.log(`User pressed: ${event.digit}`); if (event.digit === '1') { return { type: 'transfer', session_id: event.session.id, transfer_to: '49211100200' }; } return { type: 'speak', session_id: event.session.id, text: `You pressed ${event.digit}.` }; }, }); ``` ## Use Cases * **IVR menus** — route calls based on key presses * **PIN entry** — collect numeric input without speech recognition * **Confirmation flows** — press 1 to confirm, 2 to cancel * **Accessibility** — provide keypad alternatives to voice commands ## Notes * All standard DTMF tones are supported: `0`–`9`, `*`, `#` * Each key press triggers a separate `dtmf_received` event * DTMF events can occur at any point during the call, including while the assistant is speaking ## Next Steps * **[Action Types](/api/actions)** - How to respond to events * **[User Speak Event](/api/events/user-speak)** - Voice input alternative --- --- url: /sipgate-ai-flow-api/api/events/session-end.md --- # Session End Event Triggered when the call session ends. ## Event Structure ```json { "type": "session_end", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890" } } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"session_end"` | | `session.id` | string (UUID) | Yes | Session identifier | | `session.account_id` | string | Yes | Account identifier | | `session.phone_number` | string | Yes | Phone number for this flow session | ## Response **No action is allowed** for `session_end` events. Always return `204 No Content`. ```http HTTP/1.1 204 No Content ``` ## Examples ### Python ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'session_end': # Cleanup session state cleanup_session(event['session']['id']) return '', 204 ``` ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_end') { // Cleanup session state cleanupSession(event.session.id); res.status(204).send(); return; } }); ``` ### Go ```go func webhook(w http.ResponseWriter, r *http.Request) { var event map[string]interface{} json.NewDecoder(r.Body).Decode(&event) if event["type"] == "session_end" { session := event["session"].(map[string]interface{}) cleanupSession(session["id"].(string)) w.WriteHeader(http.StatusNoContent) return } } ``` ## Use Cases * **Cleanup state** - Remove session data * **Save logs** - Store conversation history * **Send analytics** - Track session metrics * **Close connections** - Clean up resources ## Best Practices 1. **Always cleanup** - Remove session state 2. **Log the session** - Save for analytics 3. **Don't return actions** - No actions are processed 4. **Handle errors** - Don't fail silently ## Next Steps * **[Session Start Event](/api/events/session-start)** - When calls begin * **[Event Flow](/api/event-flow)** - Understand the complete flow * **[Action Types](/api/actions)** - Actions you can send --- --- url: /sipgate-ai-flow-api/api/actions.md --- # Action Types Complete reference for all actions you can send to the AI Flow service. ## Overview Actions are JSON objects you send back to the AI Flow service in response to events. All actions require a `session_id` and `type` field. ## Base Action Structure ```json { "session_id": "550e8400-e29b-41d4-a716-446655440000", "type": "speak" } ``` ## Action Summary | Action Type | Description | Primary Use Case | | -------------- | --------------------------- | --------------------------------------- | | `speak` | Speak text or SSML | Respond to user with synthesized speech | | `audio` | Play pre-recorded audio | Play hold music, pre-recorded messages | | `mix_audio` | Loop a background sound mixed into speech | Add ambient noise (café, office, train station) under the agent | | `hangup` | End the call | Terminate conversation | | `transfer` | Transfer to another number | Route to human agent or department | | `barge_in` | Manually interrupt playback | Stop current audio immediately | | `configure_transcription` | Change STT language(s) mid-call | Switch recognition language without hanging up | | `configure_voice_to_voice` | Switch the session into end-to-end voice-to-voice mode | Hand the conversation to a speech-to-speech model that owns audio I/O | | `send_sms` | Send an SMS from the account | Deliver confirmation codes, summaries, links | ## Quick Reference * **[Speak Action](/api/actions/speak)** - Text-to-speech * **[Audio Action](/api/actions/audio)** - Play audio file * **[Mix Audio Action](/api/actions/mix-audio)** - Loop a background sound mixed into outbound speech * **[Hangup Action](/api/actions/hangup)** - End call * **[Transfer Action](/api/actions/transfer)** - Transfer call * **[Barge-In Action](/api/actions/barge-in)** - Manually interrupt current playback * **[Configure Transcription Action](/api/actions/configure-transcription)** - Change STT language mid-call * **[Configure Voice-to-Voice Action](/api/actions/configure-voice-to-voice)** - End-to-end speech-to-speech mode (preview) * **[Send SMS Action](/api/actions/send-sms)** - Send an SMS from your account ## Response Format ### HTTP Webhook Return a single action or an array of actions as JSON with `200 OK`: ```http HTTP/1.1 200 OK Content-Type: application/json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello!" } ``` To execute multiple actions in sequence, return an array: ```http HTTP/1.1 200 OK Content-Type: application/json [ { "type": "barge_in", "session_id": "550e8400-e29b-41d4-a716-446655440000" }, { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Sorry, let me correct that." } ] ``` Or return `204 No Content` if no action is needed: ```http HTTP/1.1 204 No Content ``` ### WebSocket Send a single action or an array of actions as JSON strings: ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello!" } ``` ```json [ { "type": "barge_in", "session_id": "..." }, { "type": "speak", "session_id": "...", "text": "Sorry, let me correct that." } ] ``` ## Action Flow ```mermaid graph TB A[Receive Event] --> B{Event Type} B -->|user_speak| C[Process Input] B -->|session_start| D[Initialize] C --> E{Decision} E -->|speak| F[Speak Action] E -->|transfer| G[Transfer Action] E -->|hangup| H[Hangup Action] D --> F F --> I[Service Executes] G --> I H --> I ``` ## Common Patterns ### Simple Response ```json { "type": "speak", "session_id": "session-123", "text": "Hello! How can I help you?" } ``` ### Conditional Response ```python if "goodbye" in event['text'].lower(): return { "type": "hangup", "session_id": event['session']['id'] } else: return { "type": "speak", "session_id": event['session']['id'], "text": "I understand." } ``` ### Multiple Actions You can return an array of actions to execute them in sequence: ```python if event['type'] == 'user_speak': return [ { "type": "barge_in", "session_id": event['session']['id'] }, { "type": "speak", "session_id": event['session']['id'], "text": "Sorry, let me correct that." } ] ``` Actions in the array are executed one after another in order. Alternatively, you can chain actions across events using the `assistant_speak` event: ```python # First response if event['type'] == 'user_speak': return { "type": "speak", "session_id": event['session']['id'], "text": "Please listen to this message." } # Follow-up after assistant speaks if event['type'] == 'assistant_speak': return { "type": "audio", "session_id": event['session']['id'], "audio": "base64-audio-data" } ``` ## Next Steps * **[Speak Action](/api/actions/speak)** - Detailed reference * **[Event Types](/api/events)** - What triggers actions * **[Event Flow](/api/event-flow)** - Understand the complete flow --- --- url: /sipgate-ai-flow-api/api/actions/speak.md --- # Speak Action Speak text or SSML to the user using text-to-speech. ## Action Structure ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?", "tts": { "provider": "azure", "language": "en-US", "voice": "en-US-JennyNeural" }, "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3 } } ``` ## Fields | Field | Type | Required | Description | |------------------------------|---------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------| | `type` | string | Yes | Always `"speak"` | | `session_id` | string (UUID) | Yes | Session identifier from event | | `text` | string | No\* | Plain text to speak | | `ssml` | string | No\* | SSML markup for advanced control | | `tts` | object | No | TTS provider configuration | | `barge_in` | object | No | Barge-in behavior configuration | | `user_input_timeout_seconds` | number | No | Timeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a `user_input_timeout` event is sent | | `vad` | object | No | Voice-activity detection tuning for the caller's reply. See [VAD Configuration](/api/vad) | \* Either `text` OR `ssml` is required (not both) ## Simple Text ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello! How can I help you?" } ``` ## SSML (Advanced) ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "ssml": "Please listen carefully.Your account balance is $42.50" } ``` ## TTS Provider Configuration ### Azure ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello in a different voice", "tts": { "provider": "azure", "language": "en-US", "voice": "en-US-JennyNeural" } } ``` ### ElevenLabs ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello from ElevenLabs", "tts": { "provider": "eleven_labs", "voice": "zrHiDhphv9ZnVXBqCLjz" } } ``` ::: tip Voice IDs The `voice` field accepts the ElevenLabs voice ID (e.g., `"zrHiDhphv9ZnVXBqCLjz"` for "Mimi"). If omitted, the first available voice will be used. See the [TTS Providers](/api/tts-providers) documentation for a list of available voices. ::: **Minimal Configuration (uses default voice):** ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Hello from ElevenLabs", "tts": { "provider": "eleven_labs" } } ``` ## Barge-In Configuration Control how users can interrupt: ### Immediate Response (Most Responsive) ⚡ ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "I can help you with billing, support, or sales. What would you like?", "barge_in": { "strategy": "immediate", "allow_after_ms": 500 } } ``` **Result:** Assistant stops instantly when user starts speaking (20-100ms latency). ### Character-Based Interruption ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Your account number is 1234567890. Please write this down.", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 10, "allow_after_ms": 2000 } } ``` **Result:** Assistant stops after user speaks 10+ characters. See [Barge-In Configuration](/api/barge-in) for all strategies and details. ## VAD (Voice Activity Detection) Tuning Optional advanced setting that lets the caller pause longer (or shorter) before their turn is considered finished. When omitted, the system default applies. ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Please tell me your address.", "vad": { "end_of_turn_silence_ms": 1500 } } ``` | Field | Type | Description | |--------------------------|--------|----------------------------------------------------------------------------------------------------------| | `end_of_turn_silence_ms` | number | Milliseconds of silence after the caller stops speaking before their turn ends. Recommended range 150–2000. | Out-of-range or invalid values are silently ignored — the speak action still runs as if `vad` were not set. See [VAD Configuration](/api/vad) for details. ## User Input Timeout Set a timeout to wait for user input after the assistant finishes speaking. If the user doesn't speak within the specified time, a `user_input_timeout` event is sent to your application: ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "What is your account number?", "user_input_timeout_seconds": 5 } ``` **Behavior:** * Timer starts when the assistant finishes speaking (`assistant_speech_ended` event) * Timer is cleared when the user starts speaking (any STT event) * If timeout is reached, a `user_input_timeout` event is sent * Your application can respond with any action (e.g., repeat question, hangup) **Example with timeout handling:** ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { return res.json({ type: 'speak', session_id: event.session.id, text: 'What is your account number?', user_input_timeout_seconds: 5 }); } if (event.type === 'user_input_timeout') { return res.json({ type: 'speak', session_id: event.session.id, text: 'I didn\'t hear anything. Let me try again. What is your account number?', user_input_timeout_seconds: 5 }); } if (event.type === 'user_speak') { return res.json({ type: 'speak', session_id: event.session.id, text: `Your account number is ${event.text}` }); } }); ``` ## Examples ### Python ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': f"You said: {event['text']}" }) ``` ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak') { return res.json({ type: 'speak', session_id: event.session.id, text: `You said: ${event.text}` }); } }); ``` ### Go ```go action := map[string]interface{}{ "type": "speak", "session_id": session["id"], "text": "Hello! How can I help you?", } json.NewEncoder(w).Encode(action) ``` ## Use Cases * **Respond to user** - Answer questions * **Provide information** - Share details * **Guide conversation** - Direct the flow * **Confirm actions** - Acknowledge user input ## Best Practices 1. **Keep it concise** - Short responses work better 2. **Use SSML sparingly** - Only when needed for emphasis 3. **Configure barge-in** - Allow natural interruptions 4. **Choose appropriate voice** - Match language and tone ## Next Steps * **[TTS Providers](/api/tts-providers)** - Configure voices * **[Barge-In Configuration](/api/barge-in)** - Control interruptions * **[Other Actions](/api/actions)** - Complete action reference --- --- url: /sipgate-ai-flow-api/api/actions/audio.md --- # Audio Action Play pre-recorded audio to the user. ## Action Structure ```json { "type": "audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3 } } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"audio"` | | `session_id` | string (UUID) | Yes | Session identifier from event | | `audio` | string | Yes | Base64 encoded WAV audio data | | `barge_in` | object | No | Barge-in behavior configuration | ## Audio Format Requirements The audio must be in the following format: * **Format**: WAV * **Sample Rate**: 16kHz * **Channels**: Mono (single channel) * **Bit Depth**: 16-bit PCM * **Encoding**: Base64 ## Simple Example ```json { "type": "audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=" } ``` ## With Barge-In Configuration ```json { "type": "audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3, "allow_after_ms": 1000 } } ``` ## Examples ### Python ```python import base64 @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': # Read audio file and encode to base64 with open('hold-music.wav', 'rb') as audio_file: audio_data = audio_file.read() base64_audio = base64.b64encode(audio_data).decode('utf-8') return jsonify({ 'type': 'audio', 'session_id': event['session']['id'], 'audio': base64_audio, 'barge_in': { 'strategy': 'minimum_characters', 'minimum_characters': 3 } }) ``` ### Node.js ```javascript const fs = require('fs'); app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak') { // Read audio file and encode to base64 const audioData = fs.readFileSync('hold-music.wav'); const base64Audio = audioData.toString('base64'); return res.json({ type: 'audio', session_id: event.session.id, audio: base64Audio, barge_in: { strategy: 'minimum_characters', minimum_characters: 3 } }); } }); ``` ### Go ```go import ( "encoding/base64" "io/ioutil" ) func webhook(w http.ResponseWriter, r *http.Request) { var event map[string]interface{} json.NewDecoder(r.Body).Decode(&event) if event["type"] == "user_speak" { // Read audio file and encode to base64 audioData, _ := ioutil.ReadFile("hold-music.wav") base64Audio := base64.StdEncoding.EncodeToString(audioData) session := event["session"].(map[string]interface{}) action := map[string]interface{}{ "type": "audio", "session_id": session["id"], "audio": base64Audio, "barge_in": map[string]interface{}{ "strategy": "minimum_characters", "minimum_characters": 3, }, } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(action) return } } ``` ## Converting Audio Files ### Using FFmpeg Convert any audio file to the required format: ```bash ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav ``` **Parameters:** * `-ar 16000` - Set sample rate to 16kHz * `-ac 1` - Set to mono (1 channel) * `-sample_fmt s16` - Set to 16-bit PCM * `-f wav` - Output WAV format ### Python Script ```python import base64 def convert_audio_to_base64(audio_file_path): with open(audio_file_path, 'rb') as f: audio_data = f.read() return base64.b64encode(audio_data).decode('utf-8') # Usage base64_audio = convert_audio_to_base64('hold-music.wav') ``` ## Barge-In Configuration Control how users can interrupt audio playback: ```json { "barge_in": { "strategy": "none" } } ``` See [Barge-In Configuration](/api/barge-in) for details. ## Use Cases * **Hold music** - Play music while user waits * **Pre-recorded messages** - Play announcements or greetings * **Sound effects** - Play notification sounds * **Background audio** - Ambient sounds during conversation ## Best Practices 1. **Keep files small** - Large audio files increase latency 2. **Use appropriate format** - Ensure WAV, 16kHz, mono, 16-bit 3. **Test playback** - Verify audio quality before production 4. **Configure barge-in** - Allow natural interruptions when appropriate 5. **Cache base64** - Encode once, reuse the base64 string ## Troubleshooting ### Audio Not Playing * Verify audio format matches requirements exactly * Check base64 encoding is correct * Ensure audio file is not corrupted * Test with a known-good audio file ### Audio Quality Issues * Ensure sample rate is exactly 16kHz * Verify mono channel (not stereo) * Check bit depth is 16-bit PCM * Re-encode source audio if needed ## Next Steps * **[Barge-In Configuration](/api/barge-in)** - Control interruption behavior * **[Speak Action](/api/actions/speak)** - Text-to-speech alternative * **[Action Types](/api/actions)** - Complete action reference --- --- url: /sipgate-ai-flow-api/api/actions/mix-audio.md --- # Mix Audio Action Play a looping background sound (e.g. train station, café, office ambience) under the call. The loop plays continuously for the lifetime of the session — also during the assistant's TTS turns and during silences between turns. Sending `mix_audio` again replaces the active loop. Sending it with `stop: true` removes the loop. The active loop is dropped automatically when the session ends. ## Action Structure ### Start a background loop ```json { "type": "mix_audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=", "volume": 0.3 } ``` ### Stop an active background loop ```json { "type": "mix_audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "stop": true } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"mix_audio"` | | `session_id` | string (UUID) | Yes | Session identifier from event | | `audio` | string | Conditional | Base64-encoded WAV (16 kHz, 16-bit, mono PCM). **Required when `stop` is not `true`.** | | `volume` | number | No | Background loop volume, `0.0`–`1.0`. Defaults to `0.5`. | | `stop` | boolean | No | When `true`, removes the active loop. | ## Audio Format Requirements Identical to the [`audio` action](/api/actions/audio): * **Format**: WAV * **Sample Rate**: 16 kHz * **Channels**: Mono (single channel) * **Bit Depth**: 16-bit PCM * **Encoding**: Base64 A 30-second loop at this format is approximately 940 KB raw and ~1.25 MB as a base64 string in the JSON action payload. ## Behavior Notes * **Continuous playback.** Once started, ambient plays for the rest of the call — under the assistant's TTS during turns and on its own during silences. * **Replace semantics.** A second `mix_audio` (without `stop`) replaces the buffer and volume of the running loop. * **Restart-safe.** If the service restarts during an active call, the loop continues automatically. * **Auto-cleanup.** The loop is dropped when the session ends. ## Use Cases * **Setting the scene.** Add café or train-station ambience to make a virtual receptionist feel located somewhere specific. * **Wait-state cues.** Light office hum during long lookups so the line doesn't feel dead. * **Accessibility / signaling.** Subtle sounds that indicate the agent is "in" a particular context. ## Examples ### Python (Flask) ```python import base64 # Load and base64-encode the loop once at startup with open('cafe.wav', 'rb') as f: AMBIENT_AUDIO = base64.b64encode(f.read()).decode('utf-8') @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'session_start': # Start the ambient loop AND speak the greeting in one response return jsonify([ { 'type': 'mix_audio', 'session_id': event['session']['id'], 'audio': AMBIENT_AUDIO, 'volume': 0.3, }, { 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Welcome, how can I help you?', }, ]) if event['type'] == 'user_speak' and 'goodbye' in event['text'].lower(): # Stop the ambient before saying goodbye, then hang up return jsonify([ { 'type': 'mix_audio', 'session_id': event['session']['id'], 'stop': True, }, { 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Goodbye!', }, { 'type': 'hangup', 'session_id': event['session']['id'] }, ]) ``` ### Node.js ```javascript import { readFileSync } from "node:fs"; // Load and base64-encode the loop once at startup const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64"); app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { return res.json([ { type: 'mix_audio', session_id: event.session.id, audio: AMBIENT_AUDIO, volume: 0.3, }, { type: 'speak', session_id: event.session.id, text: 'Welcome, how can I help you?', }, ]); } }); ``` ### Go ```go import ( "encoding/base64" "io/ioutil" ) func main() { // Load and base64-encode the loop once at startup audioBytes, _ := ioutil.ReadFile("cafe.wav") ambientAudio := base64.StdEncoding.EncodeToString(audioBytes) http.HandleFunc("/webhook", func(w http.ResponseWriter, r *http.Request) { var event map[string]interface{} json.NewDecoder(r.Body).Decode(&event) if event["type"] == "session_start" { session := event["session"].(map[string]interface{}) actions := []map[string]interface{}{ { "type": "mix_audio", "session_id": session["id"], "audio": ambientAudio, "volume": 0.3, }, { "type": "speak", "session_id": session["id"], "text": "Welcome, how can I help you?", }, } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(actions) } }) } ``` ## Converting Audio Files Convert any audio file to the required format with FFmpeg: ```bash ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav ``` For ambient sound, normalizing loudness across presets keeps the relative volume consistent at a given `volume` value. A target of `-30 LUFS` sits well below typical TTS speech (`~-16 LUFS`), so the slider stays useful around `0.2`–`0.5`: ```bash ffmpeg -i input.mp3 -t 30 -af "loudnorm=I=-30:LRA=11:TP=-2" \ -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav ``` ## Best Practices 1. **Load once, encode once.** Encode each ambient WAV to base64 at startup and reuse the string — don't read+encode per call. 2. **Start the loop with the greeting.** Return `[mix_audio, speak]` together on `session_start` so the ambient is in place from the first word. 3. **Keep the volume low.** Ambient sound should sit *under* the agent. Start around `0.3` and lower from there. 4. **Trim long files.** A 30-second loop is plenty for ambience; longer files just mean larger one-time payloads at session start. 5. **Stop explicitly when ending the call.** Sending `mix_audio { stop: true }` before a farewell is optional (the loop is dropped at `session_end` anyway), but it makes the goodbye land cleanly without ambient bleed. ## Mix Audio vs. Audio Action | Aspect | `audio` | `mix_audio` | |---|---|---| | Plays | Once, then stops | Loops continuously for the rest of the call | | Audible during silence | No | Yes | | Plays under TTS | No | Yes | | Use case | Hold music, announcements, sound effects | Scene/atmosphere under the agent | | Restart-safe | No (one-shot) | Yes (loop continues automatically) | ## Troubleshooting ### Ambient is too loud / drowns out speech * Lower the `volume` (try `0.2`). * Re-normalize the source file to a quieter target LUFS (e.g. `-30 LUFS` instead of `-23`). ### Loop pops at the boundary For material with strong transients, fade the source file in/out by 50 ms in your editor before encoding so the loop point is silent. ## Next Steps * **[Audio Action](/api/actions/audio)** - Play a single pre-recorded clip * **[Speak Action](/api/actions/speak)** - Text-to-speech under the loop * **[Action Types](/api/actions)** - Complete action reference --- --- url: /sipgate-ai-flow-api/api/actions/hangup.md --- # Hangup Action End the call. ## Action Structure ```json { "type": "hangup", "session_id": "550e8400-e29b-41d4-a716-446655440000" } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"hangup"` | | `session_id` | string (UUID) | Yes | Session identifier from event | ## Examples ### Python ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': user_text = event['text'].lower() if 'goodbye' in user_text or 'bye' in user_text: return jsonify({ 'type': 'hangup', 'session_id': event['session']['id'] }) ``` ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak') { const userText = event.text.toLowerCase(); if (userText.includes('goodbye') || userText.includes('bye')) { return res.json({ type: 'hangup', session_id: event.session.id }); } } }); ``` ### Go ```go if strings.Contains(text, "goodbye") || strings.Contains(text, "bye") { action := map[string]interface{}{ "type": "hangup", "session_id": session["id"], } json.NewEncoder(w).Encode(action) } ``` ## Use Cases * **User says goodbye** - End call politely * **Task complete** - After completing a task * **Error handling** - When something goes wrong * **Timeout** - After inactivity ## Best Practices 1. **Say goodbye first** - Optionally speak before hanging up 2. **Clean up state** - Session will end, but cleanup in `session_end` 3. **Log the reason** - Track why calls ended 4. **Handle gracefully** - Don't hang up abruptly ## Next Steps * **[Transfer Action](/api/actions/transfer)** - Transfer to another number * **[Event Types](/api/events)** - What triggers actions * **[Event Flow](/api/event-flow)** - Understand the complete flow --- --- url: /sipgate-ai-flow-api/api/actions/transfer.md --- # Transfer Action Transfer the call to another phone number. ## Action Structure ```json { "type": "transfer", "session_id": "550e8400-e29b-41d4-a716-446655440000", "target_phone_number": "1234567890", "caller_id_name": "Support Department", "caller_id_number": "1234567890", "timeout": 30 } ``` ## Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Always `"transfer"` | | `session_id` | string (UUID) | Yes | Session identifier from event | | `target_phone_number` | string | Yes | Phone number to transfer to (E.164 format without leading + recommended) | | `caller_id_name` | string | Yes | Caller ID name to display | | `caller_id_number` | string | Yes | Caller ID number to display | | `timeout` | integer (5–120) | No | Seconds to wait for the transfer target to answer. When set, enables **transfer fallback** (see below). When omitted, transfer failures end the call. | ## Transfer Fallback When `timeout` is provided, the call is returned to the agent if the transfer fails: * Target does not answer within `timeout` seconds * Target rejects the call (busy, unavailable) * Target hangs up without answering On a failed transfer, the service re-emits a [`session_start`](/api/events/session-start) event **with the same `session.id`** and the agent can either continue the conversation with the original caller or attempt another transfer. On a successful transfer, no further events are sent — the call ends normally once the transferred parties hang up. ```json { "type": "transfer", "session_id": "550e8400-e29b-41d4-a716-446655440000", "target_phone_number": "1234567890", "caller_id_name": "Support Department", "caller_id_number": "1234567890", "timeout": 30 } ``` Your webhook should treat a repeated `session_start` for a known session id as "the call came back" and respond with a recovery prompt (for example: *"Sorry, no one picked up. Would you like to try something else?"*). ## Examples ### Python ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': user_text = event['text'].lower() if 'sales' in user_text: return jsonify({ 'type': 'transfer', 'session_id': event['session']['id'], 'target_phone_number': '1234567890', 'caller_id_name': 'Sales Department', 'caller_id_number': '1234567890' }) ``` ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak') { const userText = event.text.toLowerCase(); if (userText.includes('sales')) { return res.json({ type: 'transfer', session_id: event.session.id, target_phone_number: '1234567890', caller_id_name: 'Sales Department', caller_id_number: '1234567890' }); } } }); ``` ### Go ```go if strings.Contains(text, "sales") { action := map[string]interface{}{ "type": "transfer", "session_id": session["id"], "target_phone_number": "1234567890", "caller_id_name": "Sales Department", "caller_id_number": "1234567890", } json.NewEncoder(w).Encode(action) } ``` ## Phone Number Format Use E.164 format without leading + (recommended): * ✅ `1234567890` * ✅ `491234567890` * ❌ `123-456-7890` (not recommended) ## Use Cases * **Route to departments** - Sales, support, billing * **Escalate to human** - When AI can't help * **Specialized services** - Connect to experts * **Emergency routing** - Urgent situations ## Best Practices 1. **Announce transfer** - Tell user before transferring 2. **Use E.164 format** - International phone numbers 3. **Set caller ID** - Identify the source 4. **Log transfers** - Track routing decisions ## Next Steps * **[Hangup Action](/api/actions/hangup)** - End the call * **[Event Types](/api/events)** - What triggers actions * **[Event Flow](/api/event-flow)** - Understand the complete flow --- --- url: /sipgate-ai-flow-api/api/actions/barge-in.md --- # Barge-In Action Immediately stop whatever audio the service is currently playing to the caller (synthesized speech from a `speak` action or pre-recorded audio from an `audio` action). This is the manual, application-triggered counterpart to the automatic user-driven interruption. ::: warning Action vs. configuration — don't confuse these Two things are called "barge-in" and they do different things: * **`barge_in` action** (this page): a top-level action you send, `{ "type": "barge_in", "session_id": "..." }`. **You** interrupt the playback — right now — from your application. * **`barge_in` config** on `speak` / `audio` actions: an optional object describing **how and when the caller** is allowed to interrupt. See [Barge-In Configuration](/api/barge-in). The action stops current playback. The configuration controls whether the caller is allowed to do the same thing by speaking. ::: ## Action Structure ```json { "type": "barge_in", "session_id": "550e8400-e29b-41d4-a716-446655440000" } ``` ## Fields | Field | Type | Required | Description | |--------------|---------------|----------|----------------------------------| | `type` | string | Yes | Always `"barge_in"` | | `session_id` | string (UUID) | Yes | Session identifier from an event | The action has no other fields. It always targets whatever is currently being played on this session. ## Typical Pattern — Interrupt Then Speak The most useful form is an **array of actions**: first `barge_in` to cut off the current playback, then `speak` (or `audio`) with the new content. The service executes array entries in order, so the caller hears the playback stop and the new message begin without any manual coordination on your side. ```json [ { "type": "barge_in", "session_id": "550e8400-e29b-41d4-a716-446655440000" }, { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Sorry, let me correct that — your order ships tomorrow, not today." } ] ``` This works anywhere an action response is accepted: HTTP webhook response body, WebSocket message, or external API POST. ### Replace in-progress audio with new audio ```json [ { "type": "barge_in", "session_id": "550e8400-e29b-41d4-a716-446655440000" }, { "type": "audio", "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA..." } ] ``` ### Stop playback without saying anything Send `barge_in` on its own if you only want silence (for example, to cut off a long response because an external system just produced a final answer you're about to deliver separately): ```json { "type": "barge_in", "session_id": "550e8400-e29b-41d4-a716-446655440000" } ``` ## When to Use It * **Agent self-correction.** Your LLM streamed a tentative answer via `speak`, then a tool call returned a better one. Send `[barge_in, speak]` to replace the in-flight utterance. * **External event trumps current playback.** A human operator joins, a priority notification arrives, or a fresh webhook result invalidates what's being said right now. * **Cutting off a long pre-recorded `audio` clip.** The caller gave new intent mid-playback and you've decided to stop the clip early, regardless of their `barge_in` configuration. If all you want is for the caller to be able to interrupt by speaking, you don't need this action — use the `barge_in` **configuration** on the `speak` or `audio` action instead. See [Barge-In Configuration](/api/barge-in) for the available strategies. ## Examples ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak' && correctionNeeded(event.text)) { return res.json([ { type: 'barge_in', session_id: event.session.id }, { type: 'speak', session_id: event.session.id, text: 'Sorry, let me correct that.', }, ]); } }); ``` ### Python ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak' and correction_needed(event['text']): return jsonify([ { 'type': 'barge_in', 'session_id': event['session']['id'] }, { 'type': 'speak', 'session_id': event['session']['id'], 'text': 'Sorry, let me correct that.', }, ]) ``` ### Go ```go actions := []map[string]interface{}{ {"type": "barge_in", "session_id": sessionID}, {"type": "speak", "session_id": sessionID, "text": "Sorry, let me correct that."}, } json.NewEncoder(w).Encode(actions) ``` ## Behavior Notes * `barge_in` is a no-op if nothing is currently being played. It does not produce an error. * The service emits an `assistant_speech_ended` event for the interrupted `speak`/`audio`, followed by the events for the next action in the array. * Array entries are processed strictly in order. Putting `barge_in` after a `speak` in the same array does not "cancel" that speak before it starts — the speak is dispatched first, then `barge_in` stops it mid-playback. ## Next Steps * **[Barge-In Configuration](/api/barge-in)** — Let the caller interrupt by speaking (strategies, timing) * **[Speak Action](/api/actions/speak)** — Synthesize and play text * **[Audio Action](/api/actions/audio)** — Play pre-recorded audio * **[Action Types](/api/actions)** — Complete action reference --- --- url: /sipgate-ai-flow-api/api/actions/configure-transcription.md --- # Configure Transcription Action Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up. ## Action Structure ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "provider": "DEEPGRAM", "languages": ["en-US"] } ``` ## Fields | Field | Type | Required | Default | Description | |--------------|---------------|----------|------------------|------------------------------------------------------------------------------------------------------| | `type` | string | Yes | — | Always `"configure_transcription"` | | `session_id` | string (UUID) | Yes | — | Session identifier from event | | `provider` | string | No | Current provider | STT provider to switch to. Valid values: `"AZURE"`, `"DEEPGRAM"`, `"ELEVEN_LABS"`. Omitting keeps the current provider. | | `languages` | string\[] | No | Provider default | BCP-47 language codes (1–4 entries). Fully replaces the current config. Omitting resets to provider default (auto-detection). | | `custom_vocabulary` | string\[] | No | — | Words or phrases to boost STT recognition accuracy. Max 100 entries, max 200 characters per entry. Fully replaces the current session-level vocabulary. Merged with client-level vocabulary configured during onboarding. Supported by Azure, Deepgram, and ElevenLabs. | | `vad` | object | No | Current setting | Voice-activity detection tuning, applied for the rest of the session. See [VAD Configuration](/api/vad). | At least one of `provider`, `languages`, `custom_vocabulary`, or `vad` should be provided; sending none of them is a no-op. ### Configuring VAD Session-Wide Use this action to set or change VAD parameters for the entire remaining session (equivalent to setting `vad` on every subsequent `speak`). ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "vad": { "end_of_turn_silence_ms": 1200 } } ``` Out-of-range or invalid values are silently ignored. ## Behavioral Details ### Full Replace Semantics Both `provider` and `languages` use **full replace** semantics — they never merge with existing settings. | `provider` field | `languages` field | Result | |-----------------|-------------------|-------------------------------------------------------------| | Provided | Provided | Switches to new provider with specified languages | | Provided | Omitted | Switches to new provider; languages reset to `[]` (default) | | Omitted | Provided | Keeps current provider; languages fully replaced | | Omitted | Omitted | No-op (transcription unchanged) | ### Custom Vocabulary Pass a `custom_vocabulary` array to boost recognition of domain-specific terms, product names, proper nouns, or technical terms your callers are likely to use. * Entries are matched case-insensitively during deduplication and merged with client-level vocabulary. * Multi-word phrases (e.g. `"SIP-Trunk"`) are supported by all providers. * If omitted, the current session vocabulary is kept unchanged. * Max 100 entries; max 200 characters per entry. **Supported providers:** Azure, Deepgram, ElevenLabs ### Brief Audio Gap During Restart Any change — language or provider — requires the transcription engine to restart. Audio received during the restart is dropped and will not appear in any `user_speak` event. | Change type | Typical gap | |---------------------|----------------| | Language change only | ~100–500 ms | | Provider switch | ~200–800 ms | Design your call flow to trigger changes at natural pause points (e.g., after the assistant finishes speaking) to minimize the impact of the gap. ### Barge-In Latency After Provider Switch Each provider has different Voice Activity Detection (VAD) characteristics. Switching providers may change barge-in latency for the `immediate` strategy: | Provider | Approximate barge-in latency | |----------|------------------------------| | Azure | ~20–80 ms | | Deepgram | ~20–100 ms | | ElevenLabs | ~30–120 ms | ### Compatible Channels The `configure_transcription` action is accepted on all three delivery channels: * HTTP webhook response * Client-transport WebSocket * External API POST ### Multi-Language Support per Provider Not all providers support simultaneous multi-language detection. When more than one language code is supplied, providers that only accept a single language will silently use the **first entry** and ignore the rest. | Provider value | Multi-language support | Notes | |-----------------|------------------------|-------| | `"AZURE"` | ✅ Up to 4 languages | All entries used for Language Identification (LID) | | `"DEEPGRAM"` | ✅ Multilingual | Auto-detects across the supplied languages; supply none for full auto-detect | | `"ELEVEN_LABS"` | ❌ Single language only | Only the first entry is used; rest are ignored | **Recommendation:** When targeting ElevenLabs, supply exactly one language code. Deepgram and Azure both accept multiple codes; supplying none lets the provider auto-detect. ## Examples ### Change Language Only (Keep Current Provider) Switch an active session to German transcription: ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "languages": ["de-DE"] } ``` ### Switch Provider Only (Languages Reset to Default) Switch from Azure to Deepgram; languages reset to auto-detection: ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "provider": "DEEPGRAM" } ``` ### Switch Provider and Language Simultaneously Switch to ElevenLabs and set English as the recognition language: ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "provider": "ELEVEN_LABS", "languages": ["en-US"] } ``` ### Use Multiple Languages Simultaneously Enable multi-language detection for German and English: ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "languages": ["de-DE", "en-US"] } ``` Up to 4 language codes may be provided in a single request. ### Reset to Provider Default Omit `languages` to restore automatic language detection: ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000" } ``` ### Boost Recognition with Custom Vocabulary Improve accuracy for product names and technical terms: ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "custom_vocabulary": ["sipgate", "VoIP", "ISDN", "Portsplitter"] } ``` ### Switching Language Based on User Input A common pattern: detect the caller's preferred language from their first utterance, then reconfigure transcription mid-call. ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { // Start with multi-language detection return res.json({ type: 'speak', session_id: event.session.id, text: 'Hello! Guten Tag! Please speak in your preferred language.', }); } if (event.type === 'user_speak') { const detectedLanguage = event.language; // BCP-47 code from STT if (detectedLanguage && detectedLanguage.startsWith('de')) { // Caller is speaking German — lock transcription to German only return res.json({ type: 'configure_transcription', session_id: event.session.id, languages: ['de-DE'], }); } return res.json({ type: 'speak', session_id: event.session.id, text: `You said: ${event.text}`, }); } }); ``` ### Provider Fallback Pattern Switch to a backup provider if the primary fails or for specific call scenarios: ```javascript // Switch to Deepgram for better handling of a specific language/accent return res.json({ type: 'configure_transcription', session_id: event.session.id, provider: 'DEEPGRAM', languages: ['en-US'], }); ``` ## Next Steps * **[Actions Overview](/api/actions)** - Complete action reference * **[Event Types](/api/events)** - What events carry transcribed text * **[Barge-In Configuration](/api/barge-in)** - Control how users interrupt the assistant --- --- url: /sipgate-ai-flow-api/api/actions/configure-voice-to-voice.md --- # Configure Voice-to-Voice Action ::: warning Preview End-to-end voice-to-voice mode is a preview feature. Available only after a positive review by sipgate support. See **Access Gate** below. ::: Switch a session into **end-to-end voice-to-voice** mode. From the moment this action is processed the assistant no longer goes through the standard STT → text → TTS pipeline — caller audio is forwarded directly to a speech-to-speech model and the model's spoken response is sent back to the caller in real time. The transcribed user text is still surfaced as `user_speak` events for logging and call traces, but you don't need (and shouldn't send) `speak` actions in response to them — the model speaks autonomously. ## Action Structure ```json { "type": "configure_voice_to_voice", "session_id": "550e8400-e29b-41d4-a716-446655440000", "system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise.", "greeting": "Hello, this is Acme Dental — how can I help you?", "temperature": 0.8, "language": "en" } ``` ## Fields | Field | Type | Required | Default | Description | |-----------------|---------------|----------|---------|--------------------------------------------------------------------------------------------------------| | `type` | string | Yes | — | Always `"configure_voice_to_voice"` | | `session_id` | string (UUID) | Yes | — | Session identifier from the event | | `system_prompt` | string | Yes | — | Persona / behaviour instructions for the model. Sent once at the start of the session. | | `greeting` | string | No | — | Opening line the model should speak after connecting. Delivered as an inference trigger so the model phrases it naturally. | | `temperature` | number | No | `0.8` | Sampling temperature (0–2). Lower values make replies more deterministic. | | `language` | string | No | — | Preferred response language hint (e.g. `"de"`, `"en"`). The model decides ultimately. | ## Behavioral Details ### STT and TTS are inactive Once voice-to-voice is active for a session: * `user_speak` events still arrive, but they reflect the model's own transcription of the caller's turns — not your configured STT provider. * `speak` actions are honoured by forwarding the text to the model as a speaking instruction. The model will speak the text in its own voice — it may rephrase slightly (the protocol has no verbatim-TTS path). `tts`, `ssml`, `barge_in`, `vad` and `user_input_timeout_seconds` fields on the `speak` action are ignored. * Barge-in is handled inside the model — the configured barge-in strategy has no effect for the rest of the session. * VAD parameters set via `configure_transcription.vad` or `speak.vad` are ignored. ### Reverting to the normal pipeline Send a `configure_transcription` action to switch the session back to the standard STT/TTS pipeline. After that, you can send `speak` actions again. ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "provider": "AZURE", "languages": ["de-DE"] } ``` ### Greeting When `greeting` is provided, the model speaks an opening line as soon as the session is ready (typically within 1–2 seconds). The text is given to the model as guidance — the exact wording may differ slightly. If you want full silence at the start (e.g. you announce yourself first via a `speak` action *before* sending `configure_voice_to_voice`), simply omit `greeting`. ### Latency End-to-end speech-to-speech models respond noticeably faster than the standard STT → LLM → TTS pipeline because there are no per-stage decode/encode steps. First-byte latency for the spoken response is typically in the 200–600 ms range from the end of the caller's turn. ## Examples ### Minimal: persona-only, no greeting ```json { "type": "configure_voice_to_voice", "session_id": "550e8400-e29b-41d4-a716-446655440000", "system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise." } ``` ### Persona + greeting in German ```json { "type": "configure_voice_to_voice", "session_id": "550e8400-e29b-41d4-a716-446655440000", "system_prompt": "Du bist ein freundlicher Assistent für die Zahnarztpraxis Acme.", "greeting": "Guten Tag, hier ist die Praxis Acme. Wie kann ich Ihnen helfen?", "language": "de" } ``` ### Logging caller turns while the model handles the conversation Your code receives `user_speak` events for the call trace but does not need (and should not send) any further actions: ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'session_start') { return res.json({ type: 'configure_voice_to_voice', session_id: event.session.id, system_prompt: 'You are a helpful assistant.', greeting: 'Hi! How can I help today?', }); } if (event.type === 'user_speak') { // Log only — the model is already responding. console.log(`Caller said: ${event.text}`); return res.status(200).send(); } return res.status(200).send(); }); ``` ## Access Gate Voice-to-voice mode is only available upon request and after a positive review by sipgate support. Mention `configure_voice_to_voice` when you reach out so we can enable it for your account. ## Next Steps * **[Actions Overview](/api/actions)** - Complete action reference * **[Configure Transcription](/api/actions/configure-transcription)** - Switch back to the STT/TTS pipeline * **[Event Types](/api/events)** - What events carry transcribed text --- --- url: /sipgate-ai-flow-api/api/actions/send-sms.md --- # Send SMS Action Send an SMS from the sipgate account behind the AI Flow to any phone number. Useful for delivering confirmation codes, booking summaries, or follow-up links while (or after) a call. ::: info Availability `send_sms` is **only available upon request** and after a positive review by sipgate support (fraud / scam protection). Ask your sipgate contact to enable SMS sending for your account. ::: ## Action Structure ```json { "type": "send_sms", "session_id": "550e8400-e29b-41d4-a716-446655440000", "phone_number": "4915112345678", "message": "Your confirmation code is 4242." } ``` ## Fields | Field | Type | Required | Description | |----------------|-------------|----------|----------------------------------------------------------------------------------| | `type` | string | Yes | Always `"send_sms"` | | `session_id` | string (UUID) | Yes | Session identifier from the event | | `phone_number` | string | Yes | Recipient number in E.164 format — digits only, **without** leading `+` (preferred; a leading `+` is accepted and stripped automatically). Matches the format used by `transfer` and outbound calls. | | `message` | string | Yes | SMS body. No hard length limit; long texts are billed per standard SMS segment. | ## Sender The sender shown to the recipient is determined by following rules: 1. If your account has already outbound calls enabled, the sender is the same as for outbound calls. 2. Otherwise the recipient sees the called number of the current session (i.e. the number the user dialed to reach you). You cannot override the sender per request. ## Delivery Semantics * SMS sending is **fire-and-forget**: the call is not blocked waiting for delivery confirmation. * There is no delivery receipt in the event stream. Use your own monitoring if you need per-message confirmation. * A failed send does **not** interrupt the call — the agent can still speak, hang up, or transfer. You receive an `sms_failed` event to react conversationally (e.g. apologize, retry, or collect a corrected number). ## Examples ### Python ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_speak': text = event['text'].lower() if 'send me the code' in text: return jsonify({ 'type': 'send_sms', 'session_id': event['session']['id'], 'phone_number': event['session']['from_phone_number'], 'message': 'Your confirmation code is 4242.', }) ``` ### Node.js ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_speak' && /code/i.test(event.text)) { return res.json({ type: 'send_sms', session_id: event.session.id, phone_number: event.session.from_phone_number, message: 'Your confirmation code is 4242.', }); } }); ``` ### Go ```go if strings.Contains(strings.ToLower(text), "code") { action := map[string]interface{}{ "type": "send_sms", "session_id": session["id"], "phone_number": session["from_phone_number"].(string), "message": "Your confirmation code is 4242.", } json.NewEncoder(w).Encode(action) } ``` ### Phone Number Format Align with the rest of the AI Flow API: * ✅ `4915112345678` (preferred; E.164 without `+`) * ✅ `+4915112345678` (accepted; `+` is stripped before delivery) * ❌ `+49 151 1234 5678` (spaces and dashes rejected) * ❌ `0151 1234 5678` (national format rejected) ## Handling Failure — the `sms_failed` Event When sending fails, the AI Flow emits an `sms_failed` event to your webhook / WebSocket. Handle it to keep the conversation natural: ```json { "type": "sms_failed", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "...": "..." }, "recipient": "4915112345678", "reason": "sender_not_allowed", "message": "SMSC returned faultCode 403" } ``` `reason` is one of: | Value | Meaning | |------------------------|---------------------------------------------------------------------------------------| | `sender_not_allowed` | Your configured sender number isn't verified for SMS — fix in account settings. | | `insufficient_balance` | Account has insufficient credits for the send. | | `no_sms_extension` | No SMS extension is provisioned for this account — contact sipgate support. | | `smsc_unavailable` | Transient infrastructure issue; safe to retry later. | | `unknown` | Any other failure; check the optional `message` field for details. | See **[Events Reference](/api/events)** for the full event schema. ## Best Practices 1. **Ask for consent before sending.** Announce over the call that you'll send an SMS. 2. **Use E.164 with a leading `+`.** Always normalize user-provided numbers before passing them in. 3. **Keep messages short.** Each SMS segment is billed; long messages split into multiple segments silently. 4. **Handle `sms_failed`.** Have a fallback (speak an apology, retry with a corrected number, or skip the SMS and continue). 5. **Don't loop.** A single SMS per session is usually enough — sending multiple in quick succession can look spammy. ## Next Steps * **[Event Types](/api/events)** — including the `sms_failed` event schema * **[Speak Action](/api/actions/speak)** — acknowledge the SMS over the call * **[Hangup Action](/api/actions/hangup)** — wrap up after the SMS is queued --- --- url: /sipgate-ai-flow-api/api/tts-providers.md --- # TTS Providers Configure text-to-speech providers for different voices and languages. ## Overview The AI Flow service supports multiple TTS providers. Configure them per action in the `tts` field. ## Supported Providers * **Azure Cognitive Services** - 400+ voices in 140+ languages * **ElevenLabs** - Ultra-realistic conversational voices ## Azure Cognitive Services ### Configuration ```json { "type": "speak", "session_id": "session-123", "text": "Hello!", "tts": { "provider": "azure", "language": "en-US", "voice": "en-US-JennyNeural" } } ``` ### Popular Voices | Language | Voice Name | Gender | Sample | Description | | -------- | ------------------ | ------ | --------------------------------------------------------------------- | ---------------------- | | en-US | en-US-JennyNeural | Female | | Friendly, professional | | en-US | en-US-GuyNeural | Male | | Clear, neutral | | en-GB | en-GB-SoniaNeural | Female | | British, professional | | en-GB | en-GB-RyanNeural | Male | | British, friendly | | de-DE | de-DE-KatjaNeural | Female | | Professional, clear | | de-DE | de-DE-ConradNeural | Male | | Deep, authoritative | **Full Voice List:** See [Azure TTS documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support) ## ElevenLabs ### Configuration ```json { "type": "speak", "session_id": "session-123", "text": "Hello!", "tts": { "provider": "eleven_labs", "voice": "21m00Tcm4TlvDq8ikWAM" } } ``` ::: tip Voice IDs The `voice` field is optional and accepts the ElevenLabs voice ID as a string. For example, `"21m00Tcm4TlvDq8ikWAM"` for "Rachel". If omitted, the first available voice will be used. ::: **Minimal Configuration (uses default voice):** ```json { "type": "speak", "session_id": "session-123", "text": "Hello!", "tts": { "provider": "eleven_labs" } } ``` ### Available Voices The "Sample" column plays a representative greeting from a voice assistant scenario. Multilingual voices include both German and English samples; voices verified for German only have a German sample. Click play — audio loads on demand. | Voice Name | ID | Sample | Description | | ------------ | -------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------ | | sipgate | dSu12TX3MEDQXAarG4s6 | | Clean male voice used by sipgate for system announcements (default). | | Rachel | 21m00Tcm4TlvDq8ikWAM | | Matter-of-fact, personable woman. Great for conversational use cases. | | Drew | 29vD33N1CtxCmqQRPOHJ | | - | | Clyde | 2EiwWnXFnvU5JabPnv8n | | Great for character use-cases | | Paul | 5Q0t7uMcjvnagumLfvZi | | - | | Aria | 9BWtsMINqrJLrRacOk9x | | Middle-aged female with African-American accent. Calm with hint of rasp.| | Domi | AZnzlk1XvdvUeBnXmlld | | - | | Dave | CYw3kZ02Hs0563khs1Fj | | - | | Roger | CwhRBWXzGAHq8TQ4Fs17 | | Easy going and perfect for casual conversations. | | Fin | D38z5RcWu1voky8WS1ja | | - | | Sarah | EXAVITQu4vr4xnSDxMaL | | Young adult woman with confident, warm tone. Reassuring and professional.| | Antoni | ErXwobaYiN019PkySvjV | | - | | Laura | FGY2WhTYpPnrIDTdsKH5 | | Young adult female with sunny enthusiasm and quirky attitude. | | Thomas | GBv7mTt0atIp3Br8iCZE | | Soft and subdued male voice, optimal for narrations or meditations | | Charlie | IKne3meq5aSn9XLyUdCD | | Young Australian male with confident and energetic voice. | | George | JBFqnCBsd6RMkjVDRZzb | | Warm resonance that instantly captivates listeners. | | Emily | LcfcDJNUP1GQjkzn1xUU | | - | | Elli | MF3mGyEYCl7XYWbV9V6O | | - | | Callum | N2lVS1w4EtoT3dr4eOWO | | Deceptively gravelly, yet unsettling edge. | | Patrick | ODq5zmih8GrVes37Dizd | | - | | River | SAz9YHcvj6GT2YYXdXww | | Relaxed, neutral voice ready for narrations or conversational projects. | | Harry | SOYHLrjzK2X1ezoPC6cr | | An animated warrior ready to charge forward. | | Liam | TX3LPaxmHKxFdv7VOQHJ | | Young adult with energy and warmth - suitable for reels and shorts. | | Dorothy | ThT5KcBeYPX3keUQqHPh | | - | | Josh | TxGEqnHWrfWFTfGW9XjX | | - | | Arnold | VR6AewLTigWG4xSOukaG | | - | | Charlotte | XB0fDUnXU5powFXDhCwa | | Sensual and raspy, ready to voice your temptress in video games. | | Alice | Xb7hH8MSUJpSbSDYk0k2 | | Clear and engaging British woman, suitable for e-learning. | | Matilda | XrExE9yKIg1WjnnlVkGX | | Professional woman with pleasing alto pitch. Suitable for many use cases.| | James | ZQe5CZNOzWyzPSCn5a3c | | - | | Joseph | Zlb1dXrM653N07WRdFW3 | | - | | Will | bIHbv24MWmeRgasZH58o | | Conversational and laid back. | | Jeremy | bVMeCyTHy58xNoL34h3p | | - | | Jessica | cgSgspJ2msm6clMCkdW9 | | Young and playful American female, perfect for trendy content. | | Eric | cjVigY5qzO86Huf0OWal | | Smooth tenor pitch from man in his 40s - perfect for agentic use cases. | | Michael | flq6f7yk4E4fJM5XTYuZ | | - | | Ethan | g5CIjZEefAph4nQFvHAz | | - | | Chris | iP95p4xoKVk53GoZ742B | | Natural and real, down-to-earth voice great across many use-cases. | | Gigi | jBpfuIE2acCO8z3wKNLl | | - | | Freya | jsCqWAovK2LkecY7zXl4 | | - | | Brian | nPczCjzI2devNBz1zQrb | | Middle-aged man with resonant and comforting tone. Great for narrations. | | Grace | oWAxZDx7w5VEj9dCyTzz | | - | | Daniel | onwK4e9ZLuTAKqWW03F9 | | Strong voice perfect for professional broadcast or news story. | | Lily | pFZP5JQG7iQjIQuC4Bku | | Velvety British female voice delivers news with warmth and clarity. | | Serena | pMsXgVXv3BLzUgSXRplE | | - | | Adam | pNInz6obpgDQGcFmaJgB | | - | | Nicole | piTKgcLEGmPE4e6mEKli | | - | | Bill | pqHfZKP75CvOlQylNhV4 | | Friendly and comforting voice ready to narrate your stories. | | Jessie | t0jbNlBVZ17f02VDIeMI | | - | | Sam | yoZ06aMxZJJ28mfd3POQ | | - | | Glinda | z9fAnlkpzviPz146aGWa | | - | | Giovanni | zcAOhNBS3c14rBihAFp1 | | - | | Mimi | zrHiDhphv9ZnVXBqCLjz | | - | ## Choosing a Provider ### Use Azure when: * You need many languages (140+) * You want consistent quality * You need regional accents * Budget is a concern ### Use ElevenLabs when: * You need the most natural voices * Conversational quality is critical * You're working with English/European languages * You want distinct personalities ## Examples ### Python ```python # Azure voice action = { 'type': 'speak', 'session_id': session_id, 'text': 'Hello!', 'tts': { 'provider': 'azure', 'language': 'en-US', 'voice': 'en-US-JennyNeural' } } # ElevenLabs voice action = { 'type': 'speak', 'session_id': session_id, 'text': 'Hello!', 'tts': { 'provider': 'eleven_labs', 'voice': '21m00Tcm4TlvDq8ikWAM' # Rachel } } ``` ## Next Steps * **[Speak Action](/api/actions/speak)** - How to use TTS * **[Barge-In Configuration](/api/barge-in)** - Control interruptions --- --- url: /sipgate-ai-flow-api/api/barge-in.md --- # Barge-In Configuration Control how users can interrupt the assistant while speaking. ::: tip Looking for the `barge_in` action? This page covers the `barge_in` **configuration object** attached to `speak` / `audio` actions — it decides whether and how the **caller** may interrupt. The top-level `barge_in` **action**, which lets **your application** interrupt the current playback, has its own page: [Barge-In Action](/api/actions/barge-in). ::: ## Overview Barge-in allows users to interrupt the assistant's speech. Configure it per action using the `barge_in` field. ## Configuration ```json { "type": "speak", "session_id": "session-123", "text": "Hello!", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3, "allow_after_ms": 500 } } ``` ## Strategies ### `none` Disables barge-in completely. Audio plays fully without interruption. ```json { "barge_in": { "strategy": "none" } } ``` **Use cases:** * Critical information * Legal disclaimers * Emergency instructions ### `manual` Allows manual barge-in via API only (no automatic detection). ```json { "barge_in": { "strategy": "manual" } } ``` **Use cases:** * Custom interruption logic * Button-triggered interruption * External event-based interruption ### `minimum_characters` Automatically detects barge-in when user speech exceeds character threshold. ```json { "barge_in": { "strategy": "minimum_characters", "minimum_characters": 5, "allow_after_ms": 500 } } ``` **Use cases:** * Natural conversation flow * Customer service scenarios * Interactive voice menus ### `immediate` ⚡ NEW **Most responsive option** - Interrupts immediately when user starts speaking, using Voice Activity Detection (VAD). ```json { "barge_in": { "strategy": "immediate", "allow_after_ms": 500 } } ``` **How it works:** * **Azure/Deepgram**: Uses VAD (Voice Activity Detection) - triggers before any text is recognized * **ElevenLabs**: Uses first partial transcript * **Latency**: 20-100ms (2-4x faster than `minimum_characters`) * **No text required**: Interrupts on voice detection, not transcription **Use cases:** * High-priority conversations requiring instant responsiveness * Natural dialogue where interruptions should feel seamless * Customer service where quick response matters * Urgent or time-sensitive interactions **Best practices:** * Use `allow_after_ms: 500-1000` to prevent accidental interruptions at start * Test with real users to find optimal `allow_after_ms` value * Consider network latency in production environments **Comparison with `minimum_characters`:** | Feature | `immediate` | `minimum_characters` | |---------|-------------|---------------------| | **Trigger** | Voice Activity (VAD) | Text recognition (3+ characters) | | **Latency** | 20-100ms | 50-200ms | | **User Experience** | Instant interruption | Slight delay | | **Accuracy** | May trigger on noise | More reliable (text-based) | ## Configuration Options ### minimum\_characters Minimum number of characters before barge-in triggers. * **Default**: `3` * **Range**: `1` to `100` * **Higher values**: Require more speech before interruption ### allow\_after\_ms Delay in milliseconds before barge-in is allowed (protection period). * **Default**: `0` (immediate) * **Range**: `0` to `10000` (10 seconds) * **Use**: Prevent interruption during critical information ## Examples ### Natural Conversation ```json { "type": "speak", "session_id": "session-123", "text": "I can help you with billing, support, or sales.", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 3 } } ``` ### Critical Information ```json { "type": "speak", "session_id": "session-123", "text": "Your verification code is 1-2-3-4-5-6.", "barge_in": { "strategy": "none" } } ``` ### Protected Announcement ```json { "type": "speak", "session_id": "session-123", "text": "Your account number is 1234567890.", "barge_in": { "strategy": "minimum_characters", "minimum_characters": 10, "allow_after_ms": 2000 } } ``` ### Instant Response (Immediate) ⚡ ```json { "type": "speak", "session_id": "session-123", "text": "I can help you with your order, account, or technical support. What would you like to know?", "barge_in": { "strategy": "immediate", "allow_after_ms": 500 } } ``` **Result**: Assistant stops speaking the moment user starts talking (20-100ms latency), providing the most natural conversation experience. ## Best Practices 1. **Use `none` sparingly** - Only for truly critical information 2. **Choose the right strategy**: * `immediate` - For most natural, responsive conversations * `minimum_characters` - For balance between responsiveness and reliability * `manual` - For custom logic * `none` - For critical announcements only 3. **Set protection periods** - Use `allow_after_ms: 500-1000` to prevent cutting off important intro 4. **Test with users** - Find the right balance for your use case 5. **Consider noise** - `immediate` may trigger on background noise; use `allow_after_ms` as buffer ## Next Steps * **[Speak Action](/api/actions/speak)** - How to use barge-in * **[User Speak Event with Barge-In Flag](/api/events/user-speak)** - Handle interruptions --- --- url: /sipgate-ai-flow-api/api/vad.md --- # VAD (Voice Activity Detection) Configuration Advanced setting that lets you tune how long the system waits in silence before treating the caller's turn as finished. Useful for call flows where the caller is expected to pause (think aloud, list items, spell things out) or where you want a snappier turn-taking rhythm. ::: warning Optional advanced setting The default behaviour is tuned for typical conversations. Only set `vad` when you have a concrete use case where the system's default end-of-turn timing is too eager or too patient. When omitted, the system default applies. ::: ## Where to set it VAD config is accepted in two places: * **Per `speak` action** — applies to the caller's reply that follows. The setting persists until overridden by another `speak.vad` or by `configure_transcription.vad`. * **On `configure_transcription`** — sets the value for the rest of the session (until overridden again). ## Schema ```json { "vad": { "end_of_turn_silence_ms": 1200 } } ``` | Field | Type | Recommended range | Description | |--------------------------|--------|-------------------|--------------------------------------------------------------------------------------------------------------| | `end_of_turn_silence_ms` | number | 150–2000 | Milliseconds of silence after the caller stops speaking before their turn is considered finished. | Lower values yield faster turn-taking; higher values tolerate longer pauses. ## Lenient validation If you send an out-of-range, non-integer, or otherwise invalid value, the value is **silently ignored** — the system default takes over and the rest of your action is processed normally. This avoids breaking call flows over a typo. ## Example: tolerate long pauses (e.g. spelling) ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Please spell your last name, letter by letter.", "vad": { "end_of_turn_silence_ms": 1500 } } ``` ## Example: snappy back-and-forth (e.g. yes/no questions) ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "Did you mean account number 1234?", "vad": { "end_of_turn_silence_ms": 250 } } ``` ## Example: set once for the whole session ```json { "type": "configure_transcription", "session_id": "550e8400-e29b-41d4-a716-446655440000", "vad": { "end_of_turn_silence_ms": 1000 } } ``` ## Notes * The setting takes effect immediately — speech happens before the caller can reply, so any internal reconfiguration completes before the system needs to listen again. * VAD tuning and [barge-in](/api/barge-in) are related but distinct: `vad` governs *when the caller's turn is considered finished*, while `barge_in` governs *whether and how the caller may interrupt the assistant while it is speaking*. Both can be set on the same `speak` action. --- --- url: /sipgate-ai-flow-api/sdk.md --- # SDK Guide Welcome to the sipgate AI Flow SDK documentation! This guide will help you build powerful AI-powered voice assistants with real-time speech processing capabilities. ## What is the SDK? The `@sipgate/ai-flow-sdk` is a TypeScript SDK that provides a simple, event-driven interface for building voice assistants. It handles the complexity of real-time speech processing, event management, and action responses, so you can focus on building great conversational experiences. ## Key Concepts ### Event-Driven Architecture The SDK uses an event-driven model where your assistant responds to events from the AI Flow service: * **Session Start** - When a new call begins * **User Speak** - When the user says something * **User Barge In** - When the user interrupts the assistant * **Assistant Speak** - After your assistant speaks * **Session End** - When the call ends ### Simple Response Model Event handlers can return: * **Simple strings** - Automatically converted to speech * **Action objects** - For advanced control (speak, transfer, hangup, etc.) * **null/undefined** - No response needed ### Easy Integration The SDK provides built-in middleware for: * **Express.js** - `assistant.express()` middleware * **WebSocket** - `assistant.ws(ws)` message handler * **Custom** - `assistant.onEvent(event)` for any integration ## Quick Example ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const assistant = AiFlowAssistant.create({ onSessionStart: async (event) => { return "Hello! How can I help you today?"; }, onUserSpeak: async (event) => { const userText = event.text; console.log(`User said: ${userText}`); return `You said: ${userText}`; }, onSessionEnd: async (event) => { console.log(`Session ${event.session.id} ended`); }, }); // Use with Express app.post("/webhook", assistant.express()); ``` ## What's Next? * **[Installation](/sdk/installation)** - Install the SDK and set up your project * **[Quick Start](/sdk/quick-start)** - Build your first voice assistant * **[Core Concepts](/sdk/core-concepts)** - Learn about events and responses * **[API Reference](/sdk/api-reference)** - Complete API documentation ## For AI-Assisted Development Using AI coding assistants like **Claude Code**, **ChatGPT**, or **Cursor**? We publish two auto-generated files following the [llms.txt spec](https://llmstxt.org/): * **[`/llms.txt`](/llms.txt)** — short index, auto-discovered by AI tooling. * **[`/llms-full.txt`](/llms-full.txt)** — full documentation corpus in a single file, ideal for pasting into an LLM context. --- --- url: /sipgate-ai-flow-api/sdk/installation.md --- # Installation Install the sipgate AI Flow SDK to start building voice assistants. ## Package Managers ```bash npm install @sipgate/ai-flow-sdk ``` ```bash yarn add @sipgate/ai-flow-sdk ``` ```bash pnpm add @sipgate/ai-flow-sdk ``` ## Requirements * **Node.js** >= 22.0.0 * **TypeScript** 5.x (recommended) ## TypeScript Setup The SDK is written in TypeScript and includes full type definitions. No additional `@types` package is needed. If you're using TypeScript, make sure your `tsconfig.json` includes: ```json { "compilerOptions": { "target": "ES2022", "module": "ESNext", "moduleResolution": "bundler", "strict": true, "esModuleInterop": true, "skipLibCheck": true } } ``` ## Verify Installation You can verify the installation by importing the SDK: ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; console.log("SDK installed successfully!"); ``` ## Next Steps * **[Quick Start](/sdk/quick-start)** - Build your first voice assistant * **[API Reference](/sdk/api-reference)** - Explore the complete API --- --- url: /sipgate-ai-flow-api/sdk/quick-start.md --- # Quick Start Get up and running with your first voice assistant in minutes. ## Basic Assistant Here's a minimal example that responds to user speech: ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const assistant = AiFlowAssistant.create({ debug: true, onSessionStart: async (event) => { console.log(`Session started for ${event.session.phone_number}`); return "Hello! How can I help you today?"; }, onUserSpeak: async (event) => { const userText = event.text; console.log(`User said: ${userText}`); // Process user input and return response return `You said: ${userText}`; }, onSessionEnd: async (event) => { console.log(`Session ${event.session.id} ended`); }, onUserBargeIn: async (event) => { console.log(`User interrupted with: ${event.text}`); return "I'm listening, please continue."; }, }); ``` ## Express.js Integration The easiest way to get started is with Express.js: ```typescript import express from "express"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const app = express(); app.use(express.json()); const assistant = AiFlowAssistant.create({ onSessionStart: async (event) => { return "Welcome! How can I help you today?"; }, onUserSpeak: async (event) => { // Your conversation logic here return processUserInput(event.text); }, onSessionEnd: async (event) => { await cleanupSession(event.session.id); }, }); // Webhook endpoint app.post("/webhook", assistant.express()); // Health check app.get("/health", (req, res) => { res.json({ status: "ok" }); }); const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`AI Flow assistant running on port ${PORT}`); }); ``` ## WebSocket Integration For WebSocket-based integrations: ```typescript import WebSocket from "ws"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const wss = new WebSocket.Server({ port: 8080, perMessageDeflate: false, }); const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Hello from WebSocket!"; }, }); wss.on("connection", (ws, req) => { console.log("New WebSocket connection"); ws.on("message", assistant.ws(ws)); ws.on("error", (error) => { console.error("WebSocket error:", error); }); ws.on("close", () => { console.log("WebSocket connection closed"); }); }); console.log("WebSocket server listening on port 8080"); ``` ## Response Types You can return different types of responses: ```typescript // 1. Simple string (automatically converted to speak action) return "Hello, how can I help?"; // 2. Action object (for advanced control) return { type: "speak", session_id: event.session.id, text: "Hello!", barge_in: { strategy: "minimum_characters" }, }; // 3. null/undefined (no response needed) return null; ``` ## Next Steps * **[Core Concepts](/sdk/core-concepts)** - Learn about events and responses in detail * **[API Reference](/sdk/api-reference)** - Explore the complete API * **[Integration Guides](/sdk/integrations/express)** - See more integration examples --- --- url: /sipgate-ai-flow-api/sdk/core-concepts.md --- # Core Concepts Understanding the event-driven architecture and response model. ## Event-Driven Architecture The SDK uses an event-driven model where your assistant responds to events from the AI Flow service: 1. **Session Start** - Called when a new call session begins 2. **User Speak** - Called when the user says something (after speech-to-text) 3. **User Barge In** - Called when the user interrupts the assistant 4. **Assistant Speak** - Called after your assistant starts speaking (event may be left out) 5. **Assistant Speech Ended** - Called when the assistant's speech playback ends 6. **Session End** - Called when the call ends ### Event Flow ``` ┌─────────────────┐ │ session_start │──> Respond with speak/audio or do nothing └─────────────────┘ ┌─────────────────┐ │ user_speak │──> Respond with speak/audio/transfer/hangup │ (barged_in?) │ Check barged_in flag for interruptions └─────────────────┘ ┌─────────────────┐ │ assistant_speak │──> Optional: track metrics, trigger next action └─────────────────┘ ┌─────────────────┐ │ session_end │──> Cleanup only, no actions accepted └─────────────────┘ ``` ## Response Types Event handlers can return three types of responses: ### 1. Simple String The simplest way to respond - just return a string: ```typescript onUserSpeak: async (event) => { return "Hello, how can I help?"; } ``` This is automatically converted to a `speak` action. ### 2. Action Object For advanced control, return an action object: ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "Hello!", barge_in: { strategy: "minimum_characters", minimum_characters: 3 }, }; } ``` Available action types: * `speak` - Text-to-speech response * `audio` - Play pre-recorded audio * `hangup` - End the call * `transfer` - Transfer to another number * `barge_in` - Manually interrupt playback ### 3. No Response Return `null` or `undefined` when no response is needed: ```typescript onAssistantSpeak: async (event) => { // Track metrics, no response needed trackMetrics(event); return null; } ``` ## Session Information All events include session information: ```typescript interface SessionInfo { id: string; // UUID of the session account_id: string; // Account identifier phone_number: string; // Phone number for this flow session direction?: "inbound" | "outbound"; from_phone_number: string; to_phone_number: string; } ``` ## Best Practices ### 1. Handle All Events Even if you don't need to respond, it's good practice to handle all events: ```typescript const assistant = AiFlowAssistant.create({ onSessionStart: async (event) => { // Initialize session state initializeSession(event.session.id); return "Welcome!"; }, onUserSpeak: async (event) => { // Main conversation logic return processUserInput(event.text); }, onSessionEnd: async (event) => { // Cleanup cleanupSession(event.session.id); }, }); ``` ### 2. Use Type Safety The SDK provides full TypeScript types: ```typescript import type { AiFlowEventUserSpeak, AiFlowAction } from "@sipgate/ai-flow-sdk"; onUserSpeak: async (event: AiFlowEventUserSpeak) => { // event is fully typed const text: string = event.text; const sessionId: string = event.session.id; return { type: "speak", session_id: sessionId, text: `You said: ${text}`, } as AiFlowAction; } ``` ### 3. Error Handling Always handle errors gracefully: ```typescript onUserSpeak: async (event) => { try { return await processUserInput(event.text); } catch (error) { console.error("Error processing user input:", error); return "I'm sorry, I encountered an error. Please try again."; } } ``` ## Next Steps * **[API Reference](/sdk/api-reference)** - Complete API documentation * **[Event Types](/sdk/events)** - Detailed event reference * **[Action Types](/sdk/actions)** - All available actions --- --- url: /sipgate-ai-flow-api/sdk/response-types.md --- # Response Types Learn about the different ways to respond to events. ## Overview Event handlers can return these response types: 1. **Simple string** - Automatically converted to a speak action 2. **Action object** - For advanced control 3. **Array of actions** - Execute multiple actions in sequence 4. **null/undefined** - No response needed ## Simple String Response The simplest way to respond is to return a string: ```typescript onUserSpeak: async (event) => { return "Hello, how can I help?"; } ``` This is automatically converted to: ```typescript { type: "speak", session_id: event.session.id, text: "Hello, how can I help?", } ``` ## Action Object Response For advanced control, return an action object directly: ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "Hello!", barge_in: { strategy: "minimum_characters", minimum_characters: 3, }, }; } ``` ### Available Action Types * **[Speak Action](/sdk/actions#speak-action)** - Text-to-speech response * **[Audio Action](/sdk/actions#audio-action)** - Play pre-recorded audio * **[Hangup Action](/sdk/actions#hangup-action)** - End the call * **[Transfer Action](/sdk/actions#transfer-action)** - Transfer to another number * **[Barge-In Action](/sdk/actions#barge-in-action)** - Manually interrupt playback ## No Response Return `null` or `undefined` when no response is needed: ```typescript onAssistantSpeak: async (event) => { // Track metrics, no response needed trackMetrics(event); return null; } ``` ## Type Safety The SDK provides TypeScript types for all responses: ```typescript import type { InvocationResponseType, AiFlowAction } from "@sipgate/ai-flow-sdk"; // InvocationResponseType is a union of: // string | AiFlowAction | null | undefined onUserSpeak: async (event): Promise => { // You can return any of these types return "Hello"; // string // or return { type: "speak", ... }; // AiFlowAction // or return null; // null/undefined } ``` ## Examples ### Conditional Response ```typescript onUserSpeak: async (event) => { if (event.text.toLowerCase().includes("goodbye")) { return { type: "hangup", session_id: event.session.id, }; } return "How can I help you?"; } ``` ### Multiple Actions You can return an array of actions to execute them in sequence: ```typescript onUserSpeak: async (event) => { return [ { type: "barge_in", session_id: event.session.id, }, { type: "speak", session_id: event.session.id, text: "Sorry, let me correct that.", }, ]; } ``` Actions in the array are executed one after another in order. Alternatively, you can chain actions across events using the `onAssistantSpeak` event: ```typescript const sessionState = new Map(); onUserSpeak: async (event) => { // Store what we want to do next sessionState.set(event.session.id, "play_audio"); return "Please listen to this message."; }, onAssistantSpeak: async (event) => { const nextAction = sessionState.get(event.session.id); if (nextAction === "play_audio") { sessionState.delete(event.session.id); return { type: "audio", session_id: event.session.id, audio: base64AudioData, }; } return null; } ``` ## Next Steps * **[Action Types](/sdk/actions)** - Complete reference for all actions * **[API Reference](/sdk/api-reference)** - Full API documentation --- --- url: /sipgate-ai-flow-api/sdk/api-reference.md --- # API Reference Complete API documentation for the `AiFlowAssistant` class. ## AiFlowAssistant The main class for creating AI voice assistants. ### `AiFlowAssistant.create(options)` Creates a new assistant instance. **Options:** ```typescript interface AiFlowAssistantOptions { // Bearer token for outbound call API requests token?: string; // Base URL of the sipgate API (default: "https://api.sipgate.com") baseUrl?: string; // Enable debug logging debug?: boolean; // Event handlers onSessionStart?: ( event: AiFlowEventSessionStart ) => Promise; onUserSpeak?: ( event: AiFlowEventUserSpeak ) => Promise; onAssistantSpeak?: ( event: AiFlowEventAssistantSpeak ) => Promise; onAssistantSpeechEnded?: ( event: AiFlowEventAssistantSpeechEnded ) => Promise; onUserInputTimeout?: ( event: AiFlowEventUserInputTimeout ) => Promise; onSessionEnd?: ( event: AiFlowEventSessionEnd ) => Promise; // DEPRECATED: Use onUserSpeak instead onUserBargeIn?: ( event: AiFlowEventUserBargeIn ) => Promise; } type InvocationResponseType = AiFlowAction | string | null | undefined; ``` **Example:** ```typescript const assistant = AiFlowAssistant.create({ debug: true, apiKey: process.env.API_KEY, onSessionStart: async (event) => { return "Welcome!"; }, onUserSpeak: async (event) => { return "Hello!"; }, }); ``` ### Instance Methods #### `assistant.express()` Returns an Express.js middleware function for handling webhook requests. ```typescript app.post("/webhook", assistant.express()); ``` **Usage:** ```typescript import express from "express"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const app = express(); app.use(express.json()); const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Hello!"; }, }); app.post("/webhook", assistant.express()); ``` #### `assistant.ws(websocket)` Returns a WebSocket message handler. ```typescript wss.on("connection", (ws) => { ws.on("message", assistant.ws(ws)); }); ``` **Usage:** ```typescript import WebSocket from "ws"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const wss = new WebSocket.Server({ port: 8080 }); const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Hello!"; }, }); wss.on("connection", (ws) => { ws.on("message", assistant.ws(ws)); }); ``` #### `assistant.call(params)` Initiates an outbound call. Requires `token` to be set in options. ```typescript await assistant.call({ aiFlowId: string; // ID of the AI flow billingDevice: string; // Billing device suffix (provided during onboarding) toPhoneNumber: string; // Target number in E.164 format }); ``` Returns `Promise`. Throws on API errors (e.g. flow not found, missing phone number configuration). See **[Outbound Calls](/sdk/outbound-calls)** for a full guide. #### `assistant.onEvent(event)` Manually process an event (useful for custom integrations). ```typescript const action = await assistant.onEvent(event); ``` **Usage:** ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Hello!"; }, }); // Custom integration app.post("/custom-webhook", async (req, res) => { const event = req.body; const action = await assistant.onEvent(event); if (action) { res.json(action); } else { res.status(204).send(); } }); ``` ## Options Reference ### `token?: string` Bearer token for authenticating outbound call API requests. Required when using `assistant.call()`. ### `baseUrl?: string` Base URL of the sipgate API. Defaults to `"https://api.sipgate.com"`. Override for custom environments. ### `debug?: boolean` Enable debug logging. When `true`, the SDK will log all events and actions to the console. ```typescript const assistant = AiFlowAssistant.create({ debug: true, // Logs all events and actions // ... }); ``` ### Event Handlers All event handlers are optional and follow the same pattern: ```typescript onEventName?: (event: EventType) => Promise ``` See the [Event Types](/sdk/events) documentation for details on each event. ## Type Definitions ### `InvocationResponseType` The return type for all event handlers: ```typescript type InvocationResponseType = | AiFlowAction // Action object | string // Simple string (converted to speak action) | null // No response | undefined; // No response ``` ## Error Handling The SDK handles errors gracefully. If an event handler throws an error, it will be logged and the SDK will continue processing other events. ```typescript const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { try { return await processUserInput(event.text); } catch (error) { console.error("Error:", error); return "I'm sorry, I encountered an error."; } }, }); ``` ## Next Steps * **[Event Types](/sdk/events)** - Complete event reference * **[Action Types](/sdk/actions)** - All available actions * **[Integration Guides](/sdk/integrations/express)** - Integration examples --- --- url: /sipgate-ai-flow-api/sdk/events.md --- # Event Types Complete reference for all events in the SDK. ## Overview Events are triggered by the AI Flow service and handled by your assistant. All events include session information and are typed with TypeScript. ## Base Event Structure All events extend a base structure with session information: ```typescript interface SessionInfo { id: string; // UUID of the session account_id: string; // Account identifier phone_number: string; // Phone number for this flow session direction?: "inbound" | "outbound"; from_phone_number: string; to_phone_number: string; } ``` ## Event Types ### SessionStart Event Triggered when a new call session begins. ```typescript interface AiFlowEventSessionStart { type: "session_start"; session: { id: string; // UUID of the session account_id: string; // Account identifier phone_number: string; // Phone number for this flow session direction?: "inbound" | "outbound"; // Call direction from_phone_number: string; // Phone number of the caller to_phone_number: string; // Phone number of the callee }; } ``` **Example:** ```typescript onSessionStart: async (event) => { // Log session details console.log( `${event.session.direction} call from ${event.session.from_phone_number} to ${event.session.to_phone_number}` ); // Return greeting return "Welcome to our service!"; }; ``` ### UserSpeechStarted Event Triggered when the user's speech is first detected, before the full transcript is available. Uses Voice Activity Detection (VAD) and fires 20–120 ms after the user starts speaking. > **WebSocket only** — this event is not delivered to HTTP webhook handlers. ```typescript interface AiFlowEventUserSpeechStarted { type: "user_speech_started"; session: SessionInfo; } ``` **Notes:** * Fires at most once per speech turn; resets after `user_speak` is received * No return value is expected; returning an action has no effect **Example:** ```typescript onUserSpeechStarted: async (event) => { console.log('User started speaking, session:', event.session.id); // No return value needed }, ``` ### UserSpeak Event Triggered when the user speaks and speech-to-text completes. ```typescript interface AiFlowEventUserSpeak { type: "user_speak"; text: string; // Recognized speech text session: SessionInfo; } ``` **Example:** ```typescript onUserSpeak: async (event) => { const intent = analyzeIntent(event.text); if (intent === "help") { return "I can help you with billing, support, or sales."; } return processUserInput(event.text); }; ``` ### AssistantSpeak Event Triggered after the assistant starts speaking. Event may be omitted for some text-to-speech models. ```typescript interface AiFlowEventAssistantSpeak { type: "assistant_speak"; text?: string; // Text that was spoken ssml?: string; // SSML that was used (if applicable) duration_ms: number; // Duration of speech in milliseconds speech_started_at: number; // Unix timestamp (ms) when speech started session: SessionInfo; } ``` **Example:** ```typescript onAssistantSpeak: async (event) => { console.log(`Spoke for ${event.duration_ms}ms`); // Track conversation metrics trackMetrics({ sessionId: event.session.id, duration: event.duration_ms, text: event.text, }); }; ``` ### AssistantSpeechEnded Event Triggered after the assistant finishes speaking. ```typescript interface AiFlowEventAssistantSpeechEnded { type: "assistant_speech_ended"; session: SessionInfo; } ``` **Example:** ```typescript onAssistantSpeechEnded: async (event) => { console.log(`Finished speaking for session ${event.session.id}`); // Trigger next action if needed await triggerNextAction(event.session.id); }; ``` ### UserInputTimeout Event Triggered when no user speech is detected within the configured timeout period after the assistant finishes speaking. ```typescript interface AiFlowEventUserInputTimeout { type: "user_input_timeout"; session: SessionInfo; } ``` **When Triggered:** 1. A `speak` action includes a `user_input_timeout_seconds` field 2. The assistant finishes speaking (`assistant_speech_ended` event fires) 3. The specified timeout period elapses without any user speech detected **Example:** ```typescript onUserInputTimeout: async (event) => { console.log(`No user input received for session ${event.session.id}`); // Retry the question return { type: "speak", session_id: event.session.id, text: "Are you still there? Please say yes or no.", user_input_timeout_seconds: 5 }; }; ``` **Configuring Timeout:** Set `user_input_timeout_seconds` in the speak action: ```typescript onSessionStart: async (event) => { return { type: "speak", session_id: event.session.id, text: "What is your account number?", user_input_timeout_seconds: 5 // Wait 5 seconds for response }; }; ``` **Common Use Cases:** ```typescript // Hangup after multiple timeouts const timeoutCounts = new Map(); onUserInputTimeout: async (event) => { const sessionId = event.session.id; const count = (timeoutCounts.get(sessionId) || 0) + 1; timeoutCounts.set(sessionId, count); if (count >= 3) { return { type: "hangup", session_id: sessionId }; } return { type: "speak", session_id: sessionId, text: `I didn't hear anything. Please respond. Attempt ${count} of 3.`, user_input_timeout_seconds: 5 }; }; // Transfer to agent after timeout onUserInputTimeout: async (event) => { return { type: "speak", session_id: event.session.id, text: "Let me connect you with a live agent who can help you." // Follow with transfer action }; }; ``` ### DtmfReceived Event Triggered when the user presses a key on their phone keypad. ```typescript interface AiFlowEventDtmfReceived { type: "dtmf_received"; digit: string; // The key pressed: "0"–"9", "*", or "#" session: SessionInfo; } ``` **Example:** ```typescript onDtmfReceived: async (event) => { console.log(`User pressed: ${event.digit}`); if (event.digit === '1') { return { type: 'transfer', session_id: event.session.id, transfer_to: '+49211100200' }; } return { type: 'speak', session_id: event.session.id, text: `You pressed ${event.digit}.` }; }, ``` **Notes:** * All standard DTMF tones are supported: `0`–`9`, `*`, `#` * Each key press triggers a separate event * DTMF events can occur at any point during the call ### SessionEnd Event Triggered when the call session ends. ```typescript interface AiFlowEventSessionEnd { type: "session_end"; session: SessionInfo; } ``` **Example:** ```typescript onSessionEnd: async (event) => { // Save conversation history await saveConversation(event.session.id); // Send analytics await trackSessionEnd(event.session); }; ``` ### Barge-In Detection User interruptions are detected via the `barged_in` flag in `user_speak` events: ```typescript interface AiFlowEventUserSpeak { type: "user_speak"; text: string; barged_in?: boolean; // true if user interrupted session: SessionInfo; } ``` **Example:** ```typescript onUserBargeIn: async (event) => { // Called automatically when event.barged_in === true console.log(`User interrupted with: ${event.text}`); return "I'm listening, please continue."; }; ``` ## Event Flow ``` ┌─────────────────┐ │ session_start │──> Respond with speak/audio or do nothing └─────────────────┘ ┌─────────────────┐ │ user_speak │──> Respond with speak/audio/transfer/hangup │ (barged_in?) │ Check barged_in flag for interruptions └─────────────────┘ ┌─────────────────┐ │ assistant_speak │──> Optional: track metrics, trigger next action └─────────────────┘ ┌─────────────────┐ │ session_end │──> Cleanup only, no actions accepted └─────────────────┘ ``` ## Event Summary Table | Event Type | Transport | Description | When Triggered | Can Return Action? | | ----------------------- | ------------------ | --------------------------- | ------------------------------------------ | ------------------- | | `session_start` | HTTP + WebSocket | Call session begins | When a new call is initiated | ✅ Yes | | `user_speech_started` | **WebSocket only** | Speech onset detected | When VAD detects the user starting to speak | ❌ No | | `user_speak` | HTTP + WebSocket | User speech detected | After speech-to-text completes (includes `barged_in` flag) | ✅ Yes | | `dtmf_received` | HTTP + WebSocket | DTMF digit pressed | When the user presses a phone key | ✅ Yes | | `assistant_speak` | HTTP + WebSocket | Assistant finished speaking | After TTS playback completes | ✅ Yes | | `assistant_speech_ended`| HTTP + WebSocket | Assistant finished speaking | After speech playback ends | ✅ Yes | | `user_input_timeout` | HTTP + WebSocket | User input timeout reached | When no speech detected after timeout | ✅ Yes | | `session_end` | HTTP + WebSocket | Call session ends | When the call terminates | ❌ No | ## Type Safety All events are fully typed. Import types from the SDK: ```typescript import type { AiFlowEventSessionStart, AiFlowEventUserSpeechStarted, AiFlowEventUserSpeak, AiFlowEventDtmfReceived, AiFlowEventAssistantSpeak, AiFlowEventAssistantSpeechEnded, AiFlowEventUserInputTimeout, AiFlowEventSessionEnd, AiFlowEventUserBargeIn, } from "@sipgate/ai-flow-sdk"; onSessionStart: async (event: AiFlowEventSessionStart) => { // event is fully typed const sessionId: string = event.session.id; // ... }; ``` ## Next Steps * **[Action Types](/sdk/actions)** - Learn how to respond to events * **[API Reference](/sdk/api-reference)** - Complete API documentation --- --- url: /sipgate-ai-flow-api/sdk/actions.md --- # Action Types Complete reference for all actions you can return from event handlers. ## Overview Actions are responses that tell the AI Flow service what to do next. All actions require a `session_id` and `type` field. ## Base Action Structure ```typescript interface BaseAction { session_id: string; // UUID from the event's session.id type: string; // Action type identifier } ``` ## Action Summary | Action Type | Description | Primary Use Case | | -------------- | --------------------------- | --------------------------------------- | | `speak` | Speak text or SSML | Respond to user with synthesized speech | | `audio` | Play pre-recorded audio | Play hold music, pre-recorded messages | | `mix_audio` | Loop a background sound mixed into speech | Add ambient noise (café, office, train station) under the agent | | `hangup` | End the call | Terminate conversation | | `transfer` | Transfer to another number | Route to human agent or department | | `barge_in` | Manually interrupt playback | Stop current audio immediately | | `configure_transcription` | Change STT language(s) mid-call | Switch recognition language without hanging up | ## Speak Action Speaks text or SSML to the user. ```typescript interface AiFlowActionSpeak { type: "speak"; session_id: string; // Either text OR ssml (not both) text?: string; // Plain text to speak ssml?: string; // SSML markup for advanced control // Optional configurations tts?: TtsConfig; // TTS provider settings barge_in?: BargeInConfig; // Barge-in behavior user_input_timeout_seconds?: number; // Wait this long for the caller to start vad?: VadConfig; // Tune end-of-turn silence (advanced — see /sdk/vad) } ``` **Examples:** ```typescript // Simple text return { type: "speak", session_id: event.session.id, text: "Hello, how can I help you?", }; // With SSML return { type: "speak", session_id: event.session.id, ssml: ` Please listen carefully. Your account balance is $42.50 `, }; // With custom TTS provider return { type: "speak", session_id: event.session.id, text: "Hello in a different voice", tts: { provider: "azure", language: "en-US", voice: "en-US-JennyNeural", }, }; ``` ## Audio Action Plays pre-recorded audio to the user. ```typescript interface AiFlowActionAudio { type: "audio"; session_id: string; audio: string; // Base64 encoded WAV (16kHz, mono, 16-bit) barge_in?: BargeInConfig; } ``` **Example:** ```typescript // Play hold music or pre-recorded message return { type: "audio", session_id: event.session.id, audio: base64EncodedWavData, barge_in: { strategy: "minimum_characters", minimum_characters: 3, }, }; ``` **Audio Format Requirements:** * **Format**: WAV * **Sample Rate**: 16kHz * **Channels**: Mono * **Bit Depth**: 16-bit PCM * **Encoding**: Base64 ## Mix Audio Action Play a looping background sound (e.g. train station, café, office) under the call. The loop plays continuously for the rest of the session — both during the assistant's TTS turns and during silences. Sending `mix_audio` again replaces the active loop; sending with `stop: true` removes it. The loop is dropped automatically when the session ends. ```typescript import { readFileSync } from "node:fs"; interface AiFlowActionMixAudio { type: "mix_audio"; session_id: string; /** Base64-encoded WAV (16 kHz, mono, 16-bit PCM). Required unless stop=true. */ audio?: string; /** Mix volume for the background loop, 0.0–1.0. Defaults to 0.5. */ volume?: number; /** When true, removes the active background loop. */ stop?: boolean; } ``` **Example — start an ambient loop alongside the greeting:** ```typescript // Load and base64-encode the loop once at startup const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64"); onSessionStart: async (event) => { return [ { type: "mix_audio", session_id: event.session.id, audio: AMBIENT_AUDIO, volume: 0.3, }, { type: "speak", session_id: event.session.id, text: "Welcome, how can I help you?", }, ]; }; ``` **Example — stop the ambient before hanging up:** ```typescript onUserSpeak: async (event) => { if (event.text.toLowerCase().includes("goodbye")) { return [ { type: "mix_audio", session_id: event.session.id, stop: true }, { type: "speak", session_id: event.session.id, text: "Goodbye!" }, { type: "hangup", session_id: event.session.id }, ]; } }; ``` **Audio Format Requirements:** identical to the `audio` action — WAV, 16 kHz, mono, 16-bit PCM, base64-encoded. Same FFmpeg conversion command applies. **Best practice — keep ambient quiet.** Background loops should sit *under* the agent's voice. Start around `volume: 0.3` and adjust from there. Loudness-normalize source files to about `-30 LUFS` so different presets stay comparable at a given volume value. ## Hangup Action Ends the call. ```typescript interface AiFlowActionHangup { type: "hangup"; session_id: string; } ``` **Example:** ```typescript onUserSpeak: async (event) => { if (event.text.toLowerCase().includes("goodbye")) { return { type: "hangup", session_id: event.session.id, }; } }; ``` ## Transfer Action Transfers the call to another phone number. Pass an optional `timeout` to enable **transfer fallback** — if the target doesn't pick up (or rejects / hangs up), the service re-emits `session_start` with the same `session.id` so the agent can handle the call again. ```typescript interface AiFlowActionTransfer { type: "transfer"; session_id: string; target_phone_number: string; // E.164 format without leading + recommended caller_id_name: string; caller_id_number: string; /** Optional transfer timeout in seconds (5–120). Enables transfer fallback. */ timeout?: number; } ``` **Example:** ```typescript // Transfer to sales department — fall back to the agent after 30s of no answer return { type: "transfer", session_id: event.session.id, target_phone_number: "1234567890", caller_id_name: "Sales Department", caller_id_number: "1234567890", timeout: 30, }; ``` ## Barge-In Action Manually triggers barge-in (interrupts current playback). ```typescript interface AiFlowActionBargeIn { type: "barge_in"; session_id: string; } ``` **Example:** ```typescript // Manually interrupt current playback return { type: "barge_in", session_id: event.session.id, }; ``` ## Configure Transcription Action Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up. ```typescript import { TranscriptionProvider } from "@sipgate/ai-flow-sdk"; interface AiFlowActionConfigureTranscription { type: "configure_transcription"; session_id: string; provider?: TranscriptionProvider; // "AZURE" | "DEEPGRAM" | "ELEVEN_LABS" — omit to keep current languages?: string[]; // BCP-47 codes, 1-4 entries — omit to reset to provider default custom_vocabulary?: string[]; // Words/phrases to boost STT recognition vad?: VadConfig; // Session-wide VAD tuning — see /sdk/vad } ``` At least one of `provider`, `languages`, `custom_vocabulary`, or `vad` should be provided; sending none is a no-op. Both fields use **full replace** semantics — they never merge with existing settings. **Examples:** ```typescript // Switch to German return { type: "configure_transcription", session_id: event.session.id, languages: ["de-DE"], }; // Multi-language detection (German + English) return { type: "configure_transcription", session_id: event.session.id, languages: ["de-DE", "en-US"], }; // Switch STT provider to Deepgram return { type: "configure_transcription", session_id: event.session.id, provider: "DEEPGRAM", }; // Switch provider AND language in one step return { type: "configure_transcription", session_id: event.session.id, provider: "DEEPGRAM", languages: ["en-US"], }; // Reset to provider default (automatic detection) return { type: "configure_transcription", session_id: event.session.id, }; ``` **Audio gap during restart:** Any change requires the transcription engine to restart. Audio during the restart (~100–500 ms for language-only change, ~200–800 ms for provider switch) is dropped. **Multi-language support depends on the active STT provider:** * **Azure**: up to 4 languages, all used for simultaneous Language Identification (LID) * **Deepgram**: multilingual auto-detection across all supplied languages * **ElevenLabs**: single language only — only the **first** entry is used; additional entries are silently ignored **Barge-in latency after provider switch** (for `immediate` strategy): * **Azure**: ~20–80 ms * **Deepgram**: ~20–100 ms * **ElevenLabs**: ~30–120 ms ## Type Safety All actions are fully typed. Import types from the SDK: ```typescript import type { AiFlowAction, AiFlowActionSpeak, AiFlowActionAudio, AiFlowActionMixAudio, AiFlowActionHangup, AiFlowActionTransfer, AiFlowActionBargeIn, AiFlowActionConfigureTranscription, } from "@sipgate/ai-flow-sdk"; import { TranscriptionProvider } from "@sipgate/ai-flow-sdk"; onUserSpeak: async (event) => { const action: AiFlowActionSpeak = { type: "speak", session_id: event.session.id, text: "Hello!", }; return action; }; ``` ## Next Steps * **[TTS Providers](/sdk/tts-providers)** - Configure text-to-speech voices * **[Barge-In Configuration](/sdk/barge-in)** - Control interruption behavior * **[API Reference](/sdk/api-reference)** - Complete API documentation --- --- url: /sipgate-ai-flow-api/sdk/tts-providers.md --- # TTS Providers Configure text-to-speech providers for different voices and languages. ## Overview The SDK supports both Azure Cognitive Services and ElevenLabs for high-quality voice synthesis. You can configure TTS providers per action or use default settings. ## Azure Cognitive Services Azure provides a wide range of neural voices across many languages and regions. ```typescript interface TtsProviderConfigAzure { provider: "azure"; language?: string; // BCP-47 format (e.g., "en-US", "de-DE") voice?: string; // Voice name (e.g., "en-US-JennyNeural") } ``` **Examples:** ```typescript // English (US) - Female tts: { provider: "azure", language: "en-US", voice: "en-US-JennyNeural" } // English (GB) - Female tts: { provider: "azure", language: "en-GB", voice: "en-GB-SoniaNeural" } // German - Male tts: { provider: "azure", language: "de-DE", voice: "de-DE-ConradNeural" } // Spanish - Female tts: { provider: "azure", language: "es-ES", voice: "es-ES-ElviraNeural" } ``` ### Popular Azure Voices | Language | Voice Name | Gender | Sample | Description | | -------- | ------------------ | ------ | --------------------------------------------------------------------- | ---------------------- | | en-US | en-US-JennyNeural | Female | | Friendly, professional | | en-US | en-US-GuyNeural | Male | | Clear, neutral | | en-GB | en-GB-SoniaNeural | Female | | British, professional | | en-GB | en-GB-RyanNeural | Male | | British, friendly | | de-DE | de-DE-KatjaNeural | Female | | Professional, clear | | de-DE | de-DE-ConradNeural | Male | | Deep, authoritative | **Full Voice List:** See [Azure TTS documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support) for complete list of 400+ voices in 140+ languages. ## ElevenLabs ElevenLabs provides ultra-realistic AI voices optimized for conversational use cases. ```typescript interface TtsProviderConfigElevenLabs { provider: "eleven_labs"; voice?: string; // Voice ID (e.g., "21m00Tcm4TlvDq8ikWAM") - optional, uses default if omitted } ``` **Example:** ```typescript // With specific voice tts: { provider: "eleven_labs", voice: "21m00Tcm4TlvDq8ikWAM" // Rachel } // With default voice tts: { provider: "eleven_labs" } ``` ### Available ElevenLabs Voices The "Sample" column plays a representative greeting from a voice assistant scenario. Multilingual voices include both German and English samples; voices verified for German only have a German sample. | Voice Name | ID | Sample | Description | Verified Locales | | ----------- | -------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------- | ---------------------------------- | | **sipgate** | dSu12TX3MEDQXAarG4s6 | | Clean male voice used by sipgate for system announcements (default). | de-DE | | **Rachel** | 21m00Tcm4TlvDq8ikWAM | | Matter-of-fact, personable woman. Great for conversational use cases. | en-US | | **Sarah** | EXAVITQu4vr4xnSDxMaL | | Young adult woman with a confident and warm, mature quality. | en-US, fr-FR, cmn-CN, hi-IN | | **Laura** | FGY2WhTYpPnrIDTdsKH5 | | Young adult female delivers sunny enthusiasm with quirky attitude. | en-US, fr-FR, cmn-CN, de-DE | | **George** | JBFqnCBsd6RMkjVDRZzb | | Warm resonance that instantly captivates listeners. | en-GB, fr-FR, ja-JP, cs-CZ | | **Thomas** | GBv7mTt0atIp3Br8iCZE | | Soft and subdued male, optimal for narrations or meditations. | en-US | | **Roger** | CwhRBWXzGAHq8TQ4Fs17 | | Easy going and perfect for casual conversations. | en-US, fr-FR, de-DE, nl-NL | | **Eric** | cjVigY5qzO86Huf0OWal | | Smooth tenor pitch from a man in his 40s - perfect for agentic use cases. | en-US, fr-FR, de-DE, sk-SK | | **Brian** | nPczCjzI2devNBz1zQrb | | Middle-aged man with resonant and comforting tone. | en-US, cmn-CN, de-DE, nl-NL | | **Jessica** | cgSgspJ2msm6clMCkdW9 | | Young and playful American female, perfect for trendy content. | en-US, fr-FR, ja-JP, cmn-CN, de-DE | | **Liam** | TX3LPaxmHKxFdv7VOQHJ | | Young adult with energy and warmth - suitable for reels and shorts. | en-US, de-DE, cs-CZ, pl-PL, tr-TR | | **Alice** | Xb7hH8MSUJpSbSDYk0k2 | | Clear and engaging, friendly British woman suitable for e-learning. | en-GB, it-IT, fr-FR, ja-JP, pl-PL | | **Daniel** | onwK4e9ZLuTAKqWW03F9 | | Strong voice perfect for professional broadcast or news. | en-GB, de-DE, tr-TR | | **Lily** | pFZP5JQG7iQjIQuC4Bku | | Velvety British female delivers news with warmth and clarity. | it-IT, de-DE, cmn-CN, cs-CZ, nl-NL | | **River** | SAz9YHcvj6GT2YYXdXww | | Relaxed, neutral voice ready for narrations or conversational projects. | en-US, it-IT, fr-FR, cmn-CN | | **Charlie** | IKne3meq5aSn9XLyUdCD | | Young Australian male with confident and energetic voice. | en-AU, cmn-CN, fil-PH | | **Aria** | 9BWtsMINqrJLrRacOk9x | | Middle-aged female with African-American accent. Calm with hint of rasp. | en-US, fr-FR, cmn-CN, tr-TR | | **Matilda** | XrExE9yKIg1WjnnlVkGX | | Professional woman with pleasing alto pitch. Suitable for many use cases. | en-US, it-IT, fr-FR, de-DE | | **Will** | bIHbv24MWmeRgasZH58o | | Conversational and laid back. | en-US, fr-FR, de-DE, cmn-CN, cs-CZ | | **Chris** | iP95p4xoKVk53GoZ742B | | Natural and real, down-to-earth voice great across many use-cases. | en-US, fr-FR, sv-SE, hi-IN | | **Bill** | pqHfZKP75CvOlQylNhV4 | | Friendly and comforting voice ready to narrate stories. | en-US, fr-FR, cmn-CN, de-DE, cs-CZ | **Note:** 50+ voices available in total. The full list with samples is in the [API reference](/api/tts-providers#available-voices). The SDK includes full TypeScript type definitions for all voice IDs and names. ## Choosing a TTS Provider ### Use Azure when: * You need support for many languages (140+ languages available) * You want consistent quality across all locales * You need specific regional accents or dialects * Budget is a primary concern ### Use ElevenLabs when: * You need the most natural, human-like voices * Conversational quality is critical (phone calls, virtual assistants) * You're primarily working with English or common European languages * You want voices with distinct personalities ## Usage Examples ### Per-Action Configuration ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "Hello in a different voice", tts: { provider: "azure", language: "en-US", voice: "en-US-JennyNeural", }, }; } ``` ### Using ElevenLabs ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "Hello from ElevenLabs!", tts: { provider: "eleven_labs", voice: "21m00Tcm4TlvDq8ikWAM", // Rachel }, }; } ``` ## Next Steps * **[Barge-In Configuration](/sdk/barge-in)** - Control interruption behavior * **[Action Types](/sdk/actions)** - Complete action reference --- --- url: /sipgate-ai-flow-api/sdk/barge-in.md --- # Barge-In Configuration Control how users can interrupt the assistant while speaking. ## Overview Barge-in allows users to interrupt the assistant's speech. You can configure barge-in behavior for each `speak` or `audio` action. ## Configuration ```typescript interface BargeInConfig { strategy: "none" | "manual" | "minimum_characters" | "immediate"; minimum_characters?: number; // Default: 3 (only for minimum_characters) allow_after_ms?: number; // Delay before allowing interruption } ``` ## Strategies ### `none` Disables barge-in completely. Audio plays fully without interruption. ```typescript barge_in: { strategy: "none" } ``` **Use cases:** * Critical information that must be heard * Legal disclaimers * Emergency instructions **Example:** ```typescript return { type: "speak", session_id: event.session.id, text: "This is important information. Please listen carefully.", barge_in: { strategy: "none", }, }; ``` ### `manual` Allows manual barge-in via API only (no automatic detection). ```typescript barge_in: { strategy: "manual" } ``` **Use cases:** * Custom interruption logic * Button-triggered interruption * External event-based interruption **Example:** ```typescript return { type: "speak", session_id: event.session.id, text: "Press a button to interrupt.", barge_in: { strategy: "manual", }, }; ``` ### `minimum_characters` Automatically detects barge-in when user speech exceeds character threshold. ```typescript barge_in: { strategy: "minimum_characters", minimum_characters: 5, // Trigger after 5 characters allow_after_ms: 500 // Wait 500ms before allowing interruption } ``` **Use cases:** * Natural conversation flow * Customer service scenarios * Interactive voice menus **Example:** ```typescript return { type: "speak", session_id: event.session.id, text: "How can I help you today?", barge_in: { strategy: "minimum_characters", minimum_characters: 3, }, }; ``` ### `immediate` ⚡ NEW **Most responsive option** - Interrupts immediately when user starts speaking using Voice Activity Detection (VAD). ```typescript barge_in: { strategy: "immediate", allow_after_ms: 500 // Optional: protect first 500ms } ``` **How it works:** * **Azure/Deepgram**: Uses Voice Activity Detection (VAD) - triggers before any text is recognized * **ElevenLabs**: Uses first partial transcript * **Latency**: 20-100ms (2-4x faster than `minimum_characters`) * **No text required**: Interrupts on voice detection, not transcription **Use cases:** * High-priority conversations requiring instant responsiveness * Natural dialogue where interruptions should feel seamless * Customer service where quick response matters * Urgent or time-sensitive interactions **Example:** ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "I can help you with billing, support, or sales. What would you like?", barge_in: { strategy: "immediate", allow_after_ms: 500, // Protect first 500ms from accidental noise }, }; } ``` **Comparison:** | Strategy | Trigger | Latency | Use Case | |----------|---------|---------|----------| | `immediate` | Voice Activity (VAD) | 20-100ms | Most natural, instant response | | `minimum_characters` | Text recognition | 50-200ms | Balanced reliability | | `manual` | API call | N/A | Custom logic | | `none` | Never | N/A | Critical info only | **Best practices:** * Use `allow_after_ms: 500-1000` to prevent accidental interruptions * Test with real users to find optimal settings * Consider background noise in your environment ### Protection Period You can add a protection period to prevent interruption during critical parts of speech: ```typescript return { type: "speak", session_id: event.session.id, text: "Your account number is 1234567890. Please write this down.", barge_in: { strategy: "minimum_characters", minimum_characters: 10, // Require substantial speech allow_after_ms: 2000, // Protect first 2 seconds }, }; ``` ## Configuration Options ### `minimum_characters` The minimum number of characters the user must speak before barge-in is triggered. * **Default**: `3` * **Range**: `1` to `100` * **Use**: Higher values require more speech before interruption ### `allow_after_ms` Delay in milliseconds before barge-in is allowed. This creates a "protection period" at the start of speech. * **Default**: `0` (immediate) * **Range**: `0` to `10000` (10 seconds) * **Use**: Prevent interruption during critical information ## Examples ### Natural Conversation ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "I can help you with billing, support, or sales. What would you like?", barge_in: { strategy: "minimum_characters", minimum_characters: 3, }, }; } ``` ### Critical Information ```typescript onUserSpeak: async (event) => { return { type: "speak", session_id: event.session.id, text: "Your verification code is 1-2-3-4-5-6. Please write this down.", barge_in: { strategy: "none", // Don't allow interruption }, }; } ``` ### Protected Announcement ```typescript onSessionStart: async (event) => { return { type: "speak", session_id: event.session.id, text: "Welcome! Your call may be recorded for quality assurance.", barge_in: { strategy: "minimum_characters", minimum_characters: 5, allow_after_ms: 3000, // Protect first 3 seconds }, }; } ``` ## Best Practices 1. **Use `none` sparingly** - Only for truly critical information 2. **Choose the right strategy**: * `immediate` - For most natural, responsive conversations * `minimum_characters` - For balance between responsiveness and reliability * `manual` - For custom logic * `none` - For critical announcements only 3. **Set protection periods** - Use `allow_after_ms: 500-1000` to prevent cutting off important intro 4. **Test with users** - Find the right balance for your use case 5. **Consider noise** - `immediate` may trigger on background noise; use `allow_after_ms` as buffer ## Related: VAD Configuration Barge-in controls *whether the caller may interrupt the assistant while it is speaking*. The related [VAD Configuration](/sdk/vad) controls *how long the caller may pause before their turn is considered finished*. Both can be set on the same `speak` action. ## Next Steps * **[Action Types](/sdk/actions)** - Complete action reference * **[VAD Configuration](/sdk/vad)** - Tune end-of-turn silence * **[API Reference](/sdk/api-reference)** - Full API documentation --- --- url: /sipgate-ai-flow-api/sdk/vad.md --- # VAD (Voice Activity Detection) Configuration Optional advanced setting that lets you tune how long the system waits in silence before treating the caller's turn as finished. When omitted, the system default applies. ::: warning Optional advanced setting Only set `vad` when you have a concrete use case where the system's default end-of-turn timing is too eager or too patient. ::: ## Type ```typescript interface VadConfig { /** * Milliseconds of silence after the caller stops speaking before their turn * is considered finished. Recommended range 150–2000. * Lower values yield faster turn-taking; higher values tolerate longer pauses. */ end_of_turn_silence_ms?: number; } ``` ## Where to set it `VadConfig` is accepted on two action types: * **`speak.vad`** — applies to the caller's reply that follows. Persists until overridden. * **`configure_transcription.vad`** — applies for the rest of the session. ## Lenient validation If you send an out-of-range, non-integer, or otherwise invalid value, the field is **silently ignored** — the system default takes over. Your action still runs normally; only the bad VAD value is dropped. This avoids breaking call flows over a typo. ## Example: tolerate long pauses (e.g. spelling) ```typescript return { type: "speak", session_id: event.session.id, text: "Please spell your last name, letter by letter.", vad: { end_of_turn_silence_ms: 1500, }, }; ``` ## Example: snappy back-and-forth ```typescript return { type: "speak", session_id: event.session.id, text: "Did you mean account number 1234?", vad: { end_of_turn_silence_ms: 250, }, }; ``` ## Example: set once for the whole session ```typescript return { type: "configure_transcription", session_id: event.session.id, vad: { end_of_turn_silence_ms: 1000, }, }; ``` ## VAD vs Barge-In `vad` and [`barge_in`](/sdk/barge-in) are related but distinct: * **`vad`** governs *when the caller's turn is considered finished*. * **`barge_in`** governs *whether and how the caller may interrupt the assistant while it is speaking*. Both can be set on the same `speak` action. ## Next Steps * **[Action Types](/sdk/actions)** - Complete action reference * **[Barge-In Configuration](/sdk/barge-in)** - Control caller interruptions * **[API Reference](/sdk/api-reference)** - Full API documentation --- --- url: /sipgate-ai-flow-api/sdk/integrations/express.md --- # Express.js Integration Complete guide for integrating the SDK with Express.js. ## Basic Setup The simplest way to use the SDK with Express.js: ```typescript import express from "express"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const app = express(); app.use(express.json()); const assistant = AiFlowAssistant.create({ onSessionStart: async (event) => { return "Welcome! How can I help you today?"; }, onUserSpeak: async (event) => { return processUserInput(event.text); }, onSessionEnd: async (event) => { await cleanupSession(event.session.id); }, }); // Webhook endpoint app.post("/webhook", assistant.express()); const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`AI Flow assistant running on port ${PORT}`); }); ``` ## Complete Example Here's a complete example with error handling and logging: ```typescript import express from "express"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const app = express(); app.use(express.json()); const assistant = AiFlowAssistant.create({ debug: process.env.NODE_ENV !== "production", onSessionStart: async (event) => { console.log(`Session started: ${event.session.id}`); return "Welcome! How can I help you today?"; }, onUserSpeak: async (event) => { try { return await processUserInput(event.text); } catch (error) { console.error("Error processing input:", error); return "I'm sorry, I encountered an error. Please try again."; } }, onSessionEnd: async (event) => { await cleanupSession(event.session.id); }, }); // Webhook endpoint app.post("/webhook", assistant.express()); // Health check app.get("/health", (req, res) => { res.json({ status: "ok" }); }); const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`AI Flow assistant running on port ${PORT}`); }); ``` ## Error Handling The `express()` middleware automatically handles errors, but you can add custom error handling: ```typescript app.post("/webhook", (req, res, next) => { assistant.express()(req, res).catch(next); }); app.use((err, req, res, next) => { console.error("Error:", err); res.status(500).json({ error: "Internal server error" }); }); ``` ## Authentication Add authentication middleware before the webhook: ```typescript app.post("/webhook", authenticate, assistant.express()); function authenticate(req, res, next) { const apiKey = req.headers["x-api-key"]; if (apiKey !== process.env.API_KEY) { return res.status(401).json({ error: "Unauthorized" }); } next(); } ``` ## Multiple Endpoints You can use multiple assistants for different endpoints: ```typescript const salesAssistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Welcome to sales!"; }, }); const supportAssistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Welcome to support!"; }, }); app.post("/webhook/sales", salesAssistant.express()); app.post("/webhook/support", supportAssistant.express()); ``` ## Next Steps * **[WebSocket Integration](/sdk/integrations/websocket)** - WebSocket integration guide * **[Examples](/sdk/examples)** - More integration examples --- --- url: /sipgate-ai-flow-api/sdk/integrations/websocket.md --- # WebSocket Integration Complete guide for integrating the SDK with WebSocket. ## Basic Setup ```typescript import WebSocket from "ws"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const wss = new WebSocket.Server({ port: 8080, perMessageDeflate: false, }); const assistant = AiFlowAssistant.create({ onUserSpeak: async (event) => { return "Hello from WebSocket!"; }, }); wss.on("connection", (ws, req) => { console.log("New WebSocket connection"); ws.on("message", assistant.ws(ws)); ws.on("error", (error) => { console.error("WebSocket error:", error); }); ws.on("close", () => { console.log("WebSocket connection closed"); }); }); console.log("WebSocket server listening on port 8080"); ``` ## Complete Example Here's a complete example with error handling and connection management: ```typescript import WebSocket from "ws"; import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const wss = new WebSocket.Server({ port: 8080, perMessageDeflate: false, }); const assistant = AiFlowAssistant.create({ debug: true, onSessionStart: async (event) => { console.log(`Session started: ${event.session.id}`); return "Welcome!"; }, onUserSpeak: async (event) => { return processUserInput(event.text); }, onSessionEnd: async (event) => { console.log(`Session ended: ${event.session.id}`); }, }); wss.on("connection", (ws, req) => { console.log("New WebSocket connection from", req.socket.remoteAddress); // Handle messages ws.on("message", async (data) => { try { await assistant.ws(ws)(data); } catch (error) { console.error("Error processing message:", error); ws.send(JSON.stringify({ error: "Internal server error" })); } }); // Error handling ws.on("error", (error) => { console.error("WebSocket error:", error); }); // Connection cleanup ws.on("close", (code, reason) => { console.log(`Connection closed: ${code} - ${reason}`); }); // Send welcome message ws.send(JSON.stringify({ type: "connected" })); }); console.log("WebSocket server listening on port 8080"); ``` ## Message Format The SDK expects messages in JSON format: ```typescript { "type": "session_start", "session": { "id": "uuid", "account_id": "account-id", "phone_number": "1234567890", // ... } } ``` ## Custom Message Handling You can handle messages manually: ```typescript wss.on("connection", (ws) => { ws.on("message", async (data) => { try { const event = JSON.parse(data.toString()); const action = await assistant.onEvent(event); if (action) { ws.send(JSON.stringify(action)); } } catch (error) { console.error("Error:", error); } }); }); ``` ## Connection Management Track active connections: ```typescript const connections = new Map(); wss.on("connection", (ws, req) => { const connectionId = generateId(); connections.set(connectionId, ws); ws.on("close", () => { connections.delete(connectionId); }); }); ``` ## Next Steps * **[Express.js Integration](/sdk/integrations/express)** - Express.js integration guide * **[Examples](/sdk/examples)** - More integration examples --- --- url: /sipgate-ai-flow-api/sdk/outbound-calls.md --- # Outbound Calls Initiate outbound calls directly from your assistant using `assistant.call()`. ::: warning Access Required Outbound calls are **only available upon request** and after a positive review by sipgate support. Please contact support to request access before using this feature. ::: ## Setup Pass `token` when creating the assistant. `baseUrl` is optional and defaults to `https://api.sipgate.com`. ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; const assistant = AiFlowAssistant.create({ token: process.env.SIPGATE_TOKEN, onSessionStart: async (event) => { if (event.session.direction === "outbound") { return "Hello! This is an automated call. Do you have a moment?"; } return "Hello! How can I help you today?"; }, onUserSpeak: async (event) => { return processWithLLM(event.text); }, }); ``` ## Initiating a Call ```typescript await assistant.call({ aiFlowId: "e3670012-96a3-4ae5-ac42-87abe22015c3", billingDevice: "e2", // provided by sipgate support during onboarding toPhoneNumber: "4915790000687", // E.164 format without leading + }); ``` | Parameter | Type | Description | |-----------------|--------|---------------------------------------------------| | `aiFlowId` | string | ID of the AI flow to use for the call | | `billingDevice` | string | Billing device suffix, provided during onboarding | | `toPhoneNumber` | string | Target phone number in E.164 format without leading + | `call()` resolves when the call has been successfully initiated (`201 Created`). It throws if the request fails. ## Handling the Session Once the recipient answers, the normal event flow begins. Your existing handlers (`onSessionStart`, `onUserSpeak`, etc.) are called exactly as for inbound calls. Check `event.session.direction` to distinguish outbound from inbound sessions: ```typescript onSessionStart: async (event) => { if (event.session.direction === "outbound") { // Your assistant placed this call return "Hi, I'm calling from Example Corp regarding your appointment."; } // Inbound call return "Hello! How can I help you?"; }, ``` The `direction` field is available on the `session` object of every event. ## Complete Example ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; import express from "express"; const assistant = AiFlowAssistant.create({ token: process.env.SIPGATE_TOKEN, onSessionStart: async (event) => { if (event.session.direction === "outbound") { return "Hello! This is an automated reminder from Example Corp. Your appointment is tomorrow at 10am. Press 1 to confirm or say 'cancel' to cancel."; } return "Hello! How can I help you?"; }, onUserSpeak: async (event) => { const text = event.text.toLowerCase(); if (text.includes("confirm") || text.includes("1")) { return [ { type: "speak", session_id: event.session.id, text: "Great, your appointment is confirmed. Goodbye!" }, { type: "hangup", session_id: event.session.id }, ]; } if (text.includes("cancel")) { await cancelAppointment(event.session.id); return [ { type: "speak", session_id: event.session.id, text: "Your appointment has been cancelled. Goodbye!" }, { type: "hangup", session_id: event.session.id }, ]; } return "I didn't catch that. Say 'confirm' to confirm or 'cancel' to cancel your appointment."; }, }); // Webhook server (receives events when calls connect) const app = express(); app.use(express.json()); app.post("/webhook", assistant.express()); app.listen(3000); // Initiate the call await assistant.call({ aiFlowId: process.env.AI_FLOW_ID!, billingDevice: "e2", toPhoneNumber: "4915790000687", }); ``` ## Next Steps * **[API Reference](/sdk/api-reference)** — `call()` method and all options * **[Outbound Calls (API)](/api/guides/outbound-calls)** — raw HTTP reference * **[Event Types](/sdk/events)** — complete event reference --- --- url: /sipgate-ai-flow-api/sdk/examples.md --- # Examples Real-world examples and use cases. ## Customer Service Bot A complete customer service bot with state management and routing: ```typescript import { AiFlowAssistant, BargeInStrategy } from "@sipgate/ai-flow-sdk"; import express from "express"; // Session state management const sessions = new Map(); const assistant = AiFlowAssistant.create({ debug: true, onSessionStart: async (event) => { // Initialize session state sessions.set(event.session.id, { state: "greeting", data: { attempts: 0 }, }); return { type: "speak", session_id: event.session.id, text: "Welcome to customer support. How can I help you today? You can ask about billing, technical support, or sales.", barge_in: { strategy: "minimum_characters", minimum_characters: 3, }, }; }, onUserSpeak: async (event) => { const session = sessions.get(event.session.id); if (!session) return null; const text = event.text.toLowerCase(); // Intent routing if (text.includes("billing") || text.includes("invoice")) { return { type: "transfer", session_id: event.session.id, target_phone_number: "1234567890", caller_id_name: "Billing Department", caller_id_number: "1234567890", }; } if (text.includes("goodbye") || text.includes("bye")) { return { type: "speak", session_id: event.session.id, text: "Thank you for calling. Have a great day!", barge_in: { strategy: "none" }, // Don't allow interruption }; } if (text.includes("technical") || text.includes("support")) { session.state = "technical_support"; return "I'll connect you with our technical support team. Please describe your issue."; } // Default response session.data.attempts++; if (session.data.attempts > 2) { return "I'm having trouble understanding. Let me transfer you to a representative."; } return "I can help with billing, technical support, or sales. Which would you like?"; }, onUserBargeIn: async (event) => { console.log(`User interrupted: ${event.text}`); return "Yes, I'm listening."; }, onSessionEnd: async (event) => { // Cleanup session state sessions.delete(event.session.id); console.log(`Session ${event.session.id} ended`); }, }); const app = express(); app.use(express.json()); app.post("/webhook", assistant.express()); app.listen(3000, () => { console.log("Customer service bot running on port 3000"); }); ``` ## Multi-Language Support Switch languages based on user preference: ```typescript const sessions = new Map(); const assistant = AiFlowAssistant.create({ onSessionStart: async (event) => { sessions.set(event.session.id, { language: "en" }); return "Welcome! Say 'deutsch' for German or 'english' for English."; }, onUserSpeak: async (event) => { const session = sessions.get(event.session.id); if (!session) return null; const text = event.text.toLowerCase(); if (text.includes("deutsch") || text.includes("german")) { session.language = "de"; return { type: "speak", session_id: event.session.id, text: "Willkommen! Wie kann ich Ihnen helfen?", tts: { provider: "azure", language: "de-DE", voice: "de-DE-KatjaNeural", }, }; } if (text.includes("english") || text.includes("englisch")) { session.language = "en"; return "Welcome! How can I help you?"; } // Continue in selected language if (session.language === "de") { return { type: "speak", session_id: event.session.id, text: "Wie kann ich Ihnen helfen?", tts: { provider: "azure", language: "de-DE", voice: "de-DE-KatjaNeural", }, }; } return "How can I help you?"; }, }); ``` ## User Input Timeout Handling Handle scenarios where users don't respond within a specified time period: ```typescript import { AiFlowAssistant } from "@sipgate/ai-flow-sdk"; import express from "express"; // Track timeout counts per session const timeoutCounts = new Map(); const assistant = AiFlowAssistant.create({ debug: true, onSessionStart: async (event) => { // Initialize timeout counter timeoutCounts.set(event.session.id, 0); return { type: "speak", session_id: event.session.id, text: "Welcome to our automated assistant. What can I help you with today?", user_input_timeout_seconds: 8 // Wait 8 seconds for initial response }; }, onUserSpeak: async (event) => { // Reset timeout counter on successful user input timeoutCounts.set(event.session.id, 0); const text = event.text.toLowerCase(); if (text.includes("account") || text.includes("balance")) { return { type: "speak", session_id: event.session.id, text: "Please tell me your account number.", user_input_timeout_seconds: 10 // Give more time for account number }; } if (text.includes("speak") || text.includes("agent") || text.includes("human")) { return { type: "speak", session_id: event.session.id, text: "Let me transfer you to a live agent. Please hold." // Follow with transfer action }; } return { type: "speak", session_id: event.session.id, text: "I can help you with account information, billing questions, or connect you to an agent. What would you like?", user_input_timeout_seconds: 8 }; }, onUserInputTimeout: async (event) => { const sessionId = event.session.id; const count = (timeoutCounts.get(sessionId) || 0) + 1; timeoutCounts.set(sessionId, count); console.log(`Timeout #${count} for session ${sessionId}`); // After 3 timeouts, offer to transfer to agent if (count >= 3) { return { type: "speak", session_id: sessionId, text: "I'm having trouble hearing you. Let me transfer you to a live agent who can better assist you." // Could follow with transfer action }; } // After 2 timeouts, give clearer instructions if (count === 2) { return { type: "speak", session_id: sessionId, text: "I still haven't heard your response. Please speak clearly after the beep. Say 'agent' if you'd like to speak to a person.", user_input_timeout_seconds: 10 // Give more time }; } // First timeout - gentle prompt return { type: "speak", session_id: sessionId, text: "I didn't catch that. Are you still there? Please let me know how I can help you.", user_input_timeout_seconds: 8 }; }, onSessionEnd: async (event) => { // Cleanup timeout counters timeoutCounts.delete(event.session.id); console.log(`Session ${event.session.id} ended`); }, }); const app = express(); app.use(express.json()); app.post("/webhook", assistant.express()); app.listen(3000, () => { console.log("Timeout-aware assistant running on port 3000"); }); ``` ### Advanced Timeout Strategy Context-aware timeout handling with different strategies based on the conversation state: ```typescript interface SessionData { state: "greeting" | "collecting_info" | "confirming" | "completing"; timeouts: number; data: Record; } const sessions = new Map(); const assistant = AiFlowAssistant.create({ onSessionStart: async (event) => { sessions.set(event.session.id, { state: "greeting", timeouts: 0, data: {} }); return { type: "speak", session_id: event.session.id, text: "Hello! To help you with your order, I'll need some information. What's your order number?", user_input_timeout_seconds: 10 }; }, onUserSpeak: async (event) => { const session = sessions.get(event.session.id); if (!session) return null; // Reset timeout counter on successful input session.timeouts = 0; const text = event.text; if (session.state === "greeting") { session.data.orderNumber = text; session.state = "collecting_info"; return { type: "speak", session_id: event.session.id, text: `Thank you. Order number ${text} received. Can you verify your email address?`, user_input_timeout_seconds: 10 }; } if (session.state === "collecting_info") { session.data.email = text; session.state = "confirming"; return { type: "speak", session_id: event.session.id, text: `Perfect. Let me look up order ${session.data.orderNumber} for ${text}. One moment please.`, user_input_timeout_seconds: 5 // Shorter timeout for confirmation }; } return "How else can I help you?"; }, onUserInputTimeout: async (event) => { const session = sessions.get(event.session.id); if (!session) return null; session.timeouts++; // Different strategies based on conversation state switch (session.state) { case "greeting": if (session.timeouts >= 2) { return { type: "speak", session_id: event.session.id, text: "I'm having trouble hearing your order number. Let me transfer you to someone who can help.", // Follow with transfer }; } return { type: "speak", session_id: event.session.id, text: "I didn't hear your order number. Please say or spell it out for me.", user_input_timeout_seconds: 12 // Give extra time }; case "collecting_info": return { type: "speak", session_id: event.session.id, text: "I need your email address to proceed. Please provide it now, or say 'skip' to continue without it.", user_input_timeout_seconds: 10 }; case "confirming": // Just continue with the process session.state = "completing"; return { type: "speak", session_id: event.session.id, text: "I found your order. Your package is scheduled for delivery tomorrow. Is there anything else I can help with?", user_input_timeout_seconds: 8 }; default: if (session.timeouts >= 3) { return { type: "hangup", session_id: event.session.id }; } return { type: "speak", session_id: event.session.id, text: "Are you still there?", user_input_timeout_seconds: 5 }; } }, onSessionEnd: async (event) => { sessions.delete(event.session.id); }, }); ``` ## Next Steps * **[Integration Guides](/sdk/integrations/express)** - Detailed integration guides * **[API Reference](/sdk/api-reference)** - Complete API documentation --- --- url: /sipgate-ai-flow-api/sdk/advanced/direct-integration.md --- # Working Without the Assistant Wrapper If you prefer to work directly with the SDK's event and action system without using the `AiFlowAssistant` wrapper, you can manually handle events and construct actions. ## Direct Event Handling Here's how to handle events and construct actions without the assistant wrapper: ```typescript import express from "express"; import { AiFlowEventType, AiFlowActionType } from "@sipgate/ai-flow-sdk"; const app = express(); app.use(express.json()); app.post("/webhook", async (req, res) => { const event = req.body; let action = null; switch (event.type) { case "session_start": action = { type: AiFlowActionType.SPEAK, session_id: event.session.id, text: "Welcome to our service!", barge_in: { strategy: "minimum_characters", minimum_characters: 5, }, }; break; case "user_speak": // Check if user interrupted (barge-in) if (event.barged_in) { console.log(`User interrupted with: ${event.text}`); action = { type: AiFlowActionType.SPEAK, session_id: event.session.id, text: "I'm listening, go ahead.", }; break; } // Normal user speech handling if (event.text.toLowerCase().includes("transfer")) { action = { type: AiFlowActionType.TRANSFER, session_id: event.session.id, target_phone_number: "1234567890", caller_id_name: "Support", caller_id_number: "1234567890", }; } else if (event.text.toLowerCase().includes("goodbye")) { action = { type: AiFlowActionType.HANGUP, session_id: event.session.id, }; } else { action = { type: AiFlowActionType.SPEAK, session_id: event.session.id, text: `You said: ${event.text}`, }; } break; case "assistant_speak": console.log(`Spoke for ${event.duration_ms}ms`); // Optional: track metrics, no action needed break; case "session_end": console.log(`Session ${event.session.id} ended`); // Cleanup logic, no action needed break; } // Return action if one was created if (action) { res.json(action); } else { res.status(204).send(); } }); app.listen(3000, () => { console.log("Webhook server listening on port 3000"); }); ``` ## Benefits of Direct Integration * **Full Control** - Complete control over event handling * **Custom Logic** - Easier to implement complex routing logic * **No Abstraction** - Direct access to events and actions * **Flexibility** - Can integrate with any framework or system ## When to Use Direct Integration Use direct integration when: * You need custom event processing logic * You're integrating with a non-standard framework * You want to implement your own state management * You need fine-grained control over responses ## Next Steps * **[Complete Event Reference](/sdk/advanced/events)** - All event types * **[Complete Action Reference](/sdk/advanced/actions)** - All action types * **[Validation with Zod](/sdk/advanced/validation)** - Runtime validation --- --- url: /sipgate-ai-flow-api/sdk/advanced/events.md --- # Complete Event Reference Complete reference for all events in the SDK. ## Base Event Structure All events extend the base event structure: ```typescript interface BaseEvent { session: { id: string; // UUID of the session account_id: string; // Account identifier phone_number: string; // Phone number for this flow session direction?: "inbound" | "outbound"; // Call direction from_phone_number: string; // Phone number of the caller to_phone_number: string; // Phone number of the callee }; } ``` ## All Event Types | Event Type | Description | When Triggered | | ----------------- | --------------------------- | ------------------------------------------ | | `session_start` | Call session begins | When a new call is initiated | | `user_speak` | User speech detected | After speech-to-text completes (includes `barged_in` flag) | | `assistant_speak` | Assistant finished speaking | After TTS playback completes | | `assistant_speech_ended` | Assistant finished speaking | After speech playback ends | | `session_end` | Call session ends | When the call terminates | ## Event Type Definitions ### session\_start ```typescript interface AiFlowEventSessionStart { type: "session_start"; session: { id: string; account_id: string; phone_number: string; // Phone number for this flow session direction?: "inbound" | "outbound"; // Call direction from_phone_number: string; to_phone_number: string; }; } ``` ### user\_speak ```typescript interface AiFlowEventUserSpeak { type: "user_speak"; text: string; // Recognized speech text barged_in?: boolean; // true if user interrupted assistant session: SessionInfo; } ``` The `barged_in` flag is set to `true` when the user interrupts the assistant mid-speech. ### assistant\_speak ```typescript interface AiFlowEventAssistantSpeak { type: "assistant_speak"; text?: string; // Text that was spoken ssml?: string; // SSML that was used (if applicable) duration_ms: number; // Duration of speech in milliseconds speech_started_at: number; // Unix timestamp (ms) when speech started session: SessionInfo; } ``` ### assistant\_speech\_ended ```typescript interface AiFlowEventAssistantSpeechEnded { type: "assistant_speech_ended"; session: SessionInfo; } ``` ### session\_end ```typescript interface AiFlowEventSessionEnd { type: "session_end"; session: SessionInfo; } ``` ## Type Safety All events are fully typed. Import types from the SDK: ```typescript import type { AiFlowEventSessionStart, AiFlowEventUserSpeak, AiFlowEventAssistantSpeak, AiFlowEventAssistantSpeechEnded, AiFlowEventSessionEnd, } from "@sipgate/ai-flow-sdk"; ``` ## Next Steps * **[Complete Action Reference](/sdk/advanced/actions)** - All action types * **[Direct Integration](/sdk/advanced/direct-integration)** - Working without the wrapper --- --- url: /sipgate-ai-flow-api/sdk/advanced/actions.md --- # Complete Action Reference Complete reference for all actions in the SDK. ## Base Action Structure All actions require a `session_id` and `type` field: ```typescript interface BaseAction { session_id: string; // UUID from the event's session.id type: string; // Action type identifier } ``` ## All Action Types | Action Type | Description | Primary Use Case | | -------------- | --------------------------- | --------------------------------------- | | `speak` | Speak text or SSML | Respond to user with synthesized speech | | `audio` | Play pre-recorded audio | Play hold music, pre-recorded messages | | `mix_audio` | Loop a background sound mixed into speech | Add ambient noise (café, office, train station) under the agent | | `hangup` | End the call | Terminate conversation | | `transfer` | Transfer to another number | Route to human agent or department | | `barge_in` | Manually interrupt playback | Stop current audio immediately | | `configure_transcription` | Change STT language(s) mid-call | Switch recognition language without hanging up | ## Action Type Definitions ### speak - Text-to-speech response ```typescript interface AiFlowActionSpeak { type: "speak"; session_id: string; // Provide either text OR ssml (not both) text?: string; ssml?: string; // Optional TTS configuration tts?: { provider: "azure"; language?: string; // e.g., "en-US", "de-DE" voice?: string; // Azure voice name } | { provider: "eleven_labs"; voice?: string; // ElevenLabs voice ID (optional, uses default if omitted) }; barge_in?: { strategy: "none" | "manual" | "minimum_characters"; minimum_characters?: number; // Default: 3 allow_after_ms?: number; // Delay before allowing interruption }; } ``` ### audio - Play pre-recorded audio ```typescript interface AiFlowActionAudio { type: "audio"; session_id: string; audio: string; // Base64 encoded WAV (16kHz, mono, 16-bit PCM) barge_in?: { strategy: "none" | "manual" | "minimum_characters"; minimum_characters?: number; allow_after_ms?: number; }; } ``` ### mix\_audio - Loop a background sound under outbound speech ```typescript interface AiFlowActionMixAudio { type: "mix_audio"; session_id: string; audio?: string; // Base64 WAV (16 kHz, mono, 16-bit PCM); required unless stop=true volume?: number; // 0.0–1.0, default 0.5 stop?: boolean; // true to remove the active loop } ``` The loop plays continuously for the rest of the call — under TTS during turns and on its own during silences. Sending `mix_audio` again replaces the loop. The loop is dropped automatically when the session ends. ### hangup - End call ```typescript interface AiFlowActionHangup { type: "hangup"; session_id: string; } ``` ### transfer - Transfer call ```typescript interface AiFlowActionTransfer { type: "transfer"; session_id: string; target_phone_number: string; // E.164 format recommended caller_id_name: string; caller_id_number: string; /** * Optional transfer timeout in seconds (5–120). When set, a failed transfer * returns the call to the agent via a new `session_start` event for the * same session id (transfer fallback). Omit for legacy behavior where a * failed transfer ends the call. */ timeout?: number; } ``` ### barge\_in - Manual interrupt ```typescript interface AiFlowActionBargeIn { type: "barge_in"; session_id: string; } ``` ### configure\_transcription - Change STT language mid-call ```typescript interface AiFlowActionConfigureTranscription { type: "configure_transcription"; session_id: string; provider?: "AZURE" | "DEEPGRAM" | "ELEVEN_LABS"; // Omit to keep current provider. languages?: string[]; // BCP-47 codes, 1-4 entries. Omit to reset to provider default. } ``` > **Multi-language support:** Azure uses all supplied language codes for simultaneous detection (up to 4). Deepgram performs multilingual auto-detection across the supplied languages. ElevenLabs accepts only a single language — when multiple codes are provided, only the **first** is used and the rest are silently ignored. ## Type Safety All actions are fully typed. Import types from the SDK: ```typescript import type { AiFlowAction, AiFlowActionSpeak, AiFlowActionAudio, AiFlowActionHangup, AiFlowActionTransfer, AiFlowActionBargeIn, AiFlowActionConfigureTranscription, } from "@sipgate/ai-flow-sdk"; ``` ## Next Steps * **[Complete Event Reference](/sdk/advanced/events)** - All event types * **[Direct Integration](/sdk/advanced/direct-integration)** - Working without the wrapper --- --- url: /sipgate-ai-flow-api/sdk/advanced/validation.md --- # Validation with Zod The SDK exports Zod schemas for runtime validation of events and actions. ## Event Validation Validate incoming events to ensure they match the expected format: ```typescript import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk"; import { z } from "zod"; app.post("/webhook", async (req, res) => { try { // Validate incoming event const event = AiFlowEventSchema.parse(req.body); // event is now type-safe and validated const action = await assistant.onEvent(event); if (action) { res.json(action); } else { res.status(204).send(); } } catch (error) { if (error instanceof z.ZodError) { console.error("Invalid event:", error.errors); res.status(400).json({ error: "Invalid event format", details: error.errors }); } else { console.error("Error:", error); res.status(500).json({ error: "Internal server error" }); } } }); ``` ## Action Validation Validate outgoing actions before sending: ```typescript import { AiFlowActionSchema } from "@sipgate/ai-flow-sdk"; import { z } from "zod"; onUserSpeak: async (event) => { const action = { type: "speak", session_id: event.session.id, text: "Hello!", }; try { // Validate action before returning const validatedAction = AiFlowActionSchema.parse(action); return validatedAction; } catch (error) { if (error instanceof z.ZodError) { console.error("Invalid action:", error.errors); // Return a safe fallback return "I encountered an error. Please try again."; } throw error; } } ``` ## Custom Validation You can extend the schemas for custom validation: ```typescript import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk"; import { z } from "zod"; // Extend the schema with custom validation const CustomEventSchema = AiFlowEventSchema.extend({ session: z.object({ id: z.string().uuid(), account_id: z.string().min(1), // Add custom validation }), }); app.post("/webhook", async (req, res) => { try { const event = CustomEventSchema.parse(req.body); // Process validated event } catch (error) { // Handle validation errors } }); ``` ## Benefits * **Type Safety** - Catch errors at runtime * **Better Error Messages** - Zod provides detailed error information * **Data Integrity** - Ensure events and actions match expected format * **Debugging** - Easier to identify malformed data ## Next Steps * **[Direct Integration](/sdk/advanced/direct-integration)** - Working without the wrapper * **[API Reference](/sdk/api-reference)** - Complete API documentation --- --- url: /sipgate-ai-flow-api/sdk/troubleshooting.md --- # Troubleshooting Common issues and solutions. ## Common Issues ### WebSocket Connection Errors If you encounter WebSocket connection issues: ```typescript wss.on("connection", (ws, req) => { ws.on("error", (error) => { console.error("WebSocket error:", error); }); ws.on("close", (code, reason) => { console.log(`Connection closed: ${code} - ${reason}`); }); ws.on("message", assistant.ws(ws)); }); ``` **Common causes:** * Network connectivity issues * Firewall blocking WebSocket connections * Incorrect WebSocket URL or protocol ### Event Validation Errors Use Zod schemas to validate incoming events: ```typescript import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk"; app.post("/webhook", async (req, res) => { try { const event = AiFlowEventSchema.parse(req.body); const action = await assistant.onEvent(event); if (action) { res.json(action); } else { res.status(204).send(); } } catch (error) { console.error("Invalid event:", error); res.status(400).json({ error: "Invalid event format" }); } }); ``` ### Debug Mode Enable debug logging to see all events and actions: ```typescript const assistant = AiFlowAssistant.create({ debug: true, // Logs all events and actions // ... your handlers }); ``` ### Audio Format Issues When using the audio action, ensure your audio is in the correct format: * **Format**: WAV * **Sample Rate**: 16kHz * **Channels**: Mono * **Bit Depth**: 16-bit PCM * **Encoding**: Base64 ```typescript // Example: Convert audio file to correct format import fs from "fs"; const audioBuffer = fs.readFileSync("audio.wav"); const base64Audio = audioBuffer.toString("base64"); return { type: "audio", session_id: event.session.id, audio: base64Audio, }; ``` ## TypeScript Issues ### Type Errors Make sure you're importing types correctly: ```typescript import type { AiFlowEventUserSpeak, AiFlowAction, } from "@sipgate/ai-flow-sdk"; ``` ### Module Resolution If you encounter module resolution errors, check your `tsconfig.json`: ```json { "compilerOptions": { "moduleResolution": "bundler", "esModuleInterop": true, "skipLibCheck": true } } ``` ## Performance Issues ### Slow Response Times * Check your event handler performance * Use async/await properly * Avoid blocking operations * Consider caching for frequently accessed data ### Memory Leaks * Clean up session state in `onSessionEnd` * Remove event listeners * Clear timers and intervals ## Integration Issues ### Express Middleware If the Express middleware isn't working: ```typescript // Make sure express.json() is used app.use(express.json()); // Check the route order app.post("/webhook", assistant.express()); ``` ### WebSocket Handler If WebSocket messages aren't being processed: ```typescript // Ensure message handler is set up correctly ws.on("message", assistant.ws(ws)); // Check message format ws.on("message", (data) => { console.log("Received:", data.toString()); assistant.ws(ws)(data); }); ``` ## Next Steps * **[API Reference](/sdk/api-reference)** - Complete API documentation * **[Examples](/sdk/examples)** - More examples and use cases --- --- url: /sipgate-ai-flow-api/changelog.md --- # Changelog Release notes for the sipgate AI Flow API and SDK. Only customer-visible changes are listed here. *** ## Preview — May 2026 ### End-to-End Voice-to-Voice Mode (Preview) You can now connect your assistant to an end-to-end speech-to-speech model. With the new `configure_voice_to_voice` action the assistant bypasses the standard STT → text → TTS pipeline: caller audio flows directly into the model and the model's spoken response is sent straight back to the caller. Conversations feel snappier and more natural, with first-byte response latencies typically in the 200–600 ms range. User turns are still surfaced as `user_speak` events so call traces and logs keep working — you only need to send a single `configure_voice_to_voice` action on `session_start`. To revert to the standard pipeline mid-call, send a `configure_transcription` action. This is a preview feature, available on request after sipgate support review. *** ## v1.9.0 — May 2026 ### Per-Action VAD Configuration You can now configure Voice Activity Detection (VAD) individually for each `speak` action. This lets you fine-tune how sensitive barge-in detection is depending on the context — for example, using stricter VAD during important announcements and more permissive settings during open-ended questions. *** ## Improvements — April 2026 ### Faster, More Natural Conversation Turns Upgraded to a next-generation transcription backend with significantly improved end-of-utterance detection. The assistant responds faster at natural sentence endings and is less likely to cut in while the caller is still speaking. ### Background Audio Looping (`mix_audio`) The `mix_audio` action now supports looping — play hold music or ambient sound continuously in the background while the assistant speaks, without gaps or manual re-triggering. ### Transfer with Timeout Fallback The `transfer` action accepts an optional timeout. If the transfer destination does not answer within the configured time, the call returns to your assistant automatically, allowing you to handle the fallback gracefully. ### Send SMS During a Call (`send_sms`) A new `send_sms` action lets your assistant send an SMS to the caller while the call is still active — useful for sending confirmation links, reference numbers, or follow-up information in real time. ### Keypad (DTMF) Input Support Your assistant can now react to keypad presses during a call. DTMF digits are delivered as events, enabling menu navigation, PIN entry, and other touch-tone interactions. User input timeouts also reset correctly when the caller presses a key. ### Consistent E.164 Phone Numbers in All Events Caller and callee phone numbers in all events are now consistently formatted as E.164 (e.g. `+4921112345678`). If you were normalising numbers on your side, this step is no longer necessary. *** ## v1.5.1 — March 2026 ### Outbound Calls Initiate AI-powered calls programmatically via `POST /ai-flows/:aiFlowId/call`. Your assistant handles the call as soon as the recipient picks up — the same event-driven flow as inbound calls. Available on request after a review by sipgate support. ### Real-Time Speech Start Event (`user_speech_started`) A new `user_speech_started` event is sent the moment the caller begins speaking — before transcription completes. Use it to interrupt the assistant or trigger visual feedback without waiting for the full transcript. ### Faster ElevenLabs Voices ElevenLabs voices now use the latest `eleven_flash_v2_5` model by default, delivering noticeably lower latency for generated speech. ### ElevenLabs EU Data Residency ElevenLabs voices now route through the EU endpoint by default, keeping audio data within the European Union. *** ## Improvements — February 2026 ### Immediate Barge-In Strategy A new `immediate` barge-in strategy detects speech using Voice Activity Detection (VAD) the moment a caller starts talking — typically 20–100 ms before the first word is transcribed. Conversations feel as natural as talking to a real person. ### Mid-Call Language and Provider Switching (`configure_transcription`) A new `configure_transcription` action lets your assistant switch the transcription language or provider in the middle of a call — for example, after detecting that the caller speaks a different language, or to adapt recognition parameters dynamically. Supported languages follow BCP-47 tags and work across Azure, Deepgram, and ElevenLabs. *** ## Improvements — January 2026 ### SSML Support in Speak Actions The `speak` action now accepts SSML (Speech Synthesis Markup Language) in addition to plain text. Use SSML to control pronunciation, pauses, emphasis, and speaking rate for fine-tuned voice output. *** ## Early Access — November–December 2025 ### Multi-Provider Transcription Deepgram and ElevenLabs are now available as speech-to-text providers alongside Azure. Select the provider that best fits your use case — each offers different strengths in accuracy, latency, and supported languages. ### Phone Number Routing AI flows can now be associated with specific phone numbers directly through the API, making it easier to build multi-flow routing logic without external IVR configuration. ### SDK Launch The `@sipgate/ai-flow-sdk` TypeScript SDK is now publicly available on npm. It provides fully typed event handlers and action builders, removing the need to manage raw WebSocket or HTTP webhook payloads manually. *** > **Note:** The AI Flow API follows continuous delivery — not all improvements correspond to an SDK version bump. Check this page regularly for the latest changes. --- --- url: /sipgate-ai-flow-api/README.md --- # Documentation This directory contains the documentation for sipgate AI Flow SDK, built with [VitePress](https://vitepress.dev/). ## Quick Links * **[API Reference](./api/)** - Language-agnostic HTTP/WebSocket API documentation * **[TypeScript SDK](./sdk/)** - TypeScript SDK documentation * **LLM-friendly docs** — `/llms.txt` (index) and `/llms-full.txt` (full corpus) are auto-generated on build by `vitepress-plugin-llms` and follow the [llms.txt spec](https://llmstxt.org/). Linked from [the homepage](./index.md#for-ai-assisted-development). ## Development ```bash # Install dependencies pnpm install # Start dev server pnpm dev # Build for production pnpm build # Preview production build pnpm preview ``` ## Structure * `index.md` - Homepage * `sdk/` - SDK documentation * `.vitepress/` - VitePress configuration * `config.ts` - Main configuration * `theme/` - Custom theme and styles ## Deployment The documentation is automatically deployed to GitHub Pages when changes are pushed to the `main` branch. The deployment is handled by the `.github/workflows/docs.yml` workflow. ## Base URL The documentation is configured to be served from `/sipgate-ai-flow-api/` on GitHub Pages. If you need to change this, update the `base` option in `.vitepress/config.ts`. --- --- url: /sipgate-ai-flow-api/SETUP.md --- # Documentation Setup Guide This guide will help you set up and deploy the documentation to GitHub Pages. ## Prerequisites * Node.js 22+ * pnpm 10+ ## Local Development 1. **Install dependencies:** ```bash cd docs pnpm install ``` 2. **Start the development server:** ```bash pnpm dev ``` The documentation will be available at `http://localhost:5173` 3. **Build for production:** ```bash pnpm build ``` 4. **Preview production build:** ```bash pnpm preview ``` ## GitHub Pages Setup ### 1. Enable GitHub Pages 1. Go to your repository settings on GitHub 2. Navigate to **Pages** in the left sidebar 3. Under **Source**, select **GitHub Actions** 4. Save the changes ### 2. Configure Base URL The documentation is configured to be served from `/sipgate-ai-flow-api/` on GitHub Pages. If your repository name is different, update the `base` option in `.vitepress/config.ts`: ```typescript export default defineConfig({ base: '/your-repo-name/', // Update this // ... }) ``` ### 3. Deploy The documentation will automatically deploy when you: 1. Push changes to the `main` branch that affect files in the `docs/` folder 2. Manually trigger the workflow from the **Actions** tab ### 4. Access Your Documentation Once deployed, your documentation will be available at: ``` https://sipgate.github.io/sipgate-ai-flow-api/ ``` (Replace `sipgate` with your GitHub username/organization and `sipgate-ai-flow-api` with your repository name) ## Customization ### Adding a Logo 1. Add your logo file (e.g., `logo.svg`) to the `docs/public/` folder 2. Uncomment the logo line in `.vitepress/config.ts`: ```typescript themeConfig: { logo: '/logo.svg', // ... } ``` ### Changing Colors Edit `.vitepress/theme/custom.css` to change the color scheme: ```css :root { --vp-c-brand: #6366f1; /* Primary brand color */ --vp-c-brand-light: #818cf8; /* ... */ } ``` ### Adding Pages 1. Create a new `.md` file in the appropriate directory 2. Add it to the sidebar in `.vitepress/config.ts` 3. Add navigation links if needed ## Troubleshooting ### Build Fails * Check that all dependencies are installed: `pnpm install` * Verify Node.js version is 22+ * Check for syntax errors in markdown files ### Pages Not Updating * Ensure GitHub Pages is enabled in repository settings * Check the Actions tab for workflow errors * Verify the base URL matches your repository name ### Links Not Working * Ensure all internal links use relative paths * Check that the base URL is correctly configured * Verify file paths match the actual file structure ## Support For issues or questions: * Check the [VitePress documentation](https://vitepress.dev/) * Review the workflow logs in GitHub Actions * Contact the development team --- --- url: /sipgate-ai-flow-api/api/events/user-input-timeout.md --- # User Input Timeout Event Sent when no user speech is detected within the configured timeout period after the assistant finishes speaking. ## Event Structure ```json { "type": "user_input_timeout", "session": { "id": "550e8400-e29b-41d4-a716-446655440000", "account_id": "account-123", "phone_number": "1234567890", "direction": "inbound", "from_phone_number": "9876543210", "to_phone_number": "1234567890" } } ``` ## When Triggered This event is sent when: 1. A `speak` action includes a `user_input_timeout_seconds` field 2. The assistant finishes speaking (`assistant_speech_ended` event fires) 3. The specified timeout period elapses without any user speech detected ## Response You can respond with any action: ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "I didn't hear anything. Let me repeat the question." } ``` ## Use Cases ### Retry Question ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_input_timeout') { return res.json({ type: 'speak', session_id: event.session.id, text: 'Are you still there? Please say yes or no.', user_input_timeout_seconds: 5 }); } }); ``` ### Escalate to Human ```javascript app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_input_timeout') { return res.json({ type: 'speak', session_id: event.session.id, text: 'Let me transfer you to a human agent.', // Follow with transfer action }); } }); ``` ### Hangup After Multiple Timeouts ```javascript const timeoutCounts = new Map(); app.post('/webhook', (req, res) => { const event = req.body; if (event.type === 'user_input_timeout') { const sessionId = event.session.id; const count = (timeoutCounts.get(sessionId) || 0) + 1; timeoutCounts.set(sessionId, count); if (count >= 3) { return res.json({ type: 'hangup', session_id: sessionId }); } return res.json({ type: 'speak', session_id: sessionId, text: `I didn't hear anything. Please respond. Attempt ${count} of 3.`, user_input_timeout_seconds: 5 }); } }); ``` ## Configuration The timeout is configured in the `speak` action: ```json { "type": "speak", "session_id": "550e8400-e29b-41d4-a716-446655440000", "text": "What is your account number?", "user_input_timeout_seconds": 5 } ``` See [Speak Action](/api/actions/speak#user-input-timeout) for details. ## Behavior * **Timer starts**: When `assistant_speech_ended` event fires * **Timer cleared**: When any user speech is detected (STT events) * **Event sent**: When timeout period elapses without speech * **New speak action**: Clears any existing timeout and sets a new one (if specified) ## Examples ### Python ```python @app.route('/webhook', methods=['POST']) def webhook(): event = request.json if event['type'] == 'user_input_timeout': return jsonify({ 'type': 'speak', 'session_id': event['session']['id'], 'text': 'I didn\'t hear you. Please try again.' }) ``` ### Go ```go if event["type"] == "user_input_timeout" { action := map[string]interface{}{ "type": "speak", "session_id": session["id"], "text": "I didn't hear you. Please try again.", } json.NewEncoder(w).Encode(action) } ``` ### Ruby ```ruby post '/webhook' do event = JSON.parse(request.body.read) if event['type'] == 'user_input_timeout' content_type :json { type: 'speak', session_id: event['session']['id'], text: 'I didn\'t hear you. Please try again.' }.to_json end end ``` ## Best Practices 1. **Set reasonable timeouts** - 5-10 seconds is typical for most interactions 2. **Provide feedback** - Let users know why they're being prompted again 3. **Limit retries** - After 2-3 timeouts, consider escalating or hanging up 4. **Use context** - Different questions may need different timeout durations 5. **Handle gracefully** - Don't frustrate users with immediate hangups ## Related * **[Speak Action](/api/actions/speak)** - Configure timeout * **[Assistant Speech Ended](/api/events/assistant-speech-ended)** - When timer starts * **[User Speak](/api/events/user-speak)** - Clears timeout