---
url: /sipgate-ai-flow-api/api.md
---
# API Reference
Welcome to the sipgate AI Flow API documentation. This documentation is **language-agnostic** and describes the HTTP and WebSocket protocols that power the AI Flow service.
## Overview
sipgate AI Flow is a voice assistant platform that uses an **event-driven architecture**. Your application receives events (like when a user speaks) and responds with actions (like speaking text back to the user).
## Architecture
```mermaid
graph TB
A[Phone Call] --> B[AI Flow Service]
B --> C{Event Type}
C -->|session_start| D[Your Webhook/WebSocket]
C -->|user_speak| D
C -->|assistant_speak| D
C -->|session_end| D
D --> E[Process Event]
E --> F{Response Type}
F -->|speak| G[Action: Speak]
F -->|audio| H[Action: Audio]
F -->|transfer| I[Action: Transfer]
F -->|hangup| J[Action: Hangup]
G --> B
H --> B
I --> B
J --> B
B --> A
```
## Integration Methods
### HTTP Webhooks
Receive events via HTTP POST requests to your webhook endpoint.
**Best for:**
* Serverless functions (AWS Lambda, Google Cloud Functions)
* REST APIs
* Simple integrations
[Learn more →](/api/http-webhooks)
### WebSocket
Maintain a persistent WebSocket connection for real-time event streaming.
**Best for:**
* Real-time applications
* Lower latency requirements
* High-volume scenarios
[Learn more →](/api/websocket)
## Event-Driven Flow
```mermaid
sequenceDiagram
participant Phone as Phone Call
participant Service as AI Flow Service
participant App as Your Application
Phone->>Service: Call Starts
Service->>App: POST /webhook
{type: "session_start", ...}
App->>Service: {type: "speak", text: "Hello!"}
Service->>Phone: Plays Audio
Phone->>Service: User Speaks
Service->>App: POST /webhook
{type: "user_speak", text: "..."}
App->>Service: {type: "speak", text: "How can I help?"}
Service->>Phone: Plays Audio
Phone->>Service: Call Ends
Service->>App: POST /webhook
{type: "session_end", ...}
```
## Core Concepts
### Events
Events are JSON objects sent from the AI Flow service to your application:
* **session\_start** - When a call begins
* **user\_speak** - When the user speaks (includes `barged_in` flag if user interrupted)
* **assistant\_speak** - After your assistant speaks
* **session\_end** - When the call ends
[View all events →](/api/events)
### Actions
Actions are JSON objects you send back to the AI Flow service:
* **speak** - Speak text or SSML
* **audio** - Play pre-recorded audio
* **hangup** - End the call
* **transfer** - Transfer to another number
[View all actions →](/api/actions)
## Quick Example
Here's a minimal example using HTTP webhooks:
**1. Receive an event:**
```json
POST /webhook
Content-Type: application/json
{
"type": "user_speak",
"text": "Hello",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
**2. Respond with an action:**
```json
HTTP/1.1 200 OK
Content-Type: application/json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?"
}
```
## Language Support
This API works with **any programming language** that can:
* Receive HTTP requests (for webhooks)
* Receive WebSocket connections (for real-time)
* Parse and generate JSON (for events and actions)
Examples are provided in multiple languages throughout the documentation.
## Next Steps
* **[Quick Start](/api/quick-start)** - Build your first integration
* **[HTTP Webhooks](/api/http-webhooks)** - Set up HTTP integration
* **[WebSocket](/api/websocket)** - Set up WebSocket integration
* **[Event Types](/api/events)** - Complete event reference
* **[Action Types](/api/actions)** - Complete action reference
## TypeScript SDK
If you're using TypeScript, check out our [TypeScript SDK documentation](/sdk/) for a more convenient wrapper around this API.
## For AI-Assisted Development
Using AI coding assistants like **Claude Code**, **ChatGPT**, or **Cursor**? We publish two auto-generated files following the [llms.txt spec](https://llmstxt.org/):
* **[`/llms.txt`](/llms.txt)** — short index, auto-discovered by AI tooling.
* **[`/llms-full.txt`](/llms-full.txt)** — full documentation corpus in a single file, ideal for pasting into an LLM context.
---
---
url: /sipgate-ai-flow-api/api/authentication.md
---
# Authentication
How to authenticate requests with the AI Flow API.
## API Key Authentication
The AI Flow service can authenticate your webhook endpoint using shared secrets.
### Setting Up Shared Secrets
When configuring your webhook settings, you can optionally store a shared secret if you want to use AI Flow authentication:
1. In your webhook settings, optionally store a shared secret for AI Flow authentication
2. The service will send this shared secret in request headers for validation
3. Validate the shared secret in your webhook handler to authenticate requests
### Verifying Shared Secrets
AI Flow sends the shared secret in the request headers. Validate this shared secret in your webhook handler:
### Python (Flask)
```python
from flask import Flask, request, abort
import os
# The shared secret you configured in your webhook settings
SHARED_SECRET = os.environ.get('AI_FLOW_SHARED_SECRET')
@app.route('/webhook', methods=['POST'])
def webhook():
# Verify shared secret sent by AI Flow
provided_secret = request.headers.get('X-API-TOKEN')
if provided_secret != SHARED_SECRET:
abort(401)
# Process event
event = request.json
# ...
```
### Node.js (Express)
```javascript
// The shared secret you configured in your webhook settings
const SHARED_SECRET = process.env.AI_FLOW_SHARED_SECRET;
app.post('/webhook', (req, res) => {
const providedSecret = req.headers['X-API-TOKEN'];
if (providedSecret !== SHARED_SECRET) {
return res.status(401).json({ error: 'Unauthorized' });
}
// Process event
const event = req.body;
// ...
});
```
### Go
```go
import "os"
// The shared secret you configured in your webhook settings
var sharedSecret = os.Getenv("AI_FLOW_SHARED_SECRET")
func webhook(w http.ResponseWriter, r *http.Request) {
providedSecret := r.Header.Get("X-API-TOKEN")
if providedSecret != sharedSecret {
w.WriteHeader(http.StatusUnauthorized)
return
}
// Process event
// ...
}
```
## Request Headers
The AI Flow service sends the following headers:
* `X-API-TOKEN` - The shared secret you configured in your webhook settings
* `Content-Type: application/json` - Always JSON
* `User-Agent` - Service identifier
## Response Headers
Your responses should include:
* `Content-Type: application/json` - When returning an action
* `HTTP Status Code`:
* `200` - Action returned
* `204` - No action (No Content)
* `400` - Invalid request
* `401` - Unauthorized
* `500` - Server error
## Security Best Practices
1. **Use HTTPS** - Always use HTTPS in production
2. **Validate Shared Secrets** - Always verify the shared secret sent by AI Flow
3. **Store Secrets Securely** - Use environment variables or secret management
4. **Use Strong Secrets** - Generate cryptographically secure random secrets
5. **Rate Limiting** - Implement rate limiting to prevent abuse
6. **Input Validation** - Validate all incoming events
## Environment Variables
Store shared secrets securely:
### Python
```python
import os
SHARED_SECRET = os.environ.get('AI_FLOW_SHARED_SECRET')
```
### Node.js
```javascript
const SHARED_SECRET = process.env.AI_FLOW_SHARED_SECRET;
```
### Go
```go
import "os"
sharedSecret := os.Getenv("AI_FLOW_SHARED_SECRET")
```
## Next Steps
* **[HTTP Webhooks](/api/http-webhooks)** - Complete HTTP integration guide
* **[Quick Start](/api/quick-start)** - Build your first integration
---
---
url: /sipgate-ai-flow-api/api/quick-start.md
---
# Quick Start
Get up and running with the AI Flow API in minutes, using any programming language.
## Prerequisites
* A webhook endpoint that can receive HTTP POST requests
* Ability to send HTTP responses with JSON
* (Optional) WebSocket support for real-time integration
## Step 1: Set Up Your Webhook Endpoint
Create an HTTP endpoint that receives POST requests. Here are examples in different languages:
### Python (Flask)
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
# Respond with a speak action
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': f"You said: {event['text']}"
})
return '', 204 # No response needed
if __name__ == '__main__':
app.run(port=3000)
```
### Node.js (Express)
```javascript
const express = require('express');
const app = express();
app.use(express.json());
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`
});
}
res.status(204).send();
});
app.listen(3000, () => {
console.log('Webhook server running on port 3000');
});
```
### Go
```go
package main
import (
"encoding/json"
"net/http"
)
type Event struct {
Type string `json:"type"`
Text string `json:"text,omitempty"`
Session Session `json:"session"`
}
type Session struct {
ID string `json:"id"`
}
func webhook(w http.ResponseWriter, r *http.Request) {
var event Event
json.NewDecoder(r.Body).Decode(&event)
if event.Type == "user_speak" {
action := map[string]interface{}{
"type": "speak",
"session_id": event.Session.ID,
"text": "You said: " + event.Text,
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
return
}
w.WriteHeader(http.StatusNoContent)
}
func main() {
http.HandleFunc("/webhook", webhook)
http.ListenAndServe(":3000", nil)
}
```
### Ruby (Sinatra)
```ruby
require 'sinatra'
require 'json'
post '/webhook' do
event = JSON.parse(request.body.read)
if event['type'] == 'user_speak'
return JSON.generate({
type: 'speak',
session_id: event['session']['id'],
text: "You said: #{event['text']}"
})
end
status 204
end
```
## Step 2: Handle Session Start
Respond when a call begins:
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'session_start':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Welcome! How can I help you today?'
})
if event['type'] == 'user_speak':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': f"You said: {event['text']}"
})
return '', 204
```
## Step 3: Expose Your Endpoint
Make your endpoint accessible to the AI Flow service:
1. **Local Development**: Use a tunneling service like ngrok:
```bash
ngrok http 3000
```
2. **Production**: Deploy to a public URL (AWS, Heroku, Railway, etc.)
3. **Configure**: Add your webhook URL in the AI Flow dashboard
## Step 4: Test Your Integration
1. Make a test call to your configured phone number
2. Speak something
3. Check your server logs to see events
4. Verify responses are working
## Complete Example
Here's a complete example that handles all event types:
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
session_id = event['session']['id']
if event['type'] == 'session_start':
return jsonify({
'type': 'speak',
'session_id': session_id,
'text': 'Hello! How can I help you today?'
})
elif event['type'] == 'user_speak':
user_text = event['text'].lower()
if 'goodbye' in user_text or 'bye' in user_text:
return jsonify({
'type': 'hangup',
'session_id': session_id
})
return jsonify({
'type': 'speak',
'session_id': session_id,
'text': f"You said: {event['text']}"
})
elif event['type'] == 'session_end':
print(f"Session {session_id} ended")
return '', 204
return '', 204
if __name__ == '__main__':
app.run(port=3000, debug=True)
```
## Next Steps
* **[HTTP Webhooks](/api/http-webhooks)** - Detailed HTTP integration guide
* **[WebSocket](/api/websocket)** - Real-time WebSocket integration
* **[Event Types](/api/events)** - Complete event reference
* **[Action Types](/api/actions)** - Complete action reference
---
---
url: /sipgate-ai-flow-api/api/http-webhooks.md
---
# HTTP Webhooks
Receive events via HTTP POST requests to your webhook endpoint.
## Overview
HTTP webhooks are the simplest way to integrate with AI Flow. The service sends events as JSON in HTTP POST requests to your endpoint.
## How It Works
```mermaid
sequenceDiagram
participant Call as Phone Call
participant Service as AI Flow Service
participant YourApp as Your Webhook Endpoint
Call->>Service: User speaks
Service->>YourApp: POST /webhook
JSON event
YourApp->>YourApp: Process event
YourApp->>Service: HTTP 200
JSON action
Service->>Call: Execute action
```
## Endpoint Requirements
Your webhook endpoint must:
1. **Accept POST requests** at a public URL
2. **Parse JSON** from the request body
3. **Return JSON actions** or `204 No Content`
4. **Respond as quickly as possible**
5. **Use HTTPS** in production
## Request Format
All requests are POST with JSON body:
```http
POST /webhook HTTP/1.1
Host: your-domain.com
Content-Type: application/json
X-API-Key: your-api-key (optional)
{
"type": "user_speak",
"text": "Hello",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
## Response Format
### Return an Action
```http
HTTP/1.1 200 OK
Content-Type: application/json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?"
}
```
### No Action Needed
```http
HTTP/1.1 204 No Content
```
## Implementation Examples
### Python (Flask)
```python
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
# Handle different event types
if event['type'] == 'session_start':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Welcome!'
})
elif event['type'] == 'user_speak':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': f"You said: {event['text']}"
})
# No response needed
return '', 204
if __name__ == '__main__':
app.run(port=3000)
```
### Node.js (Express)
```javascript
const express = require('express');
const app = express();
app.use(express.json());
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Welcome!'
});
}
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`
});
}
res.status(204).send();
});
app.listen(3000);
```
### Go
```go
package main
import (
"encoding/json"
"net/http"
)
type Event struct {
Type string `json:"type"`
Text string `json:"text,omitempty"`
Session Session `json:"session"`
}
type Session struct {
ID string `json:"id"`
}
func webhook(w http.ResponseWriter, r *http.Request) {
var event Event
json.NewDecoder(r.Body).Decode(&event)
if event.Type == "user_speak" {
action := map[string]interface{}{
"type": "speak",
"session_id": event.Session.ID,
"text": "You said: " + event.Text,
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
return
}
w.WriteHeader(http.StatusNoContent)
}
func main() {
http.HandleFunc("/webhook", webhook)
http.ListenAndServe(":3000", nil)
}
```
### Ruby (Sinatra)
```ruby
require 'sinatra'
require 'json'
post '/webhook' do
event = JSON.parse(request.body.read)
if event['type'] == 'user_speak'
return JSON.generate({
type: 'speak',
session_id: event['session']['id'],
text: "You said: #{event['text']}"
})
end
status 204
end
```
## Error Handling
Handle errors gracefully:
```python
@app.route('/webhook', methods=['POST'])
def webhook():
try:
event = request.json
if not event or 'type' not in event:
return jsonify({'error': 'Invalid event'}), 400
# Process event
# ...
except Exception as e:
print(f"Error processing webhook: {e}")
return jsonify({'error': 'Internal server error'}), 500
```
## Best Practices
1. **Idempotency** - Handle duplicate events gracefully
2. **Async Processing** - Process long-running tasks asynchronously
3. **Logging** - Log all events for debugging
4. **Validation** - Validate event structure
5. **Error Responses** - Return appropriate HTTP status codes
## Testing Locally
Use a tunneling service to expose your local server:
```bash
# Using ngrok
ngrok http 3000
# Using localtunnel
npx localtunnel --port 3000
```
Then configure the tunnel URL in your AI Flow dashboard.
## Production Deployment
Deploy to any platform that supports HTTP:
* **AWS Lambda** - Serverless functions
* **Google Cloud Functions** - Serverless
* **Heroku** - Platform as a service
* **Railway** - Modern deployment
* **Your own server** - VPS, dedicated server
## Next Steps
* **[WebSocket](/api/websocket)** - Real-time WebSocket integration
* **[Event Types](/api/events)** - Complete event reference
* **[Action Types](/api/actions)** - Complete action reference
---
---
url: /sipgate-ai-flow-api/api/websocket.md
---
# WebSocket Integration
Maintain a persistent WebSocket connection for real-time event streaming.
## Overview
WebSocket provides lower latency and real-time bidirectional communication compared to HTTP webhooks. When a phone call starts, the AI Flow Service initiates a WebSocket connection to your application. Your application runs a WebSocket server that accepts these connections.
## How It Works
```mermaid
sequenceDiagram
participant Call as Phone Call
participant Service as AI Flow Service
participant App as Your Application
Call->>Service: Call Starts
Service->>App: WebSocket Connection
App->>Service: Connection Established
Call->>Service: User speaks
Service->>App: JSON Event
App->>App: Process event
App->>Service: JSON Action
Service->>Call: Execute action
```
## Connection
When a phone call starts, the AI Flow Service initiates a WebSocket connection to your application. Your application must run a WebSocket server that accepts incoming connections.
### WebSocket Server
Your application needs to expose a WebSocket endpoint that the AI Flow Service can connect to. The service will connect to your configured WebSocket URL when calls begin.
### Connection URL
Configure your WebSocket server URL in the AI Flow dashboard. The service will connect to this URL, for example:
```
wss://your-domain.com/ai-flow/websocket
```
or for local development:
```
ws://localhost:8080/websocket
```
## Message Format
### Receiving Events
Events are sent as JSON strings:
```json
{
"type": "user_speak",
"text": "Hello",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890"
}
}
```
### Sending Actions
Send actions as JSON strings:
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?"
}
```
## Implementation Examples
### Python
```python
import asyncio
import websockets
import json
async def handle_message(websocket, path):
async for message in websocket:
event = json.loads(message)
if event['type'] == 'user_speak':
action = {
'type': 'speak',
'session_id': event['session']['id'],
'text': f"You said: {event['text']}"
}
await websocket.send(json.dumps(action))
async def main():
async with websockets.serve(handle_message, "localhost", 8765):
await asyncio.Future() # run forever
asyncio.run(main())
```
### Node.js
```javascript
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
ws.on('message', (data) => {
const event = JSON.parse(data.toString());
if (event.type === 'user_speak') {
const action = {
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`
};
ws.send(JSON.stringify(action));
}
});
ws.on('error', (error) => {
console.error('WebSocket error:', error);
});
});
```
### Go
```go
package main
import (
"encoding/json"
"github.com/gorilla/websocket"
"net/http"
)
var upgrader = websocket.Upgrader{
CheckOrigin: func(r *http.Request) bool {
return true
},
}
func websocketHandler(w http.ResponseWriter, r *http.Request) {
conn, err := upgrader.Upgrade(w, r, nil)
if err != nil {
return
}
defer conn.Close()
for {
var event map[string]interface{}
err := conn.ReadJSON(&event)
if err != nil {
break
}
if event["type"] == "user_speak" {
session := event["session"].(map[string]interface{})
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "You said: " + event["text"].(string),
}
conn.WriteJSON(action)
}
}
}
func main() {
http.HandleFunc("/ws", websocketHandler)
http.ListenAndServe(":8080", nil)
}
```
## Connection Management
The AI Flow Service manages the WebSocket connection lifecycle. When a call starts, it connects to your server. When the call ends, it may close the connection.
### Handling Connections
Your WebSocket server should handle incoming connections from the AI Flow Service:
```javascript
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws, req) => {
console.log('New connection from AI Flow Service');
ws.on('message', (data) => {
const event = JSON.parse(data.toString());
handleEvent(event, ws);
});
ws.on('error', (error) => {
console.error('WebSocket error:', error);
});
ws.on('close', () => {
console.log('Connection closed by AI Flow Service');
});
});
```
## Heartbeat
The AI Flow Service may send ping frames to keep the connection alive. Your server should respond with pong frames:
```javascript
wss.on('connection', (ws) => {
ws.on('ping', () => {
ws.pong(); // Respond to ping with pong
});
// ... rest of connection handling
});
```
## Error Handling
```python
async def handle_message(websocket, path):
try:
async for message in websocket:
try:
event = json.loads(message)
action = process_event(event)
if action:
await websocket.send(json.dumps(action))
except json.JSONDecodeError:
print(f"Invalid JSON: {message}")
except Exception as e:
print(f"Error processing event: {e}")
except websockets.exceptions.ConnectionClosed:
print("Connection closed")
except Exception as e:
print(f"WebSocket error: {e}")
```
## Advantages Over HTTP
* **Lower Latency** - No HTTP overhead
* **Persistent Connection** - No connection setup per request
* **Bidirectional** - Can send messages anytime, from either side
* **Real-time** - Instant event delivery
* **Proactive Communication** - Send actions without waiting for events; receive events without requests
## When to Use WebSocket
WebSockets enable a more flexible communication pattern than HTTP webhooks. Unlike the request/response model of HTTP webhooks, WebSockets allow:
* **Proactive event delivery** - The AI Flow Service can send events to your application at any time, not just in response to a request
* **Unsolicited actions** - Your application can send actions to the service without waiting for an event first
* **True bidirectional communication** - Both sides can initiate communication independently
Use WebSocket when:
* You need the lowest possible latency
* You're handling high-volume traffic
* You can run a persistent WebSocket server
* You're building a real-time application
* You have control over your server infrastructure
* You need to send actions proactively without waiting for events
Use HTTP webhooks when:
* You're using serverless functions (which can't maintain WebSocket connections)
* You want simpler deployment
* You prefer the request/response model (each event triggers a response)
* You're building a simple integration
* You can't run a persistent server
## Next Steps
* **[HTTP Webhooks](/api/http-webhooks)** - HTTP integration alternative
* **[Event Flow](/api/event-flow)** - Understand the event lifecycle
* **[Event Types](/api/events)** - Complete event reference
---
---
url: /sipgate-ai-flow-api/api/event-flow.md
---
# Event Flow
Understand the complete lifecycle of events and actions in AI Flow.
## Complete Flow Diagram
```mermaid
stateDiagram-v2
[*] --> SessionStart: Call Begins
SessionStart --> UserSpeak: Assistant Greets
UserSpeak --> AssistantSpeak: User Responds
AssistantSpeak --> AssistantSpeechEnded: Speech Completes
AssistantSpeechEnded --> UserSpeak: Wait for User
AssistantSpeak --> UserBargeIn: User Interrupts
UserBargeIn --> UserSpeak: Continue Conversation
UserSpeak --> SessionEnd: User Says Goodbye
AssistantSpeechEnded --> SessionEnd: Call Ends
SessionEnd --> [*]
```
## Sequence Diagram
```mermaid
sequenceDiagram
participant Phone as Phone Call
participant Service as AI Flow Service
participant App as Your Application
Note over Phone,App: Call Begins
Phone->>Service: Call Initiated
Service->>App: Event: session_start
App->>Service: Action: speak "Welcome!"
Service->>Phone: Plays Audio
Note over Phone,App: User Speaks
Phone->>Service: User Speech
Service->>App: Event: user_speak
{text: "Hello"}
App->>Service: Action: speak "Hello! How can I help?"
Service->>Phone: Plays Audio
Note over Phone,App: Assistant Speaking
Service->>App: Event: assistant_speak
{duration_ms: 2000}
Service->>App: Event: assistant_speech_ended
Note over App: Speech completed
Note over Phone,App: User Interrupts
Phone->>Service: User Speech (during playback)
Service->>App: Event: user_speak with barged_in flag
{text: "Wait"}
App->>Service: Action: speak "Yes, I'm listening"
Service->>Phone: Plays Audio
Note over Phone,App: Call Ends
Phone->>Service: Call Terminated
Service->>App: Event: session_end
Note over App: Cleanup (no action)
```
## Event Lifecycle
### 1. Session Start
```mermaid
graph LR
A[Call Begins] --> B[session_start Event]
B --> C{Your Response}
C -->|speak| D[Greet User]
C -->|audio| E[Play Welcome]
C -->|null| F[Silent Start]
D --> G[Continue]
E --> G
F --> G
```
**Event:**
```json
{
"type": "session_start",
"session": {
"id": "session-123",
"phone_number": "1234567890",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
### 2. User Speak
```mermaid
graph LR
A[User Speaks] --> B[Speech-to-Text]
B --> C[user_speak Event]
C --> D{Your Logic}
D -->|speak| E[Respond]
D -->|transfer| F[Transfer Call]
D -->|hangup| G[End Call]
E --> I[Continue]
F --> J[Call Transferred]
G --> K[Call Ended]
```
**Event:**
```json
{
"type": "user_speak",
"text": "Hello",
"session": {
"id": "session-123"
}
}
```
### 3. Assistant Speak
```mermaid
graph LR
A[Assistant Speaks] --> B[assistant_speak Event]
B --> C{Your Response}
C -->|null| D[Track Metrics]
C -->|speak| E[Continue Speaking]
C -->|audio| F[Play Audio]
D --> G[Wait for User]
E --> G
F --> G
```
**Event:**
```json
{
"type": "assistant_speak",
"text": "Hello! How can I help?",
"duration_ms": 2000,
"speech_started_at": 1234567890,
"session": {
"id": "session-123"
}
}
```
### 4. Assistant Speech Ended
```mermaid
graph LR
A[Speech Ends] --> B[assistant_speech_ended Event]
B --> C{Your Response}
C -->|speak| D[Continue Conversation]
C -->|null| E[Track Completion]
D --> F[Wait for User]
E --> F
```
**Event:**
```json
{
"type": "assistant_speech_ended",
"session": {
"id": "session-123"
}
}
```
### 5. User Barge In
```mermaid
graph LR
A[User Interrupts] --> B[user_speak with barged_in flag Event]
B --> C{Your Response}
C -->|speak| D[Acknowledge]
C -->|null| E[Continue]
D --> F[Listen]
E --> F
```
**Event:**
```json
{
"type": "user_speak with barged_in flag",
"text": "Wait",
"session": {
"id": "session-123"
}
}
```
### 6. Session End
```mermaid
graph LR
A[Call Ends] --> B[session_end Event]
B --> C[Cleanup]
C --> D[No Action Allowed]
```
**Event:**
```json
{
"type": "session_end",
"session": {
"id": "session-123"
}
}
```
## State Management
```mermaid
stateDiagram-v2
[*] --> Idle
Idle --> Greeting: session_start
Greeting --> Listening: speak action
Listening --> Processing: user_speak
Processing --> Speaking: speak action
Speaking --> SpeechEnded: assistant_speech_ended
SpeechEnded --> Listening: wait for user
Speaking --> Interrupted: user_speak with barged_in flag
Interrupted --> Listening: acknowledge
Listening --> Ended: hangup action
SpeechEnded --> Ended: session_end
Speaking --> Ended: session_end
Processing --> Ended: session_end
Ended --> [*]
```
## Response Timing
```mermaid
gantt
title Event Response Timeline
dateFormat X
axisFormat %L ms
section Call Flow
session_start event :0, 50
Process & respond :50, 100
speak action :100, 150
Audio playback :150, 2150
assistant_speak event :2150, 2200
user_speak event :2200, 2250
Process & respond :2250, 2300
speak action :2300, 2350
```
## Outbound Call Flow
For outbound calls initiated via the REST API, the flow is identical to inbound — except that `session.direction` is `"outbound"` and the AI flow dials the recipient first.
```mermaid
sequenceDiagram
participant API as REST Client
participant Service as AI Flow Service
participant Phone as Recipient Phone
participant App as Your Application
Note over API,App: Initiate Outbound Call
API->>Service: POST /v3/ai-flows/:id/call
Service-->>API: 201 Created
Note over Service,App: Recipient Answers
Service->>Phone: Dial toPhoneNumber
Phone-->>Service: Answers
Service->>App: Event: session_start
{ direction: "outbound" }
App->>Service: Action: speak "Hello, this is an automated call..."
Service->>Phone: Plays Audio
Note over Phone,App: Conversation continues normally
Phone->>Service: User Speech
Service->>App: Event: user_speak
App->>Service: Action: speak / hangup / transfer
```
::: tip
Check `event.session.direction === "outbound"` in your `session_start` handler to customize the opening message for calls your assistant initiated.
:::
::: warning Access Required
Outbound calls require explicit access granted by sipgate support. See the [Outbound Calls guide](/api/guides/outbound-calls) for details.
:::
## Best Practices
1. **Respond Quickly** - Keep response times under 1 second
2. **Handle All Events** - Even if you don't need to respond
3. **Clean Up State** - Use `session_end` for cleanup
4. **Track Metrics** - Use `assistant_speak` for analytics
5. **Handle Errors** - Always return valid responses or 204
## Next Steps
* **[Event Types](/api/events)** - Complete event reference
* **[Action Types](/api/actions)** - Complete action reference
* **[HTTP Webhooks](/api/http-webhooks)** - HTTP integration
* **[WebSocket](/api/websocket)** - WebSocket integration
---
---
url: /sipgate-ai-flow-api/api/guides/outbound-calls.md
---
# Outbound Calls
Initiate AI-powered outbound calls programmatically — your assistant dials the recipient and handles the conversation
once connected.
::: warning Access Required
Outbound calls are **only available upon request** and after a positive review by sipgate support. This restriction
exists to prevent fraud and spam. Please contact support to request access before using this feature.
:::
## How It Works
```
POST /v3/ai-flows/:aiFlowId/call
→ AI flow dials toPhoneNumber
→ Recipient answers
→ Session is created with direction: "outbound"
→ Webhook fires to your AI flow's webhook URL
→ Normal event/action flow begins
```
The call lifecycle after connection is identical to an inbound call — the same events (`session_start`, `user_speak`,
etc.) and actions (`speak`, `transfer`, `hangup`, etc.) apply.
## Prerequisites
* Access granted by sipgate support (see warning above)
* AI flow must have a `phone_number` configured (used as caller ID)
* Target phone number in E.164 format without leading + (e.g. `4915790000687`)
## Initiating a Call
### Endpoint
```http
POST /v3/ai-flows/:aiFlowId/call
Authorization: Bearer
Content-Type: application/json
```
### Request Body
```json
{
"billingDevice": "e2",
"toPhoneNumber": "4915790000687"
}
```
| Field | Type | Description |
|-----------------|--------|-------------------------------------|
| `billingDevice` | string | Billing device suffix (e.g. `"e2"`) |
| `toPhoneNumber` | string | Target phone number in E.164 format without leading + |
### Response
The endpoint returns `201 Created` when the call has been successfully initiated.
### Error Responses
| Status | Reason |
|--------|------------------------------------------|
| `400` | AI flow has no `phone_number` configured |
| `404` | AI flow not found |
## Session Direction
When an outbound call connects, the `session_start` event's session object will contain `"direction": "outbound"`:
```json
{
"type": "session_start",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"phone_number": "4915790000687",
"direction": "outbound",
"from_phone_number": "4921155660",
"to_phone_number": "4915790000687"
}
}
```
Use the `direction` field to tailor your greeting — for outbound calls your assistant initiated the contact, so the
opening message should reflect that context.
## Example
```http
POST /v3/ai-flows/:aiFlowId/call
Authorization: Bearer
Content-Type: application/json
{
"billingDevice": "e2",
"toPhoneNumber": "4915790000687"
}
```
::: tip TypeScript SDK
Using the `@sipgate/ai-flow-sdk`? See the **[Outbound Calls SDK guide](/sdk/outbound-calls)** for `assistant.call()` with full examples.
:::
## Next Steps
* **[Event Types](/api/events)** — complete event reference
* **[Event Flow](/api/event-flow)** — full call lifecycle
* **[Action Types](/api/actions)** — how to respond to events
---
---
url: /sipgate-ai-flow-api/api/guides/phone-number-routing.md
---
# Phone Number Routing: Multiple Assistants, One Webhook
When you have multiple phone numbers - each for a different purpose - you don't need separate webhook endpoints. Route them all to a single endpoint and dispatch based on the called number.
## The Pattern
Every sipgate AI Flow event includes the phone number in the session object:
```json
{
"type": "session_start",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"to_phone_number": "4921234567890",
"from_phone_number": "4915112345678",
"direction": "inbound"
}
}
```
Use `to_phone_number` to determine which assistant or behavior to use.
## Basic Routing
```typescript
const ASSISTANTS = {
'4921234567890': {
name: 'Sales',
greeting: 'Hi! Thanks for calling our sales team. How can I help?',
systemPrompt: 'You are a helpful sales assistant...',
},
'4921234567891': {
name: 'Support',
greeting: 'Hello! You\'ve reached customer support. What can I help you with?',
systemPrompt: 'You are a friendly support agent...',
},
'4921234567892': {
name: 'Booking',
greeting: 'Welcome! I can help you book an appointment. When would you like to come in?',
systemPrompt: 'You are an appointment booking assistant...',
},
}
export async function POST(req: Request) {
const event = await req.json()
const phoneNumber = event.session.to_phone_number
// Get assistant config for this number
const assistant = ASSISTANTS[phoneNumber]
if (!assistant) {
// Unknown number - use fallback
return speak(event.session.id, "Sorry, this number is not configured.")
}
switch (event.type) {
case 'session_start':
return speak(event.session.id, assistant.greeting)
case 'user_speak':
return handleUserSpeak(event, assistant)
// ... other events
}
}
```
## Database-Driven Routing
For dynamic configuration, store the mapping in a database:
```typescript
// Database schema
// phone_numbers: id, phone_number, assistant_id
// assistants: id, name, greeting, system_prompt, voice_provider, voice_id
async function getAssistantForNumber(phoneNumber: string) {
const { data } = await supabase
.from('phone_numbers')
.select(`
phone_number,
assistants (
id,
name,
greeting,
system_prompt,
voice_provider,
voice_id
)
`)
.eq('phone_number', phoneNumber)
.single()
return data?.assistants
}
export async function POST(req: Request) {
const event = await req.json()
const phoneNumber = event.session.to_phone_number
const assistant = await getAssistantForNumber(phoneNumber)
if (!assistant) {
return speak(event.session.id, "This number is not currently in service.")
}
// Route to appropriate handler
return handleEvent(event, assistant)
}
```
## Normalizing Phone Numbers
Phone numbers can arrive in different formats. Normalize before lookup:
```typescript
function normalizePhoneNumber(phone: string): string {
// Remove spaces, dashes, parentheses
let normalized = phone.replace(/[\s\-\(\)]/g, '')
// Remove leading + if present (E.164 without leading +)
if (normalized.startsWith('+')) {
normalized = normalized.slice(1)
}
return normalized
}
async function getAssistantForNumber(phoneNumber: string) {
const normalized = normalizePhoneNumber(phoneNumber)
const { data } = await supabase
.from('phone_numbers')
.select('*, assistants(*)')
.eq('phone_number', normalized)
.single()
return data?.assistants
}
```
## Multi-Language Routing
Route to different languages based on phone number:
```typescript
const LANGUAGE_NUMBERS = {
'4921234567890': { language: 'de-DE', voice: 'de-DE-KatjaNeural' },
'4421234567890': { language: 'en-GB', voice: 'en-GB-SoniaNeural' },
'3321234567890': { language: 'fr-FR', voice: 'fr-FR-DeniseNeural' },
}
function getLanguageConfig(phoneNumber: string) {
return LANGUAGE_NUMBERS[phoneNumber] || {
language: 'en-US',
voice: 'en-US-JennyNeural',
}
}
export async function POST(req: Request) {
const event = await req.json()
const langConfig = getLanguageConfig(event.session.to_phone_number)
// Use language-specific TTS
return Response.json({
type: 'speak',
session_id: event.session.id,
text: getGreeting(langConfig.language),
tts: {
provider: 'azure',
language: langConfig.language,
voice: langConfig.voice,
},
})
}
```
## Routing by Caller Number
You can also route based on who's calling (`from_phone_number`):
```typescript
async function handleSessionStart(event: SessionStartEvent) {
const callerNumber = event.session.from_phone_number
// Check if this is a known VIP customer
const customer = await getCustomerByPhone(callerNumber)
if (customer?.is_vip) {
return speak("Welcome back! I see you're a VIP member. How can I assist you today?")
}
// Check if this is a repeat caller
const previousCalls = await getRecentCalls(callerNumber)
if (previousCalls.length > 0) {
const lastTopic = previousCalls[0].topic
return speak(`Hello again! Are you calling about ${lastTopic}, or something new?`)
}
// First-time caller
return speak("Welcome! How can I help you today?")
}
```
## Fallback Handling
Always handle unknown numbers gracefully:
```typescript
async function getAssistantForNumber(phoneNumber: string) {
const normalized = normalizePhoneNumber(phoneNumber)
const { data } = await supabase
.from('phone_numbers')
.select('*, assistants(*)')
.eq('phone_number', normalized)
.single()
if (!data?.assistants) {
// Log for debugging
console.warn(`No assistant configured for: ${normalized}`)
// Return a default fallback assistant
return {
id: 'fallback',
name: 'Fallback',
greeting: "I'm sorry, but this number is not currently configured. Please try again later.",
system_prompt: 'Politely explain that the service is unavailable.',
voice_provider: 'azure',
voice_id: 'en-US-JennyNeural',
}
}
return data.assistants
}
```
## Caching for Performance
If you're looking up the same numbers repeatedly, cache the results:
```typescript
const assistantCache = new Map()
const CACHE_TTL_MS = 60000 // 1 minute
async function getAssistantForNumber(phoneNumber: string): Promise {
const normalized = normalizePhoneNumber(phoneNumber)
// Check cache
const cached = assistantCache.get(normalized)
if (cached && Date.now() - cached.cachedAt < CACHE_TTL_MS) {
return cached.assistant
}
// Fetch from database
const { data } = await supabase
.from('phone_numbers')
.select('*, assistants(*)')
.eq('phone_number', normalized)
.single()
const assistant = data?.assistants || getFallbackAssistant()
// Cache result
assistantCache.set(normalized, { assistant, cachedAt: Date.now() })
return assistant
}
```
## Complete Example
```typescript
// types.ts
interface Assistant {
id: string
name: string
greeting: string
system_prompt: string
voice_provider: 'azure' | 'eleven_labs'
voice_id: string
language: string
}
// routing.ts
const assistantCache = new Map()
function normalizePhoneNumber(phone: string): string {
let normalized = phone.replace(/[\s\-\(\)]/g, '')
if (normalized.startsWith('+')) normalized = normalized.slice(1)
return normalized
}
async function getAssistant(phoneNumber: string): Promise {
const normalized = normalizePhoneNumber(phoneNumber)
// Check cache
if (assistantCache.has(normalized)) {
return assistantCache.get(normalized)!
}
// Fetch from database
const { data } = await db
.from('phone_numbers')
.select('*, assistants(*)')
.eq('phone_number', normalized)
.single()
const assistant = data?.assistants || {
id: 'fallback',
name: 'Fallback',
greeting: 'This number is not configured.',
system_prompt: 'Explain the service is unavailable.',
voice_provider: 'azure',
voice_id: 'en-US-JennyNeural',
language: 'en-US',
}
assistantCache.set(normalized, assistant)
return assistant
}
// webhook.ts
export async function POST(req: Request): Promise {
const event = await req.json()
const sessionId = event.session.id
// Route to assistant based on called number
const assistant = await getAssistant(event.session.to_phone_number)
switch (event.type) {
case 'session_start':
console.log(`Call to ${assistant.name} assistant`)
return speak(sessionId, assistant.greeting, assistant)
case 'user_speak':
const response = await generateLLMResponse(event.text, assistant)
return speak(sessionId, response, assistant)
case 'session_end':
return new Response(null, { status: 204 })
default:
return new Response(null, { status: 204 })
}
}
function speak(sessionId: string, text: string, assistant: Assistant): Response {
const ttsConfig = assistant.voice_provider === 'azure'
? {
provider: 'azure' as const,
language: assistant.language,
voice: assistant.voice_id,
}
: {
provider: 'eleven_labs' as const,
voice: assistant.voice_id,
}
return Response.json({
type: 'speak',
session_id: sessionId,
text,
tts: ttsConfig,
})
}
```
## Best Practices
1. **Normalize phone numbers** - Handle different formats (49, 0049, +49, etc.) and strip any leading +
2. **Always have a fallback** - Unknown numbers should get a polite message, not an error
3. **Cache lookups** - Database queries on every event add latency
4. **Log unknown numbers** - Helps you spot misconfiguration
5. **Use `to_phone_number`** - That's the number they dialed (your number)
6. **Consider `from_phone_number`** - For personalization based on caller
## Related Documentation
* **[Session Start Event](/api/events/session-start)** - Event structure with phone numbers
* **[HTTP Webhooks](/api/http-webhooks)** - Webhook endpoint setup
---
---
url: /sipgate-ai-flow-api/api/guides/streaming-llm-responses.md
---
# Streaming LLM Responses: Sentence-by-Sentence Best Practices
When integrating Large Language Models (LLMs) like OpenAI's GPT, Anthropic's Claude, or similar services with sipgate AI Flow, how you stream responses significantly impacts the naturalness of synthesized speech. This guide shows you how to achieve smooth, natural-sounding voice output by sending complete sentences rather than individual tokens.
## The Problem: Token-by-Token Streaming
LLMs stream responses token-by-token (small text fragments). Sending each token directly to the TTS provider creates choppy, unnatural speech:
```typescript
// ❌ BAD: Sends every token immediately
for await (const chunk of llmStream) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: chunk.content, // Individual tokens: "Hello", ", ", "how", " ", "can", " ", "I"...
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
```
**Why this sounds bad:**
* Each TTS call treats the text as a complete utterance with sentence-ending prosody (falling intonation, longer pauses)
* Results in robotic, choppy speech: "Hello↘️ \[pause] how↘️ \[pause] can↘️ \[pause] I↘️ \[pause] help↘️ \[pause]"
* TTS providers optimize for complete sentences, not fragments
## The Solution: Sentence Segmentation
**✅ Best Practice:** Buffer LLM tokens and send complete sentences to the TTS provider.
```mermaid
sequenceDiagram
participant LLM as OpenAI/Claude
participant App as Your Application
participant Segmenter as Intl.Segmenter
participant Flow as AI Flow
LLM->>App: Token: "Hello"
LLM->>App: Token: ", how"
LLM->>App: Token: " can I"
LLM->>App: Token: " help?"
App->>Segmenter: Buffer: "Hello, how can I help?"
Segmenter->>App: Sentence detected!
App->>Flow: speak: "Hello, how can I help?"
Flow->>Flow: Synthesize complete sentence
LLM->>App: Token: " I'm"
LLM->>App: Token: " here"
Note right of App: Continue buffering...
```
**Benefits:**
* Natural prosody and intonation
* Appropriate pauses between sentences
* Better voice quality from TTS providers
* Maintains low latency (sentences typically complete within 1-2 seconds)
## Prompting LLMs for Voice Output
**Critical:** Instruct your LLM to avoid abbreviations that interfere with speech synthesis and sentence detection.
### The Problem with Abbreviations
Abbreviations like "Dr.", "bzw.", "z.B.", "etc." cause two issues:
1. **Incorrect sentence segmentation** - `Intl.Segmenter` detects periods as sentence boundaries:
```typescript
// "Dr. Smith will help you."
// Incorrectly splits into:
// Sentence 1: "Dr."
// Sentence 2: "Smith will help you."
```
2. **Poor TTS pronunciation** - Text-to-speech may mispronounce abbreviations:
* "Dr." → "D R" or "Doctor point" instead of "Doctor"
* "bzw." → "B Z W" instead of "beziehungsweise"
* "z.B." → "Z B" instead of "zum Beispiel"
### System Prompt Guidelines
Add these instructions to your LLM system prompt:
```typescript
const systemPrompt = `You are a voice assistant. Follow these rules strictly:
VOICE OUTPUT RULES:
- Write out all abbreviations fully (e.g., "Doctor" not "Dr.", "for example" not "e.g.")
- Use complete words instead of shortened forms
- Avoid punctuation-based abbreviations that end with periods
- Use natural, spoken language as if talking to someone in person
Examples:
❌ "Dr. Smith can help you with that."
✅ "Doctor Smith can help you with that."
❌ "You can use method A, B, or C, e.g., the first one."
✅ "You can use method A, B, or C, for example the first one."
❌ "This is available Mon.-Fri."
✅ "This is available Monday through Friday."
Your responses will be converted to speech, so write exactly how you would say it out loud.`
```
### Language-Specific Examples
**English:**
```typescript
const englishVoiceRules = `
- "Dr." → "Doctor"
- "Mr." → "Mister"
- "Mrs." → "Missus"
- "e.g." → "for example"
- "i.e." → "that is"
- "etc." → "and so on" or "etcetera"
- "vs." → "versus"
- "approx." → "approximately"
`
```
**German:**
```typescript
const germanVoiceRules = `
- "Dr." → "Doktor"
- "bzw." → "beziehungsweise"
- "z.B." → "zum Beispiel"
- "usw." → "und so weiter"
- "ca." → "circa"
- "etc." → "et cetera" or "und so weiter"
- "inkl." → "inklusive"
- "ggf." → "gegebenenfalls"
- "evtl." → "eventuell"
`
```
### Complete OpenAI Example
```typescript
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{
role: 'system',
content: `You are a voice assistant for customer service.
CRITICAL VOICE RULES:
- Never use abbreviations with periods (Dr., e.g., etc.)
- Write everything as you would speak it out loud
- Use complete words: "Doctor" not "Dr.", "for example" not "e.g."
- Your responses will be synthesized to speech
Be helpful, concise, and conversational.`
},
{
role: 'user',
content: userMessage
}
],
stream: true,
})
```
### Complete Anthropic Example
```typescript
const stream = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: `You are a voice assistant. Your responses will be converted to speech.
VOICE OUTPUT REQUIREMENTS:
- Write all abbreviations in full (Doctor, not Dr.)
- Avoid period-based abbreviations (e.g., i.e., etc.)
- Use natural spoken language
- Write numbers as words when it sounds more natural
Examples:
Wrong: "Dr. Schmidt can help you, e.g., with billing."
Right: "Doctor Schmidt can help you, for example with billing."
Wrong: "Available Mon.-Fri., 9 a.m.-5 p.m."
Right: "Available Monday through Friday, 9 AM to 5 PM."`,
messages: [
{
role: 'user',
content: userMessage
}
],
stream: true,
})
```
### Testing Your Prompt
Verify your LLM follows voice rules by testing with edge cases:
```typescript
const testCases = [
"Tell me about Dr. Smith",
"What are the benefits, e.g., cost savings?",
"This applies to companies like IBM, Microsoft, etc.",
"Available Mon.-Fri.",
]
// Expected responses should have NO abbreviations with periods
```
::: warning Common Mistake
Don't rely on post-processing to fix abbreviations. LLMs are excellent at following voice guidelines when properly instructed. Post-processing is fragile and language-dependent.
:::
## Using JavaScript's Built-in Sentence Segmenter
JavaScript provides `Intl.Segmenter` - a native API for text segmentation, including sentence detection. It's available in Node.js ≥16.
### Basic Example
```typescript
// Create a sentence segmenter (do this once, reuse for performance)
const sentenceSegmenter = new Intl.Segmenter('en', { granularity: 'sentence' })
function* extractSentences(text: string): Generator {
const segments = sentenceSegmenter.segment(text)
for (const segment of segments) {
yield segment.segment.trim()
}
}
// Usage
const text = "Hello, how can I help? I'm here to assist you today."
for (const sentence of extractSentences(text)) {
console.log(sentence)
// Output:
// "Hello, how can I help?"
// "I'm here to assist you today."
}
```
### Streaming with OpenAI
```typescript
import OpenAI from 'openai'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' })
async function streamOpenAIResponse(
sessionId: string,
userMessage: string,
sendAction: (action: any) => Promise
) {
let buffer = ''
let lastSentenceEnd = 0
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: userMessage }],
stream: true,
})
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content
if (!content) continue
// Add token to buffer
buffer += content
// Check for complete sentences
const segments = Array.from(segmenter.segment(buffer))
// Find complete sentences (all but possibly the last incomplete one)
for (let i = lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
// Send complete sentence to TTS
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
lastSentenceEnd = i + 1
}
}
// Send any remaining text as final sentence
const remainingSegments = Array.from(segmenter.segment(buffer))
for (let i = lastSentenceEnd; i < remainingSegments.length; i++) {
const sentence = remainingSegments[i].segment.trim()
if (sentence) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
}
```
### Streaming with Anthropic Claude
```typescript
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' })
async function streamClaudeResponse(
sessionId: string,
userMessage: string,
sendAction: (action: any) => Promise
) {
let buffer = ''
let lastSentenceEnd = 0
const stream = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{ role: 'user', content: userMessage }],
stream: true,
})
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
const content = event.delta.text
// Add token to buffer
buffer += content
// Check for complete sentences
const segments = Array.from(segmenter.segment(buffer))
// Send complete sentences
for (let i = lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
lastSentenceEnd = i + 1
}
}
}
// Send remaining text
const remainingSegments = Array.from(segmenter.segment(buffer))
for (let i = lastSentenceEnd; i < remainingSegments.length; i++) {
const sentence = remainingSegments[i].segment.trim()
if (sentence) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
}
```
## Reusable Helper Class
For production use, extract this logic into a reusable helper:
```typescript
export class SentenceStreamBuffer {
private buffer = ''
private lastSentenceEnd = 0
private segmenter: Intl.Segmenter
constructor(locale: string = 'en') {
this.segmenter = new Intl.Segmenter(locale, { granularity: 'sentence' })
}
/**
* Add a token/chunk to the buffer and return any complete sentences.
* @returns Array of complete sentences ready to be sent to TTS
*/
push(chunk: string): string[] {
this.buffer += chunk
const segments = Array.from(this.segmenter.segment(this.buffer))
const completeSentences: string[] = []
// Extract complete sentences (all but possibly the last incomplete one)
for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
completeSentences.push(sentence)
}
this.lastSentenceEnd = i + 1
}
return completeSentences
}
/**
* Flush remaining buffer as final sentence(s).
* Call this when the stream ends.
*/
flush(): string[] {
const segments = Array.from(this.segmenter.segment(this.buffer))
const remainingSentences: string[] = []
for (let i = this.lastSentenceEnd; i < segments.length; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
remainingSentences.push(sentence)
}
}
// Reset state
this.buffer = ''
this.lastSentenceEnd = 0
return remainingSentences
}
/**
* Reset the buffer (useful for error handling or conversation resets)
*/
reset(): void {
this.buffer = ''
this.lastSentenceEnd = 0
}
}
```
### Using the Helper
```typescript
async function streamLLMToVoice(
sessionId: string,
llmStream: AsyncIterable,
sendAction: (action: any) => Promise,
locale: string = 'en'
) {
const buffer = new SentenceStreamBuffer(locale)
try {
// Process streaming tokens
for await (const token of llmStream) {
const sentences = buffer.push(token)
// Send each complete sentence to TTS
for (const sentence of sentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
// Send any remaining text when stream completes
const finalSentences = buffer.flush()
for (const sentence of finalSentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
} catch (error) {
buffer.reset() // Clean up on error
throw error
}
}
```
## Multi-Language Support
`Intl.Segmenter` supports multiple languages out of the box:
```typescript
// German
const germanSegmenter = new Intl.Segmenter('de', { granularity: 'sentence' })
// Spanish
const spanishSegmenter = new Intl.Segmenter('es', { granularity: 'sentence' })
// French
const frenchSegmenter = new Intl.Segmenter('fr', { granularity: 'sentence' })
// Reusable buffer with language detection
function createStreamBuffer(languageCode: string): SentenceStreamBuffer {
return new SentenceStreamBuffer(languageCode)
}
```
## Handling Edge Cases
### Short Responses
For very short responses (single sentence or fragment), the buffer approach still works:
```typescript
// LLM response: "Hello!"
buffer.push("Hello!") // Returns: []
buffer.flush() // Returns: ["Hello!"]
```
### Incomplete Sentences During Interruption
If the user interrupts (barge-in), you may have incomplete sentences in the buffer:
```typescript
// Handle barge-in event
function handleBargeIn(sessionId: string) {
const buffer = sessionBuffers.get(sessionId)
if (buffer) {
// Option 1: Discard incomplete sentence
buffer.reset()
// Option 2: Send incomplete sentence as-is (for context)
const remaining = buffer.flush()
// Log or store for context but don't send to TTS
}
}
```
### Very Long Sentences
Sometimes LLMs generate very long sentences. Consider adding a character limit:
```typescript
class SentenceStreamBuffer {
private maxSentenceLength = 500 // characters
push(chunk: string): string[] {
this.buffer += chunk
// Force break on very long buffers
if (this.buffer.length > this.maxSentenceLength && this.buffer.includes(' ')) {
const lastSpace = this.buffer.lastIndexOf(' ', this.maxSentenceLength)
const forcedSentence = this.buffer.substring(0, lastSpace).trim()
this.buffer = this.buffer.substring(lastSpace).trim()
this.lastSentenceEnd = 0
return [forcedSentence]
}
// Normal sentence detection...
const segments = Array.from(this.segmenter.segment(this.buffer))
const completeSentences: string[] = []
for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
completeSentences.push(sentence)
}
this.lastSentenceEnd = i + 1
}
return completeSentences
}
// ... rest of class
}
```
## Performance Considerations
### Buffer Management
For production deployments with many concurrent sessions:
```typescript
// Store buffers per session
const sessionBuffers = new Map()
function getOrCreateBuffer(sessionId: string, locale: string): SentenceStreamBuffer {
if (!sessionBuffers.has(sessionId)) {
sessionBuffers.set(sessionId, new SentenceStreamBuffer(locale))
}
return sessionBuffers.get(sessionId)!
}
// Clean up on session end
function handleSessionEnd(sessionId: string) {
sessionBuffers.delete(sessionId)
}
```
### Timeout Protection
Add timeout to prevent indefinitely buffered text:
```typescript
class SentenceStreamBufferWithTimeout extends SentenceStreamBuffer {
private lastPushTime = Date.now()
private timeout = 5000 // 5 seconds
push(chunk: string): string[] {
this.lastPushTime = Date.now()
return super.push(chunk)
}
hasTimedOut(): boolean {
return Date.now() - this.lastPushTime > this.timeout
}
flushIfTimedOut(): string[] {
if (this.hasTimedOut()) {
return this.flush()
}
return []
}
}
```
## Complete Example: Express.js Integration
```typescript
import express from 'express'
import OpenAI from 'openai'
const app = express()
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const sessionBuffers = new Map()
app.use(express.json())
app.post('/webhook', async (req, res) => {
const event = req.body
switch (event.type) {
case 'user_speak':
// Don't await - respond immediately to avoid timeout
handleUserSpeak(event).catch(console.error)
return res.status(204).send()
case 'session_end':
sessionBuffers.delete(event.session.id)
return res.status(204).send()
default:
return res.status(204).send()
}
})
async function handleUserSpeak(event: any) {
const sessionId = event.session.id
const userText = event.text
// Get or create buffer for this session
const buffer = sessionBuffers.get(sessionId) || new SentenceStreamBuffer('en')
sessionBuffers.set(sessionId, buffer)
// Stream LLM response
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: userText }],
stream: true,
})
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content
if (!content) continue
const sentences = buffer.push(content)
for (const sentence of sentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
// Flush remaining text
const finalSentences = buffer.flush()
for (const sentence of finalSentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
async function sendAction(action: any) {
// Send action via WebSocket or HTTP to sipgate AI Flow
// Implementation depends on your integration method
await fetch('https://your-aiflow-endpoint/actions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(action),
})
}
app.listen(3000, () => console.log('Server running on port 3000'))
```
## Fallback for Older Node.js Versions
If you're using Node.js <16, you can use a simple regex-based fallback:
```typescript
// Simple fallback for environments without Intl.Segmenter
function splitSentencesSimple(text: string): string[] {
// Basic sentence splitting (not as robust as Intl.Segmenter)
// Matches sentence endings followed by whitespace
return text
.split(/(?<=[.!?])\s+/)
.map(s => s.trim())
.filter(s => s.length > 0)
}
// Use in SentenceStreamBuffer as fallback
class SentenceStreamBufferLegacy {
private buffer = ''
push(chunk: string): string[] {
this.buffer += chunk
const sentences = splitSentencesSimple(this.buffer)
if (sentences.length > 1) {
// Keep last sentence in buffer (might be incomplete)
const complete = sentences.slice(0, -1)
this.buffer = sentences[sentences.length - 1]
return complete
}
return []
}
flush(): string[] {
const sentence = this.buffer.trim()
this.buffer = ''
return sentence ? [sentence] : []
}
}
```
::: warning Regex Limitations
The regex fallback is less robust than `Intl.Segmenter` and may incorrectly split on abbreviations (Dr., e.g., etc.). If using the fallback, it's even more critical to follow the [LLM prompting guidelines](#prompting-llms-for-voice-output) to avoid abbreviations.
:::
## Best Practices Summary
1. **Prompt LLMs to avoid abbreviations** - Instruct your LLM to write out "Doctor" not "Dr.", "for example" not "e.g." to prevent incorrect segmentation and poor pronunciation
2. **Always segment sentences** - Never send individual tokens to TTS, always buffer and send complete sentences
3. **Use `Intl.Segmenter`** - Native, robust, multi-language support (Node.js ≥16)
4. **Buffer per session** - Keep separate buffers for concurrent conversations
5. **Clean up on session end** - Delete buffers to prevent memory leaks
6. **Handle timeouts** - Flush buffer if no new tokens arrive within 5 seconds
7. **Support multiple languages** - Pass correct locale to `Intl.Segmenter`
8. **Handle barge-in** - Reset or discard incomplete sentences on interruption
9. **Limit sentence length** - Force breaks for very long sentences (500+ characters)
::: tip Token Accumulation Speed
In practice, sentences complete quickly (typically 1-2 seconds with modern LLMs). Users won't notice the buffering delay, but they will notice the dramatic improvement in speech quality.
:::
## Related Documentation
* **[Speak Action](/api/actions/speak)** - Complete reference for the speak action
* **[TTS Providers](/api/tts-providers)** - Azure and ElevenLabs configuration
* **[Barge-In Best Practices](/api/guides/barge-in-best-practices)** - Handling interruptions during speech
* **[Async Hold Pattern](/api/guides/async-hold-pattern)** - Managing long-running LLM requests
---
---
url: /sipgate-ai-flow-api/api/guides/barge-in-best-practices.md
---
# Barge-In Best Practices: Handling User Interruptions Gracefully
When users interrupt your voice AI assistant mid-sentence, how you respond makes the difference between a frustrating experience and a natural conversation. This guide covers best practices for handling barge-in interruptions using the `barged_in` flag in `user_speak` events.
## Why Users Interrupt
In natural human conversations, interruptions happen constantly:
* **"Got it!"** - They understood and don't need the rest
* **"Wait, actually..."** - They want to change direction
* **"No, that's not what I meant"** - Correcting a misunderstanding
* **"Yes yes, I know"** - Impatient, want to move on
* **"Hold on"** - Something came up
A voice assistant that ignores interruptions or handles them poorly feels robotic. Done well, barge-in handling makes your assistant feel responsive and human-like.
## How Barge-In Detection Works
When a user speaks while the assistant is talking, sipgate AI Flow:
1. **Stops the assistant's speech** immediately
2. **Sends a `user_speak` event** with `barged_in: true` and what the user said
3. **Waits for your response** (action or 204 No Content)
```mermaid
sequenceDiagram
participant User
participant Flow as AI Flow
participant App as Your Application
Flow->>User: Speaking: "Let me explain how..."
User->>Flow: Interrupts: "Got it, thanks!"
Flow->>Flow: Stops playback
Flow->>App: Event: user_speak {text: "Got it, thanks!", barged_in: true}
App->>Flow: Action: speak "Great! What else can I help with?"
Flow->>User: "Great! What else can I help with?"
```
## Basic Handling
Check the `barged_in` flag to detect interruptions and respond appropriately:
```typescript
async function handleUserSpeak(event: {
type: 'user_speak'
text: string
barged_in?: boolean
session: { id: string }
}) {
if (event.barged_in) {
// User interrupted - acknowledge quickly
return {
type: 'speak',
session_id: event.session.id,
text: "Of course. What would you like to know?",
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
}
}
// Normal speech processing
return processUserInput(event)
}
```
## Respond to What They Said
The `text` field contains what the user said when interrupting. Use it to respond appropriately:
```typescript
function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean }) {
if (!event.barged_in) {
// Normal processing for non-interruptions
return processNormalSpeech(event)
}
const text = event.text.toLowerCase()
// User understood - move on
if (text.includes('got it') || text.includes('understood') || text.includes('okay')) {
return speak("Great! What else can I help you with?")
}
// User wants to change direction
if (text.includes('actually') || text.includes('wait') || text.includes('no')) {
return speak("Of course. What would you like instead?")
}
// User is correcting something
if (text.includes('not what i') || text.includes('i meant')) {
return speak("I apologize for the confusion. Please tell me more.")
}
// User has a new question - process it directly
if (text.length > 25 || text.includes('?')) {
return processAsNewQuestion(event.session.id, event.text)
}
// Default acknowledgment
return speak("I'm listening.")
}
```
## Natural Acknowledgment Phrases
Vary your responses to avoid sounding robotic:
```typescript
const ACKNOWLEDGMENTS = {
understood: [
"Great! What else can I help with?",
"Perfect. Anything else?",
"Alright! What's next?",
],
redirect: [
"Of course. What would you like instead?",
"Sure thing. Go ahead.",
"No problem. What did you have in mind?",
],
listening: [
"I'm listening.",
"Go ahead.",
"Yes?",
],
}
// German equivalents
const ACKNOWLEDGMENTS_DE = {
understood: [
"Sehr gut! Kann ich sonst noch helfen?",
"Alles klar. Was noch?",
"Prima! Was möchten Sie noch wissen?",
],
redirect: [
"Natürlich. Was kann ich für Sie tun?",
"Kein Problem. Was hätten Sie gerne?",
"Selbstverständlich. Bitte?",
],
listening: [
"Ich höre.",
"Ja bitte?",
"Ja?",
],
}
```
## When to Process vs. Acknowledge
If the user said something substantial during a barge-in, treat it as a new question rather than just acknowledging:
```typescript
function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean }) {
if (!event.barged_in) {
return processNormalSpeech(event)
}
const interruptText = event.text.trim()
// Substantial interruption = likely a complete thought or question
if (interruptText.length > 25 || interruptText.includes('?')) {
// Process as a complete question
return processUserQuestion(event.text)
}
// Short interruption = just acknowledge
return speak("I'm listening.")
}
```
This provides a smoother experience - users don't have to repeat themselves.
## Silent Acknowledgment
Sometimes the best response is no response. Return `204 No Content` to simply listen:
```typescript
function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean }) {
if (!event.barged_in) {
return processNormalSpeech(event)
}
const text = event.text.toLowerCase()
// User just said "um", "uh", background noise, etc.
if (text.length < 3) {
return new Response(null, { status: 204 })
}
// User said "stop" or similar - they probably want silence
if (text === 'stop' || text === 'quiet') {
return new Response(null, { status: 204 })
}
return speak("I'm listening.")
}
```
## Configure Barge-In Sensitivity
Use the [barge-in configuration](/api/barge-in) to control when interruptions trigger:
### Immediate Response (Most Natural) ⚡
```typescript
// Most responsive - triggers on voice detection (20-100ms)
return {
type: 'speak',
session_id: sessionId,
text: "I can help you with billing, support, or sales...",
barge_in: {
strategy: 'immediate',
allow_after_ms: 500, // Protect first 500ms from accidental noise
},
}
```
**Best for:**
* Natural conversations where instant response matters
* Customer service with high urgency
* Interactive dialogues
**Trade-off:** May trigger on background noise. Use `allow_after_ms` as buffer.
### Character-Based (Balanced)
```typescript
// Balanced - triggers after 3+ characters recognized
return {
type: 'speak',
session_id: sessionId,
text: "Let me explain how this works...",
barge_in: {
strategy: 'minimum_characters',
minimum_characters: 3, // Trigger quickly but reliably
},
}
// Protect important information
return {
type: 'speak',
session_id: sessionId,
text: "Your confirmation code is 7-4-2-9. Please write this down.",
barge_in: {
strategy: 'minimum_characters',
minimum_characters: 10, // Require more speech
allow_after_ms: 3000, // Protect first 3 seconds
},
}
```
### No Interruption (Critical Info)
```typescript
// Never allow interruption for critical info
return {
type: 'speak',
session_id: sessionId,
text: "This call may be recorded for quality assurance.",
barge_in: {
strategy: 'none',
},
}
```
### Strategy Comparison
| Strategy | Latency | Reliability | Use Case |
|----------|---------|-------------|----------|
| `immediate` | 20-100ms | May trigger on noise | Most natural conversations |
| `minimum_characters` | 50-200ms | Very reliable | Balanced approach |
| `manual` | N/A | Perfect | Custom logic |
| `none` | N/A | Perfect | Critical info only |
## Handling Impatient Users
Some users interrupt frequently. Keep acknowledgments brief:
```typescript
// Track interruption count per session
const interruptCounts = new Map()
function handleUserSpeak(event: { type: 'user_speak', text: string, barged_in?: boolean, session: any }) {
if (!event.barged_in) {
return processNormalSpeech(event)
}
const sessionId = event.session.id
const count = (interruptCounts.get(sessionId) || 0) + 1
interruptCounts.set(sessionId, count)
// User interrupts a lot - be extra brief
if (count > 3) {
return speak("Yes?")
}
return speak("Of course. What would you like?")
}
```
## Complete Example
```typescript
export async function POST(req: Request): Promise {
const event = await req.json()
const sessionId = event.session.id
switch (event.type) {
case 'user_speak':
return handleUserSpeak(event)
case 'session_end':
// Clean up session state
sessionStates.delete(sessionId)
interruptCounts.delete(sessionId)
return new Response(null, { status: 204 })
default:
return new Response(null, { status: 204 })
}
}
function handleUserSpeak(event: {
type: 'user_speak'
text: string
barged_in?: boolean
session: { id: string }
}): Response {
const sessionId = event.session.id
// Handle normal speech
if (!event.barged_in) {
return processNormalUserSpeech(event)
}
// Barge-in handling
const text = event.text.trim().toLowerCase()
// Very short - probably noise, stay silent
if (text.length < 3) {
return new Response(null, { status: 204 })
}
// User understood / confirmed
if (text.includes('got it') || text.includes('thanks') || text.includes('okay')) {
return speak(sessionId, "Great! What else can I help with?")
}
// User wants to redirect
if (text.includes('actually') || text.includes('wait') || text.includes('but')) {
return speak(sessionId, "Of course. What would you like?")
}
// Substantial text - treat as new input
if (event.text.length > 25 || event.text.includes('?')) {
return processNormalUserSpeech(event)
}
// Default
return speak(sessionId, "I'm listening.")
}
function speak(sessionId: string, text: string): Response {
return Response.json({
type: 'speak',
session_id: sessionId,
text,
tts: {
provider: 'azure',
language: 'en-US',
voice: 'en-US-JennyNeural',
},
})
}
```
## Best Practices Summary
1. **Respond to intent** - Use the `text` field to understand why they interrupted
2. **Be brief** - Short acknowledgments sound natural ("Got it!" not "I understand that you have indicated...")
3. **Vary your phrases** - Rotate through different acknowledgments
4. **Process substantial interruptions** - If they said a lot, treat it as a new question
5. **Sometimes stay silent** - Return 204 for noise or "stop" commands
6. **Configure sensitivity** - Use `barge_in` config to protect important information
7. **Keep impatient users happy** - Shorter responses for frequent interrupters
8. **Clean up state** - If you're tracking conversation state, consider resetting flags like "expecting confirmation" when the user interrupts
::: tip Async Operations
If you're using the [Async Hold Pattern](/api/guides/async-hold-pattern) for slow operations, remember to cancel pending work when the user interrupts - they've moved on and don't want the old answer.
:::
## Related Documentation
* **[Barge-In Configuration](/api/barge-in)** - Configure interruption sensitivity
* **[User Speak Event with Barge-In Flag](/api/events/user-speak)** - Event reference
* **[Event Flow](/api/event-flow)** - Complete event lifecycle
---
---
url: /sipgate-ai-flow-api/api/guides/async-hold-pattern.md
---
# Handling Long-Running Requests in Voice AI: The Async Hold Pattern
When building voice AI assistants that integrate with external tools (like MCP servers, RAG systems, or slow APIs), you'll inevitably face a challenge: **sipgate AI Flow has webhook timeout limits**, but your backend operations might take much longer.
This guide explains how to implement an elegant solution we call the "Async Hold Pattern" - keeping callers engaged while your system processes their request in the background.
## The Problem
sipgate AI Flow enforces webhook timeout limits of approximately **5 seconds**. If your server doesn't respond in time, the platform may drop the connection or return an error to the caller.
But what if your assistant needs to:
* Query a slow external API (20+ seconds)
* Search through a large knowledge base
* Call an MCP (Model Context Protocol) server
* Perform complex RAG operations
* Fetch real-time data from third-party services
You can't make the caller wait in silence, and you can't speed up the external service. So what do you do?
## The Solution: Async Hold Pattern
Instead of blocking on the slow operation, we:
1. **Start the operation in the background** (don't await it)
2. **Wait briefly** for a quick response (e.g., 4 seconds)
3. **If completed quickly** → return the result directly
4. **If still pending** → tell the caller to wait, then check again when they're done listening
This leverages a key insight: **sipgate AI Flow sends an `assistant_speech_ended` event when the assistant finishes speaking**. We can use this to create a polling loop that keeps the caller informed.
```mermaid
flowchart TD
A[/"user_speak event"/] --> B["Start slow operation
(don't await)"]
B --> C["Wait up to 4 seconds"]
C --> D{Completed?}
D -->|Yes| E[/"Return result to caller"/]
D -->|No| F["Return hold message:
'One moment, let me check...'"]
F --> G[/"assistant_speech_ended event"/]
G --> H["Wait up to 4 seconds"]
H --> I{Completed?}
I -->|Yes| J[/"Return result to caller"/]
I -->|No| K["Return next hold message:
'Still searching...'"]
K --> G
style A fill:#e1f5fe
style G fill:#e1f5fe
style E fill:#c8e6c9
style J fill:#c8e6c9
style F fill:#fff3e0
style K fill:#fff3e0
```
## Implementation
### Step 1: Create a Pending State Manager
First, we need a way to store the background promise and track state across webhook calls. Since each webhook call is a separate HTTP request, we need shared state:
```typescript
// pending-state.ts
interface PendingState {
promise: Promise<{ response: string; error?: string }>
startedAt: number
holdMessageCount: number
userMessage: string
}
// In-memory store (use Redis for multi-instance deployments)
const pendingStates = new Map()
// Hold messages - rotate through these while waiting
const HOLD_MESSAGES = [
'One moment, let me check...',
'Still searching...',
'Just a moment longer...',
'Almost there...',
]
// How long to wait before responding (stay under sipgate's timeout!)
const WAIT_BEFORE_RESPONSE_MS = 4000
export function startPending(
sessionId: string,
promise: Promise<{ response: string; error?: string }>,
userMessage: string
): void {
pendingStates.set(sessionId, {
promise,
startedAt: Date.now(),
holdMessageCount: 0,
userMessage,
})
}
export function hasPending(sessionId: string): boolean {
return pendingStates.has(sessionId)
}
export function cancelPending(sessionId: string): void {
pendingStates.delete(sessionId)
}
export function getNextHoldMessage(sessionId: string): string {
const state = pendingStates.get(sessionId)
if (!state) return HOLD_MESSAGES[0]
const index = Math.min(state.holdMessageCount, HOLD_MESSAGES.length - 1)
state.holdMessageCount++
return HOLD_MESSAGES[index]
}
export async function waitForCompletion(
sessionId: string
): Promise<{ response: string; error?: string } | null> {
const state = pendingStates.get(sessionId)
if (!state) return null
// Race between the promise and a timeout
const timeoutPromise = new Promise((resolve) => {
setTimeout(() => resolve(null), WAIT_BEFORE_RESPONSE_MS)
})
const result = await Promise.race([state.promise, timeoutPromise])
if (result !== null) {
// Completed! Clean up and return
pendingStates.delete(sessionId)
return result
}
return null // Still pending
}
```
### Step 2: Handle the Initial `user_speak` Event
When a user speaks, start the background operation and wait briefly:
```typescript
// webhook-handler.ts
async function handleUserSpeak(event: {
type: 'user_speak'
session: { id: string }
text: string
}) {
const sessionId = event.session.id
// Cancel any existing pending operation (user asked a new question)
cancelPending(sessionId)
// Start the slow operation in background (DON'T await the full operation!)
const operationPromise = performSlowOperation(event.text)
.then(result => ({ response: result }))
.catch(error => ({ response: '', error: String(error) }))
// Wait up to 4 seconds for completion
const INITIAL_WAIT_MS = 4000
const timeoutPromise = new Promise((resolve) => {
setTimeout(() => resolve(null), INITIAL_WAIT_MS)
})
const quickResult = await Promise.race([operationPromise, timeoutPromise])
if (quickResult !== null) {
// Completed quickly! Return result directly - no hold message needed
console.log('Operation completed within 4s - returning direct response')
if (quickResult.error) {
return createSpeakResponse(sessionId, 'I\'m sorry, there was an error.')
}
return createSpeakResponse(sessionId, quickResult.response)
}
// Taking too long - switch to hold pattern
console.log('Operation taking >4s - using hold pattern')
// Store the promise for the assistant_speech_ended handler
startPending(sessionId, operationPromise, event.text)
// Return hold message
return createSpeakResponse(sessionId, getNextHoldMessage(sessionId))
}
function createSpeakResponse(sessionId: string, text: string) {
return {
type: 'speak',
session_id: sessionId,
text: text,
tts: {
provider: 'azure',
language: 'en-US',
voice: 'en-US-JennyNeural',
},
}
}
```
### Step 3: Handle the `assistant_speech_ended` Event
This is the key insight: sipgate AI Flow sends an `assistant_speech_ended` event when the assistant finishes speaking. **You can return a new action from this event!**
```typescript
async function handleAssistantSpeechEnded(event: {
type: 'assistant_speech_ended'
session: { id: string }
}) {
const sessionId = event.session.id
// No pending operation? Nothing to do - return 204 No Content
if (!hasPending(sessionId)) {
return new Response(null, { status: 204 })
}
// Wait for completion (up to 4 seconds to maximize processing time)
const result = await waitForCompletion(sessionId)
if (result !== null) {
// Done! Return the actual response
console.log('Operation completed - returning result')
if (result.error) {
return createSpeakResponse(
sessionId,
'I\'m sorry, I couldn\'t find that information.'
)
}
return createSpeakResponse(sessionId, result.response)
}
// Still pending - say another hold message and wait for next speech_ended
console.log('Still waiting - returning another hold message')
return createSpeakResponse(sessionId, getNextHoldMessage(sessionId))
}
```
### Step 4: Wire Up the Webhook Router
```typescript
export async function POST(request: Request) {
const event = await request.json()
switch (event.type) {
case 'session_start':
return handleSessionStart(event)
case 'user_speak':
return handleUserSpeak(event)
case 'assistant_speech_ended':
return handleAssistantSpeechEnded(event)
case 'session_end':
// Clean up any pending state
cancelPending(event.session.id)
return handleSessionEnd(event)
default:
// Return 204 for events we don't handle
return new Response(null, { status: 204 })
}
}
```
## Key Insights
### Why 4 Seconds?
sipgate AI Flow has approximately a 5 second timeout. We use 4 seconds to:
* Leave buffer for network latency
* Allow time for the response to be transmitted
* Stay safely under the limit
You can adjust this value, but always leave at least 500ms-1s of buffer.
### The `assistant_speech_ended` Event is Powerful
Many developers overlook this event, but it's the key to the async pattern. When the assistant finishes speaking, sipgate sends this event and **waits for your response**. You can:
* Return a new `speak` action to continue talking
* Return `204 No Content` to stay silent and wait for user input
* Check if your background operation completed
This creates a natural polling mechanism without awkward silences.
### Memory vs. Redis
The example uses an in-memory `Map` for simplicity. This works for single-instance deployments, but for production with multiple server instances behind a load balancer, use Redis:
```typescript
import Redis from 'ioredis'
const redis = new Redis(process.env.REDIS_URL)
// Store metadata in Redis (promises can't be serialized)
export async function startPending(sessionId: string, ...) {
await redis.setex(
`pending:${sessionId}`,
120, // 2 minute TTL
JSON.stringify({
startedAt: Date.now(),
holdMessageCount: 0,
userMessage
})
)
// Keep promise reference in memory (same instance will handle it)
pendingPromises.set(sessionId, promise)
}
```
### Always Cancel Previous Operations
When a user asks a new question while you're still processing the old one, cancel the old operation:
```typescript
// In user_speak handler - always cancel previous pending operation
cancelPending(event.session.id)
```
This prevents confusion and wasted resources. The user doesn't care about the old answer anymore.
### Clean Up on Session End
Always clean up when the call ends:
```typescript
case 'session_end':
cancelPending(event.session.id)
return handleSessionEnd(event)
```
## Advanced: Caching Slow Initializations
If your slow operation has a one-time initialization step (like discovering available tools from an MCP server), cache it separately:
```typescript
// BAD: Fetching tool definitions on every request
async function handleUserSpeak(event) {
const tools = await mcpServer.listTools() // SLOW - 5+ seconds!
const response = await llm.generate({ tools, message: event.text })
return createSpeakResponse(sessionId, response)
}
// GOOD: Cache tool definitions, only fetch once
async function handleUserSpeak(event) {
// Tools were cached when server was configured
const tools = await database.get('mcp_server_tools', serverId) // FAST - <100ms
const response = await llm.generate({ tools, message: event.text })
return createSpeakResponse(sessionId, response)
}
// Cache tools when MCP server is configured (admin action)
async function configureMcpServer(serverUrl: string) {
const tools = await mcpServer.listTools() // Slow, but only happens once
await database.set('mcp_server_tools', serverId, tools)
}
```
This separates "one-time setup" (caching tool definitions) from "per-request work" (calling tools), dramatically improving response times.
## Complete Minimal Example
Here's a self-contained example you can adapt:
```typescript
// ============================================
// pending-state.ts
// ============================================
const pendingStates = new Map
holdCount: number
}>()
const HOLD_MESSAGES = [
'One moment please...',
'Still searching...',
'Almost there...',
]
const WAIT_MS = 4000
export const pending = {
start(id: string, promise: Promise<{ response: string; error?: string }>) {
pendingStates.set(id, { promise, holdCount: 0 })
},
has(id: string): boolean {
return pendingStates.has(id)
},
cancel(id: string): void {
pendingStates.delete(id)
},
getHoldMessage(id: string): string {
const state = pendingStates.get(id)
if (!state) return HOLD_MESSAGES[0]
return HOLD_MESSAGES[Math.min(state.holdCount++, HOLD_MESSAGES.length - 1)]
},
async wait(id: string): Promise<{ response: string; error?: string } | null> {
const state = pendingStates.get(id)
if (!state) return null
const timeout = new Promise(r => setTimeout(() => r(null), WAIT_MS))
const result = await Promise.race([state.promise, timeout])
if (result !== null) {
pendingStates.delete(id)
}
return result
},
}
// ============================================
// webhook.ts
// ============================================
import { pending } from './pending-state'
export async function POST(req: Request): Promise {
const event = await req.json()
const sessionId = event.session.id
// Handle user_speak - start background operation
if (event.type === 'user_speak') {
pending.cancel(sessionId) // Cancel any previous operation
// Start slow operation (don't await fully!)
const promise = slowExternalApiCall(event.text)
.then(result => ({ response: result }))
.catch(err => ({ response: '', error: String(err) }))
// Wait up to 4 seconds
const timeout = new Promise(r => setTimeout(() => r(null), 4000))
const quick = await Promise.race([promise, timeout])
// If completed quickly, return result directly
if (quick !== null) {
if (quick.error) {
return speak(sessionId, 'Sorry, there was an error.')
}
return speak(sessionId, quick.response)
}
// Taking too long - use hold pattern
pending.start(sessionId, promise)
return speak(sessionId, pending.getHoldMessage(sessionId))
}
// Handle assistant_speech_ended - check if operation completed
if (event.type === 'assistant_speech_ended') {
if (!pending.has(sessionId)) {
return new Response(null, { status: 204 })
}
const result = await pending.wait(sessionId)
if (result !== null) {
if (result.error) {
return speak(sessionId, 'The request could not be processed.')
}
return speak(sessionId, result.response)
}
// Still waiting - another hold message
return speak(sessionId, pending.getHoldMessage(sessionId))
}
// Handle session_end - clean up
if (event.type === 'session_end') {
pending.cancel(sessionId)
}
return new Response(null, { status: 204 })
}
function speak(sessionId: string, text: string): Response {
return Response.json({
type: 'speak',
session_id: sessionId,
text,
tts: {
provider: 'azure',
language: 'en-US',
voice: 'en-US-GuyNeural',
},
})
}
// Your slow operation (replace with actual implementation)
async function slowExternalApiCall(query: string): Promise {
// Simulating a slow API call
await new Promise(r => setTimeout(r, 15000))
return `Here's what I found about "${query}"...`
}
```
## Conclusion
The Async Hold Pattern transforms a technical limitation into a natural conversation flow. Instead of timing out or making users wait in awkward silence, your assistant says "One moment please..." - just like a human would.
**Key takeaways:**
1. **Start slow operations without awaiting** - let them run in the background
2. **Wait briefly (4 seconds)** before deciding to use hold messages
3. **Use the `assistant_speech_ended` event** to poll for completion
4. **Keep messages varied** - rotate through different hold phrases
5. **Always clean up** - cancel pending operations when no longer needed
6. **Cache when possible** - separate one-time setup from per-request work
This pattern works with any slow backend operation: MCP servers, RAG pipelines, external APIs, database queries, or anything else that might exceed the webhook timeout.
***
*For more information about sipgate AI Flow events and actions, see the [sipgate AI Flow API documentation](https://sipgate.github.io/sipgate-ai-flow-api/).*
---
---
url: /sipgate-ai-flow-api/api/guides/testing-voice-assistants.md
---
# Testing Voice Assistants Without Making Phone Calls
Testing voice assistants is challenging - you can't just write unit tests and call it a day. Real phone calls are slow, awkward to automate, and expensive at scale. This guide covers practical strategies for testing your sipgate AI Flow integration at every level.
## The Testing Challenge
Voice assistants have unique testing challenges:
* **Real calls are slow** - Each test takes 30+ seconds of actual talking
* **Hard to automate** - You can't easily script "say this, wait for response"
* **Expensive at scale** - Phone minutes add up during development
* **Environment-dependent** - Need a publicly accessible webhook URL
* **Non-deterministic** - Speech recognition varies, LLM responses vary
The solution: test at multiple levels, saving real phone calls for final validation.
## Testing Pyramid for Voice AI
```mermaid
graph TB
subgraph "Testing Pyramid"
A["🔺 Real Phone Calls
(Few, Final Validation)"]
B["🔸 Event Simulation
(HTTP requests to your webhook)"]
C["🔹 Chat Simulator
(Test LLM logic via text)"]
D["🟦 Unit Tests
(Business logic, utilities)"]
end
D --> C --> B --> A
style A fill:#ffcdd2
style B fill:#fff3e0
style C fill:#e3f2fd
style D fill:#c8e6c9
```
## Level 1: Unit Tests
Test your business logic in isolation - no sipgate, no LLM calls.
```typescript
// utils/intent-detection.ts
export function detectIntent(text: string): 'greeting' | 'question' | 'goodbye' | 'unknown' {
const lower = text.toLowerCase()
if (lower.match(/^(hi|hello|hey|good morning)/)) return 'greeting'
if (lower.match(/(bye|goodbye|see you|thanks)/)) return 'goodbye'
if (lower.includes('?')) return 'question'
return 'unknown'
}
// utils/intent-detection.test.ts
import { detectIntent } from './intent-detection'
describe('detectIntent', () => {
it('detects greetings', () => {
expect(detectIntent('Hello there')).toBe('greeting')
expect(detectIntent('Hi!')).toBe('greeting')
expect(detectIntent('Good morning')).toBe('greeting')
})
it('detects questions', () => {
expect(detectIntent('What are your hours?')).toBe('question')
expect(detectIntent('Can you help me?')).toBe('question')
})
it('detects goodbyes', () => {
expect(detectIntent('Goodbye')).toBe('goodbye')
expect(detectIntent('Thanks, bye!')).toBe('goodbye')
})
})
```
**What to unit test:**
* Intent detection logic
* Response formatting
* State machine transitions
* Phone number normalization
* TTS configuration building
## Level 2: Chat Simulator
Build a text-based interface that uses the same LLM logic as your voice assistant. This lets you rapidly iterate on prompts and conversation flow without any phone infrastructure.
```typescript
// The key insight: extract your LLM logic into a shared service
// lib/conversation-service.ts
export async function generateResponse(params: {
systemPrompt: string
conversationHistory: { role: 'user' | 'assistant'; content: string }[]
userMessage: string
}): Promise {
// Your LLM call logic here
// This is used by BOTH the webhook AND the chat simulator
}
```
```typescript
// Webhook uses it
async function handleUserSpeak(event: UserSpeakEvent) {
const response = await generateResponse({
systemPrompt: assistant.system_prompt,
conversationHistory: history,
userMessage: event.text,
})
return speak(response)
}
// Chat simulator uses the SAME function
async function handleChatMessage(message: string, sessionId: string) {
const response = await generateResponse({
systemPrompt: assistant.system_prompt,
conversationHistory: history,
userMessage: message,
})
return { response }
}
```
**Benefits:**
* Test conversation flow in seconds, not minutes
* Iterate on system prompts quickly
* Debug LLM issues without phone overhead
* Share sessions with teammates for review
**Limitations:**
* Doesn't test speech recognition accuracy
* Doesn't test TTS pronunciation
* Doesn't test real-time timing
## Level 3: Event Simulation
Send fake sipgate events directly to your webhook. This tests your actual webhook handler without needing a phone call.
### Manual Testing with curl
```bash
# Simulate session_start
curl -X POST http://localhost:3000/api/webhook \
-H "Content-Type: application/json" \
-d '{
"type": "session_start",
"session": {
"id": "test-session-123",
"account_id": "test-account",
"phone_number": "1234567890",
"direction": "inbound",
"from_phone_number": "0987654321",
"to_phone_number": "1234567890"
}
}'
# Simulate user_speak
curl -X POST http://localhost:3000/api/webhook \
-H "Content-Type: application/json" \
-d '{
"type": "user_speak",
"session": {
"id": "test-session-123",
"account_id": "test-account",
"phone_number": "1234567890"
},
"text": "What are your business hours?"
}'
# Simulate user_speak with interruption (barge_in)
curl -X POST http://localhost:3000/api/webhook \
-H "Content-Type: application/json" \
-d '{
"type": "user_speak",
"session": {
"id": "test-session-123",
"account_id": "test-account",
"phone_number": "1234567890"
},
"text": "Actually, never mind",
"barged_in": true
}'
# Simulate session_end
curl -X POST http://localhost:3000/api/webhook \
-H "Content-Type: application/json" \
-d '{
"type": "session_end",
"session": {
"id": "test-session-123",
"account_id": "test-account",
"phone_number": "1234567890"
},
"reason": "caller_hangup"
}'
```
### Automated Integration Tests
```typescript
// tests/webhook.test.ts
import { describe, it, expect, beforeEach } from 'vitest'
const WEBHOOK_URL = 'http://localhost:3000/api/webhook'
describe('Webhook Integration', () => {
const sessionId = `test-${Date.now()}`
it('handles session_start and returns greeting', async () => {
const response = await fetch(WEBHOOK_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
type: 'session_start',
session: {
id: sessionId,
account_id: 'test',
phone_number: '1234567890',
direction: 'inbound',
from_phone_number: '0987654321',
to_phone_number: '1234567890',
},
}),
})
expect(response.ok).toBe(true)
const data = await response.json()
expect(data.type).toBe('speak')
expect(data.text).toBeTruthy()
expect(data.session_id).toBe(sessionId)
})
it('handles user_speak and returns response', async () => {
const response = await fetch(WEBHOOK_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
type: 'user_speak',
session: { id: sessionId, account_id: 'test', phone_number: '1234567890' },
text: 'What are your hours?',
}),
})
expect(response.ok).toBe(true)
const data = await response.json()
expect(data.type).toBe('speak')
expect(data.text).toBeTruthy()
})
it('handles barge-in gracefully', async () => {
const response = await fetch(WEBHOOK_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
type: 'user_speak',
barged_in: true,
session: { id: sessionId, account_id: 'test', phone_number: '1234567890' },
text: 'Wait',
}),
})
expect(response.ok).toBe(true)
// Could be 204 or a speak action
if (response.status !== 204) {
const data = await response.json()
expect(data.type).toBe('speak')
}
})
it('handles session_end and cleans up', async () => {
const response = await fetch(WEBHOOK_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
type: 'session_end',
session: { id: sessionId, account_id: 'test', phone_number: '1234567890' },
reason: 'caller_hangup',
}),
})
expect(response.ok).toBe(true)
})
})
```
### Conversation Flow Tests
Test complete conversation scenarios:
```typescript
// tests/flows/booking-flow.test.ts
async function simulateConversation(messages: string[]): Promise {
const sessionId = `test-${Date.now()}`
const responses: string[] = []
// Start session
await sendEvent({ type: 'session_start', session: { id: sessionId, ... } })
// Simulate each user message
for (const message of messages) {
const response = await sendEvent({
type: 'user_speak',
session: { id: sessionId, ... },
text: message,
})
responses.push(response.text)
}
// End session
await sendEvent({ type: 'session_end', session: { id: sessionId, ... } })
return responses
}
describe('Booking Flow', () => {
it('completes a booking conversation', async () => {
const responses = await simulateConversation([
'I want to book an appointment',
'Tomorrow at 2pm',
'John Smith',
'Yes, that is correct',
])
expect(responses[0]).toMatch(/when|date|time/i)
expect(responses[1]).toMatch(/name/i)
expect(responses[2]).toMatch(/confirm/i)
expect(responses[3]).toMatch(/booked|confirmed|scheduled/i)
})
it('handles corrections mid-flow', async () => {
const responses = await simulateConversation([
'I want to book an appointment',
'Tomorrow at 2pm',
'Actually, make it 3pm instead',
])
expect(responses[2]).toMatch(/3|three|pm/i)
})
})
```
## Level 4: Local Development with ngrok
For testing with real sipgate infrastructure (but simulated calls), expose your local server:
```bash
# Start your dev server
npm run dev
# In another terminal, expose it
ngrok http 3000
```
Configure the ngrok URL as your webhook endpoint in sipgate. Now sipgate can reach your local development server.
**Use cases:**
* Test webhook authentication
* Test with sipgate's actual event format
* Debug production issues locally
## Level 5: Real Phone Calls
Save these for final validation. Create a testing checklist:
```markdown
## Pre-Release Phone Test Checklist
### Basic Flow
- [ ] Call connects and greeting plays
- [ ] Assistant responds to simple question
- [ ] Assistant handles "I don't understand" gracefully
- [ ] Call ends cleanly when user says goodbye
### Barge-In
- [ ] Interrupting mid-sentence works
- [ ] Assistant acknowledges interruption
- [ ] No "stale" responses after interruption
### Edge Cases
- [ ] Long silence from user (10+ seconds)
- [ ] Very long user input (30+ seconds of speaking)
- [ ] Background noise doesn't trigger false responses
- [ ] Accent/dialect recognition (if applicable)
### Error Handling
- [ ] Network timeout during LLM call
- [ ] Invalid user input
- [ ] Session state recovery after errors
```
## Testing Utilities
### Event Factory
Create a helper for generating test events:
```typescript
// tests/utils/event-factory.ts
export function createSessionStartEvent(overrides = {}) {
return {
type: 'session_start',
session: {
id: `test-${Date.now()}`,
account_id: 'test-account',
phone_number: '1234567890',
direction: 'inbound',
from_phone_number: '0987654321',
to_phone_number: '1234567890',
},
...overrides,
}
}
export function createUserSpeakEvent(sessionId: string, text: string, overrides = {}) {
return {
type: 'user_speak',
session: {
id: sessionId,
account_id: 'test-account',
phone_number: '1234567890',
},
text,
...overrides,
}
}
export function createBargeInEvent(sessionId: string, text: string, overrides = {}) {
return {
type: 'user_speak',
barged_in: true,
session: {
id: sessionId,
account_id: 'test-account',
phone_number: '1234567890',
},
text,
...overrides,
}
}
```
### Response Assertions
```typescript
// tests/utils/assertions.ts
export function assertSpeakAction(response: any, options: {
containsText?: string
sessionId?: string
} = {}) {
expect(response.type).toBe('speak')
expect(response.text).toBeTruthy()
expect(response.tts).toBeDefined()
if (options.containsText) {
expect(response.text.toLowerCase()).toContain(options.containsText.toLowerCase())
}
if (options.sessionId) {
expect(response.session_id).toBe(options.sessionId)
}
}
export function assertTransferAction(response: any, targetNumber?: string) {
expect(response.type).toBe('transfer')
expect(response.target).toBeTruthy()
if (targetNumber) {
expect(response.target).toBe(targetNumber)
}
}
```
## CI/CD Integration
Run event simulation tests in your pipeline:
```yaml
# .github/workflows/test.yml
name: Test Voice Assistant
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm ci
- name: Run unit tests
run: npm test
- name: Start server
run: npm run dev &
env:
# Use test/mock API keys
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY_TEST }}
- name: Wait for server
run: npx wait-on http://localhost:3000/api/health
- name: Run integration tests
run: npm run test:integration
```
## Best Practices Summary
1. **Extract shared logic** - Same LLM service for chat and voice
2. **Test the pyramid** - Most tests at unit level, fewest at phone level
3. **Automate event simulation** - Integration tests catch regressions
4. **Use deterministic test data** - Fixed session IDs, predictable inputs
5. **Test conversation flows** - Not just individual events
6. **Create test utilities** - Event factories, response assertions
7. **Run in CI** - Catch issues before deployment
8. **Save phone tests for validation** - Manual checklist for final sign-off
## Related Documentation
* **[HTTP Webhooks](/api/http-webhooks)** - Webhook endpoint reference
* **[Event Types](/api/events)** - All event structures
* **[Action Types](/api/actions)** - Response format reference
---
---
url: /sipgate-ai-flow-api/api/events.md
---
# Event Types
Complete reference for all events sent by the AI Flow service.
## Overview
Events are JSON objects sent from the AI Flow service to your application. All events include a `type` field and session information.
## Base Event Structure
All events include session information:
```json
{
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890",
"direction": "inbound",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
The `direction` field indicates whether the call was initiated by the caller (`"inbound"`) or by the AI flow via the outbound call API (`"outbound"`). Use it in your `session_start` handler to tailor the greeting accordingly.
## Event Types
| Event Type | Transport | Description | When Triggered |
|--------------------------|--------------------|-----------------------------|--------------------------------------------------------------------------------|
| `session_start` | HTTP + WebSocket | Call session begins | When a new call is initiated |
| `user_speech_started` | **WebSocket only** | Speech onset detected | When VAD detects the user starting to speak (before full transcript) |
| `user_speak` | HTTP + WebSocket | User speech detected | After speech-to-text completes (includes `barged_in` flag if user interrupted) |
| `dtmf_received` | HTTP + WebSocket | DTMF digit pressed | When the user presses a key on their phone keypad |
| `assistant_speak` | HTTP + WebSocket | Assistant finished speaking | After TTS playback completes |
| `assistant_speech_ended` | HTTP + WebSocket | Assistant finished speaking | After speech playback ends |
| `user_input_timeout` | HTTP + WebSocket | User input timeout reached | When no speech detected after timeout |
| `session_end` | HTTP + WebSocket | Call session ends | When the call terminates |
| `sms_failed` | HTTP + WebSocket | SMS delivery failed | After a `send_sms` action fails — includes `reason` so the agent can react |
## Quick Reference
* **[Session Start](/api/events/session-start)** - Call begins
* **[User Speech Started](/api/events/user-speech-started)** - Speech onset detected (WebSocket only)
* **[User Speak](/api/events/user-speak)** - User speaks (includes barge-in detection)
* **[DTMF Received](/api/events/dtmf-received)** - User pressed a phone key
* **[Assistant Speak](/api/events/assistant-speak)** - Assistant speaks
* **[Assistant Speech Ended](/api/events/assistant-speech-ended)** - Assistant finished speaking
* **[User Input Timeout](/api/events/user-input-timeout)** - Timeout reached waiting for user
* **[Session End](/api/events/session-end)** - Call ends
* **SMS Failed** — emitted when a `send_sms` action fails; see below.
## SMS Failed
Emitted to your webhook / WebSocket when a `send_sms` action fails. The call continues normally — handle this event to react conversationally (e.g. apologize, retry with a corrected number).
```json
{
"type": "sms_failed",
"session": { "id": "550e8400-...", "account_id": "...", "phone_number": "...",
"from_phone_number": "...", "to_phone_number": "..." },
"recipient": "4915112345678",
"reason": "sender_not_allowed",
"message": "SMSC returned faultCode 403"
}
```
| Field | Type | Description |
|-------------|--------|--------------------------------------------------------------------------------------------|
| `type` | string | Always `"sms_failed"` |
| `session` | object | Standard session info |
| `recipient` | string | Phone number that failed (the `phone_number` from your `send_sms` action) |
| `reason` | string | One of: `sender_not_allowed`, `insufficient_balance`, `no_sms_extension`, `smsc_unavailable`, `unknown` |
| `message` | string | Optional human-readable detail (safe to log, may contain technical error text) |
See **[Send SMS Action](/api/actions/send-sms)** for details on each failure reason.
## Event Flow
```mermaid
graph LR
A[session_start] --> B[user_speak]
B --> C[assistant_speak]
C --> B
C --> D[user_speak with barged_in=true]
D --> B
B --> E[session_end]
C --> E
```
## Handling Events
### HTTP Webhook
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
event_type = event['type']
if event_type == 'session_start':
# Handle session start
pass
elif event_type == 'user_speak':
# Handle user speech
pass
# ... handle other events
```
### WebSocket
```javascript
ws.on('message', (data) => {
const event = JSON.parse(data.toString());
switch (event.type) {
case 'session_start':
// Handle session start
break;
case 'user_speak':
// Handle user speech
break;
// ... handle other events
}
});
```
## Response Requirements
All events (except `session_end`) accept a single action, an array of actions (executed in sequence), or `204 No Content`:
* **session\_start**: Can return action(s) or `204 No Content`
* **user\_speak**: Can return action(s) or `204 No Content` (check `barged_in` flag for interruptions)
* **dtmf\_received**: Can return action(s) or `204 No Content`
* **assistant\_speak**: Can return action(s) or `204 No Content`
* **assistant\_speech\_ended**: Can return action(s) or `204 No Content`
* **user\_input\_timeout**: Can return action(s) or `204 No Content`
* **session\_end**: **No action allowed**, cleanup only
## Next Steps
* **[Session Start Event](/api/events/session-start)** - Detailed reference
* **[User Speak Event](/api/events/user-speak)** - Detailed reference
* **[Action Types](/api/actions)** - How to respond to events
---
---
url: /sipgate-ai-flow-api/api/events/session-start.md
---
# Session Start Event
Triggered when a new call session begins.
## Event Structure
```json
{
"type": "session_start",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890",
"direction": "inbound",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"session_start"` |
| `session.id` | string (UUID) | Yes | Unique session identifier |
| `session.account_id` | string | Yes | Account identifier |
| `session.phone_number` | string | Yes | Phone number for this flow session |
| `session.direction` | string | No | `"inbound"` or `"outbound"` |
| `session.from_phone_number` | string | Yes | Phone number of the caller |
| `session.to_phone_number` | string | Yes | Phone number of the callee |
## Response
You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common responses:
### Greet the User
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Welcome! How can I help you today?"
}
```
### Play Welcome Audio
```json
{
"type": "audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio": "base64-encoded-wav-data"
}
```
### No Response
```http
HTTP/1.1 204 No Content
```
## Examples
### Python (Flask)
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'session_start':
session_id = event['session']['id']
return jsonify({
'type': 'speak',
'session_id': session_id,
'text': 'Welcome! How can I help you?'
})
return '', 204
```
### Node.js (Express)
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Welcome! How can I help you?'
});
}
res.status(204).send();
});
```
### Go
```go
func webhook(w http.ResponseWriter, r *http.Request) {
var event map[string]interface{}
json.NewDecoder(r.Body).Decode(&event)
if event["type"] == "session_start" {
session := event["session"].(map[string]interface{})
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "Welcome! How can I help you?",
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
return
}
w.WriteHeader(http.StatusNoContent)
}
```
## Use Cases
* **Initialize session state** - Set up conversation context
* **Greet the user** - Welcome message
* **Log call information** - Track incoming calls
* **Route based on number** - Different greetings for different numbers
## Best Practices
1. **Respond quickly** - Keep greeting under 2 seconds
2. **Initialize state** - Set up any session tracking
3. **Log session info** - Record call metadata
4. **Handle errors** - Always return a valid response
## Next Steps
* **[User Speak Event](/api/events/user-speak)** - Handle user input
* **[Action Types](/api/actions)** - All available actions
* **[Event Flow](/api/event-flow)** - Understand the complete flow
---
---
url: /sipgate-ai-flow-api/api/events/user-speech-started.md
---
# User Speech Started Event
Triggered when the user's speech is first detected — before the full transcript is available. Uses Voice Activity Detection (VAD) and typically fires 20–120 ms after the user starts speaking.
::: info WebSocket only
This event is only delivered via WebSocket connections. It is not sent to HTTP webhook endpoints.
:::
## Event Structure
```json
{
"type": "user_speech_started",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890",
"direction": "inbound",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"user_speech_started"` |
| `session.id` | string (UUID) | Yes | Session identifier |
| `session.account_id` | string | Yes | Account identifier |
| `session.phone_number` | string | Yes | Phone number for this flow session |
## Behaviour
* Fires **at most once per speech turn** — subsequent partial transcripts within the same turn are suppressed
* Resets automatically after the corresponding `user_speak` event is received, so it fires again on the next speech turn
* No response or actions are expected; the service ignores any payload returned for this event
## Use Cases
* **Show "user is speaking" indicators** in real-time dashboards or call monitoring UIs
* **Start latency optimisations early** — e.g. pre-warm LLM context or fetch data before the full transcript arrives
* **Interrupt ongoing workflows** — cancel queued background processing when the user begins to speak
## Example (TypeScript SDK)
```typescript
import { AiFlowAssistant } from '@sipgate/ai-flow-sdk';
import WebSocket from 'ws';
const assistant = AiFlowAssistant.create({
onUserSpeechStarted: async (event) => {
console.log('User started speaking, session:', event.session.id);
// No return value needed
},
onUserSpeak: async (event) => {
return `You said: ${event.text}`;
},
});
const wss = new WebSocket.Server({ port: 3000 });
wss.on('connection', (ws) => {
ws.on('message', assistant.ws(ws));
});
```
## Example (Raw WebSocket)
```javascript
ws.on('message', (data) => {
const event = JSON.parse(data.toString());
if (event.type === 'user_speech_started') {
console.log('User started speaking in session', event.session.id);
// No response needed — the service ignores any reply
}
if (event.type === 'user_speak') {
ws.send(JSON.stringify({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`,
}));
}
});
```
## Next Steps
* **[User Speak Event](/api/events/user-speak)** - Full transcript after STT completes
* **[Barge-In Guide](/api/barge-in)** - Interrupting assistant speech
* **[WebSocket Integration](/api/websocket)** - How to connect via WebSocket
---
---
url: /sipgate-ai-flow-api/api/events/user-speak.md
---
# User Speak Event
Triggered when the user speaks and speech-to-text completes.
## Event Structure
```json
{
"type": "user_speak",
"text": "Hello, I need help",
"barged_in": false,
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890"
}
}
```
### Barge-In Detection
When a user interrupts the assistant mid-speech, the event includes `barged_in: true`:
```json
{
"type": "user_speak",
"text": "Wait",
"barged_in": true,
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890"
}
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"user_speak"` |
| `text` | string | Yes | Recognized speech text |
| `barged_in` | boolean | No | `true` if user interrupted assistant, `false` or omitted otherwise |
| `session.id` | string (UUID) | Yes | Session identifier |
| `session.account_id` | string | Yes | Account identifier |
| `session.phone_number` | string | Yes | Phone number for this flow session |
## End-of-Utterance Detection
The service does not send a `user_speak` event after every individual STT segment. Instead, it buffers recognized speech and uses an on-device model to detect when the user has actually finished speaking.
### How it works
After each STT recognition result, the service checks whether the accumulated text is a complete utterance:
| Condition | Behaviour |
|-----------|-----------|
| Utterance is complete (e.g. full sentence, question) | `user_speak` is emitted immediately with the full accumulated text |
| Utterance is incomplete (e.g. dangling fragment like *"Ich möchte"*) | Service waits up to **2 seconds** for the user to continue speaking |
| User continues speaking within 2 seconds | The 2-second timer resets; both segments are merged into one event |
| 2 seconds pass with no further speech | `user_speak` is emitted with all buffered text |
### Practical implications
* The `text` field may contain **multiple speech segments merged** into a single string when the user speaks in bursts.
* Your webhook receives **one** `user_speak` per coherent utterance, not one per STT segment.
* Response latency is lowest for complete sentences — the model triggers the event immediately without waiting.
### Language sensitivity
The end-of-utterance model uses **language-specific thresholds** to decide what counts as a complete utterance. The active language is determined by the `languages` field set via the [`configure_transcription`](/api/actions/configure-transcription) action. If no language is configured, a default threshold is used.
Setting the correct language improves detection accuracy and reduces unnecessary delays.
## Response
You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common responses:
### Speak Back
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "I understand. How can I help you?"
}
```
### Transfer Call
```json
{
"type": "transfer",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"target_phone_number": "1234567890",
"caller_id_name": "Support",
"caller_id_number": "1234567890"
}
```
### Hangup
```json
{
"type": "hangup",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
## Examples
### Python (Flask)
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
session_id = event['session']['id']
user_text = event['text'].lower()
if 'goodbye' in user_text or 'bye' in user_text:
return jsonify({
'type': 'hangup',
'session_id': session_id
})
if 'transfer' in user_text:
return jsonify({
'type': 'transfer',
'session_id': session_id,
'target_phone_number': '1234567890',
'caller_id_name': 'Support',
'caller_id_number': '1234567890'
})
return jsonify({
'type': 'speak',
'session_id': session_id,
'text': f"You said: {event['text']}"
})
return '', 204
```
### Node.js (Express)
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
const userText = event.text.toLowerCase();
if (userText.includes('goodbye') || userText.includes('bye')) {
return res.json({
type: 'hangup',
session_id: event.session.id
});
}
if (userText.includes('transfer')) {
return res.json({
type: 'transfer',
session_id: event.session.id,
target_phone_number: '1234567890',
caller_id_name: 'Support',
caller_id_number: '1234567890'
});
}
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`
});
}
res.status(204).send();
});
```
### Go
```go
func webhook(w http.ResponseWriter, r *http.Request) {
var event map[string]interface{}
json.NewDecoder(r.Body).Decode(&event)
if event["type"] == "user_speak" {
session := event["session"].(map[string]interface{})
text := strings.ToLower(event["text"].(string))
if strings.Contains(text, "goodbye") || strings.Contains(text, "bye") {
action := map[string]interface{}{
"type": "hangup",
"session_id": session["id"],
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
return
}
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "You said: " + event["text"].(string),
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
return
}
w.WriteHeader(http.StatusNoContent)
}
```
## Handling Barge-In
You can check the `barged_in` flag to provide special handling for interruptions:
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
if event.get('barged_in'):
# User interrupted - acknowledge quickly
return jsonify({
'type': 'speak',
'text': 'Yes, I\'m listening.'
})
else:
# Normal speech processing
return process_user_input(event['text'])
```
See the **[Barge-In Best Practices Guide](/api/guides/barge-in-best-practices)** for detailed strategies.
## Use Cases
* **Process user input** - Understand what the user wants
* **Detect interruptions** - Handle barge-in with `barged_in` flag
* **Route conversations** - Direct to appropriate handler
* **Collect information** - Gather details from user
* **Transfer calls** - Route to human agents
* **End calls** - Handle goodbye messages
## Best Practices
1. **Process quickly** - Respond within 1-2 seconds
2. **Handle barge-in gracefully** - Check `barged_in` flag for interruptions
3. **Handle errors** - Always return a valid response
4. **Log interactions** - Track conversation for analytics
5. **Validate input** - Check for expected patterns
## Next Steps
* **[Assistant Speak Event](/api/events/assistant-speak)** - Track when assistant speaks
* **[Action Types](/api/actions)** - All available actions
* **[Event Flow](/api/event-flow)** - Understand the complete flow
---
---
url: /sipgate-ai-flow-api/api/events/assistant-speak.md
---
# Assistant Speak Event
Triggered after the assistant starts speaking. Event may be omitted for some text-to-speech models.
## Event Structure
```json
{
"type": "assistant_speak",
"text": "Hello! How can I help you?",
"ssml": "Hello!",
"duration_ms": 2000,
"speech_started_at": 1234567890,
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890"
}
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"assistant_speak"` |
| `text` | string | No | Text that was spoken |
| `ssml` | string | No | SSML that was used (if applicable) |
| `duration_ms` | number | Yes | Duration of speech in milliseconds |
| `speech_started_at` | number | Yes | Unix timestamp (ms) when speech started |
| `session.id` | string (UUID) | Yes | Session identifier |
| `session.account_id` | string | Yes | Account identifier |
| `session.phone_number` | string | Yes | Phone number for this flow session |
## Response
You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common uses:
* **Track metrics** - Log conversation analytics
* **Chain actions** - Trigger follow-up actions
* **No response** - Just track the event
## Examples
### Track Metrics
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'assistant_speak':
# Track metrics
track_metrics({
'session_id': event['session']['id'],
'duration_ms': event['duration_ms'],
'text': event.get('text', '')
})
return '', 204
```
### Chain Actions
```python
# Store what to do next
session_state = {}
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
session_id = event['session']['id']
if event['type'] == 'user_speak':
# Set next action
session_state[session_id] = 'play_audio'
return jsonify({
'type': 'speak',
'session_id': session_id,
'text': 'Please listen to this message.'
})
if event['type'] == 'assistant_speak':
# Execute next action
if session_state.get(session_id) == 'play_audio':
del session_state[session_id]
return jsonify({
'type': 'audio',
'session_id': session_id,
'audio': 'base64-audio-data'
})
return '', 204
```
## Use Cases
* **Analytics** - Track conversation metrics
* **Action chaining** - Trigger follow-up actions
* **Logging** - Record what was said
* **Timing** - Measure response times
## Best Practices
1. **Don't block** - Process quickly
2. **Track metrics** - Use for analytics
3. **Chain carefully** - Avoid infinite loops
4. **Log interactions** - For debugging
## Next Steps
* **[User Speak Event](/api/events/user-speak)** - Handle user input
* **[Action Types](/api/actions)** - All available actions
* **[Event Flow](/api/event-flow)** - Understand the complete flow
---
---
url: /sipgate-ai-flow-api/api/events/assistant-speech-ended.md
---
# Assistant Speech Ended Event
Triggered after the assistant finishes speaking.
## Event Structure
```json
{
"type": "assistant_speech_ended",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890"
}
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"assistant_speech_ended"` |
| `session.id` | string (UUID) | Yes | Session identifier |
| `session.account_id` | string | Yes | Account identifier |
| `session.phone_number` | string | Yes | Phone number for this flow session |
## Response
You can return a single action, an array of actions (executed in sequence), or `204 No Content`. Common uses:
* **Trigger follow-up actions** - Continue the conversation flow
* **Track completion** - Log that speech finished
* **No response** - Just track the event
## Examples
### Trigger Follow-Up Action
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
session_id = event['session']['id']
if event['type'] == 'assistant_speech_ended':
# Trigger next action in conversation flow
return jsonify({
'type': 'speak',
'session_id': session_id,
'text': 'Is there anything else I can help you with?'
})
return '', 204
```
### Track Completion
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'assistant_speech_ended':
# Log that speech completed
log_speech_completed(event['session']['id'])
return '', 204
```
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'assistant_speech_ended') {
// Trigger next action
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Is there anything else I can help you with?'
});
}
res.status(204).send();
});
```
### Go
```go
func webhook(w http.ResponseWriter, r *http.Request) {
var event map[string]interface{}
json.NewDecoder(r.Body).Decode(&event)
if event["type"] == "assistant_speech_ended" {
session := event["session"].(map[string]interface{})
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "Is there anything else I can help you with?",
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
return
}
w.WriteHeader(http.StatusNoContent)
}
```
## Use Cases
* **Continue conversation** - Trigger follow-up questions or actions
* **Track completion** - Log that speech playback finished
* **Chain actions** - Execute next step in conversation flow
* **Analytics** - Track when assistant finishes speaking
## Difference from assistant\_speak
* **assistant\_speak** - Triggered when assistant **starts** speaking (includes duration, text, etc.)
* **assistant\_speech\_ended** - Triggered when assistant **finishes** speaking (simpler, just session info)
## Best Practices
1. **Use for follow-ups** - Great for continuing conversation flow
2. **Track timing** - Log when speech completes
3. **Chain actions** - Trigger next action in sequence
4. **Don't block** - Process quickly
## Next Steps
* **[Assistant Speak Event](/api/events/assistant-speak)** - When assistant starts speaking
* **[User Speak Event](/api/events/user-speak)** - Handle user input
* **[Action Types](/api/actions)** - All available actions
* **[Event Flow](/api/event-flow)** - Understand the complete flow
---
---
url: /sipgate-ai-flow-api/api/events/dtmf-received.md
---
# DTMF Received Event
Triggered when the user presses a key on their phone keypad during a call.
## Event Structure
```json
{
"type": "dtmf_received",
"digit": "1",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890",
"direction": "inbound",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
## Fields
| Field | Type | Description |
|---------|--------|-----------------------------------------------------|
| `type` | string | Always `"dtmf_received"` |
| `digit` | string | The key pressed: `0`–`9`, `*`, or `#` |
| `session` | object | Session information (see [Base Event Structure](/api/events)) |
## Example
### IVR Menu
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'session_start':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Press 1 for sales, press 2 for support.'
})
if event['type'] == 'dtmf_received':
digit = event['digit']
session_id = event['session']['id']
if digit == '1':
return jsonify({
'type': 'transfer',
'session_id': session_id,
'transfer_to': '49211100200'
})
elif digit == '2':
return jsonify({
'type': 'transfer',
'session_id': session_id,
'transfer_to': '49211100201'
})
else:
return jsonify({
'type': 'speak',
'session_id': session_id,
'text': 'Invalid selection. Press 1 for sales, press 2 for support.'
})
return '', 204
```
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'dtmf_received') {
const { digit } = event;
console.log(`User pressed: ${digit}`);
if (digit === '#') {
return res.json({
type: 'hangup',
session_id: event.session.id
});
}
}
res.status(204).send();
});
```
## TypeScript SDK
```typescript
const assistant = AiFlowAssistant.create({
onDtmfReceived: async (event) => {
console.log(`User pressed: ${event.digit}`);
if (event.digit === '1') {
return {
type: 'transfer',
session_id: event.session.id,
transfer_to: '49211100200'
};
}
return {
type: 'speak',
session_id: event.session.id,
text: `You pressed ${event.digit}.`
};
},
});
```
## Use Cases
* **IVR menus** — route calls based on key presses
* **PIN entry** — collect numeric input without speech recognition
* **Confirmation flows** — press 1 to confirm, 2 to cancel
* **Accessibility** — provide keypad alternatives to voice commands
## Notes
* All standard DTMF tones are supported: `0`–`9`, `*`, `#`
* Each key press triggers a separate `dtmf_received` event
* DTMF events can occur at any point during the call, including while the assistant is speaking
## Next Steps
* **[Action Types](/api/actions)** - How to respond to events
* **[User Speak Event](/api/events/user-speak)** - Voice input alternative
---
---
url: /sipgate-ai-flow-api/api/events/session-end.md
---
# Session End Event
Triggered when the call session ends.
## Event Structure
```json
{
"type": "session_end",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890"
}
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"session_end"` |
| `session.id` | string (UUID) | Yes | Session identifier |
| `session.account_id` | string | Yes | Account identifier |
| `session.phone_number` | string | Yes | Phone number for this flow session |
## Response
**No action is allowed** for `session_end` events. Always return `204 No Content`.
```http
HTTP/1.1 204 No Content
```
## Examples
### Python
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'session_end':
# Cleanup session state
cleanup_session(event['session']['id'])
return '', 204
```
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_end') {
// Cleanup session state
cleanupSession(event.session.id);
res.status(204).send();
return;
}
});
```
### Go
```go
func webhook(w http.ResponseWriter, r *http.Request) {
var event map[string]interface{}
json.NewDecoder(r.Body).Decode(&event)
if event["type"] == "session_end" {
session := event["session"].(map[string]interface{})
cleanupSession(session["id"].(string))
w.WriteHeader(http.StatusNoContent)
return
}
}
```
## Use Cases
* **Cleanup state** - Remove session data
* **Save logs** - Store conversation history
* **Send analytics** - Track session metrics
* **Close connections** - Clean up resources
## Best Practices
1. **Always cleanup** - Remove session state
2. **Log the session** - Save for analytics
3. **Don't return actions** - No actions are processed
4. **Handle errors** - Don't fail silently
## Next Steps
* **[Session Start Event](/api/events/session-start)** - When calls begin
* **[Event Flow](/api/event-flow)** - Understand the complete flow
* **[Action Types](/api/actions)** - Actions you can send
---
---
url: /sipgate-ai-flow-api/api/actions.md
---
# Action Types
Complete reference for all actions you can send to the AI Flow service.
## Overview
Actions are JSON objects you send back to the AI Flow service in response to events. All actions require a `session_id` and `type` field.
## Base Action Structure
```json
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "speak"
}
```
## Action Summary
| Action Type | Description | Primary Use Case |
| -------------- | --------------------------- | --------------------------------------- |
| `speak` | Speak text or SSML | Respond to user with synthesized speech |
| `audio` | Play pre-recorded audio | Play hold music, pre-recorded messages |
| `mix_audio` | Loop a background sound mixed into speech | Add ambient noise (café, office, train station) under the agent |
| `hangup` | End the call | Terminate conversation |
| `transfer` | Transfer to another number | Route to human agent or department |
| `barge_in` | Manually interrupt playback | Stop current audio immediately |
| `configure_transcription` | Change STT language(s) mid-call | Switch recognition language without hanging up |
| `configure_voice_to_voice` | Switch the session into end-to-end voice-to-voice mode | Hand the conversation to a speech-to-speech model that owns audio I/O |
| `send_sms` | Send an SMS from the account | Deliver confirmation codes, summaries, links |
## Quick Reference
* **[Speak Action](/api/actions/speak)** - Text-to-speech
* **[Audio Action](/api/actions/audio)** - Play audio file
* **[Mix Audio Action](/api/actions/mix-audio)** - Loop a background sound mixed into outbound speech
* **[Hangup Action](/api/actions/hangup)** - End call
* **[Transfer Action](/api/actions/transfer)** - Transfer call
* **[Barge-In Action](/api/actions/barge-in)** - Manually interrupt current playback
* **[Configure Transcription Action](/api/actions/configure-transcription)** - Change STT language mid-call
* **[Configure Voice-to-Voice Action](/api/actions/configure-voice-to-voice)** - End-to-end speech-to-speech mode (preview)
* **[Send SMS Action](/api/actions/send-sms)** - Send an SMS from your account
## Response Format
### HTTP Webhook
Return a single action or an array of actions as JSON with `200 OK`:
```http
HTTP/1.1 200 OK
Content-Type: application/json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello!"
}
```
To execute multiple actions in sequence, return an array:
```http
HTTP/1.1 200 OK
Content-Type: application/json
[
{
"type": "barge_in",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
},
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Sorry, let me correct that."
}
]
```
Or return `204 No Content` if no action is needed:
```http
HTTP/1.1 204 No Content
```
### WebSocket
Send a single action or an array of actions as JSON strings:
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello!"
}
```
```json
[
{ "type": "barge_in", "session_id": "..." },
{ "type": "speak", "session_id": "...", "text": "Sorry, let me correct that." }
]
```
## Action Flow
```mermaid
graph TB
A[Receive Event] --> B{Event Type}
B -->|user_speak| C[Process Input]
B -->|session_start| D[Initialize]
C --> E{Decision}
E -->|speak| F[Speak Action]
E -->|transfer| G[Transfer Action]
E -->|hangup| H[Hangup Action]
D --> F
F --> I[Service Executes]
G --> I
H --> I
```
## Common Patterns
### Simple Response
```json
{
"type": "speak",
"session_id": "session-123",
"text": "Hello! How can I help you?"
}
```
### Conditional Response
```python
if "goodbye" in event['text'].lower():
return {
"type": "hangup",
"session_id": event['session']['id']
}
else:
return {
"type": "speak",
"session_id": event['session']['id'],
"text": "I understand."
}
```
### Multiple Actions
You can return an array of actions to execute them in sequence:
```python
if event['type'] == 'user_speak':
return [
{
"type": "barge_in",
"session_id": event['session']['id']
},
{
"type": "speak",
"session_id": event['session']['id'],
"text": "Sorry, let me correct that."
}
]
```
Actions in the array are executed one after another in order.
Alternatively, you can chain actions across events using the `assistant_speak` event:
```python
# First response
if event['type'] == 'user_speak':
return {
"type": "speak",
"session_id": event['session']['id'],
"text": "Please listen to this message."
}
# Follow-up after assistant speaks
if event['type'] == 'assistant_speak':
return {
"type": "audio",
"session_id": event['session']['id'],
"audio": "base64-audio-data"
}
```
## Next Steps
* **[Speak Action](/api/actions/speak)** - Detailed reference
* **[Event Types](/api/events)** - What triggers actions
* **[Event Flow](/api/event-flow)** - Understand the complete flow
---
---
url: /sipgate-ai-flow-api/api/actions/speak.md
---
# Speak Action
Speak text or SSML to the user using text-to-speech.
## Action Structure
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?",
"tts": {
"provider": "azure",
"language": "en-US",
"voice": "en-US-JennyNeural"
},
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 3
}
}
```
## Fields
| Field | Type | Required | Description |
|------------------------------|---------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------|
| `type` | string | Yes | Always `"speak"` |
| `session_id` | string (UUID) | Yes | Session identifier from event |
| `text` | string | No\* | Plain text to speak |
| `ssml` | string | No\* | SSML markup for advanced control |
| `tts` | object | No | TTS provider configuration |
| `barge_in` | object | No | Barge-in behavior configuration |
| `user_input_timeout_seconds` | number | No | Timeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a `user_input_timeout` event is sent |
| `vad` | object | No | Voice-activity detection tuning for the caller's reply. See [VAD Configuration](/api/vad) |
\* Either `text` OR `ssml` is required (not both)
## Simple Text
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?"
}
```
## SSML (Advanced)
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"ssml": "Please listen carefully.Your account balance is $42.50"
}
```
## TTS Provider Configuration
### Azure
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello in a different voice",
"tts": {
"provider": "azure",
"language": "en-US",
"voice": "en-US-JennyNeural"
}
}
```
### ElevenLabs
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello from ElevenLabs",
"tts": {
"provider": "eleven_labs",
"voice": "zrHiDhphv9ZnVXBqCLjz"
}
}
```
::: tip Voice IDs
The `voice` field accepts the ElevenLabs voice ID (e.g., `"zrHiDhphv9ZnVXBqCLjz"` for "Mimi"). If omitted, the first available voice will be used. See the [TTS Providers](/api/tts-providers) documentation for a list of available voices.
:::
**Minimal Configuration (uses default voice):**
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello from ElevenLabs",
"tts": {
"provider": "eleven_labs"
}
}
```
## Barge-In Configuration
Control how users can interrupt:
### Immediate Response (Most Responsive) ⚡
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "I can help you with billing, support, or sales. What would you like?",
"barge_in": {
"strategy": "immediate",
"allow_after_ms": 500
}
}
```
**Result:** Assistant stops instantly when user starts speaking (20-100ms latency).
### Character-Based Interruption
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Your account number is 1234567890. Please write this down.",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 10,
"allow_after_ms": 2000
}
}
```
**Result:** Assistant stops after user speaks 10+ characters.
See [Barge-In Configuration](/api/barge-in) for all strategies and details.
## VAD (Voice Activity Detection) Tuning
Optional advanced setting that lets the caller pause longer (or shorter) before
their turn is considered finished. When omitted, the system default applies.
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Please tell me your address.",
"vad": {
"end_of_turn_silence_ms": 1500
}
}
```
| Field | Type | Description |
|--------------------------|--------|----------------------------------------------------------------------------------------------------------|
| `end_of_turn_silence_ms` | number | Milliseconds of silence after the caller stops speaking before their turn ends. Recommended range 150–2000. |
Out-of-range or invalid values are silently ignored — the speak action still
runs as if `vad` were not set. See [VAD Configuration](/api/vad) for details.
## User Input Timeout
Set a timeout to wait for user input after the assistant finishes speaking. If the user doesn't speak within the specified time, a `user_input_timeout` event is sent to your application:
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "What is your account number?",
"user_input_timeout_seconds": 5
}
```
**Behavior:**
* Timer starts when the assistant finishes speaking (`assistant_speech_ended` event)
* Timer is cleared when the user starts speaking (any STT event)
* If timeout is reached, a `user_input_timeout` event is sent
* Your application can respond with any action (e.g., repeat question, hangup)
**Example with timeout handling:**
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'What is your account number?',
user_input_timeout_seconds: 5
});
}
if (event.type === 'user_input_timeout') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'I didn\'t hear anything. Let me try again. What is your account number?',
user_input_timeout_seconds: 5
});
}
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `Your account number is ${event.text}`
});
}
});
```
## Examples
### Python
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': f"You said: {event['text']}"
})
```
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`
});
}
});
```
### Go
```go
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "Hello! How can I help you?",
}
json.NewEncoder(w).Encode(action)
```
## Use Cases
* **Respond to user** - Answer questions
* **Provide information** - Share details
* **Guide conversation** - Direct the flow
* **Confirm actions** - Acknowledge user input
## Best Practices
1. **Keep it concise** - Short responses work better
2. **Use SSML sparingly** - Only when needed for emphasis
3. **Configure barge-in** - Allow natural interruptions
4. **Choose appropriate voice** - Match language and tone
## Next Steps
* **[TTS Providers](/api/tts-providers)** - Configure voices
* **[Barge-In Configuration](/api/barge-in)** - Control interruptions
* **[Other Actions](/api/actions)** - Complete action reference
---
---
url: /sipgate-ai-flow-api/api/actions/audio.md
---
# Audio Action
Play pre-recorded audio to the user.
## Action Structure
```json
{
"type": "audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 3
}
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"audio"` |
| `session_id` | string (UUID) | Yes | Session identifier from event |
| `audio` | string | Yes | Base64 encoded WAV audio data |
| `barge_in` | object | No | Barge-in behavior configuration |
## Audio Format Requirements
The audio must be in the following format:
* **Format**: WAV
* **Sample Rate**: 16kHz
* **Channels**: Mono (single channel)
* **Bit Depth**: 16-bit PCM
* **Encoding**: Base64
## Simple Example
```json
{
"type": "audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA="
}
```
## With Barge-In Configuration
```json
{
"type": "audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 3,
"allow_after_ms": 1000
}
}
```
## Examples
### Python
```python
import base64
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
# Read audio file and encode to base64
with open('hold-music.wav', 'rb') as audio_file:
audio_data = audio_file.read()
base64_audio = base64.b64encode(audio_data).decode('utf-8')
return jsonify({
'type': 'audio',
'session_id': event['session']['id'],
'audio': base64_audio,
'barge_in': {
'strategy': 'minimum_characters',
'minimum_characters': 3
}
})
```
### Node.js
```javascript
const fs = require('fs');
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
// Read audio file and encode to base64
const audioData = fs.readFileSync('hold-music.wav');
const base64Audio = audioData.toString('base64');
return res.json({
type: 'audio',
session_id: event.session.id,
audio: base64Audio,
barge_in: {
strategy: 'minimum_characters',
minimum_characters: 3
}
});
}
});
```
### Go
```go
import (
"encoding/base64"
"io/ioutil"
)
func webhook(w http.ResponseWriter, r *http.Request) {
var event map[string]interface{}
json.NewDecoder(r.Body).Decode(&event)
if event["type"] == "user_speak" {
// Read audio file and encode to base64
audioData, _ := ioutil.ReadFile("hold-music.wav")
base64Audio := base64.StdEncoding.EncodeToString(audioData)
session := event["session"].(map[string]interface{})
action := map[string]interface{}{
"type": "audio",
"session_id": session["id"],
"audio": base64Audio,
"barge_in": map[string]interface{}{
"strategy": "minimum_characters",
"minimum_characters": 3,
},
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(action)
return
}
}
```
## Converting Audio Files
### Using FFmpeg
Convert any audio file to the required format:
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav
```
**Parameters:**
* `-ar 16000` - Set sample rate to 16kHz
* `-ac 1` - Set to mono (1 channel)
* `-sample_fmt s16` - Set to 16-bit PCM
* `-f wav` - Output WAV format
### Python Script
```python
import base64
def convert_audio_to_base64(audio_file_path):
with open(audio_file_path, 'rb') as f:
audio_data = f.read()
return base64.b64encode(audio_data).decode('utf-8')
# Usage
base64_audio = convert_audio_to_base64('hold-music.wav')
```
## Barge-In Configuration
Control how users can interrupt audio playback:
```json
{
"barge_in": {
"strategy": "none"
}
}
```
See [Barge-In Configuration](/api/barge-in) for details.
## Use Cases
* **Hold music** - Play music while user waits
* **Pre-recorded messages** - Play announcements or greetings
* **Sound effects** - Play notification sounds
* **Background audio** - Ambient sounds during conversation
## Best Practices
1. **Keep files small** - Large audio files increase latency
2. **Use appropriate format** - Ensure WAV, 16kHz, mono, 16-bit
3. **Test playback** - Verify audio quality before production
4. **Configure barge-in** - Allow natural interruptions when appropriate
5. **Cache base64** - Encode once, reuse the base64 string
## Troubleshooting
### Audio Not Playing
* Verify audio format matches requirements exactly
* Check base64 encoding is correct
* Ensure audio file is not corrupted
* Test with a known-good audio file
### Audio Quality Issues
* Ensure sample rate is exactly 16kHz
* Verify mono channel (not stereo)
* Check bit depth is 16-bit PCM
* Re-encode source audio if needed
## Next Steps
* **[Barge-In Configuration](/api/barge-in)** - Control interruption behavior
* **[Speak Action](/api/actions/speak)** - Text-to-speech alternative
* **[Action Types](/api/actions)** - Complete action reference
---
---
url: /sipgate-ai-flow-api/api/actions/mix-audio.md
---
# Mix Audio Action
Play a looping background sound (e.g. train station, café, office ambience) under the call. The loop plays continuously for the lifetime of the session — also during the assistant's TTS turns and during silences between turns.
Sending `mix_audio` again replaces the active loop. Sending it with `stop: true` removes the loop. The active loop is dropped automatically when the session ends.
## Action Structure
### Start a background loop
```json
{
"type": "mix_audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=",
"volume": 0.3
}
```
### Stop an active background loop
```json
{
"type": "mix_audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"stop": true
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"mix_audio"` |
| `session_id` | string (UUID) | Yes | Session identifier from event |
| `audio` | string | Conditional | Base64-encoded WAV (16 kHz, 16-bit, mono PCM). **Required when `stop` is not `true`.** |
| `volume` | number | No | Background loop volume, `0.0`–`1.0`. Defaults to `0.5`. |
| `stop` | boolean | No | When `true`, removes the active loop. |
## Audio Format Requirements
Identical to the [`audio` action](/api/actions/audio):
* **Format**: WAV
* **Sample Rate**: 16 kHz
* **Channels**: Mono (single channel)
* **Bit Depth**: 16-bit PCM
* **Encoding**: Base64
A 30-second loop at this format is approximately 940 KB raw and ~1.25 MB as a base64 string in the JSON action payload.
## Behavior Notes
* **Continuous playback.** Once started, ambient plays for the rest of the call — under the assistant's TTS during turns and on its own during silences.
* **Replace semantics.** A second `mix_audio` (without `stop`) replaces the buffer and volume of the running loop.
* **Restart-safe.** If the service restarts during an active call, the loop continues automatically.
* **Auto-cleanup.** The loop is dropped when the session ends.
## Use Cases
* **Setting the scene.** Add café or train-station ambience to make a virtual receptionist feel located somewhere specific.
* **Wait-state cues.** Light office hum during long lookups so the line doesn't feel dead.
* **Accessibility / signaling.** Subtle sounds that indicate the agent is "in" a particular context.
## Examples
### Python (Flask)
```python
import base64
# Load and base64-encode the loop once at startup
with open('cafe.wav', 'rb') as f:
AMBIENT_AUDIO = base64.b64encode(f.read()).decode('utf-8')
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'session_start':
# Start the ambient loop AND speak the greeting in one response
return jsonify([
{
'type': 'mix_audio',
'session_id': event['session']['id'],
'audio': AMBIENT_AUDIO,
'volume': 0.3,
},
{
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Welcome, how can I help you?',
},
])
if event['type'] == 'user_speak' and 'goodbye' in event['text'].lower():
# Stop the ambient before saying goodbye, then hang up
return jsonify([
{
'type': 'mix_audio',
'session_id': event['session']['id'],
'stop': True,
},
{
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Goodbye!',
},
{ 'type': 'hangup', 'session_id': event['session']['id'] },
])
```
### Node.js
```javascript
import { readFileSync } from "node:fs";
// Load and base64-encode the loop once at startup
const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64");
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json([
{
type: 'mix_audio',
session_id: event.session.id,
audio: AMBIENT_AUDIO,
volume: 0.3,
},
{
type: 'speak',
session_id: event.session.id,
text: 'Welcome, how can I help you?',
},
]);
}
});
```
### Go
```go
import (
"encoding/base64"
"io/ioutil"
)
func main() {
// Load and base64-encode the loop once at startup
audioBytes, _ := ioutil.ReadFile("cafe.wav")
ambientAudio := base64.StdEncoding.EncodeToString(audioBytes)
http.HandleFunc("/webhook", func(w http.ResponseWriter, r *http.Request) {
var event map[string]interface{}
json.NewDecoder(r.Body).Decode(&event)
if event["type"] == "session_start" {
session := event["session"].(map[string]interface{})
actions := []map[string]interface{}{
{
"type": "mix_audio",
"session_id": session["id"],
"audio": ambientAudio,
"volume": 0.3,
},
{
"type": "speak",
"session_id": session["id"],
"text": "Welcome, how can I help you?",
},
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(actions)
}
})
}
```
## Converting Audio Files
Convert any audio file to the required format with FFmpeg:
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav
```
For ambient sound, normalizing loudness across presets keeps the relative volume consistent at a given `volume` value. A target of `-30 LUFS` sits well below typical TTS speech (`~-16 LUFS`), so the slider stays useful around `0.2`–`0.5`:
```bash
ffmpeg -i input.mp3 -t 30 -af "loudnorm=I=-30:LRA=11:TP=-2" \
-ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav
```
## Best Practices
1. **Load once, encode once.** Encode each ambient WAV to base64 at startup and reuse the string — don't read+encode per call.
2. **Start the loop with the greeting.** Return `[mix_audio, speak]` together on `session_start` so the ambient is in place from the first word.
3. **Keep the volume low.** Ambient sound should sit *under* the agent. Start around `0.3` and lower from there.
4. **Trim long files.** A 30-second loop is plenty for ambience; longer files just mean larger one-time payloads at session start.
5. **Stop explicitly when ending the call.** Sending `mix_audio { stop: true }` before a farewell is optional (the loop is dropped at `session_end` anyway), but it makes the goodbye land cleanly without ambient bleed.
## Mix Audio vs. Audio Action
| Aspect | `audio` | `mix_audio` |
|---|---|---|
| Plays | Once, then stops | Loops continuously for the rest of the call |
| Audible during silence | No | Yes |
| Plays under TTS | No | Yes |
| Use case | Hold music, announcements, sound effects | Scene/atmosphere under the agent |
| Restart-safe | No (one-shot) | Yes (loop continues automatically) |
## Troubleshooting
### Ambient is too loud / drowns out speech
* Lower the `volume` (try `0.2`).
* Re-normalize the source file to a quieter target LUFS (e.g. `-30 LUFS` instead of `-23`).
### Loop pops at the boundary
For material with strong transients, fade the source file in/out by 50 ms in your editor before encoding so the loop point is silent.
## Next Steps
* **[Audio Action](/api/actions/audio)** - Play a single pre-recorded clip
* **[Speak Action](/api/actions/speak)** - Text-to-speech under the loop
* **[Action Types](/api/actions)** - Complete action reference
---
---
url: /sipgate-ai-flow-api/api/actions/hangup.md
---
# Hangup Action
End the call.
## Action Structure
```json
{
"type": "hangup",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"hangup"` |
| `session_id` | string (UUID) | Yes | Session identifier from event |
## Examples
### Python
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
user_text = event['text'].lower()
if 'goodbye' in user_text or 'bye' in user_text:
return jsonify({
'type': 'hangup',
'session_id': event['session']['id']
})
```
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
const userText = event.text.toLowerCase();
if (userText.includes('goodbye') || userText.includes('bye')) {
return res.json({
type: 'hangup',
session_id: event.session.id
});
}
}
});
```
### Go
```go
if strings.Contains(text, "goodbye") || strings.Contains(text, "bye") {
action := map[string]interface{}{
"type": "hangup",
"session_id": session["id"],
}
json.NewEncoder(w).Encode(action)
}
```
## Use Cases
* **User says goodbye** - End call politely
* **Task complete** - After completing a task
* **Error handling** - When something goes wrong
* **Timeout** - After inactivity
## Best Practices
1. **Say goodbye first** - Optionally speak before hanging up
2. **Clean up state** - Session will end, but cleanup in `session_end`
3. **Log the reason** - Track why calls ended
4. **Handle gracefully** - Don't hang up abruptly
## Next Steps
* **[Transfer Action](/api/actions/transfer)** - Transfer to another number
* **[Event Types](/api/events)** - What triggers actions
* **[Event Flow](/api/event-flow)** - Understand the complete flow
---
---
url: /sipgate-ai-flow-api/api/actions/transfer.md
---
# Transfer Action
Transfer the call to another phone number.
## Action Structure
```json
{
"type": "transfer",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"target_phone_number": "1234567890",
"caller_id_name": "Support Department",
"caller_id_number": "1234567890",
"timeout": 30
}
```
## Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Always `"transfer"` |
| `session_id` | string (UUID) | Yes | Session identifier from event |
| `target_phone_number` | string | Yes | Phone number to transfer to (E.164 format without leading + recommended) |
| `caller_id_name` | string | Yes | Caller ID name to display |
| `caller_id_number` | string | Yes | Caller ID number to display |
| `timeout` | integer (5–120) | No | Seconds to wait for the transfer target to answer. When set, enables **transfer fallback** (see below). When omitted, transfer failures end the call. |
## Transfer Fallback
When `timeout` is provided, the call is returned to the agent if the transfer fails:
* Target does not answer within `timeout` seconds
* Target rejects the call (busy, unavailable)
* Target hangs up without answering
On a failed transfer, the service re-emits a [`session_start`](/api/events/session-start) event **with the same `session.id`** and the agent can either continue the conversation with the original caller or attempt another transfer.
On a successful transfer, no further events are sent — the call ends normally once the transferred parties hang up.
```json
{
"type": "transfer",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"target_phone_number": "1234567890",
"caller_id_name": "Support Department",
"caller_id_number": "1234567890",
"timeout": 30
}
```
Your webhook should treat a repeated `session_start` for a known session id as "the call came back" and respond with a recovery prompt (for example: *"Sorry, no one picked up. Would you like to try something else?"*).
## Examples
### Python
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
user_text = event['text'].lower()
if 'sales' in user_text:
return jsonify({
'type': 'transfer',
'session_id': event['session']['id'],
'target_phone_number': '1234567890',
'caller_id_name': 'Sales Department',
'caller_id_number': '1234567890'
})
```
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
const userText = event.text.toLowerCase();
if (userText.includes('sales')) {
return res.json({
type: 'transfer',
session_id: event.session.id,
target_phone_number: '1234567890',
caller_id_name: 'Sales Department',
caller_id_number: '1234567890'
});
}
}
});
```
### Go
```go
if strings.Contains(text, "sales") {
action := map[string]interface{}{
"type": "transfer",
"session_id": session["id"],
"target_phone_number": "1234567890",
"caller_id_name": "Sales Department",
"caller_id_number": "1234567890",
}
json.NewEncoder(w).Encode(action)
}
```
## Phone Number Format
Use E.164 format without leading + (recommended):
* ✅ `1234567890`
* ✅ `491234567890`
* ❌ `123-456-7890` (not recommended)
## Use Cases
* **Route to departments** - Sales, support, billing
* **Escalate to human** - When AI can't help
* **Specialized services** - Connect to experts
* **Emergency routing** - Urgent situations
## Best Practices
1. **Announce transfer** - Tell user before transferring
2. **Use E.164 format** - International phone numbers
3. **Set caller ID** - Identify the source
4. **Log transfers** - Track routing decisions
## Next Steps
* **[Hangup Action](/api/actions/hangup)** - End the call
* **[Event Types](/api/events)** - What triggers actions
* **[Event Flow](/api/event-flow)** - Understand the complete flow
---
---
url: /sipgate-ai-flow-api/api/actions/barge-in.md
---
# Barge-In Action
Immediately stop whatever audio the service is currently playing to the caller (synthesized speech from a `speak` action or pre-recorded audio from an `audio` action). This is the manual, application-triggered counterpart to the automatic user-driven interruption.
::: warning Action vs. configuration — don't confuse these
Two things are called "barge-in" and they do different things:
* **`barge_in` action** (this page): a top-level action you send, `{ "type": "barge_in", "session_id": "..." }`. **You** interrupt the playback — right now — from your application.
* **`barge_in` config** on `speak` / `audio` actions: an optional object describing **how and when the caller** is allowed to interrupt. See [Barge-In Configuration](/api/barge-in).
The action stops current playback. The configuration controls whether the caller is allowed to do the same thing by speaking.
:::
## Action Structure
```json
{
"type": "barge_in",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
## Fields
| Field | Type | Required | Description |
|--------------|---------------|----------|----------------------------------|
| `type` | string | Yes | Always `"barge_in"` |
| `session_id` | string (UUID) | Yes | Session identifier from an event |
The action has no other fields. It always targets whatever is currently being played on this session.
## Typical Pattern — Interrupt Then Speak
The most useful form is an **array of actions**: first `barge_in` to cut off the current playback, then `speak` (or `audio`) with the new content. The service executes array entries in order, so the caller hears the playback stop and the new message begin without any manual coordination on your side.
```json
[
{
"type": "barge_in",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
},
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Sorry, let me correct that — your order ships tomorrow, not today."
}
]
```
This works anywhere an action response is accepted: HTTP webhook response body, WebSocket message, or external API POST.
### Replace in-progress audio with new audio
```json
[
{ "type": "barge_in", "session_id": "550e8400-e29b-41d4-a716-446655440000" },
{
"type": "audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEA..."
}
]
```
### Stop playback without saying anything
Send `barge_in` on its own if you only want silence (for example, to cut off a long response because an external system just produced a final answer you're about to deliver separately):
```json
{
"type": "barge_in",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
## When to Use It
* **Agent self-correction.** Your LLM streamed a tentative answer via `speak`, then a tool call returned a better one. Send `[barge_in, speak]` to replace the in-flight utterance.
* **External event trumps current playback.** A human operator joins, a priority notification arrives, or a fresh webhook result invalidates what's being said right now.
* **Cutting off a long pre-recorded `audio` clip.** The caller gave new intent mid-playback and you've decided to stop the clip early, regardless of their `barge_in` configuration.
If all you want is for the caller to be able to interrupt by speaking, you don't need this action — use the `barge_in` **configuration** on the `speak` or `audio` action instead. See [Barge-In Configuration](/api/barge-in) for the available strategies.
## Examples
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak' && correctionNeeded(event.text)) {
return res.json([
{ type: 'barge_in', session_id: event.session.id },
{
type: 'speak',
session_id: event.session.id,
text: 'Sorry, let me correct that.',
},
]);
}
});
```
### Python
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak' and correction_needed(event['text']):
return jsonify([
{ 'type': 'barge_in', 'session_id': event['session']['id'] },
{
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Sorry, let me correct that.',
},
])
```
### Go
```go
actions := []map[string]interface{}{
{"type": "barge_in", "session_id": sessionID},
{"type": "speak", "session_id": sessionID, "text": "Sorry, let me correct that."},
}
json.NewEncoder(w).Encode(actions)
```
## Behavior Notes
* `barge_in` is a no-op if nothing is currently being played. It does not produce an error.
* The service emits an `assistant_speech_ended` event for the interrupted `speak`/`audio`, followed by the events for the next action in the array.
* Array entries are processed strictly in order. Putting `barge_in` after a `speak` in the same array does not "cancel" that speak before it starts — the speak is dispatched first, then `barge_in` stops it mid-playback.
## Next Steps
* **[Barge-In Configuration](/api/barge-in)** — Let the caller interrupt by speaking (strategies, timing)
* **[Speak Action](/api/actions/speak)** — Synthesize and play text
* **[Audio Action](/api/actions/audio)** — Play pre-recorded audio
* **[Action Types](/api/actions)** — Complete action reference
---
---
url: /sipgate-ai-flow-api/api/actions/configure-transcription.md
---
# Configure Transcription Action
Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.
## Action Structure
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "DEEPGRAM",
"languages": ["en-US"]
}
```
## Fields
| Field | Type | Required | Default | Description |
|--------------|---------------|----------|------------------|------------------------------------------------------------------------------------------------------|
| `type` | string | Yes | — | Always `"configure_transcription"` |
| `session_id` | string (UUID) | Yes | — | Session identifier from event |
| `provider` | string | No | Current provider | STT provider to switch to. Valid values: `"AZURE"`, `"DEEPGRAM"`, `"ELEVEN_LABS"`. Omitting keeps the current provider. |
| `languages` | string\[] | No | Provider default | BCP-47 language codes (1–4 entries). Fully replaces the current config. Omitting resets to provider default (auto-detection). |
| `custom_vocabulary` | string\[] | No | — | Words or phrases to boost STT recognition accuracy. Max 100 entries, max 200 characters per entry. Fully replaces the current session-level vocabulary. Merged with client-level vocabulary configured during onboarding. Supported by Azure, Deepgram, and ElevenLabs. |
| `vad` | object | No | Current setting | Voice-activity detection tuning, applied for the rest of the session. See [VAD Configuration](/api/vad). |
At least one of `provider`, `languages`, `custom_vocabulary`, or `vad` should be provided; sending none of them is a no-op.
### Configuring VAD Session-Wide
Use this action to set or change VAD parameters for the entire remaining session
(equivalent to setting `vad` on every subsequent `speak`).
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"vad": {
"end_of_turn_silence_ms": 1200
}
}
```
Out-of-range or invalid values are silently ignored.
## Behavioral Details
### Full Replace Semantics
Both `provider` and `languages` use **full replace** semantics — they never merge with existing settings.
| `provider` field | `languages` field | Result |
|-----------------|-------------------|-------------------------------------------------------------|
| Provided | Provided | Switches to new provider with specified languages |
| Provided | Omitted | Switches to new provider; languages reset to `[]` (default) |
| Omitted | Provided | Keeps current provider; languages fully replaced |
| Omitted | Omitted | No-op (transcription unchanged) |
### Custom Vocabulary
Pass a `custom_vocabulary` array to boost recognition of domain-specific terms, product names, proper nouns, or technical terms your callers are likely to use.
* Entries are matched case-insensitively during deduplication and merged with client-level vocabulary.
* Multi-word phrases (e.g. `"SIP-Trunk"`) are supported by all providers.
* If omitted, the current session vocabulary is kept unchanged.
* Max 100 entries; max 200 characters per entry.
**Supported providers:** Azure, Deepgram, ElevenLabs
### Brief Audio Gap During Restart
Any change — language or provider — requires the transcription engine to restart. Audio received during the restart is dropped and will not appear in any `user_speak` event.
| Change type | Typical gap |
|---------------------|----------------|
| Language change only | ~100–500 ms |
| Provider switch | ~200–800 ms |
Design your call flow to trigger changes at natural pause points (e.g., after the assistant finishes speaking) to minimize the impact of the gap.
### Barge-In Latency After Provider Switch
Each provider has different Voice Activity Detection (VAD) characteristics. Switching providers may change barge-in latency for the `immediate` strategy:
| Provider | Approximate barge-in latency |
|----------|------------------------------|
| Azure | ~20–80 ms |
| Deepgram | ~20–100 ms |
| ElevenLabs | ~30–120 ms |
### Compatible Channels
The `configure_transcription` action is accepted on all three delivery channels:
* HTTP webhook response
* Client-transport WebSocket
* External API POST
### Multi-Language Support per Provider
Not all providers support simultaneous multi-language detection. When more than one language code is supplied, providers that only accept a single language will silently use the **first entry** and ignore the rest.
| Provider value | Multi-language support | Notes |
|-----------------|------------------------|-------|
| `"AZURE"` | ✅ Up to 4 languages | All entries used for Language Identification (LID) |
| `"DEEPGRAM"` | ✅ Multilingual | Auto-detects across the supplied languages; supply none for full auto-detect |
| `"ELEVEN_LABS"` | ❌ Single language only | Only the first entry is used; rest are ignored |
**Recommendation:** When targeting ElevenLabs, supply exactly one language code. Deepgram and Azure both accept multiple codes; supplying none lets the provider auto-detect.
## Examples
### Change Language Only (Keep Current Provider)
Switch an active session to German transcription:
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"languages": ["de-DE"]
}
```
### Switch Provider Only (Languages Reset to Default)
Switch from Azure to Deepgram; languages reset to auto-detection:
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "DEEPGRAM"
}
```
### Switch Provider and Language Simultaneously
Switch to ElevenLabs and set English as the recognition language:
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "ELEVEN_LABS",
"languages": ["en-US"]
}
```
### Use Multiple Languages Simultaneously
Enable multi-language detection for German and English:
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"languages": ["de-DE", "en-US"]
}
```
Up to 4 language codes may be provided in a single request.
### Reset to Provider Default
Omit `languages` to restore automatic language detection:
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
### Boost Recognition with Custom Vocabulary
Improve accuracy for product names and technical terms:
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"custom_vocabulary": ["sipgate", "VoIP", "ISDN", "Portsplitter"]
}
```
### Switching Language Based on User Input
A common pattern: detect the caller's preferred language from their first utterance, then reconfigure transcription mid-call.
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
// Start with multi-language detection
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Hello! Guten Tag! Please speak in your preferred language.',
});
}
if (event.type === 'user_speak') {
const detectedLanguage = event.language; // BCP-47 code from STT
if (detectedLanguage && detectedLanguage.startsWith('de')) {
// Caller is speaking German — lock transcription to German only
return res.json({
type: 'configure_transcription',
session_id: event.session.id,
languages: ['de-DE'],
});
}
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`,
});
}
});
```
### Provider Fallback Pattern
Switch to a backup provider if the primary fails or for specific call scenarios:
```javascript
// Switch to Deepgram for better handling of a specific language/accent
return res.json({
type: 'configure_transcription',
session_id: event.session.id,
provider: 'DEEPGRAM',
languages: ['en-US'],
});
```
## Next Steps
* **[Actions Overview](/api/actions)** - Complete action reference
* **[Event Types](/api/events)** - What events carry transcribed text
* **[Barge-In Configuration](/api/barge-in)** - Control how users interrupt the assistant
---
---
url: /sipgate-ai-flow-api/api/actions/configure-voice-to-voice.md
---
# Configure Voice-to-Voice Action
::: warning Preview
End-to-end voice-to-voice mode is a preview feature. Available only after a
positive review by sipgate support. See **Access Gate** below.
:::
Switch a session into **end-to-end voice-to-voice** mode. From the moment this
action is processed the assistant no longer goes through the standard STT →
text → TTS pipeline — caller audio is forwarded directly to a speech-to-speech
model and the model's spoken response is sent back to the caller in real time.
The transcribed user text is still surfaced as `user_speak` events for logging
and call traces, but you don't need (and shouldn't send) `speak` actions in
response to them — the model speaks autonomously.
## Action Structure
```json
{
"type": "configure_voice_to_voice",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise.",
"greeting": "Hello, this is Acme Dental — how can I help you?",
"temperature": 0.8,
"language": "en"
}
```
## Fields
| Field | Type | Required | Default | Description |
|-----------------|---------------|----------|---------|--------------------------------------------------------------------------------------------------------|
| `type` | string | Yes | — | Always `"configure_voice_to_voice"` |
| `session_id` | string (UUID) | Yes | — | Session identifier from the event |
| `system_prompt` | string | Yes | — | Persona / behaviour instructions for the model. Sent once at the start of the session. |
| `greeting` | string | No | — | Opening line the model should speak after connecting. Delivered as an inference trigger so the model phrases it naturally. |
| `temperature` | number | No | `0.8` | Sampling temperature (0–2). Lower values make replies more deterministic. |
| `language` | string | No | — | Preferred response language hint (e.g. `"de"`, `"en"`). The model decides ultimately. |
## Behavioral Details
### STT and TTS are inactive
Once voice-to-voice is active for a session:
* `user_speak` events still arrive, but they reflect the model's own
transcription of the caller's turns — not your configured STT provider.
* `speak` actions are honoured by forwarding the text to the model as a
speaking instruction. The model will speak the text in its own voice — it
may rephrase slightly (the protocol has no verbatim-TTS path). `tts`,
`ssml`, `barge_in`, `vad` and `user_input_timeout_seconds` fields on the
`speak` action are ignored.
* Barge-in is handled inside the model — the configured barge-in strategy
has no effect for the rest of the session.
* VAD parameters set via `configure_transcription.vad` or `speak.vad` are
ignored.
### Reverting to the normal pipeline
Send a `configure_transcription` action to switch the session back to the
standard STT/TTS pipeline. After that, you can send `speak` actions again.
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "AZURE",
"languages": ["de-DE"]
}
```
### Greeting
When `greeting` is provided, the model speaks an opening line as soon as the
session is ready (typically within 1–2 seconds). The text is given to the model
as guidance — the exact wording may differ slightly.
If you want full silence at the start (e.g. you announce yourself first via a
`speak` action *before* sending `configure_voice_to_voice`), simply omit
`greeting`.
### Latency
End-to-end speech-to-speech models respond noticeably faster than the standard
STT → LLM → TTS pipeline because there are no per-stage decode/encode steps.
First-byte latency for the spoken response is typically in the 200–600 ms
range from the end of the caller's turn.
## Examples
### Minimal: persona-only, no greeting
```json
{
"type": "configure_voice_to_voice",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise."
}
```
### Persona + greeting in German
```json
{
"type": "configure_voice_to_voice",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"system_prompt": "Du bist ein freundlicher Assistent für die Zahnarztpraxis Acme.",
"greeting": "Guten Tag, hier ist die Praxis Acme. Wie kann ich Ihnen helfen?",
"language": "de"
}
```
### Logging caller turns while the model handles the conversation
Your code receives `user_speak` events for the call trace but does not need
(and should not send) any further actions:
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json({
type: 'configure_voice_to_voice',
session_id: event.session.id,
system_prompt: 'You are a helpful assistant.',
greeting: 'Hi! How can I help today?',
});
}
if (event.type === 'user_speak') {
// Log only — the model is already responding.
console.log(`Caller said: ${event.text}`);
return res.status(200).send();
}
return res.status(200).send();
});
```
## Access Gate
Voice-to-voice mode is only available upon request and after a positive
review by sipgate support. Mention `configure_voice_to_voice` when you reach
out so we can enable it for your account.
## Next Steps
* **[Actions Overview](/api/actions)** - Complete action reference
* **[Configure Transcription](/api/actions/configure-transcription)** - Switch back to the STT/TTS pipeline
* **[Event Types](/api/events)** - What events carry transcribed text
---
---
url: /sipgate-ai-flow-api/api/actions/send-sms.md
---
# Send SMS Action
Send an SMS from the sipgate account behind the AI Flow to any phone number. Useful for delivering confirmation codes, booking summaries, or follow-up links while (or after) a call.
::: info Availability
`send_sms` is **only available upon request** and after a positive review by sipgate support (fraud / scam protection). Ask your sipgate contact to enable SMS sending for your account.
:::
## Action Structure
```json
{
"type": "send_sms",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"phone_number": "4915112345678",
"message": "Your confirmation code is 4242."
}
```
## Fields
| Field | Type | Required | Description |
|----------------|-------------|----------|----------------------------------------------------------------------------------|
| `type` | string | Yes | Always `"send_sms"` |
| `session_id` | string (UUID) | Yes | Session identifier from the event |
| `phone_number` | string | Yes | Recipient number in E.164 format — digits only, **without** leading `+` (preferred; a leading `+` is accepted and stripped automatically). Matches the format used by `transfer` and outbound calls. |
| `message` | string | Yes | SMS body. No hard length limit; long texts are billed per standard SMS segment. |
## Sender
The sender shown to the recipient is determined by following rules:
1. If your account has already outbound calls enabled, the sender is the same as for outbound calls.
2. Otherwise the recipient sees the called number of the current session (i.e. the number the user dialed to reach you).
You cannot override the sender per request.
## Delivery Semantics
* SMS sending is **fire-and-forget**: the call is not blocked waiting for delivery confirmation.
* There is no delivery receipt in the event stream. Use your own monitoring if you need per-message confirmation.
* A failed send does **not** interrupt the call — the agent can still speak, hang up, or transfer. You receive an `sms_failed` event to react conversationally (e.g. apologize, retry, or collect a corrected number).
## Examples
### Python
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
text = event['text'].lower()
if 'send me the code' in text:
return jsonify({
'type': 'send_sms',
'session_id': event['session']['id'],
'phone_number': event['session']['from_phone_number'],
'message': 'Your confirmation code is 4242.',
})
```
### Node.js
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak' && /code/i.test(event.text)) {
return res.json({
type: 'send_sms',
session_id: event.session.id,
phone_number: event.session.from_phone_number,
message: 'Your confirmation code is 4242.',
});
}
});
```
### Go
```go
if strings.Contains(strings.ToLower(text), "code") {
action := map[string]interface{}{
"type": "send_sms",
"session_id": session["id"],
"phone_number": session["from_phone_number"].(string),
"message": "Your confirmation code is 4242.",
}
json.NewEncoder(w).Encode(action)
}
```
### Phone Number Format
Align with the rest of the AI Flow API:
* ✅ `4915112345678` (preferred; E.164 without `+`)
* ✅ `+4915112345678` (accepted; `+` is stripped before delivery)
* ❌ `+49 151 1234 5678` (spaces and dashes rejected)
* ❌ `0151 1234 5678` (national format rejected)
## Handling Failure — the `sms_failed` Event
When sending fails, the AI Flow emits an `sms_failed` event to your webhook / WebSocket. Handle it to keep the conversation natural:
```json
{
"type": "sms_failed",
"session": { "id": "550e8400-e29b-41d4-a716-446655440000", "...": "..." },
"recipient": "4915112345678",
"reason": "sender_not_allowed",
"message": "SMSC returned faultCode 403"
}
```
`reason` is one of:
| Value | Meaning |
|------------------------|---------------------------------------------------------------------------------------|
| `sender_not_allowed` | Your configured sender number isn't verified for SMS — fix in account settings. |
| `insufficient_balance` | Account has insufficient credits for the send. |
| `no_sms_extension` | No SMS extension is provisioned for this account — contact sipgate support. |
| `smsc_unavailable` | Transient infrastructure issue; safe to retry later. |
| `unknown` | Any other failure; check the optional `message` field for details. |
See **[Events Reference](/api/events)** for the full event schema.
## Best Practices
1. **Ask for consent before sending.** Announce over the call that you'll send an SMS.
2. **Use E.164 with a leading `+`.** Always normalize user-provided numbers before passing them in.
3. **Keep messages short.** Each SMS segment is billed; long messages split into multiple segments silently.
4. **Handle `sms_failed`.** Have a fallback (speak an apology, retry with a corrected number, or skip the SMS and continue).
5. **Don't loop.** A single SMS per session is usually enough — sending multiple in quick succession can look spammy.
## Next Steps
* **[Event Types](/api/events)** — including the `sms_failed` event schema
* **[Speak Action](/api/actions/speak)** — acknowledge the SMS over the call
* **[Hangup Action](/api/actions/hangup)** — wrap up after the SMS is queued
---
---
url: /sipgate-ai-flow-api/api/tts-providers.md
---
# TTS Providers
Configure text-to-speech providers for different voices and languages.
## Overview
The AI Flow service supports multiple TTS providers. Configure them per action in the `tts` field.
## Supported Providers
* **Azure Cognitive Services** - 400+ voices in 140+ languages
* **ElevenLabs** - Ultra-realistic conversational voices
## Azure Cognitive Services
### Configuration
```json
{
"type": "speak",
"session_id": "session-123",
"text": "Hello!",
"tts": {
"provider": "azure",
"language": "en-US",
"voice": "en-US-JennyNeural"
}
}
```
### Popular Voices
| Language | Voice Name | Gender | Sample | Description |
| -------- | ------------------ | ------ | --------------------------------------------------------------------- | ---------------------- |
| en-US | en-US-JennyNeural | Female | | Friendly, professional |
| en-US | en-US-GuyNeural | Male | | Clear, neutral |
| en-GB | en-GB-SoniaNeural | Female | | British, professional |
| en-GB | en-GB-RyanNeural | Male | | British, friendly |
| de-DE | de-DE-KatjaNeural | Female | | Professional, clear |
| de-DE | de-DE-ConradNeural | Male | | Deep, authoritative |
**Full Voice List:** See [Azure TTS documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support)
## ElevenLabs
### Configuration
```json
{
"type": "speak",
"session_id": "session-123",
"text": "Hello!",
"tts": {
"provider": "eleven_labs",
"voice": "21m00Tcm4TlvDq8ikWAM"
}
}
```
::: tip Voice IDs
The `voice` field is optional and accepts the ElevenLabs voice ID as a string. For example, `"21m00Tcm4TlvDq8ikWAM"` for "Rachel". If omitted, the first available voice will be used.
:::
**Minimal Configuration (uses default voice):**
```json
{
"type": "speak",
"session_id": "session-123",
"text": "Hello!",
"tts": {
"provider": "eleven_labs"
}
}
```
### Available Voices
The "Sample" column plays a representative greeting from a voice assistant scenario. Multilingual voices include both German and English samples; voices verified for German only have a German sample. Click play — audio loads on demand.
| Voice Name | ID | Sample | Description |
| ------------ | -------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------ |
| sipgate | dSu12TX3MEDQXAarG4s6 | | Clean male voice used by sipgate for system announcements (default). |
| Rachel | 21m00Tcm4TlvDq8ikWAM | | Matter-of-fact, personable woman. Great for conversational use cases. |
| Drew | 29vD33N1CtxCmqQRPOHJ | | - |
| Clyde | 2EiwWnXFnvU5JabPnv8n | | Great for character use-cases |
| Paul | 5Q0t7uMcjvnagumLfvZi | | - |
| Aria | 9BWtsMINqrJLrRacOk9x | | Middle-aged female with African-American accent. Calm with hint of rasp.|
| Domi | AZnzlk1XvdvUeBnXmlld | | - |
| Dave | CYw3kZ02Hs0563khs1Fj | | - |
| Roger | CwhRBWXzGAHq8TQ4Fs17 | | Easy going and perfect for casual conversations. |
| Fin | D38z5RcWu1voky8WS1ja | | - |
| Sarah | EXAVITQu4vr4xnSDxMaL | | Young adult woman with confident, warm tone. Reassuring and professional.|
| Antoni | ErXwobaYiN019PkySvjV | | - |
| Laura | FGY2WhTYpPnrIDTdsKH5 | | Young adult female with sunny enthusiasm and quirky attitude. |
| Thomas | GBv7mTt0atIp3Br8iCZE | | Soft and subdued male voice, optimal for narrations or meditations |
| Charlie | IKne3meq5aSn9XLyUdCD | | Young Australian male with confident and energetic voice. |
| George | JBFqnCBsd6RMkjVDRZzb | | Warm resonance that instantly captivates listeners. |
| Emily | LcfcDJNUP1GQjkzn1xUU | | - |
| Elli | MF3mGyEYCl7XYWbV9V6O | | - |
| Callum | N2lVS1w4EtoT3dr4eOWO | | Deceptively gravelly, yet unsettling edge. |
| Patrick | ODq5zmih8GrVes37Dizd | | - |
| River | SAz9YHcvj6GT2YYXdXww | | Relaxed, neutral voice ready for narrations or conversational projects. |
| Harry | SOYHLrjzK2X1ezoPC6cr | | An animated warrior ready to charge forward. |
| Liam | TX3LPaxmHKxFdv7VOQHJ | | Young adult with energy and warmth - suitable for reels and shorts. |
| Dorothy | ThT5KcBeYPX3keUQqHPh | | - |
| Josh | TxGEqnHWrfWFTfGW9XjX | | - |
| Arnold | VR6AewLTigWG4xSOukaG | | - |
| Charlotte | XB0fDUnXU5powFXDhCwa | | Sensual and raspy, ready to voice your temptress in video games. |
| Alice | Xb7hH8MSUJpSbSDYk0k2 | | Clear and engaging British woman, suitable for e-learning. |
| Matilda | XrExE9yKIg1WjnnlVkGX | | Professional woman with pleasing alto pitch. Suitable for many use cases.|
| James | ZQe5CZNOzWyzPSCn5a3c | | - |
| Joseph | Zlb1dXrM653N07WRdFW3 | | - |
| Will | bIHbv24MWmeRgasZH58o | | Conversational and laid back. |
| Jeremy | bVMeCyTHy58xNoL34h3p | | - |
| Jessica | cgSgspJ2msm6clMCkdW9 | | Young and playful American female, perfect for trendy content. |
| Eric | cjVigY5qzO86Huf0OWal | | Smooth tenor pitch from man in his 40s - perfect for agentic use cases. |
| Michael | flq6f7yk4E4fJM5XTYuZ | | - |
| Ethan | g5CIjZEefAph4nQFvHAz | | - |
| Chris | iP95p4xoKVk53GoZ742B | | Natural and real, down-to-earth voice great across many use-cases. |
| Gigi | jBpfuIE2acCO8z3wKNLl | | - |
| Freya | jsCqWAovK2LkecY7zXl4 | | - |
| Brian | nPczCjzI2devNBz1zQrb | | Middle-aged man with resonant and comforting tone. Great for narrations. |
| Grace | oWAxZDx7w5VEj9dCyTzz | | - |
| Daniel | onwK4e9ZLuTAKqWW03F9 | | Strong voice perfect for professional broadcast or news story. |
| Lily | pFZP5JQG7iQjIQuC4Bku | | Velvety British female voice delivers news with warmth and clarity. |
| Serena | pMsXgVXv3BLzUgSXRplE | | - |
| Adam | pNInz6obpgDQGcFmaJgB | | - |
| Nicole | piTKgcLEGmPE4e6mEKli | | - |
| Bill | pqHfZKP75CvOlQylNhV4 | | Friendly and comforting voice ready to narrate your stories. |
| Jessie | t0jbNlBVZ17f02VDIeMI | | - |
| Sam | yoZ06aMxZJJ28mfd3POQ | | - |
| Glinda | z9fAnlkpzviPz146aGWa | | - |
| Giovanni | zcAOhNBS3c14rBihAFp1 | | - |
| Mimi | zrHiDhphv9ZnVXBqCLjz | | - |
## Choosing a Provider
### Use Azure when:
* You need many languages (140+)
* You want consistent quality
* You need regional accents
* Budget is a concern
### Use ElevenLabs when:
* You need the most natural voices
* Conversational quality is critical
* You're working with English/European languages
* You want distinct personalities
## Examples
### Python
```python
# Azure voice
action = {
'type': 'speak',
'session_id': session_id,
'text': 'Hello!',
'tts': {
'provider': 'azure',
'language': 'en-US',
'voice': 'en-US-JennyNeural'
}
}
# ElevenLabs voice
action = {
'type': 'speak',
'session_id': session_id,
'text': 'Hello!',
'tts': {
'provider': 'eleven_labs',
'voice': '21m00Tcm4TlvDq8ikWAM' # Rachel
}
}
```
## Next Steps
* **[Speak Action](/api/actions/speak)** - How to use TTS
* **[Barge-In Configuration](/api/barge-in)** - Control interruptions
---
---
url: /sipgate-ai-flow-api/api/barge-in.md
---
# Barge-In Configuration
Control how users can interrupt the assistant while speaking.
::: tip Looking for the `barge_in` action?
This page covers the `barge_in` **configuration object** attached to `speak` / `audio` actions — it decides whether and how the **caller** may interrupt. The top-level `barge_in` **action**, which lets **your application** interrupt the current playback, has its own page: [Barge-In Action](/api/actions/barge-in).
:::
## Overview
Barge-in allows users to interrupt the assistant's speech. Configure it per action using the `barge_in` field.
## Configuration
```json
{
"type": "speak",
"session_id": "session-123",
"text": "Hello!",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 3,
"allow_after_ms": 500
}
}
```
## Strategies
### `none`
Disables barge-in completely. Audio plays fully without interruption.
```json
{
"barge_in": {
"strategy": "none"
}
}
```
**Use cases:**
* Critical information
* Legal disclaimers
* Emergency instructions
### `manual`
Allows manual barge-in via API only (no automatic detection).
```json
{
"barge_in": {
"strategy": "manual"
}
}
```
**Use cases:**
* Custom interruption logic
* Button-triggered interruption
* External event-based interruption
### `minimum_characters`
Automatically detects barge-in when user speech exceeds character threshold.
```json
{
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 5,
"allow_after_ms": 500
}
}
```
**Use cases:**
* Natural conversation flow
* Customer service scenarios
* Interactive voice menus
### `immediate` ⚡ NEW
**Most responsive option** - Interrupts immediately when user starts speaking, using Voice Activity Detection (VAD).
```json
{
"barge_in": {
"strategy": "immediate",
"allow_after_ms": 500
}
}
```
**How it works:**
* **Azure/Deepgram**: Uses VAD (Voice Activity Detection) - triggers before any text is recognized
* **ElevenLabs**: Uses first partial transcript
* **Latency**: 20-100ms (2-4x faster than `minimum_characters`)
* **No text required**: Interrupts on voice detection, not transcription
**Use cases:**
* High-priority conversations requiring instant responsiveness
* Natural dialogue where interruptions should feel seamless
* Customer service where quick response matters
* Urgent or time-sensitive interactions
**Best practices:**
* Use `allow_after_ms: 500-1000` to prevent accidental interruptions at start
* Test with real users to find optimal `allow_after_ms` value
* Consider network latency in production environments
**Comparison with `minimum_characters`:**
| Feature | `immediate` | `minimum_characters` |
|---------|-------------|---------------------|
| **Trigger** | Voice Activity (VAD) | Text recognition (3+ characters) |
| **Latency** | 20-100ms | 50-200ms |
| **User Experience** | Instant interruption | Slight delay |
| **Accuracy** | May trigger on noise | More reliable (text-based) |
## Configuration Options
### minimum\_characters
Minimum number of characters before barge-in triggers.
* **Default**: `3`
* **Range**: `1` to `100`
* **Higher values**: Require more speech before interruption
### allow\_after\_ms
Delay in milliseconds before barge-in is allowed (protection period).
* **Default**: `0` (immediate)
* **Range**: `0` to `10000` (10 seconds)
* **Use**: Prevent interruption during critical information
## Examples
### Natural Conversation
```json
{
"type": "speak",
"session_id": "session-123",
"text": "I can help you with billing, support, or sales.",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 3
}
}
```
### Critical Information
```json
{
"type": "speak",
"session_id": "session-123",
"text": "Your verification code is 1-2-3-4-5-6.",
"barge_in": {
"strategy": "none"
}
}
```
### Protected Announcement
```json
{
"type": "speak",
"session_id": "session-123",
"text": "Your account number is 1234567890.",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 10,
"allow_after_ms": 2000
}
}
```
### Instant Response (Immediate) ⚡
```json
{
"type": "speak",
"session_id": "session-123",
"text": "I can help you with your order, account, or technical support. What would you like to know?",
"barge_in": {
"strategy": "immediate",
"allow_after_ms": 500
}
}
```
**Result**: Assistant stops speaking the moment user starts talking (20-100ms latency), providing the most natural conversation experience.
## Best Practices
1. **Use `none` sparingly** - Only for truly critical information
2. **Choose the right strategy**:
* `immediate` - For most natural, responsive conversations
* `minimum_characters` - For balance between responsiveness and reliability
* `manual` - For custom logic
* `none` - For critical announcements only
3. **Set protection periods** - Use `allow_after_ms: 500-1000` to prevent cutting off important intro
4. **Test with users** - Find the right balance for your use case
5. **Consider noise** - `immediate` may trigger on background noise; use `allow_after_ms` as buffer
## Next Steps
* **[Speak Action](/api/actions/speak)** - How to use barge-in
* **[User Speak Event with Barge-In Flag](/api/events/user-speak)** - Handle interruptions
---
---
url: /sipgate-ai-flow-api/api/vad.md
---
# VAD (Voice Activity Detection) Configuration
Advanced setting that lets you tune how long the system waits in silence before
treating the caller's turn as finished. Useful for call flows where the caller
is expected to pause (think aloud, list items, spell things out) or where
you want a snappier turn-taking rhythm.
::: warning Optional advanced setting
The default behaviour is tuned for typical conversations. Only set `vad` when
you have a concrete use case where the system's default end-of-turn timing
is too eager or too patient. When omitted, the system default applies.
:::
## Where to set it
VAD config is accepted in two places:
* **Per `speak` action** — applies to the caller's reply that follows. The
setting persists until overridden by another `speak.vad` or by
`configure_transcription.vad`.
* **On `configure_transcription`** — sets the value for the rest of the
session (until overridden again).
## Schema
```json
{
"vad": {
"end_of_turn_silence_ms": 1200
}
}
```
| Field | Type | Recommended range | Description |
|--------------------------|--------|-------------------|--------------------------------------------------------------------------------------------------------------|
| `end_of_turn_silence_ms` | number | 150–2000 | Milliseconds of silence after the caller stops speaking before their turn is considered finished. |
Lower values yield faster turn-taking; higher values tolerate longer pauses.
## Lenient validation
If you send an out-of-range, non-integer, or otherwise invalid value, the value
is **silently ignored** — the system default takes over and the rest of your
action is processed normally. This avoids breaking call flows over a typo.
## Example: tolerate long pauses (e.g. spelling)
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Please spell your last name, letter by letter.",
"vad": {
"end_of_turn_silence_ms": 1500
}
}
```
## Example: snappy back-and-forth (e.g. yes/no questions)
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Did you mean account number 1234?",
"vad": {
"end_of_turn_silence_ms": 250
}
}
```
## Example: set once for the whole session
```json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"vad": {
"end_of_turn_silence_ms": 1000
}
}
```
## Notes
* The setting takes effect immediately — speech happens before the caller can
reply, so any internal reconfiguration completes before the system needs to
listen again.
* VAD tuning and [barge-in](/api/barge-in) are related but distinct: `vad`
governs *when the caller's turn is considered finished*, while `barge_in`
governs *whether and how the caller may interrupt the assistant while it is
speaking*. Both can be set on the same `speak` action.
---
---
url: /sipgate-ai-flow-api/sdk.md
---
# SDK Guide
Welcome to the sipgate AI Flow SDK documentation! This guide will help you build powerful AI-powered voice assistants with real-time speech processing capabilities.
## What is the SDK?
The `@sipgate/ai-flow-sdk` is a TypeScript SDK that provides a simple, event-driven interface for building voice assistants. It handles the complexity of real-time speech processing, event management, and action responses, so you can focus on building great conversational experiences.
## Key Concepts
### Event-Driven Architecture
The SDK uses an event-driven model where your assistant responds to events from the AI Flow service:
* **Session Start** - When a new call begins
* **User Speak** - When the user says something
* **User Barge In** - When the user interrupts the assistant
* **Assistant Speak** - After your assistant speaks
* **Session End** - When the call ends
### Simple Response Model
Event handlers can return:
* **Simple strings** - Automatically converted to speech
* **Action objects** - For advanced control (speak, transfer, hangup, etc.)
* **null/undefined** - No response needed
### Easy Integration
The SDK provides built-in middleware for:
* **Express.js** - `assistant.express()` middleware
* **WebSocket** - `assistant.ws(ws)` message handler
* **Custom** - `assistant.onEvent(event)` for any integration
## Quick Example
```typescript
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const assistant = AiFlowAssistant.create({
onSessionStart: async (event) => {
return "Hello! How can I help you today?";
},
onUserSpeak: async (event) => {
const userText = event.text;
console.log(`User said: ${userText}`);
return `You said: ${userText}`;
},
onSessionEnd: async (event) => {
console.log(`Session ${event.session.id} ended`);
},
});
// Use with Express
app.post("/webhook", assistant.express());
```
## What's Next?
* **[Installation](/sdk/installation)** - Install the SDK and set up your project
* **[Quick Start](/sdk/quick-start)** - Build your first voice assistant
* **[Core Concepts](/sdk/core-concepts)** - Learn about events and responses
* **[API Reference](/sdk/api-reference)** - Complete API documentation
## For AI-Assisted Development
Using AI coding assistants like **Claude Code**, **ChatGPT**, or **Cursor**? We publish two auto-generated files following the [llms.txt spec](https://llmstxt.org/):
* **[`/llms.txt`](/llms.txt)** — short index, auto-discovered by AI tooling.
* **[`/llms-full.txt`](/llms-full.txt)** — full documentation corpus in a single file, ideal for pasting into an LLM context.
---
---
url: /sipgate-ai-flow-api/sdk/installation.md
---
# Installation
Install the sipgate AI Flow SDK to start building voice assistants.
## Package Managers
```bash
npm install @sipgate/ai-flow-sdk
```
```bash
yarn add @sipgate/ai-flow-sdk
```
```bash
pnpm add @sipgate/ai-flow-sdk
```
## Requirements
* **Node.js** >= 22.0.0
* **TypeScript** 5.x (recommended)
## TypeScript Setup
The SDK is written in TypeScript and includes full type definitions. No additional `@types` package is needed.
If you're using TypeScript, make sure your `tsconfig.json` includes:
```json
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "bundler",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true
}
}
```
## Verify Installation
You can verify the installation by importing the SDK:
```typescript
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
console.log("SDK installed successfully!");
```
## Next Steps
* **[Quick Start](/sdk/quick-start)** - Build your first voice assistant
* **[API Reference](/sdk/api-reference)** - Explore the complete API
---
---
url: /sipgate-ai-flow-api/sdk/quick-start.md
---
# Quick Start
Get up and running with your first voice assistant in minutes.
## Basic Assistant
Here's a minimal example that responds to user speech:
```typescript
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const assistant = AiFlowAssistant.create({
debug: true,
onSessionStart: async (event) => {
console.log(`Session started for ${event.session.phone_number}`);
return "Hello! How can I help you today?";
},
onUserSpeak: async (event) => {
const userText = event.text;
console.log(`User said: ${userText}`);
// Process user input and return response
return `You said: ${userText}`;
},
onSessionEnd: async (event) => {
console.log(`Session ${event.session.id} ended`);
},
onUserBargeIn: async (event) => {
console.log(`User interrupted with: ${event.text}`);
return "I'm listening, please continue.";
},
});
```
## Express.js Integration
The easiest way to get started is with Express.js:
```typescript
import express from "express";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const app = express();
app.use(express.json());
const assistant = AiFlowAssistant.create({
onSessionStart: async (event) => {
return "Welcome! How can I help you today?";
},
onUserSpeak: async (event) => {
// Your conversation logic here
return processUserInput(event.text);
},
onSessionEnd: async (event) => {
await cleanupSession(event.session.id);
},
});
// Webhook endpoint
app.post("/webhook", assistant.express());
// Health check
app.get("/health", (req, res) => {
res.json({ status: "ok" });
});
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`AI Flow assistant running on port ${PORT}`);
});
```
## WebSocket Integration
For WebSocket-based integrations:
```typescript
import WebSocket from "ws";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const wss = new WebSocket.Server({
port: 8080,
perMessageDeflate: false,
});
const assistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Hello from WebSocket!";
},
});
wss.on("connection", (ws, req) => {
console.log("New WebSocket connection");
ws.on("message", assistant.ws(ws));
ws.on("error", (error) => {
console.error("WebSocket error:", error);
});
ws.on("close", () => {
console.log("WebSocket connection closed");
});
});
console.log("WebSocket server listening on port 8080");
```
## Response Types
You can return different types of responses:
```typescript
// 1. Simple string (automatically converted to speak action)
return "Hello, how can I help?";
// 2. Action object (for advanced control)
return {
type: "speak",
session_id: event.session.id,
text: "Hello!",
barge_in: { strategy: "minimum_characters" },
};
// 3. null/undefined (no response needed)
return null;
```
## Next Steps
* **[Core Concepts](/sdk/core-concepts)** - Learn about events and responses in detail
* **[API Reference](/sdk/api-reference)** - Explore the complete API
* **[Integration Guides](/sdk/integrations/express)** - See more integration examples
---
---
url: /sipgate-ai-flow-api/sdk/core-concepts.md
---
# Core Concepts
Understanding the event-driven architecture and response model.
## Event-Driven Architecture
The SDK uses an event-driven model where your assistant responds to events from the AI Flow service:
1. **Session Start** - Called when a new call session begins
2. **User Speak** - Called when the user says something (after speech-to-text)
3. **User Barge In** - Called when the user interrupts the assistant
4. **Assistant Speak** - Called after your assistant starts speaking (event may be left out)
5. **Assistant Speech Ended** - Called when the assistant's speech playback ends
6. **Session End** - Called when the call ends
### Event Flow
```
┌─────────────────┐
│ session_start │──> Respond with speak/audio or do nothing
└─────────────────┘
┌─────────────────┐
│ user_speak │──> Respond with speak/audio/transfer/hangup
│ (barged_in?) │ Check barged_in flag for interruptions
└─────────────────┘
┌─────────────────┐
│ assistant_speak │──> Optional: track metrics, trigger next action
└─────────────────┘
┌─────────────────┐
│ session_end │──> Cleanup only, no actions accepted
└─────────────────┘
```
## Response Types
Event handlers can return three types of responses:
### 1. Simple String
The simplest way to respond - just return a string:
```typescript
onUserSpeak: async (event) => {
return "Hello, how can I help?";
}
```
This is automatically converted to a `speak` action.
### 2. Action Object
For advanced control, return an action object:
```typescript
onUserSpeak: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "Hello!",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 3
},
};
}
```
Available action types:
* `speak` - Text-to-speech response
* `audio` - Play pre-recorded audio
* `hangup` - End the call
* `transfer` - Transfer to another number
* `barge_in` - Manually interrupt playback
### 3. No Response
Return `null` or `undefined` when no response is needed:
```typescript
onAssistantSpeak: async (event) => {
// Track metrics, no response needed
trackMetrics(event);
return null;
}
```
## Session Information
All events include session information:
```typescript
interface SessionInfo {
id: string; // UUID of the session
account_id: string; // Account identifier
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound";
from_phone_number: string;
to_phone_number: string;
}
```
## Best Practices
### 1. Handle All Events
Even if you don't need to respond, it's good practice to handle all events:
```typescript
const assistant = AiFlowAssistant.create({
onSessionStart: async (event) => {
// Initialize session state
initializeSession(event.session.id);
return "Welcome!";
},
onUserSpeak: async (event) => {
// Main conversation logic
return processUserInput(event.text);
},
onSessionEnd: async (event) => {
// Cleanup
cleanupSession(event.session.id);
},
});
```
### 2. Use Type Safety
The SDK provides full TypeScript types:
```typescript
import type {
AiFlowEventUserSpeak,
AiFlowAction
} from "@sipgate/ai-flow-sdk";
onUserSpeak: async (event: AiFlowEventUserSpeak) => {
// event is fully typed
const text: string = event.text;
const sessionId: string = event.session.id;
return {
type: "speak",
session_id: sessionId,
text: `You said: ${text}`,
} as AiFlowAction;
}
```
### 3. Error Handling
Always handle errors gracefully:
```typescript
onUserSpeak: async (event) => {
try {
return await processUserInput(event.text);
} catch (error) {
console.error("Error processing user input:", error);
return "I'm sorry, I encountered an error. Please try again.";
}
}
```
## Next Steps
* **[API Reference](/sdk/api-reference)** - Complete API documentation
* **[Event Types](/sdk/events)** - Detailed event reference
* **[Action Types](/sdk/actions)** - All available actions
---
---
url: /sipgate-ai-flow-api/sdk/response-types.md
---
# Response Types
Learn about the different ways to respond to events.
## Overview
Event handlers can return these response types:
1. **Simple string** - Automatically converted to a speak action
2. **Action object** - For advanced control
3. **Array of actions** - Execute multiple actions in sequence
4. **null/undefined** - No response needed
## Simple String Response
The simplest way to respond is to return a string:
```typescript
onUserSpeak: async (event) => {
return "Hello, how can I help?";
}
```
This is automatically converted to:
```typescript
{
type: "speak",
session_id: event.session.id,
text: "Hello, how can I help?",
}
```
## Action Object Response
For advanced control, return an action object directly:
```typescript
onUserSpeak: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "Hello!",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 3,
},
};
}
```
### Available Action Types
* **[Speak Action](/sdk/actions#speak-action)** - Text-to-speech response
* **[Audio Action](/sdk/actions#audio-action)** - Play pre-recorded audio
* **[Hangup Action](/sdk/actions#hangup-action)** - End the call
* **[Transfer Action](/sdk/actions#transfer-action)** - Transfer to another number
* **[Barge-In Action](/sdk/actions#barge-in-action)** - Manually interrupt playback
## No Response
Return `null` or `undefined` when no response is needed:
```typescript
onAssistantSpeak: async (event) => {
// Track metrics, no response needed
trackMetrics(event);
return null;
}
```
## Type Safety
The SDK provides TypeScript types for all responses:
```typescript
import type {
InvocationResponseType,
AiFlowAction
} from "@sipgate/ai-flow-sdk";
// InvocationResponseType is a union of:
// string | AiFlowAction | null | undefined
onUserSpeak: async (event): Promise => {
// You can return any of these types
return "Hello"; // string
// or
return { type: "speak", ... }; // AiFlowAction
// or
return null; // null/undefined
}
```
## Examples
### Conditional Response
```typescript
onUserSpeak: async (event) => {
if (event.text.toLowerCase().includes("goodbye")) {
return {
type: "hangup",
session_id: event.session.id,
};
}
return "How can I help you?";
}
```
### Multiple Actions
You can return an array of actions to execute them in sequence:
```typescript
onUserSpeak: async (event) => {
return [
{
type: "barge_in",
session_id: event.session.id,
},
{
type: "speak",
session_id: event.session.id,
text: "Sorry, let me correct that.",
},
];
}
```
Actions in the array are executed one after another in order.
Alternatively, you can chain actions across events using the `onAssistantSpeak` event:
```typescript
const sessionState = new Map();
onUserSpeak: async (event) => {
// Store what we want to do next
sessionState.set(event.session.id, "play_audio");
return "Please listen to this message.";
},
onAssistantSpeak: async (event) => {
const nextAction = sessionState.get(event.session.id);
if (nextAction === "play_audio") {
sessionState.delete(event.session.id);
return {
type: "audio",
session_id: event.session.id,
audio: base64AudioData,
};
}
return null;
}
```
## Next Steps
* **[Action Types](/sdk/actions)** - Complete reference for all actions
* **[API Reference](/sdk/api-reference)** - Full API documentation
---
---
url: /sipgate-ai-flow-api/sdk/api-reference.md
---
# API Reference
Complete API documentation for the `AiFlowAssistant` class.
## AiFlowAssistant
The main class for creating AI voice assistants.
### `AiFlowAssistant.create(options)`
Creates a new assistant instance.
**Options:**
```typescript
interface AiFlowAssistantOptions {
// Bearer token for outbound call API requests
token?: string;
// Base URL of the sipgate API (default: "https://api.sipgate.com")
baseUrl?: string;
// Enable debug logging
debug?: boolean;
// Event handlers
onSessionStart?: (
event: AiFlowEventSessionStart
) => Promise;
onUserSpeak?: (
event: AiFlowEventUserSpeak
) => Promise;
onAssistantSpeak?: (
event: AiFlowEventAssistantSpeak
) => Promise;
onAssistantSpeechEnded?: (
event: AiFlowEventAssistantSpeechEnded
) => Promise;
onUserInputTimeout?: (
event: AiFlowEventUserInputTimeout
) => Promise;
onSessionEnd?: (
event: AiFlowEventSessionEnd
) => Promise;
// DEPRECATED: Use onUserSpeak instead
onUserBargeIn?: (
event: AiFlowEventUserBargeIn
) => Promise;
}
type InvocationResponseType = AiFlowAction | string | null | undefined;
```
**Example:**
```typescript
const assistant = AiFlowAssistant.create({
debug: true,
apiKey: process.env.API_KEY,
onSessionStart: async (event) => {
return "Welcome!";
},
onUserSpeak: async (event) => {
return "Hello!";
},
});
```
### Instance Methods
#### `assistant.express()`
Returns an Express.js middleware function for handling webhook requests.
```typescript
app.post("/webhook", assistant.express());
```
**Usage:**
```typescript
import express from "express";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const app = express();
app.use(express.json());
const assistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Hello!";
},
});
app.post("/webhook", assistant.express());
```
#### `assistant.ws(websocket)`
Returns a WebSocket message handler.
```typescript
wss.on("connection", (ws) => {
ws.on("message", assistant.ws(ws));
});
```
**Usage:**
```typescript
import WebSocket from "ws";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const wss = new WebSocket.Server({ port: 8080 });
const assistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Hello!";
},
});
wss.on("connection", (ws) => {
ws.on("message", assistant.ws(ws));
});
```
#### `assistant.call(params)`
Initiates an outbound call. Requires `token` to be set in options.
```typescript
await assistant.call({
aiFlowId: string; // ID of the AI flow
billingDevice: string; // Billing device suffix (provided during onboarding)
toPhoneNumber: string; // Target number in E.164 format
});
```
Returns `Promise`. Throws on API errors (e.g. flow not found, missing phone number configuration).
See **[Outbound Calls](/sdk/outbound-calls)** for a full guide.
#### `assistant.onEvent(event)`
Manually process an event (useful for custom integrations).
```typescript
const action = await assistant.onEvent(event);
```
**Usage:**
```typescript
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const assistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Hello!";
},
});
// Custom integration
app.post("/custom-webhook", async (req, res) => {
const event = req.body;
const action = await assistant.onEvent(event);
if (action) {
res.json(action);
} else {
res.status(204).send();
}
});
```
## Options Reference
### `token?: string`
Bearer token for authenticating outbound call API requests. Required when using `assistant.call()`.
### `baseUrl?: string`
Base URL of the sipgate API. Defaults to `"https://api.sipgate.com"`. Override for custom environments.
### `debug?: boolean`
Enable debug logging. When `true`, the SDK will log all events and actions to the console.
```typescript
const assistant = AiFlowAssistant.create({
debug: true, // Logs all events and actions
// ...
});
```
### Event Handlers
All event handlers are optional and follow the same pattern:
```typescript
onEventName?: (event: EventType) => Promise
```
See the [Event Types](/sdk/events) documentation for details on each event.
## Type Definitions
### `InvocationResponseType`
The return type for all event handlers:
```typescript
type InvocationResponseType =
| AiFlowAction // Action object
| string // Simple string (converted to speak action)
| null // No response
| undefined; // No response
```
## Error Handling
The SDK handles errors gracefully. If an event handler throws an error, it will be logged and the SDK will continue processing other events.
```typescript
const assistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
try {
return await processUserInput(event.text);
} catch (error) {
console.error("Error:", error);
return "I'm sorry, I encountered an error.";
}
},
});
```
## Next Steps
* **[Event Types](/sdk/events)** - Complete event reference
* **[Action Types](/sdk/actions)** - All available actions
* **[Integration Guides](/sdk/integrations/express)** - Integration examples
---
---
url: /sipgate-ai-flow-api/sdk/events.md
---
# Event Types
Complete reference for all events in the SDK.
## Overview
Events are triggered by the AI Flow service and handled by your assistant. All events include session information and are typed with TypeScript.
## Base Event Structure
All events extend a base structure with session information:
```typescript
interface SessionInfo {
id: string; // UUID of the session
account_id: string; // Account identifier
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound";
from_phone_number: string;
to_phone_number: string;
}
```
## Event Types
### SessionStart Event
Triggered when a new call session begins.
```typescript
interface AiFlowEventSessionStart {
type: "session_start";
session: {
id: string; // UUID of the session
account_id: string; // Account identifier
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound"; // Call direction
from_phone_number: string; // Phone number of the caller
to_phone_number: string; // Phone number of the callee
};
}
```
**Example:**
```typescript
onSessionStart: async (event) => {
// Log session details
console.log(
`${event.session.direction} call from ${event.session.from_phone_number} to ${event.session.to_phone_number}`
);
// Return greeting
return "Welcome to our service!";
};
```
### UserSpeechStarted Event
Triggered when the user's speech is first detected, before the full transcript is available. Uses Voice Activity Detection (VAD) and fires 20–120 ms after the user starts speaking.
> **WebSocket only** — this event is not delivered to HTTP webhook handlers.
```typescript
interface AiFlowEventUserSpeechStarted {
type: "user_speech_started";
session: SessionInfo;
}
```
**Notes:**
* Fires at most once per speech turn; resets after `user_speak` is received
* No return value is expected; returning an action has no effect
**Example:**
```typescript
onUserSpeechStarted: async (event) => {
console.log('User started speaking, session:', event.session.id);
// No return value needed
},
```
### UserSpeak Event
Triggered when the user speaks and speech-to-text completes.
```typescript
interface AiFlowEventUserSpeak {
type: "user_speak";
text: string; // Recognized speech text
session: SessionInfo;
}
```
**Example:**
```typescript
onUserSpeak: async (event) => {
const intent = analyzeIntent(event.text);
if (intent === "help") {
return "I can help you with billing, support, or sales.";
}
return processUserInput(event.text);
};
```
### AssistantSpeak Event
Triggered after the assistant starts speaking. Event may be omitted for some text-to-speech models.
```typescript
interface AiFlowEventAssistantSpeak {
type: "assistant_speak";
text?: string; // Text that was spoken
ssml?: string; // SSML that was used (if applicable)
duration_ms: number; // Duration of speech in milliseconds
speech_started_at: number; // Unix timestamp (ms) when speech started
session: SessionInfo;
}
```
**Example:**
```typescript
onAssistantSpeak: async (event) => {
console.log(`Spoke for ${event.duration_ms}ms`);
// Track conversation metrics
trackMetrics({
sessionId: event.session.id,
duration: event.duration_ms,
text: event.text,
});
};
```
### AssistantSpeechEnded Event
Triggered after the assistant finishes speaking.
```typescript
interface AiFlowEventAssistantSpeechEnded {
type: "assistant_speech_ended";
session: SessionInfo;
}
```
**Example:**
```typescript
onAssistantSpeechEnded: async (event) => {
console.log(`Finished speaking for session ${event.session.id}`);
// Trigger next action if needed
await triggerNextAction(event.session.id);
};
```
### UserInputTimeout Event
Triggered when no user speech is detected within the configured timeout period after the assistant finishes speaking.
```typescript
interface AiFlowEventUserInputTimeout {
type: "user_input_timeout";
session: SessionInfo;
}
```
**When Triggered:**
1. A `speak` action includes a `user_input_timeout_seconds` field
2. The assistant finishes speaking (`assistant_speech_ended` event fires)
3. The specified timeout period elapses without any user speech detected
**Example:**
```typescript
onUserInputTimeout: async (event) => {
console.log(`No user input received for session ${event.session.id}`);
// Retry the question
return {
type: "speak",
session_id: event.session.id,
text: "Are you still there? Please say yes or no.",
user_input_timeout_seconds: 5
};
};
```
**Configuring Timeout:**
Set `user_input_timeout_seconds` in the speak action:
```typescript
onSessionStart: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "What is your account number?",
user_input_timeout_seconds: 5 // Wait 5 seconds for response
};
};
```
**Common Use Cases:**
```typescript
// Hangup after multiple timeouts
const timeoutCounts = new Map();
onUserInputTimeout: async (event) => {
const sessionId = event.session.id;
const count = (timeoutCounts.get(sessionId) || 0) + 1;
timeoutCounts.set(sessionId, count);
if (count >= 3) {
return {
type: "hangup",
session_id: sessionId
};
}
return {
type: "speak",
session_id: sessionId,
text: `I didn't hear anything. Please respond. Attempt ${count} of 3.`,
user_input_timeout_seconds: 5
};
};
// Transfer to agent after timeout
onUserInputTimeout: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "Let me connect you with a live agent who can help you."
// Follow with transfer action
};
};
```
### DtmfReceived Event
Triggered when the user presses a key on their phone keypad.
```typescript
interface AiFlowEventDtmfReceived {
type: "dtmf_received";
digit: string; // The key pressed: "0"–"9", "*", or "#"
session: SessionInfo;
}
```
**Example:**
```typescript
onDtmfReceived: async (event) => {
console.log(`User pressed: ${event.digit}`);
if (event.digit === '1') {
return {
type: 'transfer',
session_id: event.session.id,
transfer_to: '+49211100200'
};
}
return {
type: 'speak',
session_id: event.session.id,
text: `You pressed ${event.digit}.`
};
},
```
**Notes:**
* All standard DTMF tones are supported: `0`–`9`, `*`, `#`
* Each key press triggers a separate event
* DTMF events can occur at any point during the call
### SessionEnd Event
Triggered when the call session ends.
```typescript
interface AiFlowEventSessionEnd {
type: "session_end";
session: SessionInfo;
}
```
**Example:**
```typescript
onSessionEnd: async (event) => {
// Save conversation history
await saveConversation(event.session.id);
// Send analytics
await trackSessionEnd(event.session);
};
```
### Barge-In Detection
User interruptions are detected via the `barged_in` flag in `user_speak` events:
```typescript
interface AiFlowEventUserSpeak {
type: "user_speak";
text: string;
barged_in?: boolean; // true if user interrupted
session: SessionInfo;
}
```
**Example:**
```typescript
onUserBargeIn: async (event) => {
// Called automatically when event.barged_in === true
console.log(`User interrupted with: ${event.text}`);
return "I'm listening, please continue.";
};
```
## Event Flow
```
┌─────────────────┐
│ session_start │──> Respond with speak/audio or do nothing
└─────────────────┘
┌─────────────────┐
│ user_speak │──> Respond with speak/audio/transfer/hangup
│ (barged_in?) │ Check barged_in flag for interruptions
└─────────────────┘
┌─────────────────┐
│ assistant_speak │──> Optional: track metrics, trigger next action
└─────────────────┘
┌─────────────────┐
│ session_end │──> Cleanup only, no actions accepted
└─────────────────┘
```
## Event Summary Table
| Event Type | Transport | Description | When Triggered | Can Return Action? |
| ----------------------- | ------------------ | --------------------------- | ------------------------------------------ | ------------------- |
| `session_start` | HTTP + WebSocket | Call session begins | When a new call is initiated | ✅ Yes |
| `user_speech_started` | **WebSocket only** | Speech onset detected | When VAD detects the user starting to speak | ❌ No |
| `user_speak` | HTTP + WebSocket | User speech detected | After speech-to-text completes (includes `barged_in` flag) | ✅ Yes |
| `dtmf_received` | HTTP + WebSocket | DTMF digit pressed | When the user presses a phone key | ✅ Yes |
| `assistant_speak` | HTTP + WebSocket | Assistant finished speaking | After TTS playback completes | ✅ Yes |
| `assistant_speech_ended`| HTTP + WebSocket | Assistant finished speaking | After speech playback ends | ✅ Yes |
| `user_input_timeout` | HTTP + WebSocket | User input timeout reached | When no speech detected after timeout | ✅ Yes |
| `session_end` | HTTP + WebSocket | Call session ends | When the call terminates | ❌ No |
## Type Safety
All events are fully typed. Import types from the SDK:
```typescript
import type {
AiFlowEventSessionStart,
AiFlowEventUserSpeechStarted,
AiFlowEventUserSpeak,
AiFlowEventDtmfReceived,
AiFlowEventAssistantSpeak,
AiFlowEventAssistantSpeechEnded,
AiFlowEventUserInputTimeout,
AiFlowEventSessionEnd,
AiFlowEventUserBargeIn,
} from "@sipgate/ai-flow-sdk";
onSessionStart: async (event: AiFlowEventSessionStart) => {
// event is fully typed
const sessionId: string = event.session.id;
// ...
};
```
## Next Steps
* **[Action Types](/sdk/actions)** - Learn how to respond to events
* **[API Reference](/sdk/api-reference)** - Complete API documentation
---
---
url: /sipgate-ai-flow-api/sdk/actions.md
---
# Action Types
Complete reference for all actions you can return from event handlers.
## Overview
Actions are responses that tell the AI Flow service what to do next. All actions require a `session_id` and `type` field.
## Base Action Structure
```typescript
interface BaseAction {
session_id: string; // UUID from the event's session.id
type: string; // Action type identifier
}
```
## Action Summary
| Action Type | Description | Primary Use Case |
| -------------- | --------------------------- | --------------------------------------- |
| `speak` | Speak text or SSML | Respond to user with synthesized speech |
| `audio` | Play pre-recorded audio | Play hold music, pre-recorded messages |
| `mix_audio` | Loop a background sound mixed into speech | Add ambient noise (café, office, train station) under the agent |
| `hangup` | End the call | Terminate conversation |
| `transfer` | Transfer to another number | Route to human agent or department |
| `barge_in` | Manually interrupt playback | Stop current audio immediately |
| `configure_transcription` | Change STT language(s) mid-call | Switch recognition language without hanging up |
## Speak Action
Speaks text or SSML to the user.
```typescript
interface AiFlowActionSpeak {
type: "speak";
session_id: string;
// Either text OR ssml (not both)
text?: string; // Plain text to speak
ssml?: string; // SSML markup for advanced control
// Optional configurations
tts?: TtsConfig; // TTS provider settings
barge_in?: BargeInConfig; // Barge-in behavior
user_input_timeout_seconds?: number; // Wait this long for the caller to start
vad?: VadConfig; // Tune end-of-turn silence (advanced — see /sdk/vad)
}
```
**Examples:**
```typescript
// Simple text
return {
type: "speak",
session_id: event.session.id,
text: "Hello, how can I help you?",
};
// With SSML
return {
type: "speak",
session_id: event.session.id,
ssml: `
Please listen carefully.
Your account balance is $42.50
`,
};
// With custom TTS provider
return {
type: "speak",
session_id: event.session.id,
text: "Hello in a different voice",
tts: {
provider: "azure",
language: "en-US",
voice: "en-US-JennyNeural",
},
};
```
## Audio Action
Plays pre-recorded audio to the user.
```typescript
interface AiFlowActionAudio {
type: "audio";
session_id: string;
audio: string; // Base64 encoded WAV (16kHz, mono, 16-bit)
barge_in?: BargeInConfig;
}
```
**Example:**
```typescript
// Play hold music or pre-recorded message
return {
type: "audio",
session_id: event.session.id,
audio: base64EncodedWavData,
barge_in: {
strategy: "minimum_characters",
minimum_characters: 3,
},
};
```
**Audio Format Requirements:**
* **Format**: WAV
* **Sample Rate**: 16kHz
* **Channels**: Mono
* **Bit Depth**: 16-bit PCM
* **Encoding**: Base64
## Mix Audio Action
Play a looping background sound (e.g. train station, café, office) under the call. The loop plays continuously for the rest of the session — both during the assistant's TTS turns and during silences. Sending `mix_audio` again replaces the active loop; sending with `stop: true` removes it. The loop is dropped automatically when the session ends.
```typescript
import { readFileSync } from "node:fs";
interface AiFlowActionMixAudio {
type: "mix_audio";
session_id: string;
/** Base64-encoded WAV (16 kHz, mono, 16-bit PCM). Required unless stop=true. */
audio?: string;
/** Mix volume for the background loop, 0.0–1.0. Defaults to 0.5. */
volume?: number;
/** When true, removes the active background loop. */
stop?: boolean;
}
```
**Example — start an ambient loop alongside the greeting:**
```typescript
// Load and base64-encode the loop once at startup
const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64");
onSessionStart: async (event) => {
return [
{
type: "mix_audio",
session_id: event.session.id,
audio: AMBIENT_AUDIO,
volume: 0.3,
},
{
type: "speak",
session_id: event.session.id,
text: "Welcome, how can I help you?",
},
];
};
```
**Example — stop the ambient before hanging up:**
```typescript
onUserSpeak: async (event) => {
if (event.text.toLowerCase().includes("goodbye")) {
return [
{ type: "mix_audio", session_id: event.session.id, stop: true },
{ type: "speak", session_id: event.session.id, text: "Goodbye!" },
{ type: "hangup", session_id: event.session.id },
];
}
};
```
**Audio Format Requirements:** identical to the `audio` action — WAV, 16 kHz, mono, 16-bit PCM, base64-encoded. Same FFmpeg conversion command applies.
**Best practice — keep ambient quiet.** Background loops should sit *under* the agent's voice. Start around `volume: 0.3` and adjust from there. Loudness-normalize source files to about `-30 LUFS` so different presets stay comparable at a given volume value.
## Hangup Action
Ends the call.
```typescript
interface AiFlowActionHangup {
type: "hangup";
session_id: string;
}
```
**Example:**
```typescript
onUserSpeak: async (event) => {
if (event.text.toLowerCase().includes("goodbye")) {
return {
type: "hangup",
session_id: event.session.id,
};
}
};
```
## Transfer Action
Transfers the call to another phone number. Pass an optional `timeout` to
enable **transfer fallback** — if the target doesn't pick up (or rejects /
hangs up), the service re-emits `session_start` with the same `session.id`
so the agent can handle the call again.
```typescript
interface AiFlowActionTransfer {
type: "transfer";
session_id: string;
target_phone_number: string; // E.164 format without leading + recommended
caller_id_name: string;
caller_id_number: string;
/** Optional transfer timeout in seconds (5–120). Enables transfer fallback. */
timeout?: number;
}
```
**Example:**
```typescript
// Transfer to sales department — fall back to the agent after 30s of no answer
return {
type: "transfer",
session_id: event.session.id,
target_phone_number: "1234567890",
caller_id_name: "Sales Department",
caller_id_number: "1234567890",
timeout: 30,
};
```
## Barge-In Action
Manually triggers barge-in (interrupts current playback).
```typescript
interface AiFlowActionBargeIn {
type: "barge_in";
session_id: string;
}
```
**Example:**
```typescript
// Manually interrupt current playback
return {
type: "barge_in",
session_id: event.session.id,
};
```
## Configure Transcription Action
Change the STT (Speech-to-Text) provider and/or recognition language(s) during an active call session without hanging up.
```typescript
import { TranscriptionProvider } from "@sipgate/ai-flow-sdk";
interface AiFlowActionConfigureTranscription {
type: "configure_transcription";
session_id: string;
provider?: TranscriptionProvider; // "AZURE" | "DEEPGRAM" | "ELEVEN_LABS" — omit to keep current
languages?: string[]; // BCP-47 codes, 1-4 entries — omit to reset to provider default
custom_vocabulary?: string[]; // Words/phrases to boost STT recognition
vad?: VadConfig; // Session-wide VAD tuning — see /sdk/vad
}
```
At least one of `provider`, `languages`, `custom_vocabulary`, or `vad` should be provided; sending none is a no-op.
Both fields use **full replace** semantics — they never merge with existing settings.
**Examples:**
```typescript
// Switch to German
return {
type: "configure_transcription",
session_id: event.session.id,
languages: ["de-DE"],
};
// Multi-language detection (German + English)
return {
type: "configure_transcription",
session_id: event.session.id,
languages: ["de-DE", "en-US"],
};
// Switch STT provider to Deepgram
return {
type: "configure_transcription",
session_id: event.session.id,
provider: "DEEPGRAM",
};
// Switch provider AND language in one step
return {
type: "configure_transcription",
session_id: event.session.id,
provider: "DEEPGRAM",
languages: ["en-US"],
};
// Reset to provider default (automatic detection)
return {
type: "configure_transcription",
session_id: event.session.id,
};
```
**Audio gap during restart:** Any change requires the transcription engine to restart. Audio during the restart (~100–500 ms for language-only change, ~200–800 ms for provider switch) is dropped.
**Multi-language support depends on the active STT provider:**
* **Azure**: up to 4 languages, all used for simultaneous Language Identification (LID)
* **Deepgram**: multilingual auto-detection across all supplied languages
* **ElevenLabs**: single language only — only the **first** entry is used; additional entries are silently ignored
**Barge-in latency after provider switch** (for `immediate` strategy):
* **Azure**: ~20–80 ms
* **Deepgram**: ~20–100 ms
* **ElevenLabs**: ~30–120 ms
## Type Safety
All actions are fully typed. Import types from the SDK:
```typescript
import type {
AiFlowAction,
AiFlowActionSpeak,
AiFlowActionAudio,
AiFlowActionMixAudio,
AiFlowActionHangup,
AiFlowActionTransfer,
AiFlowActionBargeIn,
AiFlowActionConfigureTranscription,
} from "@sipgate/ai-flow-sdk";
import { TranscriptionProvider } from "@sipgate/ai-flow-sdk";
onUserSpeak: async (event) => {
const action: AiFlowActionSpeak = {
type: "speak",
session_id: event.session.id,
text: "Hello!",
};
return action;
};
```
## Next Steps
* **[TTS Providers](/sdk/tts-providers)** - Configure text-to-speech voices
* **[Barge-In Configuration](/sdk/barge-in)** - Control interruption behavior
* **[API Reference](/sdk/api-reference)** - Complete API documentation
---
---
url: /sipgate-ai-flow-api/sdk/tts-providers.md
---
# TTS Providers
Configure text-to-speech providers for different voices and languages.
## Overview
The SDK supports both Azure Cognitive Services and ElevenLabs for high-quality voice synthesis. You can configure TTS providers per action or use default settings.
## Azure Cognitive Services
Azure provides a wide range of neural voices across many languages and regions.
```typescript
interface TtsProviderConfigAzure {
provider: "azure";
language?: string; // BCP-47 format (e.g., "en-US", "de-DE")
voice?: string; // Voice name (e.g., "en-US-JennyNeural")
}
```
**Examples:**
```typescript
// English (US) - Female
tts: {
provider: "azure",
language: "en-US",
voice: "en-US-JennyNeural"
}
// English (GB) - Female
tts: {
provider: "azure",
language: "en-GB",
voice: "en-GB-SoniaNeural"
}
// German - Male
tts: {
provider: "azure",
language: "de-DE",
voice: "de-DE-ConradNeural"
}
// Spanish - Female
tts: {
provider: "azure",
language: "es-ES",
voice: "es-ES-ElviraNeural"
}
```
### Popular Azure Voices
| Language | Voice Name | Gender | Sample | Description |
| -------- | ------------------ | ------ | --------------------------------------------------------------------- | ---------------------- |
| en-US | en-US-JennyNeural | Female | | Friendly, professional |
| en-US | en-US-GuyNeural | Male | | Clear, neutral |
| en-GB | en-GB-SoniaNeural | Female | | British, professional |
| en-GB | en-GB-RyanNeural | Male | | British, friendly |
| de-DE | de-DE-KatjaNeural | Female | | Professional, clear |
| de-DE | de-DE-ConradNeural | Male | | Deep, authoritative |
**Full Voice List:** See [Azure TTS documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support) for complete list of 400+ voices in 140+ languages.
## ElevenLabs
ElevenLabs provides ultra-realistic AI voices optimized for conversational use cases.
```typescript
interface TtsProviderConfigElevenLabs {
provider: "eleven_labs";
voice?: string; // Voice ID (e.g., "21m00Tcm4TlvDq8ikWAM") - optional, uses default if omitted
}
```
**Example:**
```typescript
// With specific voice
tts: {
provider: "eleven_labs",
voice: "21m00Tcm4TlvDq8ikWAM" // Rachel
}
// With default voice
tts: {
provider: "eleven_labs"
}
```
### Available ElevenLabs Voices
The "Sample" column plays a representative greeting from a voice assistant scenario. Multilingual voices include both German and English samples; voices verified for German only have a German sample.
| Voice Name | ID | Sample | Description | Verified Locales |
| ----------- | -------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------- | ---------------------------------- |
| **sipgate** | dSu12TX3MEDQXAarG4s6 | | Clean male voice used by sipgate for system announcements (default). | de-DE |
| **Rachel** | 21m00Tcm4TlvDq8ikWAM | | Matter-of-fact, personable woman. Great for conversational use cases. | en-US |
| **Sarah** | EXAVITQu4vr4xnSDxMaL | | Young adult woman with a confident and warm, mature quality. | en-US, fr-FR, cmn-CN, hi-IN |
| **Laura** | FGY2WhTYpPnrIDTdsKH5 | | Young adult female delivers sunny enthusiasm with quirky attitude. | en-US, fr-FR, cmn-CN, de-DE |
| **George** | JBFqnCBsd6RMkjVDRZzb | | Warm resonance that instantly captivates listeners. | en-GB, fr-FR, ja-JP, cs-CZ |
| **Thomas** | GBv7mTt0atIp3Br8iCZE | | Soft and subdued male, optimal for narrations or meditations. | en-US |
| **Roger** | CwhRBWXzGAHq8TQ4Fs17 | | Easy going and perfect for casual conversations. | en-US, fr-FR, de-DE, nl-NL |
| **Eric** | cjVigY5qzO86Huf0OWal | | Smooth tenor pitch from a man in his 40s - perfect for agentic use cases. | en-US, fr-FR, de-DE, sk-SK |
| **Brian** | nPczCjzI2devNBz1zQrb | | Middle-aged man with resonant and comforting tone. | en-US, cmn-CN, de-DE, nl-NL |
| **Jessica** | cgSgspJ2msm6clMCkdW9 | | Young and playful American female, perfect for trendy content. | en-US, fr-FR, ja-JP, cmn-CN, de-DE |
| **Liam** | TX3LPaxmHKxFdv7VOQHJ | | Young adult with energy and warmth - suitable for reels and shorts. | en-US, de-DE, cs-CZ, pl-PL, tr-TR |
| **Alice** | Xb7hH8MSUJpSbSDYk0k2 | | Clear and engaging, friendly British woman suitable for e-learning. | en-GB, it-IT, fr-FR, ja-JP, pl-PL |
| **Daniel** | onwK4e9ZLuTAKqWW03F9 | | Strong voice perfect for professional broadcast or news. | en-GB, de-DE, tr-TR |
| **Lily** | pFZP5JQG7iQjIQuC4Bku | | Velvety British female delivers news with warmth and clarity. | it-IT, de-DE, cmn-CN, cs-CZ, nl-NL |
| **River** | SAz9YHcvj6GT2YYXdXww | | Relaxed, neutral voice ready for narrations or conversational projects. | en-US, it-IT, fr-FR, cmn-CN |
| **Charlie** | IKne3meq5aSn9XLyUdCD | | Young Australian male with confident and energetic voice. | en-AU, cmn-CN, fil-PH |
| **Aria** | 9BWtsMINqrJLrRacOk9x | | Middle-aged female with African-American accent. Calm with hint of rasp. | en-US, fr-FR, cmn-CN, tr-TR |
| **Matilda** | XrExE9yKIg1WjnnlVkGX | | Professional woman with pleasing alto pitch. Suitable for many use cases. | en-US, it-IT, fr-FR, de-DE |
| **Will** | bIHbv24MWmeRgasZH58o | | Conversational and laid back. | en-US, fr-FR, de-DE, cmn-CN, cs-CZ |
| **Chris** | iP95p4xoKVk53GoZ742B | | Natural and real, down-to-earth voice great across many use-cases. | en-US, fr-FR, sv-SE, hi-IN |
| **Bill** | pqHfZKP75CvOlQylNhV4 | | Friendly and comforting voice ready to narrate stories. | en-US, fr-FR, cmn-CN, de-DE, cs-CZ |
**Note:** 50+ voices available in total. The full list with samples is in the [API reference](/api/tts-providers#available-voices). The SDK includes full TypeScript type definitions for all voice IDs and names.
## Choosing a TTS Provider
### Use Azure when:
* You need support for many languages (140+ languages available)
* You want consistent quality across all locales
* You need specific regional accents or dialects
* Budget is a primary concern
### Use ElevenLabs when:
* You need the most natural, human-like voices
* Conversational quality is critical (phone calls, virtual assistants)
* You're primarily working with English or common European languages
* You want voices with distinct personalities
## Usage Examples
### Per-Action Configuration
```typescript
onUserSpeak: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "Hello in a different voice",
tts: {
provider: "azure",
language: "en-US",
voice: "en-US-JennyNeural",
},
};
}
```
### Using ElevenLabs
```typescript
onUserSpeak: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "Hello from ElevenLabs!",
tts: {
provider: "eleven_labs",
voice: "21m00Tcm4TlvDq8ikWAM", // Rachel
},
};
}
```
## Next Steps
* **[Barge-In Configuration](/sdk/barge-in)** - Control interruption behavior
* **[Action Types](/sdk/actions)** - Complete action reference
---
---
url: /sipgate-ai-flow-api/sdk/barge-in.md
---
# Barge-In Configuration
Control how users can interrupt the assistant while speaking.
## Overview
Barge-in allows users to interrupt the assistant's speech. You can configure barge-in behavior for each `speak` or `audio` action.
## Configuration
```typescript
interface BargeInConfig {
strategy: "none" | "manual" | "minimum_characters" | "immediate";
minimum_characters?: number; // Default: 3 (only for minimum_characters)
allow_after_ms?: number; // Delay before allowing interruption
}
```
## Strategies
### `none`
Disables barge-in completely. Audio plays fully without interruption.
```typescript
barge_in: {
strategy: "none"
}
```
**Use cases:**
* Critical information that must be heard
* Legal disclaimers
* Emergency instructions
**Example:**
```typescript
return {
type: "speak",
session_id: event.session.id,
text: "This is important information. Please listen carefully.",
barge_in: {
strategy: "none",
},
};
```
### `manual`
Allows manual barge-in via API only (no automatic detection).
```typescript
barge_in: {
strategy: "manual"
}
```
**Use cases:**
* Custom interruption logic
* Button-triggered interruption
* External event-based interruption
**Example:**
```typescript
return {
type: "speak",
session_id: event.session.id,
text: "Press a button to interrupt.",
barge_in: {
strategy: "manual",
},
};
```
### `minimum_characters`
Automatically detects barge-in when user speech exceeds character threshold.
```typescript
barge_in: {
strategy: "minimum_characters",
minimum_characters: 5, // Trigger after 5 characters
allow_after_ms: 500 // Wait 500ms before allowing interruption
}
```
**Use cases:**
* Natural conversation flow
* Customer service scenarios
* Interactive voice menus
**Example:**
```typescript
return {
type: "speak",
session_id: event.session.id,
text: "How can I help you today?",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 3,
},
};
```
### `immediate` ⚡ NEW
**Most responsive option** - Interrupts immediately when user starts speaking using Voice Activity Detection (VAD).
```typescript
barge_in: {
strategy: "immediate",
allow_after_ms: 500 // Optional: protect first 500ms
}
```
**How it works:**
* **Azure/Deepgram**: Uses Voice Activity Detection (VAD) - triggers before any text is recognized
* **ElevenLabs**: Uses first partial transcript
* **Latency**: 20-100ms (2-4x faster than `minimum_characters`)
* **No text required**: Interrupts on voice detection, not transcription
**Use cases:**
* High-priority conversations requiring instant responsiveness
* Natural dialogue where interruptions should feel seamless
* Customer service where quick response matters
* Urgent or time-sensitive interactions
**Example:**
```typescript
onUserSpeak: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "I can help you with billing, support, or sales. What would you like?",
barge_in: {
strategy: "immediate",
allow_after_ms: 500, // Protect first 500ms from accidental noise
},
};
}
```
**Comparison:**
| Strategy | Trigger | Latency | Use Case |
|----------|---------|---------|----------|
| `immediate` | Voice Activity (VAD) | 20-100ms | Most natural, instant response |
| `minimum_characters` | Text recognition | 50-200ms | Balanced reliability |
| `manual` | API call | N/A | Custom logic |
| `none` | Never | N/A | Critical info only |
**Best practices:**
* Use `allow_after_ms: 500-1000` to prevent accidental interruptions
* Test with real users to find optimal settings
* Consider background noise in your environment
### Protection Period
You can add a protection period to prevent interruption during critical parts of speech:
```typescript
return {
type: "speak",
session_id: event.session.id,
text: "Your account number is 1234567890. Please write this down.",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 10, // Require substantial speech
allow_after_ms: 2000, // Protect first 2 seconds
},
};
```
## Configuration Options
### `minimum_characters`
The minimum number of characters the user must speak before barge-in is triggered.
* **Default**: `3`
* **Range**: `1` to `100`
* **Use**: Higher values require more speech before interruption
### `allow_after_ms`
Delay in milliseconds before barge-in is allowed. This creates a "protection period" at the start of speech.
* **Default**: `0` (immediate)
* **Range**: `0` to `10000` (10 seconds)
* **Use**: Prevent interruption during critical information
## Examples
### Natural Conversation
```typescript
onUserSpeak: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "I can help you with billing, support, or sales. What would you like?",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 3,
},
};
}
```
### Critical Information
```typescript
onUserSpeak: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "Your verification code is 1-2-3-4-5-6. Please write this down.",
barge_in: {
strategy: "none", // Don't allow interruption
},
};
}
```
### Protected Announcement
```typescript
onSessionStart: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "Welcome! Your call may be recorded for quality assurance.",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 5,
allow_after_ms: 3000, // Protect first 3 seconds
},
};
}
```
## Best Practices
1. **Use `none` sparingly** - Only for truly critical information
2. **Choose the right strategy**:
* `immediate` - For most natural, responsive conversations
* `minimum_characters` - For balance between responsiveness and reliability
* `manual` - For custom logic
* `none` - For critical announcements only
3. **Set protection periods** - Use `allow_after_ms: 500-1000` to prevent cutting off important intro
4. **Test with users** - Find the right balance for your use case
5. **Consider noise** - `immediate` may trigger on background noise; use `allow_after_ms` as buffer
## Related: VAD Configuration
Barge-in controls *whether the caller may interrupt the assistant while it is
speaking*. The related [VAD Configuration](/sdk/vad) controls *how long the
caller may pause before their turn is considered finished*. Both can be set on
the same `speak` action.
## Next Steps
* **[Action Types](/sdk/actions)** - Complete action reference
* **[VAD Configuration](/sdk/vad)** - Tune end-of-turn silence
* **[API Reference](/sdk/api-reference)** - Full API documentation
---
---
url: /sipgate-ai-flow-api/sdk/vad.md
---
# VAD (Voice Activity Detection) Configuration
Optional advanced setting that lets you tune how long the system waits in
silence before treating the caller's turn as finished. When omitted, the
system default applies.
::: warning Optional advanced setting
Only set `vad` when you have a concrete use case where the system's default
end-of-turn timing is too eager or too patient.
:::
## Type
```typescript
interface VadConfig {
/**
* Milliseconds of silence after the caller stops speaking before their turn
* is considered finished. Recommended range 150–2000.
* Lower values yield faster turn-taking; higher values tolerate longer pauses.
*/
end_of_turn_silence_ms?: number;
}
```
## Where to set it
`VadConfig` is accepted on two action types:
* **`speak.vad`** — applies to the caller's reply that follows. Persists until
overridden.
* **`configure_transcription.vad`** — applies for the rest of the session.
## Lenient validation
If you send an out-of-range, non-integer, or otherwise invalid value, the field
is **silently ignored** — the system default takes over. Your action still runs
normally; only the bad VAD value is dropped. This avoids breaking call flows
over a typo.
## Example: tolerate long pauses (e.g. spelling)
```typescript
return {
type: "speak",
session_id: event.session.id,
text: "Please spell your last name, letter by letter.",
vad: {
end_of_turn_silence_ms: 1500,
},
};
```
## Example: snappy back-and-forth
```typescript
return {
type: "speak",
session_id: event.session.id,
text: "Did you mean account number 1234?",
vad: {
end_of_turn_silence_ms: 250,
},
};
```
## Example: set once for the whole session
```typescript
return {
type: "configure_transcription",
session_id: event.session.id,
vad: {
end_of_turn_silence_ms: 1000,
},
};
```
## VAD vs Barge-In
`vad` and [`barge_in`](/sdk/barge-in) are related but distinct:
* **`vad`** governs *when the caller's turn is considered finished*.
* **`barge_in`** governs *whether and how the caller may interrupt the
assistant while it is speaking*.
Both can be set on the same `speak` action.
## Next Steps
* **[Action Types](/sdk/actions)** - Complete action reference
* **[Barge-In Configuration](/sdk/barge-in)** - Control caller interruptions
* **[API Reference](/sdk/api-reference)** - Full API documentation
---
---
url: /sipgate-ai-flow-api/sdk/integrations/express.md
---
# Express.js Integration
Complete guide for integrating the SDK with Express.js.
## Basic Setup
The simplest way to use the SDK with Express.js:
```typescript
import express from "express";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const app = express();
app.use(express.json());
const assistant = AiFlowAssistant.create({
onSessionStart: async (event) => {
return "Welcome! How can I help you today?";
},
onUserSpeak: async (event) => {
return processUserInput(event.text);
},
onSessionEnd: async (event) => {
await cleanupSession(event.session.id);
},
});
// Webhook endpoint
app.post("/webhook", assistant.express());
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`AI Flow assistant running on port ${PORT}`);
});
```
## Complete Example
Here's a complete example with error handling and logging:
```typescript
import express from "express";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const app = express();
app.use(express.json());
const assistant = AiFlowAssistant.create({
debug: process.env.NODE_ENV !== "production",
onSessionStart: async (event) => {
console.log(`Session started: ${event.session.id}`);
return "Welcome! How can I help you today?";
},
onUserSpeak: async (event) => {
try {
return await processUserInput(event.text);
} catch (error) {
console.error("Error processing input:", error);
return "I'm sorry, I encountered an error. Please try again.";
}
},
onSessionEnd: async (event) => {
await cleanupSession(event.session.id);
},
});
// Webhook endpoint
app.post("/webhook", assistant.express());
// Health check
app.get("/health", (req, res) => {
res.json({ status: "ok" });
});
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`AI Flow assistant running on port ${PORT}`);
});
```
## Error Handling
The `express()` middleware automatically handles errors, but you can add custom error handling:
```typescript
app.post("/webhook", (req, res, next) => {
assistant.express()(req, res).catch(next);
});
app.use((err, req, res, next) => {
console.error("Error:", err);
res.status(500).json({ error: "Internal server error" });
});
```
## Authentication
Add authentication middleware before the webhook:
```typescript
app.post("/webhook", authenticate, assistant.express());
function authenticate(req, res, next) {
const apiKey = req.headers["x-api-key"];
if (apiKey !== process.env.API_KEY) {
return res.status(401).json({ error: "Unauthorized" });
}
next();
}
```
## Multiple Endpoints
You can use multiple assistants for different endpoints:
```typescript
const salesAssistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Welcome to sales!";
},
});
const supportAssistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Welcome to support!";
},
});
app.post("/webhook/sales", salesAssistant.express());
app.post("/webhook/support", supportAssistant.express());
```
## Next Steps
* **[WebSocket Integration](/sdk/integrations/websocket)** - WebSocket integration guide
* **[Examples](/sdk/examples)** - More integration examples
---
---
url: /sipgate-ai-flow-api/sdk/integrations/websocket.md
---
# WebSocket Integration
Complete guide for integrating the SDK with WebSocket.
## Basic Setup
```typescript
import WebSocket from "ws";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const wss = new WebSocket.Server({
port: 8080,
perMessageDeflate: false,
});
const assistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Hello from WebSocket!";
},
});
wss.on("connection", (ws, req) => {
console.log("New WebSocket connection");
ws.on("message", assistant.ws(ws));
ws.on("error", (error) => {
console.error("WebSocket error:", error);
});
ws.on("close", () => {
console.log("WebSocket connection closed");
});
});
console.log("WebSocket server listening on port 8080");
```
## Complete Example
Here's a complete example with error handling and connection management:
```typescript
import WebSocket from "ws";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const wss = new WebSocket.Server({
port: 8080,
perMessageDeflate: false,
});
const assistant = AiFlowAssistant.create({
debug: true,
onSessionStart: async (event) => {
console.log(`Session started: ${event.session.id}`);
return "Welcome!";
},
onUserSpeak: async (event) => {
return processUserInput(event.text);
},
onSessionEnd: async (event) => {
console.log(`Session ended: ${event.session.id}`);
},
});
wss.on("connection", (ws, req) => {
console.log("New WebSocket connection from", req.socket.remoteAddress);
// Handle messages
ws.on("message", async (data) => {
try {
await assistant.ws(ws)(data);
} catch (error) {
console.error("Error processing message:", error);
ws.send(JSON.stringify({ error: "Internal server error" }));
}
});
// Error handling
ws.on("error", (error) => {
console.error("WebSocket error:", error);
});
// Connection cleanup
ws.on("close", (code, reason) => {
console.log(`Connection closed: ${code} - ${reason}`);
});
// Send welcome message
ws.send(JSON.stringify({ type: "connected" }));
});
console.log("WebSocket server listening on port 8080");
```
## Message Format
The SDK expects messages in JSON format:
```typescript
{
"type": "session_start",
"session": {
"id": "uuid",
"account_id": "account-id",
"phone_number": "1234567890",
// ...
}
}
```
## Custom Message Handling
You can handle messages manually:
```typescript
wss.on("connection", (ws) => {
ws.on("message", async (data) => {
try {
const event = JSON.parse(data.toString());
const action = await assistant.onEvent(event);
if (action) {
ws.send(JSON.stringify(action));
}
} catch (error) {
console.error("Error:", error);
}
});
});
```
## Connection Management
Track active connections:
```typescript
const connections = new Map();
wss.on("connection", (ws, req) => {
const connectionId = generateId();
connections.set(connectionId, ws);
ws.on("close", () => {
connections.delete(connectionId);
});
});
```
## Next Steps
* **[Express.js Integration](/sdk/integrations/express)** - Express.js integration guide
* **[Examples](/sdk/examples)** - More integration examples
---
---
url: /sipgate-ai-flow-api/sdk/outbound-calls.md
---
# Outbound Calls
Initiate outbound calls directly from your assistant using `assistant.call()`.
::: warning Access Required
Outbound calls are **only available upon request** and after a positive review by sipgate support. Please contact support to request access before using this feature.
:::
## Setup
Pass `token` when creating the assistant. `baseUrl` is optional and defaults to `https://api.sipgate.com`.
```typescript
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const assistant = AiFlowAssistant.create({
token: process.env.SIPGATE_TOKEN,
onSessionStart: async (event) => {
if (event.session.direction === "outbound") {
return "Hello! This is an automated call. Do you have a moment?";
}
return "Hello! How can I help you today?";
},
onUserSpeak: async (event) => {
return processWithLLM(event.text);
},
});
```
## Initiating a Call
```typescript
await assistant.call({
aiFlowId: "e3670012-96a3-4ae5-ac42-87abe22015c3",
billingDevice: "e2", // provided by sipgate support during onboarding
toPhoneNumber: "4915790000687", // E.164 format without leading +
});
```
| Parameter | Type | Description |
|-----------------|--------|---------------------------------------------------|
| `aiFlowId` | string | ID of the AI flow to use for the call |
| `billingDevice` | string | Billing device suffix, provided during onboarding |
| `toPhoneNumber` | string | Target phone number in E.164 format without leading + |
`call()` resolves when the call has been successfully initiated (`201 Created`). It throws if the request fails.
## Handling the Session
Once the recipient answers, the normal event flow begins. Your existing handlers (`onSessionStart`, `onUserSpeak`, etc.) are called exactly as for inbound calls.
Check `event.session.direction` to distinguish outbound from inbound sessions:
```typescript
onSessionStart: async (event) => {
if (event.session.direction === "outbound") {
// Your assistant placed this call
return "Hi, I'm calling from Example Corp regarding your appointment.";
}
// Inbound call
return "Hello! How can I help you?";
},
```
The `direction` field is available on the `session` object of every event.
## Complete Example
```typescript
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
import express from "express";
const assistant = AiFlowAssistant.create({
token: process.env.SIPGATE_TOKEN,
onSessionStart: async (event) => {
if (event.session.direction === "outbound") {
return "Hello! This is an automated reminder from Example Corp. Your appointment is tomorrow at 10am. Press 1 to confirm or say 'cancel' to cancel.";
}
return "Hello! How can I help you?";
},
onUserSpeak: async (event) => {
const text = event.text.toLowerCase();
if (text.includes("confirm") || text.includes("1")) {
return [
{ type: "speak", session_id: event.session.id, text: "Great, your appointment is confirmed. Goodbye!" },
{ type: "hangup", session_id: event.session.id },
];
}
if (text.includes("cancel")) {
await cancelAppointment(event.session.id);
return [
{ type: "speak", session_id: event.session.id, text: "Your appointment has been cancelled. Goodbye!" },
{ type: "hangup", session_id: event.session.id },
];
}
return "I didn't catch that. Say 'confirm' to confirm or 'cancel' to cancel your appointment.";
},
});
// Webhook server (receives events when calls connect)
const app = express();
app.use(express.json());
app.post("/webhook", assistant.express());
app.listen(3000);
// Initiate the call
await assistant.call({
aiFlowId: process.env.AI_FLOW_ID!,
billingDevice: "e2",
toPhoneNumber: "4915790000687",
});
```
## Next Steps
* **[API Reference](/sdk/api-reference)** — `call()` method and all options
* **[Outbound Calls (API)](/api/guides/outbound-calls)** — raw HTTP reference
* **[Event Types](/sdk/events)** — complete event reference
---
---
url: /sipgate-ai-flow-api/sdk/examples.md
---
# Examples
Real-world examples and use cases.
## Customer Service Bot
A complete customer service bot with state management and routing:
```typescript
import { AiFlowAssistant, BargeInStrategy } from "@sipgate/ai-flow-sdk";
import express from "express";
// Session state management
const sessions = new Map();
const assistant = AiFlowAssistant.create({
debug: true,
onSessionStart: async (event) => {
// Initialize session state
sessions.set(event.session.id, {
state: "greeting",
data: { attempts: 0 },
});
return {
type: "speak",
session_id: event.session.id,
text: "Welcome to customer support. How can I help you today? You can ask about billing, technical support, or sales.",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 3,
},
};
},
onUserSpeak: async (event) => {
const session = sessions.get(event.session.id);
if (!session) return null;
const text = event.text.toLowerCase();
// Intent routing
if (text.includes("billing") || text.includes("invoice")) {
return {
type: "transfer",
session_id: event.session.id,
target_phone_number: "1234567890",
caller_id_name: "Billing Department",
caller_id_number: "1234567890",
};
}
if (text.includes("goodbye") || text.includes("bye")) {
return {
type: "speak",
session_id: event.session.id,
text: "Thank you for calling. Have a great day!",
barge_in: { strategy: "none" }, // Don't allow interruption
};
}
if (text.includes("technical") || text.includes("support")) {
session.state = "technical_support";
return "I'll connect you with our technical support team. Please describe your issue.";
}
// Default response
session.data.attempts++;
if (session.data.attempts > 2) {
return "I'm having trouble understanding. Let me transfer you to a representative.";
}
return "I can help with billing, technical support, or sales. Which would you like?";
},
onUserBargeIn: async (event) => {
console.log(`User interrupted: ${event.text}`);
return "Yes, I'm listening.";
},
onSessionEnd: async (event) => {
// Cleanup session state
sessions.delete(event.session.id);
console.log(`Session ${event.session.id} ended`);
},
});
const app = express();
app.use(express.json());
app.post("/webhook", assistant.express());
app.listen(3000, () => {
console.log("Customer service bot running on port 3000");
});
```
## Multi-Language Support
Switch languages based on user preference:
```typescript
const sessions = new Map();
const assistant = AiFlowAssistant.create({
onSessionStart: async (event) => {
sessions.set(event.session.id, { language: "en" });
return "Welcome! Say 'deutsch' for German or 'english' for English.";
},
onUserSpeak: async (event) => {
const session = sessions.get(event.session.id);
if (!session) return null;
const text = event.text.toLowerCase();
if (text.includes("deutsch") || text.includes("german")) {
session.language = "de";
return {
type: "speak",
session_id: event.session.id,
text: "Willkommen! Wie kann ich Ihnen helfen?",
tts: {
provider: "azure",
language: "de-DE",
voice: "de-DE-KatjaNeural",
},
};
}
if (text.includes("english") || text.includes("englisch")) {
session.language = "en";
return "Welcome! How can I help you?";
}
// Continue in selected language
if (session.language === "de") {
return {
type: "speak",
session_id: event.session.id,
text: "Wie kann ich Ihnen helfen?",
tts: {
provider: "azure",
language: "de-DE",
voice: "de-DE-KatjaNeural",
},
};
}
return "How can I help you?";
},
});
```
## User Input Timeout Handling
Handle scenarios where users don't respond within a specified time period:
```typescript
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
import express from "express";
// Track timeout counts per session
const timeoutCounts = new Map();
const assistant = AiFlowAssistant.create({
debug: true,
onSessionStart: async (event) => {
// Initialize timeout counter
timeoutCounts.set(event.session.id, 0);
return {
type: "speak",
session_id: event.session.id,
text: "Welcome to our automated assistant. What can I help you with today?",
user_input_timeout_seconds: 8 // Wait 8 seconds for initial response
};
},
onUserSpeak: async (event) => {
// Reset timeout counter on successful user input
timeoutCounts.set(event.session.id, 0);
const text = event.text.toLowerCase();
if (text.includes("account") || text.includes("balance")) {
return {
type: "speak",
session_id: event.session.id,
text: "Please tell me your account number.",
user_input_timeout_seconds: 10 // Give more time for account number
};
}
if (text.includes("speak") || text.includes("agent") || text.includes("human")) {
return {
type: "speak",
session_id: event.session.id,
text: "Let me transfer you to a live agent. Please hold."
// Follow with transfer action
};
}
return {
type: "speak",
session_id: event.session.id,
text: "I can help you with account information, billing questions, or connect you to an agent. What would you like?",
user_input_timeout_seconds: 8
};
},
onUserInputTimeout: async (event) => {
const sessionId = event.session.id;
const count = (timeoutCounts.get(sessionId) || 0) + 1;
timeoutCounts.set(sessionId, count);
console.log(`Timeout #${count} for session ${sessionId}`);
// After 3 timeouts, offer to transfer to agent
if (count >= 3) {
return {
type: "speak",
session_id: sessionId,
text: "I'm having trouble hearing you. Let me transfer you to a live agent who can better assist you."
// Could follow with transfer action
};
}
// After 2 timeouts, give clearer instructions
if (count === 2) {
return {
type: "speak",
session_id: sessionId,
text: "I still haven't heard your response. Please speak clearly after the beep. Say 'agent' if you'd like to speak to a person.",
user_input_timeout_seconds: 10 // Give more time
};
}
// First timeout - gentle prompt
return {
type: "speak",
session_id: sessionId,
text: "I didn't catch that. Are you still there? Please let me know how I can help you.",
user_input_timeout_seconds: 8
};
},
onSessionEnd: async (event) => {
// Cleanup timeout counters
timeoutCounts.delete(event.session.id);
console.log(`Session ${event.session.id} ended`);
},
});
const app = express();
app.use(express.json());
app.post("/webhook", assistant.express());
app.listen(3000, () => {
console.log("Timeout-aware assistant running on port 3000");
});
```
### Advanced Timeout Strategy
Context-aware timeout handling with different strategies based on the conversation state:
```typescript
interface SessionData {
state: "greeting" | "collecting_info" | "confirming" | "completing";
timeouts: number;
data: Record;
}
const sessions = new Map();
const assistant = AiFlowAssistant.create({
onSessionStart: async (event) => {
sessions.set(event.session.id, {
state: "greeting",
timeouts: 0,
data: {}
});
return {
type: "speak",
session_id: event.session.id,
text: "Hello! To help you with your order, I'll need some information. What's your order number?",
user_input_timeout_seconds: 10
};
},
onUserSpeak: async (event) => {
const session = sessions.get(event.session.id);
if (!session) return null;
// Reset timeout counter on successful input
session.timeouts = 0;
const text = event.text;
if (session.state === "greeting") {
session.data.orderNumber = text;
session.state = "collecting_info";
return {
type: "speak",
session_id: event.session.id,
text: `Thank you. Order number ${text} received. Can you verify your email address?`,
user_input_timeout_seconds: 10
};
}
if (session.state === "collecting_info") {
session.data.email = text;
session.state = "confirming";
return {
type: "speak",
session_id: event.session.id,
text: `Perfect. Let me look up order ${session.data.orderNumber} for ${text}. One moment please.`,
user_input_timeout_seconds: 5 // Shorter timeout for confirmation
};
}
return "How else can I help you?";
},
onUserInputTimeout: async (event) => {
const session = sessions.get(event.session.id);
if (!session) return null;
session.timeouts++;
// Different strategies based on conversation state
switch (session.state) {
case "greeting":
if (session.timeouts >= 2) {
return {
type: "speak",
session_id: event.session.id,
text: "I'm having trouble hearing your order number. Let me transfer you to someone who can help.",
// Follow with transfer
};
}
return {
type: "speak",
session_id: event.session.id,
text: "I didn't hear your order number. Please say or spell it out for me.",
user_input_timeout_seconds: 12 // Give extra time
};
case "collecting_info":
return {
type: "speak",
session_id: event.session.id,
text: "I need your email address to proceed. Please provide it now, or say 'skip' to continue without it.",
user_input_timeout_seconds: 10
};
case "confirming":
// Just continue with the process
session.state = "completing";
return {
type: "speak",
session_id: event.session.id,
text: "I found your order. Your package is scheduled for delivery tomorrow. Is there anything else I can help with?",
user_input_timeout_seconds: 8
};
default:
if (session.timeouts >= 3) {
return {
type: "hangup",
session_id: event.session.id
};
}
return {
type: "speak",
session_id: event.session.id,
text: "Are you still there?",
user_input_timeout_seconds: 5
};
}
},
onSessionEnd: async (event) => {
sessions.delete(event.session.id);
},
});
```
## Next Steps
* **[Integration Guides](/sdk/integrations/express)** - Detailed integration guides
* **[API Reference](/sdk/api-reference)** - Complete API documentation
---
---
url: /sipgate-ai-flow-api/sdk/advanced/direct-integration.md
---
# Working Without the Assistant Wrapper
If you prefer to work directly with the SDK's event and action system without using the `AiFlowAssistant` wrapper, you can manually handle events and construct actions.
## Direct Event Handling
Here's how to handle events and construct actions without the assistant wrapper:
```typescript
import express from "express";
import { AiFlowEventType, AiFlowActionType } from "@sipgate/ai-flow-sdk";
const app = express();
app.use(express.json());
app.post("/webhook", async (req, res) => {
const event = req.body;
let action = null;
switch (event.type) {
case "session_start":
action = {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "Welcome to our service!",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 5,
},
};
break;
case "user_speak":
// Check if user interrupted (barge-in)
if (event.barged_in) {
console.log(`User interrupted with: ${event.text}`);
action = {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "I'm listening, go ahead.",
};
break;
}
// Normal user speech handling
if (event.text.toLowerCase().includes("transfer")) {
action = {
type: AiFlowActionType.TRANSFER,
session_id: event.session.id,
target_phone_number: "1234567890",
caller_id_name: "Support",
caller_id_number: "1234567890",
};
} else if (event.text.toLowerCase().includes("goodbye")) {
action = {
type: AiFlowActionType.HANGUP,
session_id: event.session.id,
};
} else {
action = {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: `You said: ${event.text}`,
};
}
break;
case "assistant_speak":
console.log(`Spoke for ${event.duration_ms}ms`);
// Optional: track metrics, no action needed
break;
case "session_end":
console.log(`Session ${event.session.id} ended`);
// Cleanup logic, no action needed
break;
}
// Return action if one was created
if (action) {
res.json(action);
} else {
res.status(204).send();
}
});
app.listen(3000, () => {
console.log("Webhook server listening on port 3000");
});
```
## Benefits of Direct Integration
* **Full Control** - Complete control over event handling
* **Custom Logic** - Easier to implement complex routing logic
* **No Abstraction** - Direct access to events and actions
* **Flexibility** - Can integrate with any framework or system
## When to Use Direct Integration
Use direct integration when:
* You need custom event processing logic
* You're integrating with a non-standard framework
* You want to implement your own state management
* You need fine-grained control over responses
## Next Steps
* **[Complete Event Reference](/sdk/advanced/events)** - All event types
* **[Complete Action Reference](/sdk/advanced/actions)** - All action types
* **[Validation with Zod](/sdk/advanced/validation)** - Runtime validation
---
---
url: /sipgate-ai-flow-api/sdk/advanced/events.md
---
# Complete Event Reference
Complete reference for all events in the SDK.
## Base Event Structure
All events extend the base event structure:
```typescript
interface BaseEvent {
session: {
id: string; // UUID of the session
account_id: string; // Account identifier
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound"; // Call direction
from_phone_number: string; // Phone number of the caller
to_phone_number: string; // Phone number of the callee
};
}
```
## All Event Types
| Event Type | Description | When Triggered |
| ----------------- | --------------------------- | ------------------------------------------ |
| `session_start` | Call session begins | When a new call is initiated |
| `user_speak` | User speech detected | After speech-to-text completes (includes `barged_in` flag) |
| `assistant_speak` | Assistant finished speaking | After TTS playback completes |
| `assistant_speech_ended` | Assistant finished speaking | After speech playback ends |
| `session_end` | Call session ends | When the call terminates |
## Event Type Definitions
### session\_start
```typescript
interface AiFlowEventSessionStart {
type: "session_start";
session: {
id: string;
account_id: string;
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound"; // Call direction
from_phone_number: string;
to_phone_number: string;
};
}
```
### user\_speak
```typescript
interface AiFlowEventUserSpeak {
type: "user_speak";
text: string; // Recognized speech text
barged_in?: boolean; // true if user interrupted assistant
session: SessionInfo;
}
```
The `barged_in` flag is set to `true` when the user interrupts the assistant mid-speech.
### assistant\_speak
```typescript
interface AiFlowEventAssistantSpeak {
type: "assistant_speak";
text?: string; // Text that was spoken
ssml?: string; // SSML that was used (if applicable)
duration_ms: number; // Duration of speech in milliseconds
speech_started_at: number; // Unix timestamp (ms) when speech started
session: SessionInfo;
}
```
### assistant\_speech\_ended
```typescript
interface AiFlowEventAssistantSpeechEnded {
type: "assistant_speech_ended";
session: SessionInfo;
}
```
### session\_end
```typescript
interface AiFlowEventSessionEnd {
type: "session_end";
session: SessionInfo;
}
```
## Type Safety
All events are fully typed. Import types from the SDK:
```typescript
import type {
AiFlowEventSessionStart,
AiFlowEventUserSpeak,
AiFlowEventAssistantSpeak,
AiFlowEventAssistantSpeechEnded,
AiFlowEventSessionEnd,
} from "@sipgate/ai-flow-sdk";
```
## Next Steps
* **[Complete Action Reference](/sdk/advanced/actions)** - All action types
* **[Direct Integration](/sdk/advanced/direct-integration)** - Working without the wrapper
---
---
url: /sipgate-ai-flow-api/sdk/advanced/actions.md
---
# Complete Action Reference
Complete reference for all actions in the SDK.
## Base Action Structure
All actions require a `session_id` and `type` field:
```typescript
interface BaseAction {
session_id: string; // UUID from the event's session.id
type: string; // Action type identifier
}
```
## All Action Types
| Action Type | Description | Primary Use Case |
| -------------- | --------------------------- | --------------------------------------- |
| `speak` | Speak text or SSML | Respond to user with synthesized speech |
| `audio` | Play pre-recorded audio | Play hold music, pre-recorded messages |
| `mix_audio` | Loop a background sound mixed into speech | Add ambient noise (café, office, train station) under the agent |
| `hangup` | End the call | Terminate conversation |
| `transfer` | Transfer to another number | Route to human agent or department |
| `barge_in` | Manually interrupt playback | Stop current audio immediately |
| `configure_transcription` | Change STT language(s) mid-call | Switch recognition language without hanging up |
## Action Type Definitions
### speak - Text-to-speech response
```typescript
interface AiFlowActionSpeak {
type: "speak";
session_id: string;
// Provide either text OR ssml (not both)
text?: string;
ssml?: string;
// Optional TTS configuration
tts?: {
provider: "azure";
language?: string; // e.g., "en-US", "de-DE"
voice?: string; // Azure voice name
} | {
provider: "eleven_labs";
voice?: string; // ElevenLabs voice ID (optional, uses default if omitted)
};
barge_in?: {
strategy: "none" | "manual" | "minimum_characters";
minimum_characters?: number; // Default: 3
allow_after_ms?: number; // Delay before allowing interruption
};
}
```
### audio - Play pre-recorded audio
```typescript
interface AiFlowActionAudio {
type: "audio";
session_id: string;
audio: string; // Base64 encoded WAV (16kHz, mono, 16-bit PCM)
barge_in?: {
strategy: "none" | "manual" | "minimum_characters";
minimum_characters?: number;
allow_after_ms?: number;
};
}
```
### mix\_audio - Loop a background sound under outbound speech
```typescript
interface AiFlowActionMixAudio {
type: "mix_audio";
session_id: string;
audio?: string; // Base64 WAV (16 kHz, mono, 16-bit PCM); required unless stop=true
volume?: number; // 0.0–1.0, default 0.5
stop?: boolean; // true to remove the active loop
}
```
The loop plays continuously for the rest of the call — under TTS during turns and on its own during silences. Sending `mix_audio` again replaces the loop. The loop is dropped automatically when the session ends.
### hangup - End call
```typescript
interface AiFlowActionHangup {
type: "hangup";
session_id: string;
}
```
### transfer - Transfer call
```typescript
interface AiFlowActionTransfer {
type: "transfer";
session_id: string;
target_phone_number: string; // E.164 format recommended
caller_id_name: string;
caller_id_number: string;
/**
* Optional transfer timeout in seconds (5–120). When set, a failed transfer
* returns the call to the agent via a new `session_start` event for the
* same session id (transfer fallback). Omit for legacy behavior where a
* failed transfer ends the call.
*/
timeout?: number;
}
```
### barge\_in - Manual interrupt
```typescript
interface AiFlowActionBargeIn {
type: "barge_in";
session_id: string;
}
```
### configure\_transcription - Change STT language mid-call
```typescript
interface AiFlowActionConfigureTranscription {
type: "configure_transcription";
session_id: string;
provider?: "AZURE" | "DEEPGRAM" | "ELEVEN_LABS"; // Omit to keep current provider.
languages?: string[]; // BCP-47 codes, 1-4 entries. Omit to reset to provider default.
}
```
> **Multi-language support:** Azure uses all supplied language codes for simultaneous detection (up to 4). Deepgram performs multilingual auto-detection across the supplied languages. ElevenLabs accepts only a single language — when multiple codes are provided, only the **first** is used and the rest are silently ignored.
## Type Safety
All actions are fully typed. Import types from the SDK:
```typescript
import type {
AiFlowAction,
AiFlowActionSpeak,
AiFlowActionAudio,
AiFlowActionHangup,
AiFlowActionTransfer,
AiFlowActionBargeIn,
AiFlowActionConfigureTranscription,
} from "@sipgate/ai-flow-sdk";
```
## Next Steps
* **[Complete Event Reference](/sdk/advanced/events)** - All event types
* **[Direct Integration](/sdk/advanced/direct-integration)** - Working without the wrapper
---
---
url: /sipgate-ai-flow-api/sdk/advanced/validation.md
---
# Validation with Zod
The SDK exports Zod schemas for runtime validation of events and actions.
## Event Validation
Validate incoming events to ensure they match the expected format:
```typescript
import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk";
import { z } from "zod";
app.post("/webhook", async (req, res) => {
try {
// Validate incoming event
const event = AiFlowEventSchema.parse(req.body);
// event is now type-safe and validated
const action = await assistant.onEvent(event);
if (action) {
res.json(action);
} else {
res.status(204).send();
}
} catch (error) {
if (error instanceof z.ZodError) {
console.error("Invalid event:", error.errors);
res.status(400).json({
error: "Invalid event format",
details: error.errors
});
} else {
console.error("Error:", error);
res.status(500).json({ error: "Internal server error" });
}
}
});
```
## Action Validation
Validate outgoing actions before sending:
```typescript
import { AiFlowActionSchema } from "@sipgate/ai-flow-sdk";
import { z } from "zod";
onUserSpeak: async (event) => {
const action = {
type: "speak",
session_id: event.session.id,
text: "Hello!",
};
try {
// Validate action before returning
const validatedAction = AiFlowActionSchema.parse(action);
return validatedAction;
} catch (error) {
if (error instanceof z.ZodError) {
console.error("Invalid action:", error.errors);
// Return a safe fallback
return "I encountered an error. Please try again.";
}
throw error;
}
}
```
## Custom Validation
You can extend the schemas for custom validation:
```typescript
import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk";
import { z } from "zod";
// Extend the schema with custom validation
const CustomEventSchema = AiFlowEventSchema.extend({
session: z.object({
id: z.string().uuid(),
account_id: z.string().min(1),
// Add custom validation
}),
});
app.post("/webhook", async (req, res) => {
try {
const event = CustomEventSchema.parse(req.body);
// Process validated event
} catch (error) {
// Handle validation errors
}
});
```
## Benefits
* **Type Safety** - Catch errors at runtime
* **Better Error Messages** - Zod provides detailed error information
* **Data Integrity** - Ensure events and actions match expected format
* **Debugging** - Easier to identify malformed data
## Next Steps
* **[Direct Integration](/sdk/advanced/direct-integration)** - Working without the wrapper
* **[API Reference](/sdk/api-reference)** - Complete API documentation
---
---
url: /sipgate-ai-flow-api/sdk/troubleshooting.md
---
# Troubleshooting
Common issues and solutions.
## Common Issues
### WebSocket Connection Errors
If you encounter WebSocket connection issues:
```typescript
wss.on("connection", (ws, req) => {
ws.on("error", (error) => {
console.error("WebSocket error:", error);
});
ws.on("close", (code, reason) => {
console.log(`Connection closed: ${code} - ${reason}`);
});
ws.on("message", assistant.ws(ws));
});
```
**Common causes:**
* Network connectivity issues
* Firewall blocking WebSocket connections
* Incorrect WebSocket URL or protocol
### Event Validation Errors
Use Zod schemas to validate incoming events:
```typescript
import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk";
app.post("/webhook", async (req, res) => {
try {
const event = AiFlowEventSchema.parse(req.body);
const action = await assistant.onEvent(event);
if (action) {
res.json(action);
} else {
res.status(204).send();
}
} catch (error) {
console.error("Invalid event:", error);
res.status(400).json({ error: "Invalid event format" });
}
});
```
### Debug Mode
Enable debug logging to see all events and actions:
```typescript
const assistant = AiFlowAssistant.create({
debug: true, // Logs all events and actions
// ... your handlers
});
```
### Audio Format Issues
When using the audio action, ensure your audio is in the correct format:
* **Format**: WAV
* **Sample Rate**: 16kHz
* **Channels**: Mono
* **Bit Depth**: 16-bit PCM
* **Encoding**: Base64
```typescript
// Example: Convert audio file to correct format
import fs from "fs";
const audioBuffer = fs.readFileSync("audio.wav");
const base64Audio = audioBuffer.toString("base64");
return {
type: "audio",
session_id: event.session.id,
audio: base64Audio,
};
```
## TypeScript Issues
### Type Errors
Make sure you're importing types correctly:
```typescript
import type {
AiFlowEventUserSpeak,
AiFlowAction,
} from "@sipgate/ai-flow-sdk";
```
### Module Resolution
If you encounter module resolution errors, check your `tsconfig.json`:
```json
{
"compilerOptions": {
"moduleResolution": "bundler",
"esModuleInterop": true,
"skipLibCheck": true
}
}
```
## Performance Issues
### Slow Response Times
* Check your event handler performance
* Use async/await properly
* Avoid blocking operations
* Consider caching for frequently accessed data
### Memory Leaks
* Clean up session state in `onSessionEnd`
* Remove event listeners
* Clear timers and intervals
## Integration Issues
### Express Middleware
If the Express middleware isn't working:
```typescript
// Make sure express.json() is used
app.use(express.json());
// Check the route order
app.post("/webhook", assistant.express());
```
### WebSocket Handler
If WebSocket messages aren't being processed:
```typescript
// Ensure message handler is set up correctly
ws.on("message", assistant.ws(ws));
// Check message format
ws.on("message", (data) => {
console.log("Received:", data.toString());
assistant.ws(ws)(data);
});
```
## Next Steps
* **[API Reference](/sdk/api-reference)** - Complete API documentation
* **[Examples](/sdk/examples)** - More examples and use cases
---
---
url: /sipgate-ai-flow-api/changelog.md
---
# Changelog
Release notes for the sipgate AI Flow API and SDK. Only customer-visible changes are listed here.
***
## Preview — May 2026
### End-to-End Voice-to-Voice Mode (Preview)
You can now connect your assistant to an end-to-end speech-to-speech model. With the new `configure_voice_to_voice` action the assistant bypasses the standard STT → text → TTS pipeline: caller audio flows directly into the model and the model's spoken response is sent straight back to the caller. Conversations feel snappier and more natural, with first-byte response latencies typically in the 200–600 ms range.
User turns are still surfaced as `user_speak` events so call traces and logs keep working — you only need to send a single `configure_voice_to_voice` action on `session_start`. To revert to the standard pipeline mid-call, send a `configure_transcription` action.
This is a preview feature, available on request after sipgate support review.
***
## v1.9.0 — May 2026
### Per-Action VAD Configuration
You can now configure Voice Activity Detection (VAD) individually for each `speak` action. This lets you fine-tune how sensitive barge-in detection is depending on the context — for example, using stricter VAD during important announcements and more permissive settings during open-ended questions.
***
## Improvements — April 2026
### Faster, More Natural Conversation Turns
Upgraded to a next-generation transcription backend with significantly improved end-of-utterance detection. The assistant responds faster at natural sentence endings and is less likely to cut in while the caller is still speaking.
### Background Audio Looping (`mix_audio`)
The `mix_audio` action now supports looping — play hold music or ambient sound continuously in the background while the assistant speaks, without gaps or manual re-triggering.
### Transfer with Timeout Fallback
The `transfer` action accepts an optional timeout. If the transfer destination does not answer within the configured time, the call returns to your assistant automatically, allowing you to handle the fallback gracefully.
### Send SMS During a Call (`send_sms`)
A new `send_sms` action lets your assistant send an SMS to the caller while the call is still active — useful for sending confirmation links, reference numbers, or follow-up information in real time.
### Keypad (DTMF) Input Support
Your assistant can now react to keypad presses during a call. DTMF digits are delivered as events, enabling menu navigation, PIN entry, and other touch-tone interactions. User input timeouts also reset correctly when the caller presses a key.
### Consistent E.164 Phone Numbers in All Events
Caller and callee phone numbers in all events are now consistently formatted as E.164 (e.g. `+4921112345678`). If you were normalising numbers on your side, this step is no longer necessary.
***
## v1.5.1 — March 2026
### Outbound Calls
Initiate AI-powered calls programmatically via `POST /ai-flows/:aiFlowId/call`. Your assistant handles the call as soon as the recipient picks up — the same event-driven flow as inbound calls. Available on request after a review by sipgate support.
### Real-Time Speech Start Event (`user_speech_started`)
A new `user_speech_started` event is sent the moment the caller begins speaking — before transcription completes. Use it to interrupt the assistant or trigger visual feedback without waiting for the full transcript.
### Faster ElevenLabs Voices
ElevenLabs voices now use the latest `eleven_flash_v2_5` model by default, delivering noticeably lower latency for generated speech.
### ElevenLabs EU Data Residency
ElevenLabs voices now route through the EU endpoint by default, keeping audio data within the European Union.
***
## Improvements — February 2026
### Immediate Barge-In Strategy
A new `immediate` barge-in strategy detects speech using Voice Activity Detection (VAD) the moment a caller starts talking — typically 20–100 ms before the first word is transcribed. Conversations feel as natural as talking to a real person.
### Mid-Call Language and Provider Switching (`configure_transcription`)
A new `configure_transcription` action lets your assistant switch the transcription language or provider in the middle of a call — for example, after detecting that the caller speaks a different language, or to adapt recognition parameters dynamically. Supported languages follow BCP-47 tags and work across Azure, Deepgram, and ElevenLabs.
***
## Improvements — January 2026
### SSML Support in Speak Actions
The `speak` action now accepts SSML (Speech Synthesis Markup Language) in addition to plain text. Use SSML to control pronunciation, pauses, emphasis, and speaking rate for fine-tuned voice output.
***
## Early Access — November–December 2025
### Multi-Provider Transcription
Deepgram and ElevenLabs are now available as speech-to-text providers alongside Azure. Select the provider that best fits your use case — each offers different strengths in accuracy, latency, and supported languages.
### Phone Number Routing
AI flows can now be associated with specific phone numbers directly through the API, making it easier to build multi-flow routing logic without external IVR configuration.
### SDK Launch
The `@sipgate/ai-flow-sdk` TypeScript SDK is now publicly available on npm. It provides fully typed event handlers and action builders, removing the need to manage raw WebSocket or HTTP webhook payloads manually.
***
> **Note:** The AI Flow API follows continuous delivery — not all improvements correspond to an SDK version bump. Check this page regularly for the latest changes.
---
---
url: /sipgate-ai-flow-api/README.md
---
# Documentation
This directory contains the documentation for sipgate AI Flow SDK, built with [VitePress](https://vitepress.dev/).
## Quick Links
* **[API Reference](./api/)** - Language-agnostic HTTP/WebSocket API documentation
* **[TypeScript SDK](./sdk/)** - TypeScript SDK documentation
* **LLM-friendly docs** — `/llms.txt` (index) and `/llms-full.txt` (full corpus) are auto-generated on build by `vitepress-plugin-llms` and follow the [llms.txt spec](https://llmstxt.org/). Linked from [the homepage](./index.md#for-ai-assisted-development).
## Development
```bash
# Install dependencies
pnpm install
# Start dev server
pnpm dev
# Build for production
pnpm build
# Preview production build
pnpm preview
```
## Structure
* `index.md` - Homepage
* `sdk/` - SDK documentation
* `.vitepress/` - VitePress configuration
* `config.ts` - Main configuration
* `theme/` - Custom theme and styles
## Deployment
The documentation is automatically deployed to GitHub Pages when changes are pushed to the `main` branch. The deployment is handled by the `.github/workflows/docs.yml` workflow.
## Base URL
The documentation is configured to be served from `/sipgate-ai-flow-api/` on GitHub Pages. If you need to change this, update the `base` option in `.vitepress/config.ts`.
---
---
url: /sipgate-ai-flow-api/SETUP.md
---
# Documentation Setup Guide
This guide will help you set up and deploy the documentation to GitHub Pages.
## Prerequisites
* Node.js 22+
* pnpm 10+
## Local Development
1. **Install dependencies:**
```bash
cd docs
pnpm install
```
2. **Start the development server:**
```bash
pnpm dev
```
The documentation will be available at `http://localhost:5173`
3. **Build for production:**
```bash
pnpm build
```
4. **Preview production build:**
```bash
pnpm preview
```
## GitHub Pages Setup
### 1. Enable GitHub Pages
1. Go to your repository settings on GitHub
2. Navigate to **Pages** in the left sidebar
3. Under **Source**, select **GitHub Actions**
4. Save the changes
### 2. Configure Base URL
The documentation is configured to be served from `/sipgate-ai-flow-api/` on GitHub Pages.
If your repository name is different, update the `base` option in `.vitepress/config.ts`:
```typescript
export default defineConfig({
base: '/your-repo-name/', // Update this
// ...
})
```
### 3. Deploy
The documentation will automatically deploy when you:
1. Push changes to the `main` branch that affect files in the `docs/` folder
2. Manually trigger the workflow from the **Actions** tab
### 4. Access Your Documentation
Once deployed, your documentation will be available at:
```
https://sipgate.github.io/sipgate-ai-flow-api/
```
(Replace `sipgate` with your GitHub username/organization and `sipgate-ai-flow-api` with your repository name)
## Customization
### Adding a Logo
1. Add your logo file (e.g., `logo.svg`) to the `docs/public/` folder
2. Uncomment the logo line in `.vitepress/config.ts`:
```typescript
themeConfig: {
logo: '/logo.svg',
// ...
}
```
### Changing Colors
Edit `.vitepress/theme/custom.css` to change the color scheme:
```css
:root {
--vp-c-brand: #6366f1; /* Primary brand color */
--vp-c-brand-light: #818cf8;
/* ... */
}
```
### Adding Pages
1. Create a new `.md` file in the appropriate directory
2. Add it to the sidebar in `.vitepress/config.ts`
3. Add navigation links if needed
## Troubleshooting
### Build Fails
* Check that all dependencies are installed: `pnpm install`
* Verify Node.js version is 22+
* Check for syntax errors in markdown files
### Pages Not Updating
* Ensure GitHub Pages is enabled in repository settings
* Check the Actions tab for workflow errors
* Verify the base URL matches your repository name
### Links Not Working
* Ensure all internal links use relative paths
* Check that the base URL is correctly configured
* Verify file paths match the actual file structure
## Support
For issues or questions:
* Check the [VitePress documentation](https://vitepress.dev/)
* Review the workflow logs in GitHub Actions
* Contact the development team
---
---
url: /sipgate-ai-flow-api/api/events/user-input-timeout.md
---
# User Input Timeout Event
Sent when no user speech is detected within the configured timeout period after the assistant finishes speaking.
## Event Structure
```json
{
"type": "user_input_timeout",
"session": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"account_id": "account-123",
"phone_number": "1234567890",
"direction": "inbound",
"from_phone_number": "9876543210",
"to_phone_number": "1234567890"
}
}
```
## When Triggered
This event is sent when:
1. A `speak` action includes a `user_input_timeout_seconds` field
2. The assistant finishes speaking (`assistant_speech_ended` event fires)
3. The specified timeout period elapses without any user speech detected
## Response
You can respond with any action:
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "I didn't hear anything. Let me repeat the question."
}
```
## Use Cases
### Retry Question
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_input_timeout') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Are you still there? Please say yes or no.',
user_input_timeout_seconds: 5
});
}
});
```
### Escalate to Human
```javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_input_timeout') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'Let me transfer you to a human agent.',
// Follow with transfer action
});
}
});
```
### Hangup After Multiple Timeouts
```javascript
const timeoutCounts = new Map();
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_input_timeout') {
const sessionId = event.session.id;
const count = (timeoutCounts.get(sessionId) || 0) + 1;
timeoutCounts.set(sessionId, count);
if (count >= 3) {
return res.json({
type: 'hangup',
session_id: sessionId
});
}
return res.json({
type: 'speak',
session_id: sessionId,
text: `I didn't hear anything. Please respond. Attempt ${count} of 3.`,
user_input_timeout_seconds: 5
});
}
});
```
## Configuration
The timeout is configured in the `speak` action:
```json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "What is your account number?",
"user_input_timeout_seconds": 5
}
```
See [Speak Action](/api/actions/speak#user-input-timeout) for details.
## Behavior
* **Timer starts**: When `assistant_speech_ended` event fires
* **Timer cleared**: When any user speech is detected (STT events)
* **Event sent**: When timeout period elapses without speech
* **New speak action**: Clears any existing timeout and sets a new one (if specified)
## Examples
### Python
```python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_input_timeout':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': 'I didn\'t hear you. Please try again.'
})
```
### Go
```go
if event["type"] == "user_input_timeout" {
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "I didn't hear you. Please try again.",
}
json.NewEncoder(w).Encode(action)
}
```
### Ruby
```ruby
post '/webhook' do
event = JSON.parse(request.body.read)
if event['type'] == 'user_input_timeout'
content_type :json
{
type: 'speak',
session_id: event['session']['id'],
text: 'I didn\'t hear you. Please try again.'
}.to_json
end
end
```
## Best Practices
1. **Set reasonable timeouts** - 5-10 seconds is typical for most interactions
2. **Provide feedback** - Let users know why they're being prompted again
3. **Limit retries** - After 2-3 timeouts, consider escalating or hanging up
4. **Use context** - Different questions may need different timeout durations
5. **Handle gracefully** - Don't frustrate users with immediate hangups
## Related
* **[Speak Action](/api/actions/speak)** - Configure timeout
* **[Assistant Speech Ended](/api/events/assistant-speech-ended)** - When timer starts
* **[User Speak](/api/events/user-speak)** - Clears timeout