Appearance
Streaming LLM Responses: Sentence-by-Sentence Best Practices
When integrating Large Language Models (LLMs) like OpenAI's GPT, Anthropic's Claude, or similar services with sipgate AI Flow, how you stream responses significantly impacts the naturalness of synthesized speech. This guide shows you how to achieve smooth, natural-sounding voice output by sending complete sentences rather than individual tokens.
The Problem: Token-by-Token Streaming
LLMs stream responses token-by-token (small text fragments). Sending each token directly to the TTS provider creates choppy, unnatural speech:
typescript
// ❌ BAD: Sends every token immediately
for await (const chunk of llmStream) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: chunk.content, // Individual tokens: "Hello", ", ", "how", " ", "can", " ", "I"...
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}Why this sounds bad:
- Each TTS call treats the text as a complete utterance with sentence-ending prosody (falling intonation, longer pauses)
- Results in robotic, choppy speech: "Hello↘️ [pause] how↘️ [pause] can↘️ [pause] I↘️ [pause] help↘️ [pause]"
- TTS providers optimize for complete sentences, not fragments
The Solution: Sentence Segmentation
✅ Best Practice: Buffer LLM tokens and send complete sentences to the TTS provider.
Benefits:
- Natural prosody and intonation
- Appropriate pauses between sentences
- Better voice quality from TTS providers
- Maintains low latency (sentences typically complete within 1-2 seconds)
Prompting LLMs for Voice Output
Critical: Instruct your LLM to avoid abbreviations that interfere with speech synthesis and sentence detection.
The Problem with Abbreviations
Abbreviations like "Dr.", "bzw.", "z.B.", "etc." cause two issues:
Incorrect sentence segmentation -
Intl.Segmenterdetects periods as sentence boundaries:typescript// "Dr. Smith will help you." // Incorrectly splits into: // Sentence 1: "Dr." // Sentence 2: "Smith will help you."Poor TTS pronunciation - Text-to-speech may mispronounce abbreviations:
- "Dr." → "D R" or "Doctor point" instead of "Doctor"
- "bzw." → "B Z W" instead of "beziehungsweise"
- "z.B." → "Z B" instead of "zum Beispiel"
System Prompt Guidelines
Add these instructions to your LLM system prompt:
typescript
const systemPrompt = `You are a voice assistant. Follow these rules strictly:
VOICE OUTPUT RULES:
- Write out all abbreviations fully (e.g., "Doctor" not "Dr.", "for example" not "e.g.")
- Use complete words instead of shortened forms
- Avoid punctuation-based abbreviations that end with periods
- Use natural, spoken language as if talking to someone in person
Examples:
❌ "Dr. Smith can help you with that."
✅ "Doctor Smith can help you with that."
❌ "You can use method A, B, or C, e.g., the first one."
✅ "You can use method A, B, or C, for example the first one."
❌ "This is available Mon.-Fri."
✅ "This is available Monday through Friday."
Your responses will be converted to speech, so write exactly how you would say it out loud.`Language-Specific Examples
English:
typescript
const englishVoiceRules = `
- "Dr." → "Doctor"
- "Mr." → "Mister"
- "Mrs." → "Missus"
- "e.g." → "for example"
- "i.e." → "that is"
- "etc." → "and so on" or "etcetera"
- "vs." → "versus"
- "approx." → "approximately"
`German:
typescript
const germanVoiceRules = `
- "Dr." → "Doktor"
- "bzw." → "beziehungsweise"
- "z.B." → "zum Beispiel"
- "usw." → "und so weiter"
- "ca." → "circa"
- "etc." → "et cetera" or "und so weiter"
- "inkl." → "inklusive"
- "ggf." → "gegebenenfalls"
- "evtl." → "eventuell"
`Complete OpenAI Example
typescript
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{
role: 'system',
content: `You are a voice assistant for customer service.
CRITICAL VOICE RULES:
- Never use abbreviations with periods (Dr., e.g., etc.)
- Write everything as you would speak it out loud
- Use complete words: "Doctor" not "Dr.", "for example" not "e.g."
- Your responses will be synthesized to speech
Be helpful, concise, and conversational.`
},
{
role: 'user',
content: userMessage
}
],
stream: true,
})Complete Anthropic Example
typescript
const stream = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: `You are a voice assistant. Your responses will be converted to speech.
VOICE OUTPUT REQUIREMENTS:
- Write all abbreviations in full (Doctor, not Dr.)
- Avoid period-based abbreviations (e.g., i.e., etc.)
- Use natural spoken language
- Write numbers as words when it sounds more natural
Examples:
Wrong: "Dr. Schmidt can help you, e.g., with billing."
Right: "Doctor Schmidt can help you, for example with billing."
Wrong: "Available Mon.-Fri., 9 a.m.-5 p.m."
Right: "Available Monday through Friday, 9 AM to 5 PM."`,
messages: [
{
role: 'user',
content: userMessage
}
],
stream: true,
})Testing Your Prompt
Verify your LLM follows voice rules by testing with edge cases:
typescript
const testCases = [
"Tell me about Dr. Smith",
"What are the benefits, e.g., cost savings?",
"This applies to companies like IBM, Microsoft, etc.",
"Available Mon.-Fri.",
]
// Expected responses should have NO abbreviations with periodsCommon Mistake
Don't rely on post-processing to fix abbreviations. LLMs are excellent at following voice guidelines when properly instructed. Post-processing is fragile and language-dependent.
Using JavaScript's Built-in Sentence Segmenter
JavaScript provides Intl.Segmenter - a native API for text segmentation, including sentence detection. It's available in Node.js ≥16.
Basic Example
typescript
// Create a sentence segmenter (do this once, reuse for performance)
const sentenceSegmenter = new Intl.Segmenter('en', { granularity: 'sentence' })
function* extractSentences(text: string): Generator<string> {
const segments = sentenceSegmenter.segment(text)
for (const segment of segments) {
yield segment.segment.trim()
}
}
// Usage
const text = "Hello, how can I help? I'm here to assist you today."
for (const sentence of extractSentences(text)) {
console.log(sentence)
// Output:
// "Hello, how can I help?"
// "I'm here to assist you today."
}Streaming with OpenAI
typescript
import OpenAI from 'openai'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' })
async function streamOpenAIResponse(
sessionId: string,
userMessage: string,
sendAction: (action: any) => Promise<void>
) {
let buffer = ''
let lastSentenceEnd = 0
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: userMessage }],
stream: true,
})
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content
if (!content) continue
// Add token to buffer
buffer += content
// Check for complete sentences
const segments = Array.from(segmenter.segment(buffer))
// Find complete sentences (all but possibly the last incomplete one)
for (let i = lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
// Send complete sentence to TTS
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
lastSentenceEnd = i + 1
}
}
// Send any remaining text as final sentence
const remainingSegments = Array.from(segmenter.segment(buffer))
for (let i = lastSentenceEnd; i < remainingSegments.length; i++) {
const sentence = remainingSegments[i].segment.trim()
if (sentence) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
}Streaming with Anthropic Claude
typescript
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' })
async function streamClaudeResponse(
sessionId: string,
userMessage: string,
sendAction: (action: any) => Promise<void>
) {
let buffer = ''
let lastSentenceEnd = 0
const stream = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{ role: 'user', content: userMessage }],
stream: true,
})
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
const content = event.delta.text
// Add token to buffer
buffer += content
// Check for complete sentences
const segments = Array.from(segmenter.segment(buffer))
// Send complete sentences
for (let i = lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
lastSentenceEnd = i + 1
}
}
}
// Send remaining text
const remainingSegments = Array.from(segmenter.segment(buffer))
for (let i = lastSentenceEnd; i < remainingSegments.length; i++) {
const sentence = remainingSegments[i].segment.trim()
if (sentence) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
}Reusable Helper Class
For production use, extract this logic into a reusable helper:
typescript
export class SentenceStreamBuffer {
private buffer = ''
private lastSentenceEnd = 0
private segmenter: Intl.Segmenter
constructor(locale: string = 'en') {
this.segmenter = new Intl.Segmenter(locale, { granularity: 'sentence' })
}
/**
* Add a token/chunk to the buffer and return any complete sentences.
* @returns Array of complete sentences ready to be sent to TTS
*/
push(chunk: string): string[] {
this.buffer += chunk
const segments = Array.from(this.segmenter.segment(this.buffer))
const completeSentences: string[] = []
// Extract complete sentences (all but possibly the last incomplete one)
for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
completeSentences.push(sentence)
}
this.lastSentenceEnd = i + 1
}
return completeSentences
}
/**
* Flush remaining buffer as final sentence(s).
* Call this when the stream ends.
*/
flush(): string[] {
const segments = Array.from(this.segmenter.segment(this.buffer))
const remainingSentences: string[] = []
for (let i = this.lastSentenceEnd; i < segments.length; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
remainingSentences.push(sentence)
}
}
// Reset state
this.buffer = ''
this.lastSentenceEnd = 0
return remainingSentences
}
/**
* Reset the buffer (useful for error handling or conversation resets)
*/
reset(): void {
this.buffer = ''
this.lastSentenceEnd = 0
}
}Using the Helper
typescript
async function streamLLMToVoice(
sessionId: string,
llmStream: AsyncIterable<string>,
sendAction: (action: any) => Promise<void>,
locale: string = 'en'
) {
const buffer = new SentenceStreamBuffer(locale)
try {
// Process streaming tokens
for await (const token of llmStream) {
const sentences = buffer.push(token)
// Send each complete sentence to TTS
for (const sentence of sentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
// Send any remaining text when stream completes
const finalSentences = buffer.flush()
for (const sentence of finalSentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
} catch (error) {
buffer.reset() // Clean up on error
throw error
}
}Multi-Language Support
Intl.Segmenter supports multiple languages out of the box:
typescript
// German
const germanSegmenter = new Intl.Segmenter('de', { granularity: 'sentence' })
// Spanish
const spanishSegmenter = new Intl.Segmenter('es', { granularity: 'sentence' })
// French
const frenchSegmenter = new Intl.Segmenter('fr', { granularity: 'sentence' })
// Reusable buffer with language detection
function createStreamBuffer(languageCode: string): SentenceStreamBuffer {
return new SentenceStreamBuffer(languageCode)
}Handling Edge Cases
Short Responses
For very short responses (single sentence or fragment), the buffer approach still works:
typescript
// LLM response: "Hello!"
buffer.push("Hello!") // Returns: []
buffer.flush() // Returns: ["Hello!"]Incomplete Sentences During Interruption
If the user interrupts (barge-in), you may have incomplete sentences in the buffer:
typescript
// Handle barge-in event
function handleBargeIn(sessionId: string) {
const buffer = sessionBuffers.get(sessionId)
if (buffer) {
// Option 1: Discard incomplete sentence
buffer.reset()
// Option 2: Send incomplete sentence as-is (for context)
const remaining = buffer.flush()
// Log or store for context but don't send to TTS
}
}Very Long Sentences
Sometimes LLMs generate very long sentences. Consider adding a character limit:
typescript
class SentenceStreamBuffer {
private maxSentenceLength = 500 // characters
push(chunk: string): string[] {
this.buffer += chunk
// Force break on very long buffers
if (this.buffer.length > this.maxSentenceLength && this.buffer.includes(' ')) {
const lastSpace = this.buffer.lastIndexOf(' ', this.maxSentenceLength)
const forcedSentence = this.buffer.substring(0, lastSpace).trim()
this.buffer = this.buffer.substring(lastSpace).trim()
this.lastSentenceEnd = 0
return [forcedSentence]
}
// Normal sentence detection...
const segments = Array.from(this.segmenter.segment(this.buffer))
const completeSentences: string[] = []
for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) {
const sentence = segments[i].segment.trim()
if (sentence) {
completeSentences.push(sentence)
}
this.lastSentenceEnd = i + 1
}
return completeSentences
}
// ... rest of class
}Performance Considerations
Buffer Management
For production deployments with many concurrent sessions:
typescript
// Store buffers per session
const sessionBuffers = new Map<string, SentenceStreamBuffer>()
function getOrCreateBuffer(sessionId: string, locale: string): SentenceStreamBuffer {
if (!sessionBuffers.has(sessionId)) {
sessionBuffers.set(sessionId, new SentenceStreamBuffer(locale))
}
return sessionBuffers.get(sessionId)!
}
// Clean up on session end
function handleSessionEnd(sessionId: string) {
sessionBuffers.delete(sessionId)
}Timeout Protection
Add timeout to prevent indefinitely buffered text:
typescript
class SentenceStreamBufferWithTimeout extends SentenceStreamBuffer {
private lastPushTime = Date.now()
private timeout = 5000 // 5 seconds
push(chunk: string): string[] {
this.lastPushTime = Date.now()
return super.push(chunk)
}
hasTimedOut(): boolean {
return Date.now() - this.lastPushTime > this.timeout
}
flushIfTimedOut(): string[] {
if (this.hasTimedOut()) {
return this.flush()
}
return []
}
}Complete Example: Express.js Integration
typescript
import express from 'express'
import OpenAI from 'openai'
const app = express()
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const sessionBuffers = new Map<string, SentenceStreamBuffer>()
app.use(express.json())
app.post('/webhook', async (req, res) => {
const event = req.body
switch (event.type) {
case 'user_speak':
// Don't await - respond immediately to avoid timeout
handleUserSpeak(event).catch(console.error)
return res.status(204).send()
case 'session_end':
sessionBuffers.delete(event.session.id)
return res.status(204).send()
default:
return res.status(204).send()
}
})
async function handleUserSpeak(event: any) {
const sessionId = event.session.id
const userText = event.text
// Get or create buffer for this session
const buffer = sessionBuffers.get(sessionId) || new SentenceStreamBuffer('en')
sessionBuffers.set(sessionId, buffer)
// Stream LLM response
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: userText }],
stream: true,
})
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content
if (!content) continue
const sentences = buffer.push(content)
for (const sentence of sentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
// Flush remaining text
const finalSentences = buffer.flush()
for (const sentence of finalSentences) {
await sendAction({
type: 'speak',
session_id: sessionId,
text: sentence,
tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
})
}
}
async function sendAction(action: any) {
// Send action via WebSocket or HTTP to sipgate AI Flow
// Implementation depends on your integration method
await fetch('https://your-aiflow-endpoint/actions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(action),
})
}
app.listen(3000, () => console.log('Server running on port 3000'))Fallback for Older Node.js Versions
If you're using Node.js <16, you can use a simple regex-based fallback:
typescript
// Simple fallback for environments without Intl.Segmenter
function splitSentencesSimple(text: string): string[] {
// Basic sentence splitting (not as robust as Intl.Segmenter)
// Matches sentence endings followed by whitespace
return text
.split(/(?<=[.!?])\s+/)
.map(s => s.trim())
.filter(s => s.length > 0)
}
// Use in SentenceStreamBuffer as fallback
class SentenceStreamBufferLegacy {
private buffer = ''
push(chunk: string): string[] {
this.buffer += chunk
const sentences = splitSentencesSimple(this.buffer)
if (sentences.length > 1) {
// Keep last sentence in buffer (might be incomplete)
const complete = sentences.slice(0, -1)
this.buffer = sentences[sentences.length - 1]
return complete
}
return []
}
flush(): string[] {
const sentence = this.buffer.trim()
this.buffer = ''
return sentence ? [sentence] : []
}
}Regex Limitations
The regex fallback is less robust than Intl.Segmenter and may incorrectly split on abbreviations (Dr., e.g., etc.). If using the fallback, it's even more critical to follow the LLM prompting guidelines to avoid abbreviations.
Best Practices Summary
- Prompt LLMs to avoid abbreviations - Instruct your LLM to write out "Doctor" not "Dr.", "for example" not "e.g." to prevent incorrect segmentation and poor pronunciation
- Always segment sentences - Never send individual tokens to TTS, always buffer and send complete sentences
- Use
Intl.Segmenter- Native, robust, multi-language support (Node.js ≥16) - Buffer per session - Keep separate buffers for concurrent conversations
- Clean up on session end - Delete buffers to prevent memory leaks
- Handle timeouts - Flush buffer if no new tokens arrive within 5 seconds
- Support multiple languages - Pass correct locale to
Intl.Segmenter - Handle barge-in - Reset or discard incomplete sentences on interruption
- Limit sentence length - Force breaks for very long sentences (500+ characters)
Token Accumulation Speed
In practice, sentences complete quickly (typically 1-2 seconds with modern LLMs). Users won't notice the buffering delay, but they will notice the dramatic improvement in speech quality.
Related Documentation
- Speak Action - Complete reference for the speak action
- TTS Providers - Azure and ElevenLabs configuration
- Barge-In Best Practices - Handling interruptions during speech
- Async Hold Pattern - Managing long-running LLM requests