Streaming with Twilio

Integrating Play3.0 TTS with Twilio for Phone-Based AI Interactions

This guide will walk you through the process of integrating PlayHT's Play3.0 Text-to-Speech (TTS) voice model with Twilio to create a phone-based AI interaction system. We'll use ChatGPT for generating responses, but you can adapt this to work with other LLMs.

Prerequisites

  • Node.js installed on your system
  • An OpenAI API key
  • A PlayHT API key and User ID
  • A Twilio account with a phone number
  • ngrok for exposing your local server to the internet (for development)

Step 1: Set Up Your Project

  1. Create a new directory for your project and navigate to it:

    mkdir twilio-playht-ai-phone
    cd twilio-playht-ai-phone
    
  2. Initialize a new Node.js project:

    npm init -y
    
  3. Open the package.json file and add the following line to enable ES modules:

    {
      ...
      "type": "module",
      ...
    }
    
  4. Install the required dependencies:

    npm install openai playht dotenv express twilio axios
    

Step 2: Set Up Environment Variables

Create a .env file in your project root and add your API keys to the .env file:

OPENAI_API_KEY=your_openai_api_key_here
PLAYHT_API_KEY=your_playht_api_key_here
PLAYHT_USER_ID=your_playht_user_id_here
TWILIO_ACCOUNT_SID=your_twilio_account_sid_here
TWILIO_AUTH_TOKEN=your_twilio_auth_token_here

Step 3: Create the AI Response Generation Function

Create a file named generateAIResponse.js with the following content:

import OpenAI from 'openai';
import dotenv from 'dotenv';

dotenv.config();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function generateAIResponse(prompt) {
  try {
    const completion = await openai.chat.completions.create({
      messages: [{ role: "user", content: prompt }],
      model: "gpt-3.5-turbo",
    });

    return completion.choices[0].message.content;
  } catch (error) {
    console.error('Error generating AI response:', error);
    return "I'm sorry, I couldn't generate a response at this time.";
  }
}

Step 4: Create the Text-to-Speech Function

Create a file named textToSpeech.js with the following content:

import * as PlayHT from 'playht';
import dotenv from 'dotenv';

dotenv.config();

PlayHT.init({
  apiKey: process.env.PLAYHT_API_KEY,
  userId: process.env.PLAYHT_USER_ID,
});

export async function textToSpeech(text) {
  try {
    const response = await PlayHT.generate(text, {
      voiceId: "s3://voice-cloning-zero-shot/801a663f-efd0-4254-98d0-5c175514c3e8/jennifer/manifest.json",
      voiceEngine: "Play3.0",
      outputFormat: 'mulaw',
      sampleRate: 8000,
    });

    return response.audioUrl;
  } catch (error) {
    console.error('Error generating speech:', error);
    throw error;
  }
}

Step 5: Create the Main Application

Create a file named index.js with the following content:

import express from 'express';
import dotenv from 'dotenv';
import twilio from 'twilio';
import { generateAIResponse } from './generateAIResponse.js';
import { textToSpeech } from './textToSpeech.js';

dotenv.config();

const app = express();
app.use(express.urlencoded({ extended: true }));

const twilioClient = twilio(process.env.TWILIO_ACCOUNT_SID, process.env.TWILIO_AUTH_TOKEN);

app.post('/voice', async (req, res) => {
  const twiml = new twilio.twiml.VoiceResponse();

  twiml.say('Welcome to the AI Phone Assistant. Please speak after the beep.');
  twiml.record({
    action: '/process-speech',
    maxLength: 30,
    transcribe: true,
    transcribeCallback: '/process-speech'
  });

  res.type('text/xml');
  res.send(twiml.toString());
});

app.post('/process-speech', async (req, res) => {
  const twiml = new twilio.twiml.VoiceResponse();

  if (req.body.TranscriptionText) {
    const aiResponse = await generateAIResponse(req.body.TranscriptionText);
    const audioUrl = await textToSpeech(aiResponse);

    twiml.play(audioUrl);
    twiml.say('Thank you for using the AI Phone Assistant. Goodbye!');
    twiml.hangup();
  } else {
    twiml.say("I'm sorry, I couldn't understand that. Please try again.");
    twiml.redirect('/voice');
  }

  res.type('text/xml');
  res.send(twiml.toString());
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`Server is running on port ${PORT}`);
});

Step 6: Set Up ngrok

  1. Install ngrok globally:

    npm install -g ngrok
    
  2. Start your Express server:

    node index.js
    
  3. In a new terminal window, start ngrok:

    ngrok http 3000
    
  4. Note the HTTPS URL provided by ngrok (e.g., https://your-ngrok-subdomain.ngrok.io).

Step 7: Configure Twilio

  1. Log in to your Twilio account.
  2. Navigate to the Phone Numbers section and select your Twilio phone number.
  3. In the Voice & Fax section, set the "A Call Comes In" webhook to:
    • Webhook: https://your-ngrok-subdomain.ngrok.io/voice
    • HTTP Method: POST

Step 8: Test Your Integration

  1. Call your Twilio phone number.
  2. After the greeting, speak your question or prompt.
  3. You should receive an AI-generated response spoken using the Play3.0 TTS voice.

Customization

  • To use a different voice, change the voiceId in the textToSpeech.js file.
  • To use a different LLM, modify the generateAIResponse function in generateAIResponse.js.

Conclusion

You've now successfully integrated PlayHT's Play3.0 TTS voice model with Twilio for phone-based AI interactions. This setup allows callers to interact with an AI system over the phone, with responses generated by ChatGPT and spoken using PlayHT's TTS.