Realtime audio streaming using our Python SDK
Get started in 2 mins with our Python SDK streaming. You can also try out streaming using the Nodejs SDK too.
Model Version
For the lowest realtime latency, you must pass
voice_engine: "PlayHT2.0-turbo"
when calling the python sdk.
PlayHT2.0-turbo
is the latest version of our Conversational Text to Voice model called Gargamel, and has improved quality and reliability especially with handling acronyms, emails, phone numbers, addresses, etc.
Get your Credentials
To use the HTTP API you will need an API Key and a User Id, you can easily generate those, check this guide for a how-to.
Set Up
First, Install the pyht SDK package using pip:
pip install pyht
Let's Stream
Let's stream some sentences and see how quickly we get that first chunk of data for each one. Heads up: you'll need your credentials for PlayHT. Swap out the placeholders in the code.
asyncio and Flask
This example uses the sync client, but our SDK also supports asyncio. check this demo for an example.
# import the playht SDK
from pyht import Client, TTSOptions, Format
# Initialize PlayHT API with your credentials
client = Client("<YOUR_PLAY_HT_USER_ID>", "<YOUR_PLAY_HT_API_KEY>")
# configure your stream
options = TTSOptions(
# this voice id can be one of our prebuilt voices or your own voice clone id, refer to the`listVoices()` method for a list of supported voices.
voice="s3://voice-cloning-zero-shot/d9ff78ba-d016-47f6-b0ef-dd630f59414e/female-cs/manifest.json",
# you can pass any value between 8000 and 48000, 24000 is default
sample_rate=44_100,
# the generated audio encoding, supports 'raw' | 'mp3' | 'wav' | 'ogg' | 'flac' | 'mulaw'
format=Format.FORMAT_MP3,
# playback rate of generated speech
speed=1,
)
# start streaming!
text = "Hey, this is Jennifer from Play. Please hold on a moment, let me just pull up your details real quick."
# must use turbo voice engine for the best latency
for chunk in client.tts(text=text, voice_engine="PlayHT2.0-turbo", options=options):
# Do whatever you want with the stream, you could save it to a file, stream it in realtime to the browser or app, or to a telephony system
pass
Input Streaming
The
.tts
method can accepttext
orList[str]
as an input, so you can just pass to it the output coming from any LLM like chatGPT as you can see in this example.
Demos
Stream audio with ChatGPT
Stream audio to a local file
Streaming Options
The full list of options you can use to control the generated audio.
Parameter | Type | Values | Description |
---|---|---|---|
voice_engine | string | 'PlayHT2.0-turbo' | 'PlayHT2.0' You must use "PlayHT2.0-turbo" for the lowest latency | Specifies the voice model to be used for speech synthesis. |
voice | string | _ | Identifier for the voice to be used to synthesize the text. Refer to the /voices HTTP endpoint for a list of all prebuilt voices, or cloned-voices/instant/ for your instant cloned voices. |
sample_rate | number | A number greater than or equal to 8000, and must be less than or equal to 48000 | Sample rate for the output audio. |
format | string | 'raw' | 'mp3' | 'wav' | 'ogg' | 'flac' | 'mulaw' | The format in which the output audio should be generated. Defaults to 'mp3'. |
speed | number | A number greater than 0 and less than or equal to 5.0. | Controls how fast the generated audio should be. |
temperature | number | A floating point number between 0, inclusive, and 2, inclusive. | Controls variance. Lower temperatures result in more predictable results. Higher temperatures allow each run to vary more, creating voices that sound less like the baseline. |
voice_guidance | number | A number between 1 and 6. | Use lower numbers to reduce how unique your chosen voice will be compared to other voices. Higher numbers will maximize its individuality. |
text_guidance | number | A floating point number between 0 and 2. | This number influences how closely the generated speech adheres to the input text. Use lower values to create more fluid speech, but with a higher chance of deviating from the input text. Higher numbers will make the generated speech more accurate to the input text, ensuring that the words spoken align closely with the provided text. |
seed | number | An integer number greater than or equal to 0. If equal to null or not provided, a random seed will be used every time. | Controls the reproducibility of the generated audio. Assuming all other properties didn't change, a fixed seed will generate the exact same audio file given the same text and voiceId. |
That is all you need to get started with realtime streaming through our Python SDK. If you need support, reach out to us at [email protected] or join us on Discord.