Text to Speech

Headers

api-subscription-key

string

default:""

Your unique subscription key for authenticating requests to the Sarvam AI Speech-to-Text API. Here are the steps to get your api key

Body

application/json

target_language_code

enum<string>

required

The language of the text is BCP-47 format

Available options:

bn-IN,

en-IN,

gu-IN,

hi-IN,

kn-IN,

ml-IN,

mr-IN,

od-IN,

pa-IN,

ta-IN,

te-IN

text

string | null

The text(s) to be converted into speech.

Features:

Each text should be no longer than 1500 characters
Supports code-mixed text (English and Indic languages)

Important Note:

For numbers larger than 4 digits, use commas (e.g., '10,000' instead of '10000')
This ensures proper pronunciation as a whole number

Required string length: 1 - 1500

speaker

enum<string> | null

default:anushka

The speaker voice to be used for the output audio.

Default: Anushka

Model Compatibility (Speakers compatible with respective models):

bulbul:v1: (Will be deprecated soon)
- Female: Diya, Maya, Meera, Pavithra, Maitreyi, Misha
- Male: Amol, Arjun, Amartya, Arvind, Neel, Vian
bulbul:v2:
- Female: Anushka, Manisha, Vidya, Arya
- Male: Abhilash, Karun, Hitesh

Note: Speaker selection must match the chosen model version.

Available options:

meera,

pavithra,

maitreyi,

arvind,

amol,

amartya,

diya,

neel,

misha,

vian,

arjun,

maya,

anushka,

abhilash,

manisha,

vidya,

arya,

karun,

hitesh

pitch

number | null

default:0

Controls the pitch of the audio. Lower values result in a deeper voice, while higher values make it sharper. The suitable range is between -0.75 and 0.75. Default is 0.0.

Required range: -1 <= x <= 1

pace

number | null

default:1

Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. The suitable range is between 0.5 and 2.0. Default is 1.0.

Required range: 0.3 <= x <= 3

loudness

number | null

default:1

Controls the loudness of the audio. Lower values result in quieter audio, while higher values make it louder. The suitable range is between 0.3 and 3.0. Default is 1.0.

Required range: 0.1 <= x <= 3

speech_sample_rate

enum<integer> | null

default:22050

Specifies the sample rate of the output audio. Supported values are 8000, 16000, 22050, 24000 Hz. If not provided, the default is 22050 Hz.

Available options:

8000,

16000,

22050,

24000

enable_preprocessing

boolean

default:false

Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text. Default is false.

model

enum<string>

Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.

Available options:

bulbul:v1,

bulbul:v2

Response

Successful Response

request_id

string | null

required

audios

string[]

required

The output audio files in WAV format, encoded as base64 strings. Each string corresponds to one of the input texts.

Getting Started

Changelog

Usage Guides

Endpoints

Headers

Body

Response