Headers
Your unique subscription key for authenticating requests to the Sarvam AI Speech-to-Text API. Here are the steps to get your api key
Body
The language of the text is BCP-47 format
bn-IN, en-IN, gu-IN, hi-IN, kn-IN, ml-IN, mr-IN, od-IN, pa-IN, ta-IN, te-IN The text(s) to be converted into speech.
Features:
- Each text should be no longer than 1500 characters
- Supports code-mixed text (English and Indic languages)
Important Note:
- For numbers larger than 4 digits, use commas (e.g., '10,000' instead of '10000')
- This ensures proper pronunciation as a whole number
1 - 1500The speaker voice to be used for the output audio.
Default: Anushka
Model Compatibility (Speakers compatible with respective models):
-
bulbul:v1: (Will be deprecated soon)
- Female: Diya, Maya, Meera, Pavithra, Maitreyi, Misha
- Male: Amol, Arjun, Amartya, Arvind, Neel, Vian
-
bulbul:v2:
- Female: Anushka, Manisha, Vidya, Arya
- Male: Abhilash, Karun, Hitesh
Note: Speaker selection must match the chosen model version.
meera, pavithra, maitreyi, arvind, amol, amartya, diya, neel, misha, vian, arjun, maya, anushka, abhilash, manisha, vidya, arya, karun, hitesh Controls the pitch of the audio. Lower values result in a deeper voice, while higher values make it sharper. The suitable range is between -0.75 and 0.75. Default is 0.0.
-1 <= x <= 1Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. The suitable range is between 0.5 and 2.0. Default is 1.0.
0.3 <= x <= 3Controls the loudness of the audio. Lower values result in quieter audio, while higher values make it louder. The suitable range is between 0.3 and 3.0. Default is 1.0.
0.1 <= x <= 3Specifies the sample rate of the output audio. Supported values are 8000, 16000, 22050, 24000 Hz. If not provided, the default is 22050 Hz.
8000, 16000, 22050, 24000 Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text. Default is false.
Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.
bulbul:v1, bulbul:v2