OpenAI Whisper API: Hands-on Guide

This guide will walk you through on how to get started with making calls to OpenAI Whisper API. Whisper is an AI model from OpenAI that can be used to convert speech to text.

Table of Contents

Get your OpenAI API Key

First, you need to have an OpenAI Account. If you don’t have one, here is a guide on how to get OpenAI API access.

Then, log in to your OpenAI Account and get your API key from -> Your API keys

Keep the API somewhere safe.

Setup

Install the OpenAI Python module. You need to Python 3.x or above and OpenAI Python module versionv0.27.0 or above.

you can install it using the below command

pip install openai

Call OpenAI Whisper API

You can use the below code to make your first call to OpenAI Whisper API to transcribe a file Make sure to replace the text “your-api-key” with your actual OpenAI API key

import openai

openai.api_key = "your-api-key"

audio_file= open("/path/to/file/audio.mp3", "rb")

transcript = openai.Audio.transcribe("whisper-1", audio_file)

print(transcript)

Make sure to pass the actual audio file in the 3rd line above. Once you run it, you will get the transcription of the audio like this:

{ "text": "Imagine the wildest idea that you've ever had, and you're curious. .... }

What are the languages supported by OpenAI Whisper API

Below are the languages supported by OpenAI Whisper API

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

Pricing for OpenAI Whisper API

Whisper API costs $0.006 /minute (rounded to the nearest second)

OpenAI Whisper API Options

file
string
Required
The audio file to transcribe, in one of these formats: mp3, mp4, mpeg, mpga, m4a, wav, or webm.

model
string
Required
ID of the model to use. Only whisper-1 is currently available.

prompt
string
Optional
An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.

response_format
string
Optional
Defaults to json
The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.

temperature
number
Optional
Defaults to 0
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

language
string
Optional
The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.

Prompting

In the same way that you can use a prompt to influence the output of our language models, you can also use one to improve the quality of the transcripts generated by the Whisper API. The model will try to match the style of the prompt, so it will be more likely to use capitalization and punctuation if the prompt does too. Here are some examples of how prompting can help in different scenarios:

Prompts can be very helpful for correcting specific words or acronyms that the model often misrecognizes in the audio. For example, the following prompt improves the transcription of the words DALL·E and GPT-3, which were previously written as “GDP 3” and “DALI”.

The transcript is about OpenAI which makes technology like DALL·E, GPT-3, and ChatGPT with the hope of one day building an AGI system that benefits all of humanity

To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment. This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier.

Sometimes the model might skip punctuation in the transcript. You can avoid this by using a simple prompt that includes punctuation:

Hello, welcome to my lecture.

The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them:

Umm, let me think like, hmm… Okay, here’s what I’m, like, thinking.”

Some languages can be written in different ways, such as simplified or traditional Chinese. The model might not always use the writing style that you want for your transcript by default. You can improve this by using a prompt in your preferred writing style.