Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

How can I simulate real-time streaming transcription using OpenAI API? #2307

Unanswered
Santoshchodipilli asked this question in Q&A
Discussion options

I'm working on a project where I want to convert speech to text in real-time using OpenAI's Whisper model. I see that Whisper's hosted API (whisper-1) currently only supports batch mode — sending a full audio file and receiving the full transcript.

I'm trying to achieve a streaming-like transcription experience, where I can start receiving partial transcriptions as audio is still being recorded or uploaded.

Is there a way to simulate streaming transcription using Whisper?

I'm using Python.

I considered chunking the audio into small parts and sending them sequentially.

Is that the best approach, or is there a better method?

Also, is there any public roadmap or timeline for when the official OpenAI Whisper API might support real-time streaming transcription?

Thanks in advance!

You must be logged in to vote

Replies: 1 comment · 2 replies

Comment options

You're on the right path of emulating streaming transcription with Whisper — that's the best workaround available at the moment, TheOpenAI's whisper-1 API is only capable of batch processing and not streaming.

let me describe it for you -

  • Emulating Streaming with Whisper (Python)
    first, breaking down the audio into small segments (e.g., every 2-5 seconds) and processing them in chunked form and passing them into Whisper is the most prevalent and efficient method for simulating real-time transcription. Here's what you can do:

Method: Chunked Streaming Simulation
Record audio in chunks via a microphone stream.
Store each chunk into a temporary buffer or file.
Pass the chunk to Whisper (whisper-1) API or locally execute if you have the open-source model.
Show the transcription incrementally.

  • Python Utilities You May Use:
    pyaudio or sounddevice — for recording microphone sound in chunks.
    queue.Queue — for handling chunks of audio asynchronously.
    openai.Audio.transcribe("whisper-1",.) — for transcribing a chunk if you are using the OpenAI API.
    Or use OpenAI whisper for local inference with faster turnaround.

If you're interested, I can help you set up a full real-time transcription.

You must be logged in to vote
2 replies
@Santoshchodipilli
Comment options

It's not about to microphone stream, I am expecting streaming response for recorded audio file which is static.
So, as you said if i chunk the audio with respect to 2-5 seconds. if any word falls at the chunking endpoint, then the chunks goes to be wrong.

@pravakarp98
Comment options

You can use an offset or overlap to map the last word of chunk A with the first word of chunk B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
🙏
Q&A
Labels
None yet
2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.