#gpt-realtime
1 messages · Page 4 of 1
I just followed this tutorial and everything was working until recently
https://www.youtube.com/watch?v=XX-ET_-onYU&t=488s
OpenAI has done some fantastic things. Whisper is a great project open to the public. Transcribe (Turn audio into text) for MANY languages, all completely for free and all from your computer. No subscription, no fees, nothing. Your data, your computer, your free unlimited transcription.
Whisper AI install guide: https://hub.tcno.co/ai/whisper/i...
so you installed it with the pip package manager than.
how about you start by pip list.
Ans you are using it via rust?
Well thanks for the side i am using it directly in linux. With a simple command.
it looks helpfull wanted to try out Tensorflow anyways.
Yesterday I tried uninstall all Whisper and I used command in powershell: iex (irm whisper.tc.ht) but it got worse...
I will try again do the whole installation
Btw are you using python 3.9.9 or leatest?
Well i thought whisper was not for windows. So no clue. Probably a windows problem.
i have 3.12.1 installed
my whisper is not in the pakage list
i would try it with an absolute path if i would you.
for example. I first added whisper to the path.
Than i wrote a script with help of chat gpt.
Beispiel .sh datei.
#!/bin/bash
nummer=220
endnummer=233pfad="/home/Nutzername/Dokumente/whisper_using/jenny/22_folgend/"
datei="${pfad}${nummer}.mp3"
model="base"
output_format="vtt"
target_location=""echo "datei DATEI DATEI DATEI= ${datei} "
whisper "$datei" --model "$model" --output_format "$output_format"
Schleife durch die Nummern
while [ $nummer -le $endnummer ]
do
# Konstruiere den Dateinamen
datei="${pfad}${nummer}.mp3"# Führe den Befehl aus echo "Führe whisper für $datei aus" whisper "$datei" --model "$model" --output_format "$output_format" --output_dir "$pfad" # Erhöhe die Nummer für die nächste Iteration nummer=$((nummer + 1))done
whisper /home/Nutzername/Dokumente/whisper_using/jenny/12_jenny.mp3 --model base --output_format vtt
@restive reef
YES!!!
I have it
It works
I can't have any space in this line of names
and I have to do it in download folder
Thanks for help ❤️
Deez
wondering if there's a way to use whisper to transcribe only the primary speaker in a file? Imagine something really long like a lecture with questions from the audience at the end. Primary speaker has way more presence. I'd like to transcribe just the speaker.
anything open in the space capable of doing that?
Hello all. I have created a python whisper app for my company to transcribe videos. Right now it works great with given videos on the server, however, I would like the option to re-process specific parts of the videos where the transcription gave no results or hallucinations, for example, in a 1hour long video, I would like to be able to say 'reprocess from 00:05:15 to 00:07:20'.
Is there some way I can give these parameters directly to whisper?
Hello. I think the only option is to cut the video yourself (using python.. or something else), give the new video to whisper and change it then in the transcript.
never put spaces into your folders if you work with programs. Thats why i use underscores.
E
Hi, I've an error with my code.
There is my code:
const openai = new OpenAI({ apiKey: "my-openai-api" });
async function transcription() {
console.log("Démarrage de la transcription...");
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream("audio_${time}.mp3"),
model: "whisper-1",
});
console.log("Fin de la transcription...");
console.log(transcription.text);
}
But I've this error:
Error: APIConnectionError: Connection error.
at OpenAI.makeRequest (C:\Users\Eleve\Documents\Videos\node_modules\openai\core.js:292:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async transcription (C:\Users\Eleve\Documents\Videos\index.js:157:25)
at async C:\Users\Eleve\Documents\Videos\index.js:74:5 {
status: undefined,
headers: undefined,
request_id: undefined,
error: undefined,
code: undefined,
param: undefined,
type: undefined,
cause: FetchError: request to https://api.openai.com/v1/audio/transcriptions failed, reason: read ECONNRESET
at ClientRequest.<anonymous> (C:\Users\Eleve\Documents\Videos\node_modules\node-fetch\lib\index.js:1501:11)
at ClientRequest.emit (node:events:519:28)
at TLSSocket.socketErrorListener (node:_http_client:492:9)
at TLSSocket.emit (node:events:531:35)
at emitErrorNT (node:internal/streams/destroy:169:8)
at emitErrorCloseNT (node:internal/streams/destroy:128:3)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
type: 'system',
errno: 'ECONNRESET',
code: 'ECONNRESET'
}
}
I've the free trial API idk if its the cause
Friends in the community, we encountered a problem during the use of whisper and wanted to ask if anyone has experience with it. At what decibel does the whisper speaker's voice maintain? WER's performance is relatively excellent.
社区的朋友们,针对whisper 我们使用的过程中遇到一个问题想问问大家谁有经验。whisper的讲话者声音维持在多少分贝,WER得表现是比较优异的情况呀。
shh we need to whisper like the chat name tells us
👀
you're right. you should purchase a paid key
sad... Thanks for the answer
Hello everyone
I'm currently workin on Whisper to specialize it in French railway language. I'm facing some issues with transcribing amnigous words, as well as recognizin station names. Initially, i tried training it with audio file totaling 2 hours, but the results didn't meet my expectations. I then turned to usings prompts, which solved thé ambiguity problème, however since the context size is limited to 244 tokens, i can't include aller station names.
Could you please provide me with some tips? I'm new to this field.
Thank you
imo the transcription of random words is hard to avoid. I'm assuming you mean things like "Thanks for watching!". Maybe use a speech detection prior to whisper perhaps?
I haven't used whisper offline much but the time I did it didn't perform well out of the box. the api has pretty good results but still has the "hallucinations"
Actuctially my main problem is to transcribe station names for exemple " La Défense".
I tried usings embedding to correct the transcription in post processing but didn't get the results i wanted .
How are you guys using the Whisper API to create subtitles for 2-hour videos?
Have you tried post-processing with an LLM? There’s an example here: https://platform.openai.com/docs/guides/speech-to-text/improving-reliability
has anyone tried the wishper v3 ? does it significantly improve the v2 or not?
Guess open source whisper is a dead end, glad openai folded it into their proprietary models
I'd like to know if there will be enhancements to Whisper with SSML or a similar markup which will help with enunciation, pronounciation, accentuation, tone, and other features noted today in the presentation.
Not clear because they say large-v2 is used for the api
Has anyone noticed this with whisper?
When there is no audio input it's sometimes outputs something like "welcome to chatgpt" or "thank you for watching"
It also outputed this, on real-time chat with chatgpt
This transcript was recorded on October 12, 2021. Thank you for participating in this webinar.
Hi,
I’m using the openAI API, I’m trying to get short segments (~4 words) with timings & punctuations. What I went through:
- API doesn’t allow to set the number of words per segment
- Thought I could build it from words level transcribe → there is no punctuation there, also characters like - and ' weirdly managed
- Thought I could merge text or segments with words (I can get punctuation from text and timing from words)
Until I noticed a few things between text/segments and words:
- Text might differ. Literally having word in words that totally not exist in text/segments
- Timestamps is a big mismatch between words and segments
- No punctuation in words
- Words containing ' or - like « it’s » in some language would be consider as one word, in other language as two word.
This makes merging segments and words difficult since there is not the same amount of words in both side and rules on specific characters differ depending on the language
Did anyone succeed getting a word based transcribe with punctuation and level timestamp with the API or short segments ?
Thank you
Yup have had similar issues. It tends to hallucinate quite a lot actually
Servers are having load issues now. All of the services are getting that in one way or another.
it would be nice to have a date of when the voice engine (or new whisper) is going to be released
Hello there !
I am Loïc Founder of the french startup Callifly. I'm looking for our french speaking future CTO to complete our Core Team (possibility of equity entry). Already made up of 4 members (company of 8 people all remotely) I am looking for an AI expert to automate our solution.
We specialize in recovering abandoned shopping carts over the phone for e-retailers. More generally, we want to become the voice of e-retailers by selling, advising and supporting customers in their purchasing journey.
To start, I would like to work on a demonstrator (MVP type) and offer it to our clients for simple campaigns (Promo Code proposal for example). They are already excited about the idea.
First step, meet and discuss. 🙂
Who feels up to the challenge? Dm me.
Simple question about Whisper. Does it pad audios shorter than 30s to 30s?
Also have the same question as this guy here; although I'm using faster-whisper and I observed the same behavior.
Related discussion: https://github.com/SYSTRAN/faster-whisper/discussions/837
ॐ गं गणपतये नमः – Language bot test
चिद्धर्मा सर्वदेहेषु विशेषो नास्ति कुत्रचित् ।
अतश्च तन्मयं सर्वं भावयन्भवजिज्जनः ॥ १०० ॥
𑆃𑆤𑆴𑆫𑆾𑆣𑆩𑇀 𑆃𑆤𑆶𑆠𑇀𑆥𑆳𑆢𑆩𑇀 𑆃𑆤𑆶𑆖𑇀𑆗𑆼𑆢𑆩𑇀 𑆃𑆯𑆳𑆯𑇀𑆮𑆠𑆩𑇀 𑇅
𑆃𑆤𑆼𑆑𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆤𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆓𑆩𑆩𑇀 𑆃𑆤𑆴𑆫𑇀𑆓𑆩𑆩𑇀 𑇆
𑆪𑆂 𑆥𑇀𑆫𑆠𑆵𑆠𑇀𑆪𑆱𑆩𑆶𑆠𑇀𑆥𑆳𑆢𑆁 𑆥𑇀𑆫𑆥𑆚𑇀𑆖𑆾𑆥𑆯𑆩𑆁 𑆯𑆴𑆮𑆩𑇀 𑇅
𑆢𑆼𑆯𑆪𑆳𑆩𑆳𑆱 𑆱𑆁𑆧𑆶𑆢𑇀𑆣𑆱𑇀𑆠𑆁 𑆮𑆤𑇀𑆢𑆼 𑆮𑆢𑆠𑆳𑆁 𑆮𑆫𑆩𑇀 𑇆
𑆃𑆤𑆴𑆫𑆾𑆣𑆩𑇀 𑆃𑆤𑆶𑆠𑇀𑆥𑆳𑆢𑆩𑇀 𑆃𑆤𑆶𑆖𑇀𑆗𑆼𑆢𑆩𑇀 𑆃𑆯𑆳𑆯𑇀𑆮𑆠𑆩𑇀 𑇅
𑆃𑆤𑆼𑆑𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆤𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆓𑆩𑆩𑇀 𑆃𑆤𑆴𑆫𑇀𑆓𑆩𑆩𑇀 𑇆
𑆪𑆂 𑆥𑇀𑆫𑆠𑆵𑆠𑇀𑆪𑆱𑆩𑆶𑆠𑇀𑆥𑆳𑆢𑆁 𑆥𑇀𑆫𑆥𑆚𑇀𑆖𑆾𑆥𑆯𑆩𑆁 𑆯𑆴𑆮𑆩𑇀 𑇅
𑆢𑆼𑆯𑆪𑆳𑆩𑆳𑆱 𑆱𑆁𑆧𑆶𑆢𑇀𑆣𑆱𑇀𑆠𑆁 𑆮𑆤𑇀𑆢𑆼 𑆮𑆢𑆠𑆳𑆁 𑆮𑆫𑆩𑇀 𑇆
anirodham anutpādam anucchedam aśāśvatam .
anekārtham anānārtham anāgamam anirgamam ..
yaḥ pratītyasamutpādaṃ prapañcopaśamaṃ śivam .
deśayāmāsa saṃbuddhastaṃ vande vadatāṃ varam ..
𑆪𑆼 𑆣𑆩𑇀𑆩𑆳 𑆲𑆼𑆠𑆶𑆥𑇀𑆥𑆨𑆮𑆳 𑆠𑆼𑆱𑆁 𑆲𑆼𑆠𑆶𑆁 𑆠𑆡𑆳𑆓𑆠𑆾 𑆄𑆲 𑇅
𑆠𑆼𑆱𑆚𑇀𑆖 𑆪𑆾 𑆤𑆴𑆫𑆾𑆣𑆾 𑆍𑆮𑆁 𑆮𑆳𑆢𑆵 𑆩𑆲𑆳𑆱𑆩𑆟𑆾 𑇆
hey Rule 4 is no spamming, you are gonna get kicked for posting that in muliple channels if you aren't careful @arctic plover
Hello. I don't know why, since almost 2 weeks, Whisper API seems to not be able anymore to transcribe lyrics from songs, where it was the best at this game a few weeks ago... Why???
I've got this on a song like Paradise from ColdPlay, where it was able to clearly transcribe the lyrics 2 weeks ago 😕
Can please someone at open ai explain us what happend??
Hey, does anyone here have experience with insanely-fast-whisper? I've managed to get it working with flash attention 2, but to achieve the best final results, I need to utilize Spacy and the PunctuationModel for optimal quality. However, I'm encountering an issue with obtaining word-level timestamps to complete this task. Any ideas on how to address this?
So it seems OpenAI is dynamically changing the amount of usage we get with a plus subscription even after we have subscribed... I'm curious how this is okay? You can't change the terms of a contract after both parties have agreed to it?
4o has changed from 50 prompts to 40 prompts per 3 hours.. and these limits are not mentioned at all when subscribing for plus. Not trying to be rude just curious what a deb would have to say about this?
what??? just 40>??> how can i speak 40 sentences only? if its will be so short then im not buying plus
i just wanna speak to the new voice
.
Every 3 hours yup.
do u have the new advanced voice already?
Nope. Only alpha testers have it afaik.
thought i read albaik for a moment
I am willing to pay for someone to help get the whisper API working for IOS and Mac OS
It only gets the first few seconds using Mobile IOS
Please DM ❤️ I am so tired
dumb question, but are you calling the api correctly?
Yes @neat cedar I am
The problem is with the Audio file itself
I found a work around using cloud convert but my application needs a straight through change and I really don’t want to use FFMMEG
I'm going to guess you're streaming audio directly to the whsiper apis and you're probably only transcribing the first segment?
@neat cedar So I am using an audio file
It will pick up the first few seconds then poof
like it grabs the first word then breaks
I vow to open source a fix
Pls @ me ❤️ actively working on this
I figured it out
Ok so fluent-ffmpeg
IF YOU GOT STUCK WHERE I GOT STUCK
YOU NEED TO USE fluent-ffmpeg
SO YOU CAN USE Whisper IOS, Whisper on IOS, Whisper Mobile, Whisper Mac OS (Tags for those that get stuck)
Convert .wav -> .mp3 then process
No do not setup the recording as mp3 off the get go; for some reason you need to translate it into that format and the audio file will process fine
FOR DOCKER USERS
RUN apt-get update &&
apt-get install -y ffmpeg
CONVERSELY USE CLOUD CONVERTS API TO DO THE TRANSFER OF AUDIO TYPE
const ffmpeg = require('fluent-ffmpeg');
function convertWavToMp3(inputFile, outputFile) {
// Set the PATH environment variable to include FFmpeg bin before conversion
process.env.PATH = `C:\\FFmpeg\\bin;${process.env.PATH}`;
ffmpeg(inputFile)
.toFormat('mp3')
.on('end', () => {
console.log('File has been converted successfully');
})
.on('error', (err) => {
console.error('An error occurred: ' + err.message);
})
.saveToFile(outputFile);
}
// Example usage
convertWavToMp3('audio.wav', 'output.mp3');
lame foo.wav 🙂
(the requirement "straight through change" is not exactly clear, imo) @gilded fiber
@still moon wym
i was just throwing 'lame' out there.. i use the binary program from cli
but not for whisper.. haven't had the issue you're having with wav
@gilded fiber i wonder why yours stops.. wav file format issue? max size issue?
It’s all formats
I just figured out the best way to fix it just to convert file type
Works from wav to mp3
i wrote (cgpt'ed up) a script to split my files into N mb chunks..
haven't yet done the part where re-writing timestamp offsets is done for reassembly though [haven't needed to]
from the sound of it, it still could be a size issue you hit? (mp3 bing much smaller than wav)
hello, I am new to openai and am trying to use whisper for speech to text translations. I've been trying to use it in the magic leap, and im finding some target architecture issues. it seems whisper requires arm64 but the magic leap requires x86_64. does anyone know if i can re-target whisper or if there's anyway i can still use it? or if there are other speech to text tools i can use?
Not helpful—if you don’t think the question is worthy of answering please don’t reply
ChatGPT may very well provide the answer
If you're the one building stuff chatgpt almost always will steer you wrong on the details. It's not bad if you already know how to make something. maybe also if you're brain storming without needing a viable program
what is the actual processor architecture ? x86_64 or arm
you can def run whisper on x86_64. I've done it on my system. can you tell me more about your setup
Yeah magic leap is an x86_64 device. My desktop is windows 11. I’m building from Unity and to build to the headset I have to use Android platform. I tried using a whisper Unity package, but it looks like for Android it only supports arm64
wut
x86_64 is the standard processor architecture on intel and amds
windows is an operating system that is often x86_64, but sometimes they have an arm version
android is almost always arm64
I might be missing something
I think the issue is that I have to use Android platform to build to the headset and specifically if I use Android I have to use arm64
But also I’m really new to this so I could be confused about something else or maybe there’s a better way to do this
Not yet—I’ll look into it. Thanks for your help!
you could probably find a way to run whisper on arm64 but depending on the hardware it may not run great
there's always the openai whisper api but not sure if that's going to work for your implementation
https://github.com/ggerganov/whisper.cpp mentions android support
Thanks I’ll try those out!
model = WhisperModel('tiny', compute_type="int8")
segments, _ = model.transcribe("input.wav")
text = ''.join(segment.text for segment in segments)
return text
i am trying to use the whisper model and get it to work on cpu and i think the code i have written should work fine on cpu but it says that is needs cudnn to work, how can i fix this without needing cudnn?
tag me if u answer thanks
whats this channel for
This is for OpenAI's speech to text model called Whisper
Anyone knows an app for windows for TTS like the built-in one (win+h) but with an option to use external tts api like whisper?
I want to put the transcription into an any text field immediately without copy paste actions..
tldr: whisper wont use cuda depsite torch detecting graphics card and cuda in same file
Soooo I have Cuda installed
and these in my venv:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install -U openai-whisper
My code checks if cuda is available and it prints yes and my GPU:
import whisper
import torch
print(torch.cuda.is_available())
import torch
if torch.cuda.is_available():
print(f"CUDA is available. Using GPU: {torch.cuda.get_device_name(0)}")
else:
print("CUDA is not available. Using CPU.")
model = whisper.load_model("medium")
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("test.opus")
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)
(venv) ➜ mssp-transcript python test.py
True
CUDA is available. Using GPU: NVIDIA GeForce GTX 1070
But it uses my CPU???? The source code of the whisper module also checks torch_cuda.is_avaibale() and runs on cuda if it is so I have NO idea why its running on the CPU
ハンバーガーやチーズバーガーはフライドポテトと一緒に食べると美味しいと思う
I have a Korean show that has English subtitles files, but I would like to generate Korean subtitles using whisper.
Is there a simple way to use the english subtitle timings as a guide for the whisper generated transcription subtitles?
I would like to have the generated native Korean subtitles match the existing timings of the English subtitles.
I have ideas for how to hack together some stuff but I'd like to check first if there's an easy way
I think you can tinker with the timestamp_granularities parameter as demonstrated here
https://platform.openai.com/docs/guides/speech-to-text/timestamps
friend you can also emulate//use a vm to get whisper running. Set emulation/vm to arm and run the model.
oh, the person above links a cpp, so its in c++/c and you can just run that native lol
Hello, I'm looking for a TTS model that can handle caribbean creole or ocean indian creole.
Whisper.cpp uses quantized 4 bit whisper and works great on CPU only on my Pixel 6 up to the base model. There are also two Android apps I know implement it (Whisper Journal) and the same dev did a keyboard app that simply uses 10-30 second clips to use whisper as keyboard. Hasn't been updated in a bit but works great and is fast even in this Pixel 6.
Sweet
Hey - I have a problem with my subtitles. I use large-v2 model and whenever there's a pause in audio (no one is talking), timecodes desync.
It's really annoying to manually resync them, especially on videos that are 30+ minutes long. Did any of you figured out what to do in that situation? Some kind of fix?
Do you have timecodes for your transcription, like a specific time in the recorded audio where that word was recognized? If so, how did you do it?
Okay, to be honest - I've found another solution. Switched to WhisperX, set my model to large-v3 and align_model to WAV2VEG2_ASR_LARGE_LV60K_9COH and it works wonders. It's much better than before.
BUT - for some reason I can't force it to work with CUDA (float16), it only works with CPU (int8) - and it's slow as hell. And I have no idea why it doesn't want to use it.
hey so i have some code meant to put subtitles on a video, but i only want the subtitles to be of a very short length (e.g. like 9 characters). this is my python code using whisper. it works perfectly fine except for the fact that my max_line_width and max_line_count parameters arent respected (the subtitles still are multi-lined and much longer than i would like). does anyone know why?
def generate_subtitles(audio_path):
subtitle_filename = audio_path.replace('.mp3', '.srt')
# max_line_width = 9
# max_line_count = 1
model = whisper.load_model("base")
result = model.transcribe(audio_path)
srt_writer = WriteSRT(".")
srt_writer.write_result(
result,
file=open(os.path.join(".", os.path.splitext(os.path.basename(audio_path))[0] + ".srt"), "w",
encoding="utf-8"),
options={"max_line_width": 9, "max_line_count": 1, "highlight_words": False} # Why aren't the options working?
)
Hi guys 👋
I’m a newbie here, I’m currently using the Whisper transcription API to transcribe audio files within a python app.
My question is : is it mandatory to read the audio file prior to sending it to the transcription API ? I’m scared RAM wise since I may have a lot of concurrent requests to handle with 9MBs for each audio files, and my app is hosted on Heroku with only 512MB of RAM.
Is there a way to send the transcription request without loading the audio file in RAM ?
Is this approach a risk of memory exhaustion ?
it throws me that float16 (cuda) is not supported by my pc, but... it is and it's installed
it worked previously with faster-whisper
any ideas what might be the cause and fix?
Would this approach be correct ? :
async with aiofiles.open(audio_file_path, 'wb') as f:
while True:
chunk = await response.content.read(8192)
if not chunk:
break
await f.write(chunk)
transcription_response = await client.audio.transcriptions.create(
model="whisper-1",
file=audio_file_path
)
Anyone ?
the API request has to contain the audio file, it will be loaded in memory at some point even if you decide to not use the lib and do the HTTP reqeust yourself
Thanks for your answer ! Does it mean that if my app needs to be able to handle 100 transcriptions calls concurrently I’ll need to have a beefy instance full of ram to avoid performance issues ?
yes, you will need system resources to match the load
Thanks Robert !
processing that amount of audio at the same time will surely require a beefy machine, I would go with something with a good amount of ram and also a nice SSD if you plan to read all of those audio files from disk
even if it is all being done on the cloud, the machine will need some good amount of IO to keep uploading all of that consistently
Yup makes sense, I guess my temporary solution is to define a limit of concurrent calls to 10 for example to lower my expectations
Ram is extremely expensive on Heroku, my instance only has 512MB lol
for sure you will need a queueing process to be able to control that
any help with that, please?
There is also Speech to Text API from local file. You can pick a file from the device and ofcurse use it as API
https://rapidapi.com/swift-api-swift-api-default/api/ai-speech-to-text/playground/apiendpoint_e98a440e-6f3d-473a-8d13-56afe020f179
Empower your applications with cutting-edge speech recognition technology, crafted to meet the highest standards of excellence. Equipped with the essential tools and resources, developers can confidently create and deploy their projects swiftly and efficiently. Our solution ensures unparalleled performance, delivering a seamless experience that ...
I know I'm pulling up an old thread here, but where can I find info on these more specific parameters such as vad or even compression_ratio_threshold and temperature_increment_on_fallback. I'm having the same issue as a lot of people have mentioned on here (hallucinations), but with the API and I can't find relevant docs on either OpenAI's platform docs nor Azure's OpenAI services.
This seems to be common in audio clips longer than 15 minutes or so, and it happens pretty often. I've tried setting my parameters based on some online answers (I'm using node and axios fetch) but none seem to get whisper any closer to a coherent response on this audio clip
const audioResponse = await axios.get(audioUrl, { responseType: 'stream' });
const form = new FormData();
form.append('file', audioResponse.data, {
filename: 'audio.mp3',
contentType: 'audio/mpeg'
});
// THESE ARE MY WHISPER API PARAMS
form.append('response_format', 'verbose_json');
form.append('timestamp_granularities', 'segment');
form.append('temperature_increment_on_fallback', 'None');
form.append('compression_ratio_threshold', 1.2);
form.append('temperature', 0.1);
form.append('vad', 'True');
// ===============================
const azureResponse = await axios.post(
`${AZURE_OPENAI_ENDPOINT}/openai/deployments/${deploymentName}/audio/translations?api-version=2024-05-01-preview`,
form,
{
headers: {
'api-key': AZURE_OPENAI_API_KEY,
...form.getHeaders()
}
}
);
return azureResponse.data;
yes I'm using the Azure instance; and no I have no idea whether these params can actually be passed to the API but it should be the same as the OpenAI whisper api
Unfortunately I don't believe the API gives you access to all of those options...
have anyone test this model can tell me, it's better than large v2 japanese 5k or not ?
link :https://huggingface.co/drewschaub/whisper-large-v3-japanese-4k-steps/tree/main
yes
what?
Has whisper been abandoned by OpenAI?
they released v3 large less then a year ago
but they also said that gpt-4o does a better job than whisper, and haven't touched it since last year
they did edit the readme recently
but they haven't accepted any PRs, even spelling mistake fixes. I know I submitted one that removed an error when run on Windows
Pull requests have always taken literal years if there not like large security issues.
how to use whisper to detect sound effects in a video?
that is definitely not something it could do
it'll try and put a word or letters to the sound
when i use whisper through command prompt it works flawlessly, however when i try to run it through a python script it says:
'whisper' has no attribute 'load_model'
i tried to debug and used: print(dir(whisper))
which gave me:
['AudioTranscriber', 'QApplication', 'QColor', 'QMainWindow', 'QPalette', 'QPushButton', 'QTextCursor', 'QTextEdit', 'QTimer', 'QVBoxLayout', 'QWidget', 'Qt', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'np', 'queue', 'sd', 'sf', 'sys', 'tempfile', 'threading', 'whisper']
and now im stumpted, ive updated everything, reinstalled everything, restarted everything 10 times over and this still happens.
edit: you can not name any file "whisper.py" anywhere on your pc or it will break everything. i'd put in a pull req to fix this but im sure there is an open one already.
hi
am new here. I have some doubts about the training and fine tuning the whisper model
is any posibilities to finetune the model with larger duration audio files
i have various audio files minimum 10 mins to maximum 30 mins and also have their trancription to in my dataset
i need to train the model with my dataset it is possible
any one help me about this
You just have too convert the audio files to the correct format for the model you want to tune e.g. 16kHz then tokenize the transcriptions, set up your training pramaters, and run the training function. You can then use the trainer.evaluate() function to check the results and change the pramaters as needed.
Here is an example:
https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz
ok thanks, @surreal plover I'll try this
@surreal plover is it possible with finetuning the model with larger audio files more than 20mins audio and its transcription (note : i need to finetune the model with full size audio i don't want to split 30sec)
I don't believe there is any limit on the length of audio you can use to fine tune the model. You should be able to use any length file, it's just hardware constraints.
but when ever i try to train my dataset (long duration audio with transcription) i got this following error.
ERROR : RuntimeError: The size of tensor a (1719) must match the size of tensor b (448) at non-singleton dimension 1
Your current model is set up expecting all files in the batch to have the same dimensions between both audio files. The simplest fix would be to either pad or truncate the files. A simple python script could automate this if you don't want to get into changing what the batch expects.
i already try those two method padding and truncate after also i got this same error
Sounds like there is an error with how your padding and or turncating. I'd add some logging and check the length of files after they have been padded, etc.
@surreal plover do you give me any sample files for your padding and truncate logic code i'll go through that and i'll follow your methods will you help me for this
Hi,
I want to optimize whisper model for running local Android device.
What is the best approach for optimziing whisper?
Which model is the best for base one?
While maintaining accuracy, how to optimize much?
Hi, I have problems with whisper. Whisper works well this way:
model = whisper.load_model(“base”)
result = model.transcribe(“audio2.wav”)
print(result[“text”])```
But when I want to use it in my code, I get errors. Here a part of my code :
```import whisper
import sounddevice as sd
import numpy as np
import torch # Assurez-vous que PyTorch est importé
def voice_model(language='en-EN', mic_index=0,
voice_id='HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\MSTTS_V110_enGB_HazelM'):
working = True
model = whisper.load_model("base")
engine = pyttsx3.init()
engine.setProperty('voice', voice_id)
if language == 'fr-FR':
print("Apuuyez sur * sur votre clavier pour mettre en pause la conversation")
else:
print("Press * on your keyboard to pause the conversation")
def talk(text):
engine.say(text)
engine.runAndWait()
def listen():
command = ''
try:
# Définir la durée de l'enregistrement en secondes
duration = 5 # Durée de l'enregistrement en secondes
samplerate = 16000 # Fréquence d'échantillonnage de l'audio
# Enregistrer l'audio du microphone
print('Assistant :')
audio = sd.rec(int(samplerate * duration), samplerate=samplerate, channels=1, dtype='int16')
sd.wait() # Attendre que l'enregistrement soit terminé
audio = np.squeeze(audio) # Assurez-vous que l'audio est en mono
# Convertir l'audio en tensor PyTorch et en type flottant
audio_tensor = torch.tensor(audio).float() # Convertir en tensor flottant
# Charger le modèle whisper et transcrire l'audio
model = whisper.load_model("base")
result = model.transcribe(audio_tensor)
command = result["text"]
except Exception as e:
print(f"Erreur lors de la transcription Whisper: {e}")
return command```
In English : ```import whisper
import sounddevice as sd
import numpy as np
import torch # Make sure PyTorch is imported
def voice_model(language='en-EN', mic_index=0,
voice_id='HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\MSTTS_V110_enGB_HazelM'):
working = True
model = whisper.load_model(“base”)
engine = pyttsx3.init()
engine.setProperty('voice', voice_id)
if language == 'fr-FR':
print(“Press * on your keyboard to pause the conversation”)
else:
print(“Press * on your keyboard to pause the conversation”)
def talk(text):
engine.say(text)
engine.runAndWait()
def listen():
command = ''
try:
# Set recording duration in seconds
duration = 5 # Recording duration in seconds
samplerate = 16000 # Audio sampling frequency
# Record microphone audio
print('Assistant:')
audio = sd.rec(int(samplerate * duration), samplerate=samplerate, channels=1, dtype='int16')
sd.wait() # Wait for recording to finish
audio = np.squeeze(audio) # Make sure audio is mono
# Convert audio to PyTorch tensor and floating type```
# Load whisper model and transcribe audio
model = whisper.load_model(“base”)
result = model.transcribe(audio_tensor)
command = result[“text”]
except Exception as e:
print(f “Transcription error Whisper: {e}”)
return command```
And what is the actual error here?
Please post a traceback
Anyone here with the same problem? It all started when I included a prompt for the API, something like this:
prompt = (
"Please provide a clean and formatted transcription of the following audio:\n"
"The transcription should include punctuation and capitalization to make the text readable."
)
The API then goes wild sometimes and returns "If you have any questions or comments", instead of transcribing the audio submitted. Sometimes it works, sometimes it does not. The fact that there's also no "If you have any questions or comments" in the MP3 file at all makes me question it even more
any tips on improving Greek language recognition on whisper and faster-whisper?
Check the audio levels and overall quality. Whisper behaves this way when it hears silence, likely because it was trained on videos that ended with subtitles containing such expressions but no audible audio.
Yeah, that's what I thought. However, if I submit the file multiple times, it gets the lyrics right. The audio is also pretty clean/clear if you ask me. It did only happen more often when giving a prompt alongside the initial audio
Thanks for responding though! 
That's correct because it's something like confabulation. It doesn't happen every time.
I have a Voice UI that transcribes as "thank you for watching," "please subscribe to my channel," or even something in a language I don't understand when the user doesn't speak for a moment.
Interesting to know, thank you! I mean, it's not something big so for my use cases that's fine, but I was just confused for a second 
sure!
I have a picture representing this behavior. I activated the mic, and it deactivated after three seconds of detected silence, yet I still received a transcription.
Love it.
I've been playing with whisper and Google voice synthesis using SSML. It's pretty awesome wiht some of the new voices google has for SSML. You can create a group conversation as a discussion using made up actors that take the transcription, reprocess it with OpenAI API to genereate the conversation and SSML code - send to Google and get the audio back. We've layered it on top of other videos as commentary and it was pretty cool.
Here is a sample of a cmpletely unscripted conversation that used Whisper to take the conversation of the exising marketing video and created the dialog. The app alloows you to add as many actors as you want and their roll. One person knows everything, one person is asking questions and we've added a third who is the joke commentary to provide levity in what was a boring video before it was processed. https://www.robsoninc.com/wp-content/uploads/2024/06/420OTU6uCnA-final.mp4
Best part is, It can run off any existing youtube video. I've tried publicly avaliable music videos (to test), which was just odd when you remove the existing voices, keep the music and add in a multi person conversation talking about the lyrics. Next version is going to overlay pop-ups on the video to provide contextual information. It's all about increasing user engagement, creating new unique content.
set repetition penalty and use default temperature. why did you change it to 0.1 ?
hey
whisper does accept .mp4 video files directly to transcript?
yes!
https://platform.openai.com/docs/guides/speech-to-text
"File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm."
Hey guys, I have some issues when transcribing to WORD LEVEL, I get some words with the same start and end time, how can I solve that ?
Would love your help !
You are best to run Whisper locally on your own GPU
interesting. ..
I use the GPU in the Mac M3. Been working great. Also tried a Nvidia card in a dedicated server, I like the Mac for dev better.
Check out Torchx / TensorFlow
that is so cool
well I can't, I'm hosting an app on Vercel so I prefere to use the API
Did I mess up installing something?
Have you tried forcing a reinstall/update for Whisper? That should help make sure any dependencies are installed.
i think try updating your pytorch module or go to the fit repo and pip install -r requirements.txt
mate at this point i just tell people to upload pip list
dependency conflict is so freaking annoying honestly
GPT-4o 😆 sadly we can't use that via the API yet
hopefully that ages badly and we get that sweet sweet voice mode on the API soon =P
what is whisper
speach to text AI model made by OpenAI
speech to text? isnt that just on every phone/youtube?
Is there any workaround to transcribe a multilingual audio file ? I have french and english in the same audio file
trying to use whisper to subtitle an episode and running into an issue
whisper via the api works great for picking up silence and properly segmenting the text, but the timings are all off
and so now I'm trying to use whisperx, where the timings are correct, but it does a much poorer job with breaking up the text by speaker
anyone have any familiarity w this and/or recommendations?
actually I'll just write out the srt myself based on the word timings...
You can use the API locally, you just need the *nix environment.
You can... if you're a paying subscriber.
I could help, I've done a lot of Whister -> Transcription -> to OpenAI API to Process and generate SSML -> process with OpenAI TTS and re add back over the existing video. In any language. (More with TTS using Google Speech) Just need a Google Dev API with the service to use that.
GPT-4o voice features are not available, there is no API endpoint for it, even if you are a subscriber of ChatGPT, the ChatGPT subscription is a completely separated service to the API and a subscription on ChatGPT does not grant any API usage
You don't need GPT4o to use OpenAI TTS - I mean there would be zero point or benefit.
If you want to create new, better content off the transcript you would use GPT4o to generate either the SSML (if you want to use Google Voice) or you can just send the text to OpenAI TTS and let it do it's thing. Which I found to be much more human sounding.
https://platform.openai.com/docs/api-reference/audio
I used tts-1-hd
In the context of the message you replied to, it was talking about the upcoming voice features that have been showcased by OpenAI on ChatGPT. These features are not available on the API. Those features are in fact not even available to ChatGPT subscribers on ChatGPT except for a small amount of beta testers.
How do you see the features being introduced not programmatically available via the API now? Genuine question.. The issues as I see it is just speed for realtime conversation.
You need an OpenAI subscription to access the API, not ChatGPT Plus. - right?
TTS is included with the OpenAI subscription.
any tips would b appreciated, can show you what I'm looking at
left is from openai's API, right is from whisperx run locally
the api did a better job with breaking up sentences with silence/no speech between them
whisperx kinda just groups them together if they were close enough
i know the latter is basically just whisper under the hood, but I'm not sure which parameters were used to achieve the former (for whisper, i.e. past the api)
a clear difference is the punctuation obv but I have no idea how to control that hehe
toyed with params like chunk_size no_speech_threshold max_line_count and only the first one of those made a significant difference, though its effect isn't exactly what I'm looking for
I found OpenAI TTS-HD to be more realistic that Google Voice with SSML.
But Google has a lot more languages/accents and there are some new - pretty great sounding voices but the SSML can be tricky. I had issues when adding in two GenAI people in to a conversation.

a ye seems like its obsesed whit amara org always say that 😄
it hallucinates when there is silence
I had never seen that one before
I had never seen "Amara" message before too.
send an empty audio clip
i noticed it when i was building a STT/TTS agent and sometimes i’d accidentally send an empty message to the agent
hi i am trying to transcribe the audio but whisper seems to skip a big chunk every time in the audio in other audio i work fine but in this one i seem to skip a big chunk
here the audio i about 56 sec
if you listen to the audio in the start the lady say anyway okay let's move on so what I'm doing right now and after the the whole chunk it skipped untell did you figure it out
https://filebin.net/mnk9e6qmy5nfm9oa
TimeStamp
00:05 - 00:30 audio is skipped every time i transcribe it 3 time i get the same response
transcriptions
anyway okay let's move on so what I'm doing right now did you figure out what I was saying there or you just bought the test like pretty much everybody does because now we have some groups that are supposed to help people guide people but actually what happens is that sometimes people go into their small private discussions and when somebody wants to do the test they just ask someone from that group\n
Convenient file sharing. Registration is not required. Large files are supported.
Hey, trying to get whisper.cpp working on windows. anyone been able to compile with cmake? for whatever reason manually compiling the files doesn't work as it doesn't seem to use any backend and therefore doesn't work. i can't comile any example with cmake other than the main and i wanna do more than that. any help?
I have a self hosted whisper server used to transcribe Greek news videos (Large-v3 model), however I am experiencing an issue where sometimes, after the speaker changes, the transcription is lost.
Sometimes it comes back after a sentence or two, other times it does not understand the 2nd speaker at all.
Audio quality is excellent for both speakers and neither has a thick accent or is unintelligible.
Is there anything I can do to improve this?
Hey everyone!
Does anyone else here finds a problem that max file size for whisper being 25MB?
If it is an issue for anyone else, I thought I'd share with you a simple api for transcoding videos / audios in a variety of formats into opus ogg, which greatly reduces the file size.
Some tests I ran were able to reduce 1GB (video) mp4 file into a ~15 MBs audio, and 50MBs mp3 audios into ~ 2MB files, and the transcription with whisper worked perfectly!
It's a very simple api (only one file), that can be run/deployed with a docker container.
You can find all instructions (including, deploying it to fly.io) at the repo:
https://github.com/vfssantos/ffmpeg-deno-microservice
Contribute to vfssantos/ffmpeg-deno-microservice development by creating an account on GitHub.
Is whisper-1 the only whisper model?
Hi I'm trying to generate translated subtitles for a non-english video using whisper but using --task translate doesn't seem to do anything what am I missing
i found the issue turbo doesn't have the translation capabilities
https://github.com/openai/whisper/discussions/2363 large-v3-turbo is the latest
the api docs says ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available.
https://platform.openai.com/docs/api-reference/audio/createTranscription
looks like large-v3-turbo is not yet available :/
How does one transcribe an audio file with multiple speakers, and have the AI distinguish between them?? Examples: podcasts, meeting call recordings
You will need to do speaker diarization. Whisper doesn't support this itself but you can use libraries like pyannote https://github.com/pyannote/pyannote-audio or if you are okay with paying you can check assembly.ai they have this available via api and its fairly cheap if your usage is limited.
Whisper large-v3-turbo is available. The model is available on huggingface
Here is the link - https://huggingface.co/openai/whisper-large-v3-turbo
thanks...that' s interesting.
Anyone know is there will be a new version or completed remake of wihsper when the audio version of chatgpt will be available?
i'm waiting for that info before i make an investment....would be really hurtfull to invest just that some days later a new complete version of whisper comes out....it's for STT project
if anyone has any news, it would be great, thanks!
Hi all. Any tips or resources for generating an srt, given an mp3 file and its lyrics?
I thought about using whisper, but sometimes the lyrics it hears are wrong -- eg when the song audio isnt too clear or words are in other languages --
So.... many... crickets...
What is the best wrapper your have seen to run whisper locally on mac?
Atm I use flowvoice.ai but I would prefer running something open source if possible
simplest would be to do a single http request with curl, or if you install the OpenAI API a simple Python script:
client = OpenAI()
THE_PATH = "[path goes here]"
THE_NAME = "[file name goes here]"
audio_file = open(THE_PATH+THE_NAME+".wav", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="srt"
)
print(transcription)
f = open(THE_PATH+THE_NAME+'.txt', 'w')
print(transcription, file = f)```
(i just added THE_PATH and THE_NAME to make this example clear)
also you can install ffmpeg and convert the audio to .wav first which can prevent weirdness with server side conversion:
ffmpeg -i <audio> -ar 16000 -ac 1 -c:a pcm_s16le <output>.wav
if you divide it into multiple audio files you can use a GPT to combine the transcript files smoothly
Also you could create your own Whisper app for Mac by telling o1-preview:
Please write me a Swift app for Mac that uses SwiftUI for the UI, and Combine for asynchronous data and UI. The app should let me select an audio file, convert it to a transcript with an OpenAI Whisper http request, and then put it into a text editor window so i can edit it. Please add whatever other features you think will make it amazing. Please write out all source files 100% completely three files at a time, and ask permission to continue.
There are many ways
Thanks! I still think the simple way would be an existing open source solution. It seems like a pretty straightforward case and whisper is not really new so I guess there should be something already out there that is plug and play
Hey everyone! I have a short question (maybe not so short, haha) about fine-tuning Whisper models on own labelled data. Following the general jupyter notebook on Huggingface, there seems to be no preprocessing steps for transcriptions that are generated during the training and are used for evaluations over the course of training. Is it necessary to add this step in the training somehow?
What I mean:
For example, for my own dataset that I am using for fine-tuning, all letters are lowercase, and there are no punctuation marks, only pure letter characters and whitespaces. Now, after starting fine-tuning, for example, every 200 steps the current model gets evaluated, however, is it not possible that model will generate output that will have uppercase and punctuation characters, and therefore WER will be higher than if they were preprocessed after being generated?
hi I am new to the channel, so forgive me if it has been answered before. Has anybody tried to use gpt4o- native audio 2 text to transcribe audio? How does it compare with Whisper? How good is it for long-context? Can it somehow do speaker identification?
hi is it possible to directly summarize an audio with whisper without a previous transcription?
Hey, I've recently discovered a problem with using two different versions of Whisper with CUDA.
I have installed whisperx and whisper-ctranslate2 on two different conda enviroments.
CUDA v12.5 and v11.8 are installed in Windows.
CUDNN v8.9.7 and v9.6 are installed in Windows.
Both of them are included in environment variable PATH.
Additionally I have two CUDA_PATH and CUDA_PATH_V11_8 set up.
whisperx works correctly with CUDA and spits out transcripts perfectly
whisper-ctranslate2 recognizes file but it refuses to transcript it - it crashes with no error. It only works with --device CPU argument in command line.
So, it looks like whisper-ctranslate2 has some problems with reading CUDA and CUDNN files, but I have no idea how to force it to recognize files correctly.
(I remember that on my previous Windows installation, the problem was reversed - whisper-ctranslate2 was working with CUDA and whisperx was not having any of it)
Any ideas how to fix it?
probably not but you can use any python package for this summarization
Hey,
We've recently added Intel® Gaudi® support to the Whisper repo! 🎉
Check out the GitHub discussion for more details - https://github.com/openai/whisper/discussions/2463
Let us know your thoughts or any feedback in the thread!
I have seen that midjourney allow to generate img in relaxed mode for free users! Is that true
Is there documentation on how to tune the ServerVAD API for PSTN calls? The default seems to be really sensitive
Hello guys, I need some help. I don't know any coding but through chatgpt, I somehow got whisper integrated in my terminal, but when I ran the prompt to get speech to text transcription, "Error calling OpenAI Audio API: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https:/platform.openai.com/docs/guides/error-codes/api-errors.", even though I have the chatgpt $20/ month subscription
The api does not use the ChatGP account. It is separate, separate billing.
Ohh okay
hey, i'm currently trying to integrate whisper locally to save on some api call costs, and i've followed the github repo's instructions on how to load and set up the module and model.
this is my code:
import whisper
model = whisper.load_model("tiny")
This is the error it raises:
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)>
Anyone know how to fix?
Oh - running python 3.12.8 in a venv using VSCode
Are you running this from a public or a school or work computer ?
Or public wifi ?
Personal device on personal wifi. I tried on python 3.9 and it worked.
Oh good then. Lmk if there's anything else
Can whisper detect um, ah and other filler words? Could you help me with what I should be putting in as a prompt so that I can detect it. I need it for the app I am building.
I suggest setting normalize=False, and yes! If you prompt with something like "Umm, let me think like, hmm... Okay, here's what I'm, like, thinking." It's at least somewhat more likely to include those sorts of filler words, but maybe not all of them
if you search for "allow wisper to detect filler words" there's a few good conversations on various online forums for tricks and methods of doing so :)
This is not working for me, even the prompt is not working.
I don't think there is a parameter for normalisation
This video goes in depth on how you can do something like you're trying to achieve: https://youtu.be/pUzBuwjvH9E
Let's build an AI-powered filler-word detector to play an air horn when we say "um" or "uh." We'll use powerful off-the-shelf tools and wire them together with minimal code.
00:00 - Intro
01:17 - Getting Started
03:17 - Recording in chunks
06:15 - Roughing out recognition
07:16 - Detecting filler words
15:16 - Adding the air horn
16:56 - Faster...
I am using whisper on windows10 and set up the settings the way I like now I just drop the files I want to transcribe onto it and it does so dropping the translations into the folder of the file its transcribing.
I wanted to ask if theres a command I can add to the *.bat file that would do the following: The audio file I drop onto the .bat I want to be moved to another folder after its completed.
Right now i Just do it manually, but I wanted to automate it.
Is there anyway I can have taxonomy included in the whisper? I have some banking terms like TMRW (pronounced as tomorrow), UOB where I don't get these terms when spoken. Based on the context can whisper identify these terms? I tried with initial prompt but that doesn't work out really well. Any suggestions or ideas?
What's the best way to do speaker diarization via the API for a react native app?
This isn't directly offered via the whisper API and you will likely need a middleware or additional step for diarization
A very strict prompt structure may help here, I find with whisper specifically putting directions and examples in a format similar to XML works pretty well
Thanks, any recommendations?
Any best practices to avoid hallucination with whisper? Nothing I try works for a specific audio file
Whisper is a speech-to-text model
Guys is there an app that lets you run the Whisper Large V3 model for free on MacOS?
brew install python
pip install openai-whisper
whisper --model=large-v3 whatever.mp3
If you do whisper --help it says the default is --model=turbo:
MODELS = {
...
"large-v3": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
...
"turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
}
so it defaults to large-v3-turbo
Thank you!
I wrote a thing for transcribing to the terminal from the default mic pip install catvox
you can pass --model to that, it uses whisper. Currently working on getting it to recognise who is talking, so it can just show me my own speech, so I can send that to an assistant.
https://bitplane.net/dev/python/catvox/
if anyone is interested in helping make it work, current pipeline redesign plan is here: https://github.com/bitplane/catvox/tree/pipeline
That's awesome
Thanks. Building an automatic pipeline builder is a headache. Been scratching my head over it for ages
Are there any best practices to avoid having wrong timestamps via Whisper API when transcribing in SRT format ? I notice huge offsets on the timestamps...
you can get word-level timestamps and then consolodate. that gives me more accurate results
Are you guys using Silero VAD to exclude silences of the audio chunks sent to Whisper API ?
Currently doing the following : Silero VAD --> Whisper API --> Aeneas
what i've to do ?
Am using openai whisper model and observed following issues
- When I don't speak anything instead of giving empty response generates random words
- Sometimes, detect incorrect words where audio is clear
Can anyone please advise me on how do i improve?
Nobody answered me on this before. Here is why it happens : Whisper is not handling properly silences, which induces hallucinations such as repetitions, triple dots and so on (very funky stuff).
What you need to do is use a VAD tool like Silero-VAD to cut your audio into voice only chunks and send these to transcribe.
was this the whole error message?
Thanks @agile niche for providing suggestion on how to handle silence
Have you faced any challenge where it detected wrong words
eg:
What I asked "hi can you assist me "
What it assumed "Hi Kenyas speak to me"
Yup, so a good practice is to ask ChatGPT to correct your whisper transcript by giving him guidelines
An efficient transcription pipeline is : Audio --> VAD --> transcription job --> correction job
Do you also need time stamps ?
I don't need timestamps
Hey folks,
My colleague raised a cool PR for whisper that boostx transcription peformance and I was wondering if I could find anyone here who might be able to review the PR and give her some feedback? Apologies if this is the wrong place, here's a link if not https://github.com/openai/whisper/pull/2516
hi, I just got the command example running using whisper. this is using ggml and the 'base' model. its great, only when i have no real ambient music on. something as simple as a kind of hangar background or something makes it not detect anything. do i adjust 'temperature' or anything?
i would train it based on how well a voice seems in context with any detected music, in the sample. that way you can have anything on, and as long as you arent nailing the right vocal tones with the music it would think its the speaker
Hi Folks, can whisper base model be used for other langauges than english for the real-time usecase? Also can it be used for non-english use case for batch processing? Has anyone tried and how is the accuracy?
Hi Guys, we are working on a real-time STT using whisper base model, this is for a conversation between agent and customer. 2 websockets connection to the backend STT engine. The latency is little high as I understand whisper is not meant for real-time, but is there any way we can enhance this?
Also, the WER is around 35-40%, any ways to reduce this? The language is mostly Singapore English.
hii i was wondering if whisper could run on any microcontrollers?
im planning to use it in a project with an esp32, but idk if itll run on it and theres barely anything on the internet about it
So I keep getting these markers for blankness or silence. I'll get stuff like this:
00:26:56.599 --> 00:27:07.599
Silence.
00:27:08.599 --> 00:27:35.599
Silence.
00:27:36.599 --> 00:27:49.599
Silence.
Or
00:26:56.599 --> 00:27:07.599
...
00:27:08.599 --> 00:27:35.599
...
00:27:36.599 --> 00:27:49.599
...
But it's not the same. It varies by each output file. Each one has its own idea of conveying "silence" or blankness. But the thing is, I really don't want this in my stuff. Is there a way to get it to not do this? Can I just put it into the prompt? Will that work?
Also what it does with dead space on a track is actually hilarious. Here's what I got on dead space on the track of one of my players:
00:06:14.000 --> 00:06:32.000
Yeah.
00:06:32.000 --> 00:06:38.000
Yeah.
00:07:02.000 --> 00:07:12.000
Yeah.
00:07:12.000 --> 00:07:22.000
Yeah.
00:07:22.000 --> 00:07:40.000
Yeah.
00:07:40.000 --> 00:07:50.000
Yeah.
00:07:50.000 --> 00:08:00.000
Yeah.
00:08:00.000 --> 00:08:10.000
Yeah.
00:08:10.000 --> 00:08:20.000
Yeah.
00:08:20.000 --> 00:08:30.000
Yeah.
00:08:30.000 --> 00:08:40.000
Yeah.
00:08:40.000 --> 00:08:50.000
Yeah.
00:08:50.000 --> 00:09:00.000
Yeah.
00:09:00.000 --> 00:09:10.000
Yeah.
00:09:10.000 --> 00:09:20.000
Yeah.
00:09:20.000 --> 00:09:44.000
Yeah.
He said literally nothing in this ~4min period. At all. I listened to the whole thing to make certain of it
I think that if you consider it to be trained on live stream data from twitch. Brief silences followed by a yeah could be someone reading chat lol.
You should remove all silence so it doesn't attempt to transcribe it with something like VAD preprocessing.
That's what I ended up doing and it worked quite well. I chopped up the file based on inverse silence detection using ffmpeg
open source GitHub projects like faster-whisper and whisperx have VAD preprocessing built-in that works rather well
So, will it be possible to download gpt-4o-transcribe? I want to run it locally
I’ve only seen it announced on API and seems quite expensive unless I’m reading their chart incorrectly.
so RIP this channel?
Whisper isn't being removed and I doubt the new models will be much more (if at all) popular considering they're API only.
Much cheaper and easier to run Whisper locally.
whisper is open source also
Hello, I am having an issue with a whisper related plugin. WhisperAttack is a plugin for Voice Attack that changes its language recognition to whisper's STT. But whisper never loads its model when I'm using GPU, on CPU it loads fine. It doesn't throw any error codes or something similar, it just doesn't get past the loading stage. I've done some admittedly meager testing, because this is the first time i've actually stepped into ai territory, but i've found that my system can run whisper itself through python+pytorch using a gpu. Or atleast thats what the code ChatGPT gave me said. I'd appreciate if anyone knew if there was a more concrete way I could test it, or if its something else.
Sorry that i've asked here, but WhisperAttack has no forums or discord servers i can find. So this was my best option, tied with the VoiceAttack discord.
I have no clue about the software, never heard of it or used it.
If you created the code, simply add an insane amount of logging so you know what’s actually going on and paste it back into ChatGPT.
I didn't create the code, and I'm almost entirely code illiterate unfortunately.
That isn't what I meant. You said you made ChatGPT create the code. So you make ChatGPT add extensive logging throughout the file. Then you paste all the new output you've got back into ChatGPT to solve your problem.
Though you shouldn't just trust that any tools can do what ChatGPT says they can do. You should actually ensure it has the ability to do whatever it is you're wanting to do first. If you can't verify that, then ya, you're definitely possibly gonna spend many hours on it to achieve nothing.
Oh sorry, chatgpt didn't make whisper attack's code either, it just made some code for me to run on my computer to test if I could even run whisper without the plugin. I'll see if I can find out where to edit the plugins code and try what you said. Sorry if I misunderstand you again.
It should be able to run on GPU, that's what the plugin comes on natively (and recommends you to use). I was only able to make it work by changing the config file to use CPU
hello i have a question, how do you use whisper word timestamps?
however you want to utilize them?
How does whisper prompts work? Does it uses the prompt for every chunk of audio that we send to the model?
Hi all,can we get pronunciation feed back while the audio is been transcribed?
Heyhey. I am trying to get whisper to work in a java environment but am not sure what the best approach is. I am targeting a live transcript so I was using whisper live for a bit but am getting a bit tied up with the library. Is there someone that has a bit of experience or idea towards this? If this is a space to even ask this
whisper cpp is very actively maintained, and has Java bindings: https://github.com/ggml-org/whisper.cpp/tree/master/bindings/java
Oh hell Yeh. I will look it up. I assume its with JNI?
README said it is JNI!
Thanks. Got some time soon, I shall dig in and see.
For anyone interested in running Whisper locally, I am working on Open-source AI notepad for meetings: https://github.com/fastrepl/hyprnote
@calm finch
alright, i think i'll use the API then
i think i'll just use the API it's not too bad
i'm gonna test 4o transcribe to see how good it is
yeah mine(Hyprnote) is for using local whisper(for now)
4o transcription isn't working for me, does anyone know what to do?
Is there anyone here who might be able to assist me with better understanding OpenAI whisper pricing? We are testing using them for transcription of call recordings from a telephony server, and I cannot reconcile their charges with our usage.
You listening
What is whisper ?
Why is ChatGPT whisper not even working
elaborate
what's up with whisper? i get The server had an error while processing your request. Sorry about that! Please contact us through our help center at help.openai.com if the error persists.
for the life of me I cannot get whisper to translate from english to another language. Any help appreciated
If I need accurate timestamps so k can show the current location in a transcript while playing the audio , is there anyway to use the new transribe models or do I have to use whisper ?
hello im new here so guys can you help me
Hoi, is whisper capable of making non verbal sounds by using symbols to activate X non-verbal action?
Or is it capable of processing a voice shouting? As i'm trying to make some nonsense automation in home assistant, but can't seem to figure out how to make my VA sneeze for instance lol. Or make it phonetically make the noise i want by using only letters., as if there's too many letters and whatnot, it will start a spelling-bee competition lol
@oceangrover muted
Reason: Possible spam: Excessive mentions in a short period.
Expiration: 56 seconds
Proof: @winter island

I need to be able to get language realtime captured off of a PC livestream and then auto translated. I heard that whisper is the best captioning software but it doesnt have realtime captioning?? I also heard there has been a lot of things added to this software through huggingface etc. Just wondering if what im asking is possible
So apparently thats a yes but the actual good model large v2 requires a nvidia cuda gpu.... Surely you can get this service as a cheap api key that doesnt run locally?
Shhhhhhh
WhisperAI is one of the most interesting things that OpenAI has built and apparently no one uses or engineers it.
I am Poison Apple
🚀 New Project Alert!
Hey everyone! I just fine-tuned OpenAI's Whisper-Tiny model to translate Bengali-English code-switched speech into English, especially for healthcare use cases like doctor-patient conversations. An easy fine-tune script for translation is open-sourced.🎙️🏥
🧠 It’s called MediBeng-Whisper-Tiny — perfect for building better clinical transcriptions and exploring speech translation tasks!
Check it out here:
🔗 GitHub: https://github.com/pr0mila/MediBeng-Whisper-Tiny
🔗 Hugging Face: https://huggingface.co/pr0mila-gh0sh/MediBeng-Whisper-Tiny
Let me know what you think or if you try it out! 😊
#SpeechTranslation #gpt-realtime #HealthcareAI #Bengali #OpenAI
Hi,
I have video which has only music/ no audio
i need to show video response for the user when they ask the queries related to video
what approach should i follow please advise
Hi,
Hey there! is anyone here experienced with using faster-whisper while multithreading?
I’m running Faster-Whisper in a Python app that transcribes multiple audios (.wav) in parallel. I’m using ProcessPoolExecutor
The issue: sometimes the transcription just hangs silently — no errors, no CPU or GPU usage, no output. Eventually I have to kill it. When I add logging inside the transcription subprocess, it often doesn’t get past WhisperModel(...).transcribe(...).
Still stuck.
Has anyone run into this kind of silent freeze or dead GPU thread in Faster-Whisper? Any known workarounds or tips?
Thanks 🙏
was this made with whisper??
That was verbatim responses from gpt in one of my threads on chatGPT app
What is whisper lmao
You’re in the channel for whisper lol. It’s their api for text to speech
Oh i know it well
Hello! Anyone here use an automated whisper transcription workflow for better note taking?
Anyone else having very delayed Whisper API responses over this last week?
@narrow thorn
anybody knows some platform or app that can do audio/video files transcription with translatioin support?
and export to srt
preferably able to run locally on windows
and optionally on mac
👋🏽
Whisper not working in chatgpt?
Draw a playing card backside
why are we whispering?
shhh!
has anyone figured out how to avoid large v3 going into an endless loop of repeating the same sentence thousands of times?
makes v3 essentially useless for longer videos, even with no background noise
Late reply, but for everyone experiencing the same; do you use Whisper locally? If you do, give condition_on_previous_text = False. This solved all my repetition, spamming, nonsense output problem.
However, I can't find the way to give this option in API...
doesnt help
yea I am using it locally
using this ui
https://www.youtube.com/watch?v=bcIgITbt3cg
This video for example
10日は関東甲信地方を中心に各地で猛烈な雨となり、都内でも道路が冠水するなどの被害が相次いだ。横浜では、大雨の影響でマンホールが吹き飛び道路が陥没するという被害も出た。
■都心大冠水 帰宅ラッシュ混乱
打ちつける雨に鳴り響く雷鳴。10...
around 4 minutes in large v3 gets stuck even with condition on previous text disabled
I'm using pure Python script myself, and got this result with large-v3 You mean model stuck around here?
indeed
might be a bug in the gui that doesnt pass on the parameter
so its the previous text conditioning that breaks v3
I think so. I mostly use Whisper with Korean, and every single transcription had that kind of problem. When I turn off previous text conditioning, every problem gone; it rarely appears, but re-transcribe works.
This is with previous text conditioning. It indeed causes problem...
okay but it doesnt get permanently stuck
with previous text conditioning i get the same sentence repeated from around 4 mins until the end of the video
Hmm... that's strange. I only get partial repetition with conditioning. Maybe because WebUI uses its own setting for transcription?
I randomly got that kind of thing, but not every time.
okay ran the app in debug mode, first of all it doesnt pass through the condition_on_previous_text to the actual transcribe call
It uses various default settings...
whisper:
model_size: "large-v2"
file_format: "SRT"
lang: "Automatic Detection"
is_translate: false
beam_size: 5
log_prob_threshold: -1
no_speech_threshold: 0.6
best_of: 5
patience: 1
condition_on_previous_text: true
prompt_reset_on_temperature: 0.5
initial_prompt: null
temperature: 0
compression_ratio_threshold: 2.4
chunk_length: 30
batch_size: 24
length_penalty: 1
repetition_penalty: 1
no_repeat_ngram_size: 0
prefix: null
suppress_blank: true
suppress_tokens: "[-1]"
max_initial_timestamp: 1
word_timestamps: false
prepend_punctuations: "\"'“¿([{-"
append_punctuations: "\"'.。,,!!??::”)]}、"
max_new_tokens: null
hallucination_silence_threshold: null
hotwords: null
language_detection_threshold: 0.5
language_detection_segments: 1
add_timestamp: false
enable_offload: true
https://github.com/SYSTRAN/faster-whisper/releases
Still getting updates, I suppose.
okay well then I guess I know where to file a bug report at least, thanks
Critical user face bug
@snow dirge
Saw expected Behavior then became critical behavior need to talk to someone privately
Hello everyone, I'm having some difficulties training whisper (specifically large_v2) with a Greek dataset.
If anyone is available to let me pick their mind it would be greatly appreciated.
Specifically, doing a full finetune with Transformers using the common_voice_11 dataset gives me very high WER% even if loss stays low
Doing a LoRA finetune on the same dataset does not produce better results
Finetuning on a smaller dataset I have created (~1 hour) appears to be overfitting
How about starting here. What happened?
And/Or take it to #1070006915414900886
And/Or OpenAI Help Center > Support
The app on iOS locked up my phone and had to do a hard restart multiple times are you part of the AI team
No, guides are community members just like you, helping people to find things and get what they're looking for.
Staff doesn't respond here. Our (we, the community) resources include the channels here or Help Center on the OpenAI site.
But I am a career developer, and you're question seemed to involve Whisper, so ... I'm offering what I can.
I get the impression that you're talking about ChatGPT and not Whisper?
Yes I'm sorry it's about chat GPT i thought this Whisper was to whisper to a developer i can't get through there help stuff is driving me nuts
Ahhh, try #chatgpt-discussions
Is there any benchmarking available for Translation of data using whisper ? (WER / CER)
What is whisper
Okay thanks
Hi, I need some help debugging this
I've tried whisper models base and large and I get nonsensical results, I've gotten Japanese, English(but far off from what i said), and in my most recent attempt, Viet
I only speak english and ive never been to the east of the world so my accent isn't influenced by there
Testing on this video (from my mic) for the first 30 ish seconds
Results:
Good morning. Today, I hope you are more compacted and to the right. Today, I hope you are more compacted and to the right. Madison land, London, and Cleveland. It was no less than 10 hours ago that a black September owl was talking about the worst attack that had happened to a black and a white and black owl. That was a very, very rough shock and a terrible shock.
very interesting, this video is only transcribable if no preprocessing is done, any attempt to remove background noise and do VAD will make whisper large (v2 and v3) completely give up
anyone got any idea whats going on?
Here's a full transcript for the video.
it's not about the transcript. I need help debugging why my code produces an incredibly wrong transcript
Ive tried, no luck
What about Claude?
did so as well
Grok 4, if you have access?
-# Be mindful of what other users in a channel might find helpful or interesting when posting. Stay on topic in order to keep conversations focused and productive.
-# Consider posting in #off-topic or an appropriate channel.
I just came across this awesome news: https://www.phoronix.com/news/FFmpeg-Lands-Whisper
this is so cool
trasncriptions directly on ffmpeg with the open source model
I love ffmpeg.. really, I couldn't think of a software that had a better impact in the world..
hey all, i was wondering if this could be something that you could all make use of.
it's called owhisper - basically ollama for realtime speech-to-text. more in docs.
https://docs.hyprnote.com/owhisper/what-is-this
This!! I’ve been looking for something like this!! Thank you!!
Whisper is no longer working properly because OpenAI is shutting down the Whisper voice Aug 8.
And I am not going to use the new one 😒
That's next level, I'm definitely going to be using this. My little web app project has FFMPEG installed in the docker container, will be super handy to be able to call AI-based subtitle generation, if I'm reading this correctly!
Meta meta meta ...... game inside VM inside VM loool :3 open for co-creation 😉 prompt exchange, link the sun and send the moons as func(open moon = "{" and closed moon = "}" end this with a shadow or full moon 😉 :3);
So fun... T_T
イギリス・ロンドンの観光名所となっている、ろう人形館「マダム・タッソー」に5日、ある“新メンバー”が登場しました。新たに展示されたのは「人」ではなく…人気の軽食「ソーセージロール」。そのワケとは。
この動画の記事を読む>
https://news.ntv.co...
another video that whisper really struggles with
are there some optimal set of parameters for japanese?
it really struggles with clear, low-background noise language
if the chunk size is above 5s it skips whole sentences and even then it doesnt translate whats literally being said but often omits or changes parts of the grammar
1
00:00:00,000 --> 00:00:05,000
ロンドンの観光名所になっている老人業館マダム
2
00:00:05,000 --> 00:00:10,000
タッソーにはイギリスの王室のメンバーや世界の著名人の老人
3
00:00:10,000 --> 00:00:14,000
業が数多く展示されています。
4
00:00:15,000 --> 00:00:20,000
ロンドンのマダムタッソーに新たな仲間が増えました。
5
00:00:20,000 --> 00:00:23,000
ソーセージロールです。
6
00:00:25,000 --> 00:00:30,000
今回新たに展示されたのは人ではなく、
7
00:00:30,000 --> 00:00:33,000
イギリスで人気の軽食ソーセージロールです。
8
00:00:35,000 --> 00:00:38,000
6月5日がナショナルソーセージロールデーとされていて、
9
00:00:39,000 --> 00:00:42,000
マダムタッソーにも参加しました。
10
00:00:40,000 --> 00:00:43,000
マダムタッソーではイギリスのチェーン店グレックス社の
11
00:00:44,000 --> 00:00:47,000
ソーセージロールそっくりに作っています。
12
00:00:45,000 --> 00:00:49,000
今月末まで展示されることになりました。
13
00:00:50,000 --> 00:00:53,000
食べ物の老人業が飾られるのは初めてです。
14
00:00:55,000 --> 00:00:58,000
グレックス社のソーセージロールはイギリスで人気のスナックです。
15
00:01:00,000 --> 00:01:03,000
およそ100万個が販売されているということです。
16
00:01:05,000 --> 00:01:07,000
ソーセージロールはカリカリで柔らかく、
17
00:01:08,000 --> 00:01:10,000
柔らかくて柔らかく、
18
00:01:10,000 --> 00:01:13,000
調味料もとても良いです。
19
00:01:15,000 --> 00:01:17,000
老人業の製作チームは、
20
00:01:18,000 --> 00:01:20,000
ソーセージロールのパリッとしたパイの層と、
21
00:01:20,000 --> 00:01:23,000
サクサク感を再現するために試行錯誤を重ね、
22
00:01:24,000 --> 00:01:26,000
数か月をかけたものです。
23
00:01:25,000 --> 00:01:28,000
すべて作品を完成させたということです。
What's up with the Whisper being so horribly broken in the ChatGPT SVM for the last 3 or 4 weeks? It hallucinates all the time... Is it deliberate action by OpenAI to discourage users from SVM? I tested it on several different devices including the browser with high quality microphone and results are basically the same... Because of that we (I mean, me and ChatGPT) started calling it the Careless Whisper...
Which model are you using?
ChatGPT with gpt-4o and SVM, but the STT hallucinations persist when using gpt-5 too. AVM doesn't seem to have these issues, but being AVM it's horrible when it comes to conversation being significant.
HyprNote looks great but looking for something like that for Windows. Does anyone have any suggestions?

Since the last weekend the Whisper-related issue in ChatGPT I reported in previous msg seems to be fixed!
Hey guys i am having some troubles using the transcripition method . for some reason some audio files sometimes dosent transcript all the content in it.
did you guys ever got that problem?
how did you solve them?
Hi @trim rune
Yeah, that happens sometimes. A few reasons could be background noise, overlapping voices, or the model hitting length limits and cutting off. I usually fix it by either cleaning up the audio first (noise reduction, splitting long files into chunks) or using a different transcription tool/model. Breaking the file into smaller segments tends to help the most.
I have a lot of "almost correct" subtitles in Polish. I.e. not CC but regular TV subtitles. Would it be possible to use Whisper to correct each of the lines in this subtitle given an audio segment and the original almost correct line?
hi everyone i just want to know how can i use whisper and is it free for the chatgpt plus user ? Thank for any respons please
Has anyone had a lot of success with diarization using things like pyannote? I seemed always have problems
how can i start using sora
All you need is the audio files (wav, mp3, or mp4), feed those into Whisper and it should transcribe into text.
You don’t even need ChatGPT Plus, you can use Whisper for free as a free user, after getting it set up and running on your computer, just open the windows terminal/command prompt and run whisper “path of your audio file”, don’t forget to add the type of audio file at the end and have quotes.
I haven’t used pyannote but you should just be able to use WhisperX as that is basic Whisper plus speech diarization all in one, no need for a separate software.
👯♂️ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
https://github.com/m-bain/whisperX
WhispherX is built on pyannote. I tried it a few years back without much luck. I'll try again to see if its better.
Ok yeah that’s a while ago, definitely try it again and see how it goes.
what can whisper do
in ChatGPT once i uploaded the audio file
but it's a concern if ChatGPT puts out an error due to some dependency issue iirc
i wonder if they fixed it now
because i tried it earlier it starts to complain 😤
@dense imp I was working on getting WhisperX to work on my windows pc, and I finally got it to work. Here's what you need to do:
- I take it you already have the basic Whisper installed (without speech diarization), if so you already have dependencies needed for WhisperX (PyTorch, ffmpeg, etc.). Regarding Python specifically, I recommend version 3.11, it worked the best for me, so in your environment system variables on your Windows PC make sure your paths for both user and system variables has only the python 3.11 version path, as other python version paths might cause conflicts.
- Download Git for HuggingFace models: https://git-scm.com/downloads/win
- Install required Python packages in Windows Command Prompt:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install whisperx pyannote.audio pydub
pip install ffmpeg-python
If you already have a later version of CUDA installed on your computer (e.g., 13.0), that's fine, as it is backward-compatible. If you don’t have a CUDA GPU, just install the CPU version of PyTorch with --index-url https://download.pytorch.org/whl/cpu. - Get Hugging Face token: https://huggingface.co/join
Sign up, create a new token, make it Read access, and give it any name you'd like. - Accept each of these model's conditions while still logged in your Hugging Face account:
https://huggingface.co/pyannote/speaker-diarization-3.1
https://huggingface.co/pyannote/segmentation-3.0 - Run the script file attached in Python. Fill in yourhuggingfacetoken with your own token and pathtoaudiofile in mp4_file and wav_file with the path to your own audio file.
Hope this helps.
@vocal sierra Whisper transcribes audio files to text, and if you have WhisperX it will add speech diarization. Regarding your concern with ChatGPT, you don't upload the audio file directly into ChatGPT, you have to install Whisper first, and then once that's running then you can take an audio file of yours in wav, mp3, or mp4 audio file format, then run the following one-line command to test: whisper "pathtoaudiofile.wav", replace pathtoaudiofile with your actual path for your audio file in File Explorer and replace .wav with whatever your audio file format is. If you want to add diarization, you can go through the process I outlined above to get WhisperX running.
Yeah, I'm familiar. I got it to compile last time. It was the results that was the main issue. The diarization just had incredibly poor accuracy. Not really sure why.
Oh gotcha, yeah that’s interesting because my diarization also wasn’t very accurate, but it did compile, just like yours. Hopefully it gets improved soon.
I've just released a full benchmark on a STT model trained for one of our customer. 9-page full of value to compare models + GitHub code to be able to compare models by yourself ! Send me a DM and I send you the PDF 🙂
its called whisper guys so shhh we have to whisper
It includes whispering
I used open AI whisper -large-v3 to build this give your feedback guyys - https://aitranscript.in/
Transform your videos into accurate transcripts instantly with AI. Upload, transcribe, and analyze your content with our powerful video transcription tool. Free trial available.
What do u guys think about the new gpt 4o diarize
I haven’t tried it as I’m not sure if I want to have to pay for the API tokens, but I heard it’s a lot more accurate than Whisper. However, you’re restricted to what OpenAI allows you to do in the API so you won’t get as much customization and options as Whisper being open-source.
They should release whisper diarize open
When we say whisper are we talking about the translation model?
The model seems pretty powerful but I still find a lot of bug, and it still doesn't support many of the old features, like prompting and real time. Did you have success with it ?
Mostly I find that it always try to translate audio in english, even if the language parameter is set and no one speaks english in the original audio
🎥✨ Introducing AITranscript.in
— Free, Fast, and No Sign-Up Needed!
🚀 AITranscript turns any video into clear, actionable insights — instantly.
No login. No limits. No catch.
💡 What you get:
🎧 Automatic Transcription — powered by Whisper
🧠 AI-Generated Action Points — focus on what truly matters
💬 Smart Chat Assistant — ask anything from the video context
📜 Instant View Mode — see full script + key takeaways before download
🆓 100% Free. Forever.
⚡ Just upload a video → get insights → done.
🔗 Try now → https://aitranscript.in
Transform your videos into accurate transcripts instantly with AI. Upload, transcribe, and analyze your content with our powerful video transcription tool. Free trial available.
Check your Transcriptions! Please read attachment. Very quick update. I had a song that I wrote in English, I took the English Lyrics and transcribed them in Chinese Simplified to create the song in Chinese. I used AI tools and created the song, but wanted to see how the lyrics looked after Chinese conversion. I took the song .mp3 file and had the song file transcribed back into text using melobytes. (which I discovered later uses 'Whisper". The song was in Chinese, so it generated text output in Chinese. I converted txt back into English and was shocked that Whisper had given attribution, (credit for the song as writer and composer to a Li Zongsheng, which IS a real person, he is a well known musician. I was shocked. I did extensive testing, and reported it to OpenAI. I then downloaded Whisper on my home computer and ran the text again. For a 2nd and 3rd time received the note that the song was written composed by Li Zongsheng. Running the same tests on a different song, some spam/donation for some other persons YouTube account was in my transcription. Please look at the attachment for a lot more information and check your transcriptions/conversions as it is likely that the output contains credit and spam injected in your work. The results differ with different languages. I did not see any errors in English conversion/text. Chinese was the most problematic. But millions of people have used Whisper to transcribe audio into text, so this is a real problem. Look at the attachment and GitHub: https://github.com/openai/whisper/discussions/2685
I create a script that allow you to to transcribe videos and add stylish subtitles using Whisper and .A S S files for your shorts videos (9:16)
https://github.com/revanflp/Frot
⭐ it !
So Ive been running into an issue when trying to upload a file to transcribe using Whisper. It tells me that it cannot perform this operation because its using a path instead of being uploaded but I am dragging the file into the chat box and creating the prompt
I decided to try using the API, but it wont allow me to attach my card for billing (separate issue lol) But is this issue im having with uploading the file because its too large?
can someone send me the most accurate settings and model possible i can use for max accuracy im not too concerned with time
.a#s is crazy
yep
I used Whisper as the transcription engine for a new project I’ve been building called ContentRob. It takes any informational video and turns it into high-quality written content, SEO articles, tutorials, case studies, and even share-ready infographics.
Whisper handles the speech-to-text layer, and the output is then processed into different content formats. You can also export as PDF or DOCX, repurpose the same video into multiple formats, and publish or schedule posts directly.
If you want to try the demo, it’s available here:
https://contentrob.com/
Open to connect, collaborate, or discuss the implementation details.
AI-powered platform that converts videos into blog posts, tutorials, and articles. Export, repurpose, and publish across platforms effortlessly.
Hello
Check your local browser settings. I remember on older versions of some browsers it would enter a "fakepath" as a dummy which would simulate the upload, but not release the files. It could also be filesize, but I'd expect you'd get an error to that effect.
now I am looking for dev who have rich experience with openai, whisper, electron and ffmpeg. now I am going to try to get app for ai interview assistant. so we need to implement function that voice to text from mic, and speaker from live meeting
Voice meeter potato
do you have any experience with this ? @tacit tree
please help me.. !!!
None
But i do produce music
In Ableton
Bassicly
Poržtato lets u route audio
Over ports
Same like pulse audio
But u have 8 channels
VoiceMeeter Potato, the Ultimate Virtual Audio Mixer for Windows
This is bsnnsna i think verdion eithb4
But the point is the vbsn
Vban
Oh that’s doable for me
I’ve already been working on similar pipelines necessary for my own project actually lol
Takes live audio streams, handles audio transcription, uses context and information to determine various important variables. Long story short it is for a NVR system that leverages ai workflows to streamline alerting and monitoring of security systems via audio and video data.
Your could possibly use a heavily modified form of this pipeline that essentially just has the basic parts and systemization already together anyways
would it be legal/ethical to manually port whisper to windows
I mean I think chatgpt could do it if I gave enough time
WRONG THING not whisper Ignore my message
What do you mean? Can’t you already run whisper using it’s python library on windows? Worked for me
Look below my message, I accidentally said whisper, I meant to say atlas.
oh my bad
I'd love to connect AI engineers who love learning
Let's connect
Ok. msg me
I am running into problems with whisper-v3-large is just performs so much worse thatn v2. Refusing to translate, looping over sentences. Is there anything I can do
Ok
Wsp
Hola mi gente
Whisper can be run offline but huntingface also has some stuff. Kinda depends on what the beam you need is and a few other factors. Regardless both need API access
lol I meant to say atlas
That's fun. I wonder how it does with silence
You got to set the language filter or it almost always defaults to English
At the time of my post, the language filter was not an available option and was ignored by the api if provided. Has something changed ? I can't find any changelog
I actually have a minor bug in that silence comes out as [Blank Audio} occasionally but it is an edge case normally its fine
I'm just to lazy to fix it right nowq.
😄
ₛₕₕₕ...
Hello how do I fix wrong transcrib of the whisper small model on Serbian/Croatian/Bosnian
New whisper coming out never
Hi
hi
Is the Whisper project still being actively developed, or has it gone quiet? It’s a great model, but it’s starting to fall behind the competition. It would be awesome to see a Whisper v4.
s
I’m also looking forward to the new version. Will OpenAI be launched new version this year?
Mixtral’s Voxtral Transcribe 2 just launched. Really wish whisper have new version soon.
yes, many guys need speech to text servise!!!!
fast, and precise speech to text model
what is the difference between whisper and elevenlabs?
Higgisfiead code
?
Whisper is open sourced
To use method or capabilites, I wanna know
What is the best STT or TTS model?
@vocal moth For STT you may use Azure Voice API and for TTS you may use Cartesia.
thank you.
Should I choose between Azure Voice API and Deepgram?
Azure voice api gives you real time language changes facility.
what is mean ?
I think Deepgram also provides that function.
I recently done a PoC on it.
bruh
That happens when it runs out of memory ):
So, you need help right?
No
What is your POC?
We all need new S2T model.
Stuck Emotional over run.
AI base sales agent.
hello
rrs? Qwen 3 tts best cloning best prosidy and lightweight if you are low on resources 1.7b or 0.6 b.
Works like a charm on reds openwebui with a fastAPI wrapper
What’s whispering?
I think it's a open source tool for LLMs to hear speech or something
I still can't find the speech-to-text of oss that goes beyond whisper in Japanese.
hii
hi
Can we expect newer whisper model this year?
Can we expect a better open-weight whisper model this year?
no
Since this incident happens today (5/13) https://status.openai.com/incidents/01KRG0AZKH41DV4D9SNJSXM33Q#01KRG0AZKHH37CKBST5E3WBQW6 we are with realtime api down. Have anyone with the same issue?
In our case, the flow is now:
- SIP INVITE reaches OpenAI.
- OpenAI dispatches the
realtime.call.incomingwebhook to our server. - We call
/v1/realtime/calls/{call_id}/accept. - The accept request returns HTTP 200.
- Immediately after that, connecting to
wss://api.openai.com/v1/realtime?call_id={call_id}returns:
{
"error": {
"message": "No session found for the provided call_id",
"type": "invalid_request_error",
"code": "call_id_not_found",
"param": ""
}
}
All impacted services have now fully recovered.
Since the realtime api went down my applications was getting the following error:
Realtime call failed {
status: 400,
statusText: 'Bad Request',
body: '{\n' +
' "error": {\n' +
' "message": "The Realtime Beta API is no longer supported. Please use /v1/realtime for the GA API.",\n' +
' "type": "invalid_request_error",\n' +
' "code": "beta_api_shape_disabled",\n' +
' "param": ""\n' +
' }\n' +
'}'
}
i made some changes and now i cant escape this error
hi
Ok

Proof: @winter island