Thanks for the suggestion! I don't see a direct/legitimate way to programmatically use this tool, so probably not (I guess I never specified that's one of my requirements, but I'm mostly hoping for general advice). I'm also not sure if this can actually remove the offending sections rather than just make them sound like speech, since that seems to be what it is doing.
#gpt-realtime
1 messages · Page 3 of 1
well it is in Beta right now.. maybe there will be an API later. So far i use it for realy bad interviews with a lot of external noise.. works fine for me in my case
tpacker Hi seem to be quite
i made a little tool so that my mom can use whisper at translate.mom (hopefully i'm not breaking any rules!)
Hey, I'm using the local version of whisper and was wondering if I compress an audio file from 1,411 to 96 Kbps, in general would there be much of a speedup in transcription time, and how much of a decrease in accuracy would I see?
I don't think there would be much speedup
translate.mom then voila what's happening?
Hello everyone. Wanted to know if whisper was handling diarization ?
looks pretty good
hello question how can you have whisper
I played around with the Lex Fridman interview with Mark Zuckerberg using Whisper and ChatGPT. This could be a really cool use case for processing interviews: https://lastmileai.dev/workbooks/clj9c2dxw01uzr0gvlk8rbx57
Hello everyone. I would like to know how or where to contact the marketing team.
In this step-by-step tutorial, learn how to transcribe speech into text using OpenAI's Whisper AI. Whisper AI is an AI speech recognition system that can transcribe and translate audio files in approximately 100 different languages.
📚 RESOURCES
- Install Python: https://www.python.org/
- Install PyTorch: https://pytorch.org/get-started/locally/...
Thanks for pre-emptively answering a question I had about using whisper to identify multiple voices (e.g. in a podcast or interview). Nice lateral thinking as well, I wouldn't have considered ChatGPT for picking up on the flow of communication! 🤗
If anyone is into AI development on a beginner scale hmu I got a project I need help on
hihi
Too loud.
Can you whisper?
Why
Can you pull the captions from video websites for training?
Hey, what's the required specification to run the Whisper Model on a VPS?
Should like to check, is anyone interested in - or has there been any discussion - on audio-to-text transcription that captures details of speaker identities (i.e. "speaker diarization")?
I'm following the guide here https://github.com/m-bain/whisperX and referring to https://github.com/m-bain/whisperX/blob/main/whisperx/transcribe.py for command-line options
I'm having some limited success, but seem to be constrained by sample sizes no more than a couple of minutes long ..
It is a GitHub Project
there is a whisper fork that tries speaker identification.. i am not on my computer rn butmaybe you can google it
Hi @cerulean flint many thanks for your suggestion, I was intending to give more detail earlier as I'm working on sample code from WhisperX and actually I think it looks promising
Not sure if this is a fork from Whisper .. if there are other projects I'm interested
I managed earlier to break down a 2-minute audio sample by speakers, labelled SPEAKER_00, SPEAKER_01, .. this seemed like a good start, except on inspection it wasn't very accurate
regarding the 2-minute sample (it was an export from the leading 2 minutes of a longer 30 minute segment) it produced a 15 line transcript which in terms of word accuracy was quite good, I printed the speaker labels next to the text segments like so
SPEAKER_02 And he is now suspended, ready for receiving his travel next week.
SPEAKER_02 And [name] and [name] are available.
SPEAKER_02 So the donor blood group is outposed, the recipient is outposed.
SPEAKER_02 It's a 0-1-1 mismatch.
SPEAKER_02 They had full cross-match on the 18th of the 5th, which was negative.
SPEAKER_05 Do you want any other antibody samples?
SPEAKER_02 No.
...
When I ran a 10 minute sample though against the same code, I only managed to get the first 70 lines of transcript, which corresponded to about 2/3 of the sample
here is the full source code I was using
import whisperx
device = "cpu"
language = "en"
audio_filename = "audio-sample.mp3"
# -- transcription
model = whisperx.load_model("large-v2", device, compute_type="int8", language=language)
audio = whisperx.load_audio(audio_filename)
result = model.transcribe(audio, batch_size=16)
# -- alignment
model_a, metadata = whisperx.load_align_model(language_code=language, device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
# -- diarization
YOUR_HF_TOKEN = "hf_xxx"
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
diarize_segments = diarize_model(audio_filename, min_speakers=4, max_speakers=7)
# -- speaker assignment
result = whisperx.assign_word_speakers(diarize_segments, result)
# -- print result
for i in range(len(result["segments"])):
print("{} {}".format(result["segments"][i]['speaker'], result["segments"][i]['text']))
Can update, the problem I had earlier - with audio length - is because the print method is broken .. !
sometimes a speaker isn't assigned to a text segment and if that happens then result["segments"][i]['speaker'] terminates the script early
I rewrote the printout, i.e. the part following result = whisperx.assign_word_speakers(diarize_segments, result) as
# -- print result
for i in range(len(result["segments"])):
speaker = result["segments"][i].get('speaker', 'SPEAKER-UNIDENTIFIED')
text = result["segments"][i]['text']
print("{} {}".format(speaker, text))
anyone might know why
const transcription = await openai.createTranscription({
file: buffer,
model: 'whisper-1',
response_format: 'json'
});
returns an error?: Error creating transcription: RequiredError: Required parameter model was null or undefined when calling createTranscription.
this is the openai configuration: ```js
const configuration = new Configuration({
apiKey: OPENAI_API_KEY,
organization: OPENAI_ORGANIZATION
});
export const openai = new OpenAIApi(configuration);
Why medium.en require more VRAM but have less speed than tiny.en ?
?
that is exactly how it should be?
large is bigger and the slowest
ohhh mb i tough it was better in term of performance and recognition
it is, but that takes more time 😉
How does using the api compare to running whisper.cpp locally? I'm using the api for my app but I'm wondering if it would be worth it to generate the transcripts locally for a performance boost.
Also, I'm using zh (chinese) for the language and getting errors for my requests. Works for french and english though.
Is this only for the whisper api?
I wanted to know other people's solution on how to get "real time" voice transcription with microphone
does the sample rate of an mp3 affect the usage cost?
Has anyone used fast-whisper?
I use Deepgram, it has Whisper model as well as their own models, it has fast accurate real time transcription
I wanted to know other people s solution
Hi, why im getting nothing when i enter the commands ?
whisper --model tiny.en "test.mp3"
like i get

at least not with whisper.. but google translate etc. can do it for sure?
yes it doing great but it would be good to automate
Anyone able to get half decent results with small whisper models running locally on OrangePi or RK3588 boards?
ллллллллллллллллллллл
єєєєєєєєєєєєєєєєєєєєєєєждлорнепаквіфівссапрролджє.дбьотипсіячсмитьбю.
Hello, I have a question about Whisper. I want to try incorporating it into a small Python program for a voice assistant. Can someone please help me?
whisper --model medium.en "audio.mp3" --output_format txt srt
I use this command and i only want to have the format **txt **and **srt **but it doesn't work
anyone can help me ?
open ai is a company that makes ai models and stuff like that
Apple should use this tech for their dictation function on the keyboard.
Prove me wrong
When does whisper come to code interpreter ?
So I can ask for transcript of my audio files directly from chatgpt
That’s a great idea. Maybe there will be a plug-in or something
Idk if that is even possible though
When I submit my recording , the output is error, instead of "transafjkl" that I typed.
How to solve this, Thank you.
Hey! My teacher told me to use this command line arguments from whisper, is he correct?
whisper --model medium --language Spanish --output_format {txt,vtt}
Hey My teacher told me to use this
I tried to post a link, but cannot. You can google for this: "Audio Course from HuggingFace". It uses Whisper.

how possible is it to have a video who you can talk to? sort of chatgpt in the form of a video?
how possible is it to have a video who
What is whisper?
try only one
Where do I access whisper?
Do you know Where do I access whisper?
Ok thx for your help by the way you can make images on Dall E 2 website or bing
An ai voice generator I think
Ok there is not available to create automatic With AI ?
Thanks 👍
/what was the beginning of the ford motors company
You don't, That's Dall-E
Whisper is voice to text, it's an API
Nope.
Might want to use google as there are no bots on this server for you to make a command to (Except I guess Modmail counts in a way)
Ok
Hello hello how are you
hey guys im starting a community based server VORA-AI for AI development/showcasing and i need mods/staff
Let's reduce the noise in this channel, shall we? Suggestions:
- Questions that Google and ChatGPT can easily answer, I expect to be ignored here.
- Nonsensical and off-topic statements like "/what was the beginning of the ford motors company" should not happen.
- Ask a complete question so the topic is clear. When answering a question, start a new thread on that question.
Are u a mod? Smh
Hi
So I have a folder with 45 audio files in MP3 format. I want to transcribe them to text using OpenAI Whisper API. I have an API key. After transcribing them, I want to take the entire output, and put it into a text file.
Please tell me the best way to do this.
server staff will have the "community volunteer" role
ChatGPT should be able to help you with, right?
Here's one of my whisper programs
import openai
from pprint import pprint
openai.api_key = os.getenv("OPENAI_API_KEY")
# Prompt for user input
file_path = input("Enter the audio file path: ")
prompt = input("Enter the prompt text (optional): ")
response_format = input("Enter the response format (optional, defaults to json): ")
language = input("Enter the language of the input audio (optional): ")
# Open the audio file in binary mode
audio_file = open(file_path, "rb")
# Set the model ID
model_id = "whisper-1"
# Set the temperature to 0 if it is not provided
temperature = 0
# Call the API with user-input variables
transcript = openai.Audio.transcribe(model_id, audio_file, prompt=prompt, response_format=response_format, temperature=temperature, language=language)
pprint(transcript)```
This just prints the transcription to the screen you can modify it to instead save it to a file.
I tried to make it write a script in code interpreter, but it didnt work. And im not a coder so I dunno how to
Is there any GUI app that can do this in bulk?
you can use Google Collab for it
Not to my knowledge but if you want, I can write one this afternoon.
Not to my knowledge but if you want I
how do i output as timestamps?
Is this available for plus subscribers yet,?
Whisper is an API, there is no waitlist, but you must write your own program to use it
please advice about community role please.
what roles are you asking about specifically?
psst
i ask the same part of the question, how do they look like and who are they..? showing exploration of neurological netwrking system.
which role Admin want to gave ..because i am Business information system and an engineer alieas plant manager.
I hate you
Hey! My teacher told me to use this command line arguments from whisper, is he correct?
whisper --model medium --language Spanish --output_format {txt,vtt}
Gay
Absolutely not, whisper is an api, it does not work that way.
(There is no whisper exec file)
what is exec file?
Working on my first python project. I put the code and the output in here: https://justpaste.it/a6j5p
The goal is to run Whisper to transcript the audio and save it as a .txt with the same name. Currently there is no txt file, even though the output says so. According to the error messages there might be something wrong with ffmpeg but when I run Whisper in CLI the transcription works so I don't think that's the problem
could it be a permission issue?
is whisper better than elevenlabs?
hi, ubuntu user
whisper is really good, definitely better than all other speech to text things
Why whisper when you can say it out loud and be proud?
try this if possible:
(change folder etc. ofc)
Hey so, I have never used AI before. However, just watched the Iron Man again movies and honestly I'm convinced I need to spend time and try and make my own for around my home. However, I genuinely have no idea where to start. I have experience in some coding, java and javascript. But, if a new language is needed I'm willing to invest. I think what I'd want the AI to be able to do proceeds as:
Control lights
Control Music
Regular ChatGPT for questions/talk
Book events
Control TV
Control Camera (not really sure how this would work, but, I could say turn on garage camera, and it'd pull up on the screen)
Those are just the ones I can think of right now, but, if AI can learn, could it learn who people are? For example, "Charlie has just walked into the house", or if I asked, where is the dog, they could say "The dog is outside". Because of inspiration, id want it to have a Jarvis voice (but who knows, maybe Ill want it different for my own style).
Another question I was curious about, is if prompted correctly could AI just code this for me? not sure how this works and half of this is even possible. But please lmk!
A.I is a helping tools, It's depends on the roots they enquired.
all of this stuff here is no AI per definition, learning algorithms that are very limited in adding knowledge above their initial data. But it is not self-aware and not yet able to "evolve"
python
also learn about prompt engineering
Wait you’re planning to have something like Jarvis in your house?
Heyy, just wanted to ask if there is a way to run faster-whisper without the python? Cause I'm using command prompt and am not quite sure on how to convert...
How does whisper handle multiple languages being spoken in a file? I'm trying to transcribe subs for my wedding video but my family and my wife's family speak different languages
From my first experiments it handles it rather poorly. Or am I forgetting some setting that improves it?
i think you mean proompt
I would probably try chirp if you have access
Chirp is a version of a Universal Speech Model that has over 2B parameters and can transcribe in over 100 languages in a single model. Chirp achieves state-of-the-art Word Error Rate (WER) on a variety of public test sets and languages.```
```Chirp is available through the Cloud Speech-to-Text API. The API lets you do inference for transcription against the Chirp model```
guys
can someone please help
it's kinda urgent
I am using Whisper API and trying to simply transcirbe a text from a voice message that is in English
but for a strange reason the text gets transcribed to Greek
meanwhile I have put NOWHERE for Greek
can someone help?
Maybe specify that the language is English through your model?
How we use Whisper I don't understand
import openai
from pprint import pprint
openai.api_key = os.getenv("OPENAI_API_KEY")
# Prompt for user input
file_path = input("Enter the audio file path: ")
prompt = input("Enter the prompt text (optional): ")
response_format = input("Enter the response format (optional, defaults to json): ")
language = input("Enter the language of the input audio (optional): ")
# Open the audio file in binary mode
audio_file = open(file_path, "rb")
# Set the model ID
model_id = "whisper-1"
# Set the temperature to 0 if it is not provided
temperature = 0
# Call the API with user-input variables
transcript = openai.Audio.transcribe(model_id, audio_file, prompt=prompt, response_format=response_format, temperature=temperature, language=language)
pprint(transcript)```
hello all,
are there any Ai researchers here? working on custom models in text or image generation?
File "docs.py", line 5, in <module>
transcript = openai.Audio.transcribe("whisper-1", audio_file)
AttributeError: module 'openai' has no attribute 'Audio'```
I am getting this error while trying to copying and pasting the basic api reference from the docs
🙂
That wasn't a different language... Why did it delete my message?
Can whisper gain IPA as a language?

я
Which version of openai package are you using? I'm on 0.27.0 and dir(openai.Audio) works fine
In my testing, the language setting has no effect. The API acted exactly like where is no language parameter. What's you experience?
One way might work is to feed some prompt in English into the call.
I have no means to answer that, I've not used it for anything multilingual
I'm facing an issue I'm new to this
import whisper
from typing import Annotated
from fastapi import FastAPI, File
app = FastAPI()
@app.post("/ta")
async def transcribe_audio(audio_file_upload: Annotated[bytes, File()]):
model = whisper.load_model("base")
result = model.transcribe(audio_file_upload, word_timestamps=True, fp16=True)
return {"res": result}
I'm getting this error
File "N:\audio-to-text\venv\Lib\site-packages\whisper\audio.py", line 131, in log_mel_spectrogram
audio = torch.from_numpy(audio)
^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected np.ndarray (got bytes)
I have been using the default free Google speech-to-text tool for a mobile app. Is there any advantage to using Whisper?
not sure
I'm only using it for short bits of text, so maybe I would run into issues with more extensive conversions
do open ai playground and azure playground give different responses for the same prompt?
I assume OpenAI think Whisper fills a niche that the Google version doesn't fill... Maybe the niche is other-than-mobile settings.
i transcribe multi-hour .mp3 with whisper, pretty good results in germand and english, and i can work several files one after another - works well for me
Is it expensive?
sup
do i need to pay to use whisper?
The api has a cost. Check the openai pricing page. I think it’s like 2 cents a minute? But don’t quote me I forget. Check the page
so i've been using whisper to get transcriptions from video clips i'm processing, and i've noticed an interesting quirk at the end of some of the transcriptions
[TRANSCRIPTION] It's the summer's biggest sensation. And now it all comes down to this. The Dancing with the Stars Grand Finale. Underdog Kelly Monaco The eye of the tiger, baby. Takes on favorite John O'Hurley. Be afraid. Be very afraid. The judges score and your vote decides the champion. Live. Plus, all the stars reunite for one final encore. It's the night America has been waiting for. Dancing with the Stars Grand Finale. Wednesday, 9, 8 Central. Only on ABC. Subs by www.zeoranger.co.uk
duh
Can I get the transcript in 30 sec batches, rather than the entire text, via whisper APIs?
It happened from time to time when I asked it to transcribe short audio clips with fuzzy voice. I think that's one built-in flaw of the machine learning approach Whisper is using. My understanding is that Whisper is using a variation of the generative AI similar to ChatGPT, not the usual voice recognition ML models used by other providers. It's pretty unique. But it could lead to the problem that it's actively trying to figure out "what's next" in the process, which led to the "next" being the somehow popular websites. At least that my explanation based on my experience. It's so much so that I built code logic to hadle this situation correctly.
Makes sense, especially if it was trained with publicly available subtitles. My code uses the transcription and other data to generate a summary of the video clip, so it hasn't been problematic for me
Interestingly enough that doesn't seem to be an active website
whisper needs actualization?
help meee!!!
why is my api key not working
today i finnaly made a project that i copied from youtube but my key is not working
it says you need tokens
whats that??
anyone?
maybe you need to setup the billing plan?
On the OpenAI API section
I went there and also went on the pricing part. It asks for my role and the company h work for
Hey guys, can someone tell me if this is possible with whisper?
I want to use whispers language ID on code-switching speech, that is switching between languages mid or between sentences. Ultimately, I would like to make timestamps of when languages are used. For example, (0,-1 seconds) English -> (1-2) Spanish ->.... etc
Can this be done?
Might OpenAI remove the Try In Playground link from the product page under Whisper? There is no playground option for Whisper. Why tease people into going to look for something that we know doesn't exist? Thanks.
Question; "Will Whisper need an update at some point, or does it remain unupdated?"
ok mods can someone explain why the timeout? I was writing in english...
Do you know why its say me error? whisper: error: argument --output_format/-f: invalid choice: '{txt,vtt}' (choose from 'txt', 'vtt', 'srt', 'tsv', 'json', 'all')
This is the code:
whisper Histologia_general_teo_sem_1 --model medium --language Spanish --output_format {txt,vtt}
try this: whisper .\Histologia_general_teo_sem_1 --model medium --language Spanish --output_format txt --output_format vtt --task transcribe
Do i have to upgrade to gptplus in order to use whisper?
Ok. I got the answer by reading previous chat. I have to update.
0.006 dollars/minute
yessss
to late to watch it
i resolved it later
but thanks man
is like
1 month
i didnd know how to do it
i only use; whisper .\Histologia_general_teo_sem_1 --model medium --language Spanish --output_format txt --output_format vtt
i didnt use --task transcribe and it worked
why should i need to use --task transcribe?
pd: sorry for the bad english
Is it possible for Whisper to distinguish different speakers that appear in the same audio file?
No. you need to use another model to do this task.
transcribe is to transcribe the video with the given language or auto detected language.
translate is to translate it to english.
seems like the default setting is transcribe
yes is the default
do you think is necesaraly?
it would improve the transcribtion?
or not?
btw
do you use--patience PATIENCE?
has anyone used Whisper to live-transcribe and translate a panel discussion IRL?
When whisper can identify different speakers is when I'll pay for it
bro is free 🤣
I suspect they'll bill for it when it can do more than translate voice to text
well thats a posible future
im wondering over the same thing
you could technically cut the live recorded mp3 in word pieces and upload it to the server to be transcribed
and do these couple of things simultaneously using multithreading
however i dont see it being efficient enough to be a viable option especially on mobile
so no clue what could be used instead
for example assebly.ai is an amazing api but also pricey as hell
does anyone know any relatively good apis for this but without some ridiculous prices that no client would stand?
I found a whisper app on the Google Play Store to dictate anything using a custom keyboard, is there is a similar app on the App Store for Apple phones?
Anyone familiar with CoreML?
Hey so does anyone here know where the bot channel is?
there is no chatbots on this discord
whisper --model medium --language Spanish --output_format txt --output_format vtt
i used this and the whisper olny gave me format vtt, you know why?
--output_format vtt 🤔
@tender rivet i think i didnt emphasise, but i want to give me --output_format txt --output_format vtt and only gave me --output_format vtt. Do you know why it dindt give me --output_format txt?
Hi guys, On iOS, chatgpt transcription feature through whisper translates my English to Russian sometimes automatically, possibly because my accent is Russian. Honestly the fact that it's possible is amazing, but I would prefer to control it. Does anyone know what happens and how to control it? Also, is there a dedicated app using whisper specifically for translation?
To regain control explore the settings or preferences within the app that uses Whisper for transcription.
it's official chatgpt app
Attention Off-topic chats have been cleared by a moderator.
seems like only one output_format is available as output.
you can use this GUI to save the output in 5 different file formats without the need to run it multiple times, also you can directly save the subtitle to the video hardcoded or as .mkv file
https://github.com/meeksqueal/OpenAI-Whisper-GUI
I'm running whisper on a GCP Invidia A100 GPU 40GB. This is a huge instance and the cost is immense. Fine, whatever. So I'm handling my first transcription and the transcription rate is as slow or slower than on my Mac. How can I confirm that the GPU is actually being used? The image I installed on top of the GPU is direct from GCP and is a special PyTorch-Cuda image so I'm extremely confused.
as long as you define cuda it should be fine
try using Collab free tier
I don't care about free tier, this is on the corpo dime.
i use it nearly everyday, works
How long would it take whisper to run through a 4GB file on an Nvidia A100 40GB?
not sure it can handle 4GB at all once
i part my large files into smaller ones
why would that matter, I'm just genuinely lost is all.
then i guess you should figure it out
no no I hear you -- smaller files. But why?
i don't know, all i can say is that i transcribed hundreds of .mp3 so far and Whisper has ad problems with bigger files (hours of interviews) one i parted it into smaller segments - ir runs smoothly
nope
I mean this is a $10K USD card?! It costs $5/hr to run in the cloud
That's why I'm confused. It's like running Whisper on a GPU doesn't matter or something. And the machine image I used was specifically from GCP with PyTorch-CUDA preinstalled for python 3.10.0. I mean pretty straightforward. Unless running it on a GPU doesn't actuall do anything for transcription 🤷♂️
that works for me
oh. this is from a notebook?
can you export the script and upload it here? I'm a low-tech kind of guy that tends to do things from cli's and what not.
please
maybe later, need to finish here & heading to work
I'm running whisper on a GCP Invidia A100 GPU 40GB. This is a huge instance and the cost is immense. Fine, whatever. So I'm handling my first transcription and the transcription rate is as slow or slower than on my Mac. How can I confirm that the GPU is actually being used? The image I installed on top of the GPU is direct from GCP and is a special PyTorch-Cuda image so I'm extremely confused.
🎙️ Whisper's Response to Blank Inputs - A Quirk?
Hi all! 👋
I noticed that when Whisper receives "blank" audio (no spoken words), it transcribes phrases like "Thank you for watching." Even OpenAI's ChatGPT mobile apps do the same.
Has anyone else seen this? Is it a known issue, or is there a workaround? It's intriguing but could be challenging in some scenarios.
Your insights would be super helpful!
anyone having issues with the API today? I'm having extra errors and also responses that are not the full file.
good question, i have a 3060 gpu laptop and only keep medium, its that ok?
yes, not only Thank you for watching it use anothers phrases too
and why you talk like chat gpt 😆
Interestingly, they're all "YouTube-isms", hinting at the data on which it was trained. 🍿
I have also seen "Thanks for watching", "Subtitles by...", and so on. I'm just waiting for it to say "Don't forget to like and subscribe!". 🤣
youre not going to believe me, in spanish says: suscribete!
suscribete!
suscribete!
suscribete! jajajja
now the mononeural of the bot is going to put me timeout
?
is the bot is with a mod
or is only a bot?
i want to report this to ai
guys whisper can use also gpu shared memory or only gpu dedicated?
I'm running whisper on a GCP Invidia A100 GPU 40GB. So I'm handling my first transcription and the transcription rate is as slow or slower than on my Mac. How can I confirm that the GPU is actually being used? The image I installed on top of the GPU is direct from GCP and is a special Cuda image meant precisely for my use-case. What am I missing here?
Python 3.11.5
:/home/container$ if [[ -d .git ]] && [[ "${AUTO_UPDATE}" == "1" ]]; then git pull; fi; if [[ ! -z "${PY_PACKAGES}" ]]; then pip install -U --prefix .local ${PY_PACKAGES}; fi; if [[ -f /home/container/${REQUIREMENTS_FILE} ]]; then pip install -U --prefix .local -r ${REQUIREMENTS_FILE}; fi; /usr/local/bin/python /home/container/${PY_FILE}
Collecting openai-whisper
Using cached openai-whisper-20230314.tar.gz (792 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting triton==2.0.0 (from openai-whisper)
Downloading triton-2.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━ 46.7/63.3 MB 47.8 MB/s eta 0:00:01ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━ 47.1/63.3 MB 35.0 MB/s eta 0:00:01
How much space does whisper need?
I have like 2gbs on my vm
But still error
[413] Maximum content size limit (26214400) exceeded (26247606 bytes read)
so 25 MB is the max limit? is there a way to upgrade this limit?
Hi! Anyone know how I can use Whisper to dictate sms messages on ios?
Whisper is speech to text, not text to speech
please use #1050184247920562316
oh, bummer..
anyone here that has used whisper... is it able to interpret background noise? for example, could it delineate between crowd cheering and crowd booing?
Hey, any good APIs for using whisper in c#?
Heys guys, I am trying to create a real time voice reader but my code is not working.
If you use python i can help
Language?
Hey guys this is a thanks from me and my team to any contributors to whisper. This library basically converted a month's work to a few days.
if u are using python i can help u. i have made a similar project
Can you put the repo
sorry but the project is for an upcoming hackathon so i cannot share the repo as of right now
When it will be?
it will take some time bro I believe at the end of this year
@humble hare bro I need your help with real time voice transcription
How can i help?
Can you share the idea?
i will send a screenshot in pm
You are in the right place, check the oai lib for js or python and the code examples on the docs
Hey guys, has anyone tried hosting whisper large model on firebase cloud functions.
Hmm..
Hey so I'm trying to use whisper-1 speech to text to generate a transcription from audio in Portuguese using an audio/ogg file as a NodeJS Buffer, but when hitting the endpoint I get an error message that reads as follows: {"message":"","type":"server_error","param":null,"code":null}. This does not really tell me what the issue is, so I came here to ask for help
attempting to use Blob instead of Buffer returns an error saying file parameter does not exist, but it does
also fix your automod cuz I got timed out for sending an image of code written in english
Made the request using built in fetch function instead of the openai library and that fixed it, so idk what the issue could be
Is there any way to get timestamp for each word and not only the segment?
is possible to identify different voices and have it tagged a different person?
This is possible, but you would need to modify the model because that behavior is not in it's orgins.
Do you have more info on how I can do that?
Hey the latest version of whisper requires option parameter with srt_writer. How do i get this parameter?
Hey so good news, turns out it has already been done by some folks before. Checkout this github project https://github.com/MahmoudAshraf97/whisper-diarization
has anyone used the offline whisper version?
it works but for some reason it refuses to use my gpu and uses my cpu instead
UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
any idea how to reconfigure it? (edited)
because of this it seems to take longer when transcribing
Has anybody seen information as to when we might see improvements to Whisper, particularly for smaller languages?
Hello are you using Python to run the whisper model? Or just the command line? If you're using Python you can set fp16 to False and it would use FP32 instead
here you go!
thanks!, ill give it a try
Any thoughts as to how OpenAI are getting such fast transcriptions in ChatGPT-V? ⚡
Are they using a new Whisper model not available to the public?
i have used whisper offline, make sure you have the right CUDA stuff installed so that it runs on your GPU (if you want that)
otherwise it won't be able to detect your GPU as a device
This is just a guess: They probably do not use Whisper. They probably just use a STT model provided by e.g.: Google
I would be shocked if they weren't using their own SOTA ASR for STT. 🧐
Should be easy enough to identify the model once we get our hands on it. Whisper has its own idiosyncratic quirks.
Anyone been lucky enough to receive the ChatGPT upgrade yet?
Whisper is anything but fast though. Maybe they just use a private faster model. I doubt that though, cause they would surely commercialise it.
Also Whisper would have to run on their backend. (That's extra cost too.) They could just use the STT model integrated in Android/ IOS.
That's what I'm saying, they're doing something differently with ChatGPT's ASR. My guess is it's a tuned or newer model of Whisper.
There have been open source reimplementations of Whisper that are 4x as fast as OpenAI's initial release, so it wouldn't be hard to make it faster.
Still, a lokally running model would be my go-to as a developer.
I'd probably use something like this: https://github.com/alphacep/vosk-android-demo
For sure, remote Whisper transcriptions would add cost and latency. But the accuracy gains are worth the cost, and they're printing money anyway.
Android/iOS STT doesn't give high enough accuracy for conversational AI, IMO.
And remote transcriptions with Whisper can be close enough to real time to work. I should know, I've done it. 🎤
Hey sorry to ping you but I've been trying to get the colab notebook to work but it's just riddled with errors. Do you happen to know how to fix it?
What are the errors you're getting?
So in the 17th block of code you'll get an error at the result_aligned = whisperx.align(... part.
After looking through the repo someone suggested you change the beam_size from 1 to 7.
Then after you fix that error you'll get stuck at the 21st block of code
https://github.com/MahmoudAshraf97/whisper-diarization/issues/69
The owner of the repo suggests adding pip uninstall nvidia-cudnn-cu11 but that's already been done at the very beginning of the notebook
Forgive me for my lack of knowledge regarding coding but I've been trying to debug it myself for quite some time now...
Hey sorry for the late response but someone already made a good suggestion on the github repo
!sudo apt upgrade
!sudo apt install cuda-11-8 -y
!sudo apt install nvidia-kernel-common-460 nvidia-driver-460```
Note: after testing this, it may take a while to work. But it will work regardless in the end
I am sending audio recordings to the OpenAI Whisper API and cannot get mobile recordings to accept past a few seconds of data, I have no idea why. Desktop audio recordings function perfectly fine but whenever I try on my phone the transcriptions only get a word or two
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header 'Authorization: BearerTOKEN' \
--header 'Content-Type: multipart/form-data' \
--form 'file=@C:\Users\katra\Desktop\71753801708__8C36058A-E077-4000-B93D-901529FBD0AE (1).mp4' \
--form model=whisper-1
I haven't had the time to actually test out the fix yet but thank you!
Btw how good is NeMo diarization anyway? I've been trying out the pyannote diarization and it's really janky tbh
Both NeMo and Pyannote diarization are similar because they use neural networks to segment and cluster audio recordings by speaker labels which you in the colab of whisper diarization where they prepare to convert the data into NeMo combatibility.
But NeMo diarization between the two gets more update and is backed by a huge corporation(Nvidia) and perhaps a wider community while Pyannote is relying on open-source, which many can finetune the model to be better at tasks.
Both overall, both are free to use.
That's interesting. Thanks for the explanation!
The best method I've found for transcribing (and translating) Japanese is using silero-VAD. It fixes Whisper's hallucinations over longer recordings and makes the timings far better than without it.
The thing is most repos/open source programs that have Whisper and utilize a VAD all use pyannote for diarization. I wish I knew how to code something with NeMo so I could test and see for myself if it would yield better results
Good insight, I will be sure to checkout silero-VAD. And I agree, we see many opensource projects using OpenAI Whisper model because it is pretty popular and known by many. Right now I am building a project using GPT4 and Whisper Tiny/Base model to transcribe youtube videos and summarize it right on your cpu.
In my opinion it performed pretty well on English videos that were over 2 hours long in length.
Not sure if I can send links in this server, but there's already some research talking about how much better diarization is when LLMs are added to the equation. I hope we'll get to see more of stuff easily accessible to the public soon
I hope your project works out! I've seen a small number of repos on Github already try to implement a GPT model to help with transcriptions, but I haven't tested them out myself yet
Plus one for silero-VAD with Whisper. It can accurately remove those non-speaking portions from your audio that might otherwise confuse Whisper. You may want to apply some post-processing to combine the resulting shorter transcripts.
The stuff I use automatically does it for me! I just assumed the VAD did it for me the whole time lol
ik this is prolly less but. Whisper is awesome ive been using it to get notes of my classes
i really like it
it really is excellent, best speech rec i've ever used - plugged it into my ttrpg table blended with eleven labs. https://www.youtube.com/watch?v=CXIrFXDQBFQ . They way it punctuates normal speech is outstanding.
Testing out a workflow for doing voices adhoc during play.
I would send a picture but this dumb### ai keeps deleting my image and timeouting me
has anyone tried transcribing using whisper in a thread? because it just seems to not halt for me which is weird.
Im working on live audio that I have a threaded function that reads mp3 frames and then given some volume threshold saves the audio and calls a new thread to transcribe it using whisper
but that thread just doesnt do any transcribing until I ctrl+C the program
then the rest of that threaded function executes and I get the transcription
pip install whisper
I'm interested in watching a YouTube video that's in another language. While there are English subtitles available through "auto-translate", I'm curious to know if OpenAI offers a tool that can analyze the foreign language audio and convert it into an English voiceover?
Can someone do this for me? ^ I have no idea how to use Whisper
I figured out the processing was just being slowed down quite a lot
especially the call to "FeatureExtractor" (I am using faster-whisper)
My plan is to use a process instead
Good evening, does anyone know if the Whisper API can be used to generate an .SRT file with the transcripts?
I know it can be done by running the model locally with the Whisper package, but I don't have that computational capability and I would like to use the API
I'm sure it's possible
maybe use ffmpeg to separate the audio to a usable streamtype then use python or whatever to make the api calls, and generate SRT from the transcript
Commission somebody if you're not a programmer. Or ask GPT to help your journey
@prisma sinew @worn tundra check this out
<.<
.>
oops, sorry for the ping tho =P
Neat. I've got something very similar set up, not a parrot so much as a two-way, but similar
I Loved the fact that it is integrated with foundry
let me get my macro for ya'll then
I still need to decide the best method of natural capture, like how long of a pause, and background noise ignoring
Right now it just waits for a 1 sec pause lol
chatgpt gave me a great technique
Okay:
main macro - https://gist.github.com/thebwt/8142c510c2d2ae31f0c8e6bbfb45016a
python fastapi endpoint to do the whisper stuff (I think I don't need this in the end... but haven't refactored yet) -
https://gist.github.com/thebwt/003f58a4454876c4e706b7d5875b6fbb
i'm lazy, so I just ssh port forwarded the running service to localhost.
source chat: https://chat.openai.com/share/8a8313a1-bc4e-4723-9aa5-702ed15b992a
okay actually that's literally what I do
mediaRecorder.stop();
}```
kek
Can whisper only be installed on Windows?
Has anyone installed whisper to a MacBook that has Parallel Desktop installed (Windows VM)
do you mean a local installation of the model or the api?
you can run whisper on anything that can run python (and has good enough hardware)
the OS is not a concern
Yeah local install I suppose. Is there an easy web version? It looks quite confusing to install and use.. is there an easy way to use with ?
Depends on what you want to use it for
it's likely somebody has built an app for that and whisper may not be ideal. It would probably involve installing programming tools and writing some code
You have a macbook so you have python. probably somebody could give you a couple terminal commands and it would start spitting out your transcript
If you have Chrome all you wanna do is transcribe, I have you. Here's a branch off another project I made. Save the index.html and run it in chrome.
https://github.com/danomation/chrome-gpt-assistant/tree/transcriber
Thank you!
It wont stay on due to privacy so that limits its use
Lol, whisper just came up with an advertisement for an existing webshop by transcribing a silent wav file...
Go to Beadaholique.com for all of your beading supplies needs!
Super weird
yep, it gets hallucinations all the time
I'm a novice when it comes to whisper, but the more I use it the more I notice I need some way of handling dubious inputs
Hi. Anyone knows of a windows dictation software based on whisper?
Does anyone know how many minutes can chatgpt listen and transcribe in app? Does plus customers has any advantage over this?
Hello, I'm facing some issues getting Whisper to transcribe acronyms in audios
E.g when a speaker says 'SDR', it should be transcribed as 'SDR', not 'as the are'.
I would like to include a custom dictionary where the user would pre-record how all these acronyms are pronounced + how it should be spelt. Does anyone know how I can do this?
I just need some quick and short fine-tuning so that the model can transcribe a list of 10-20 acronyms accurately for each audio. This list of acronyms varies for each audio hence fine-tuning the model every time, which would take 5-10h in the post below is not sustainable.
Thanks in advance!
https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311/2
Hugging Face Forums
Hey @sebasarango1180, If you have a corpus of paired audio-text data with examples of such terms/entities/acronyms, you could experiment with fine-tuning the Whisper model on this dataset and seeing whether this improves downstream ASR performance on this distribution of data. To do so, you can follow the blog post at Fine-Tune Whisper For Mult...
You might also try post-processing text with GPT4 or 3.5
oh yes, if its not possible to fine tune whisper with a custom dictionary then I'll go around it with llm post processing
thanks for the suggestion!
Hey, I'm reporting a very annoying bug when it comes to whisperemedium, it captures text great, but sometimes it loops and repeats one word over and over again, can you fix it please?
has anyone ever run whisper jax in a production environemnt on a tpu
How can I set up Whisper on my MacBook Pro to utilize its voices when using macOS' "Text to Speech" function for more realistic speech output, similar to the voices in ChatGPT's "Voice chat" feature?
Does whisper support Brazilian Portuguese vs Portugal Portuguese?
Imagine world scale on device STT data collection using phones
Major privacy concerns but possible
@autumn bolt Learn some basics with python or node.js. There's a lot of ways you can accomplish that goal. Python is my suggestion. You can use python modules to talk to apple's speech synthesis api.
What you'd do is import modules for whisper (bundled into openai) and macos speech. start a python file, create an api call to whisper (or the module for the self hosted model) , do whatever you want to with the code, then send it to macos's tts with this
https://pypi.org/project/macos-speech/
Probably possible to do this without any api calls using the opensource whisper models instead of burning api time
You dont have to be a hardcore programmer. There's libraries for just about everything with python
I asked gpt4 and it thinks you can just use "say" to accomplish the tts. Anyway... here's it's code that may or may not work. Good launching point. Get an openai api key to start. Add all this to a speech.py file, edit it, then start it in terminal
import sounddevice as sd
import scipy.io.wavfile
import requests
import openai
from subprocess import call
def main():
recording = sd.rec(int(10 * 44100), samplerate=44100, channels=2)
sd.wait()
scipy.io.wavfile.write("output.wav", 44100, recording)
response = requests.post("https://api.openai.com/v1/whisper/recognize/async",
headers={"Authorization": "Bearer <your OpenAI key>"},
data={"file": ("audio.wav", open("output.wav", "rb"), "audio/wav")})
transcribed_text = response.json()['task']['postprocessed']['utterances'][0]['postprocessed']
call(["say", transcribed_text])
if __name__ == "__main__":
main()
replace "<your OpenAI key>" with your actual OpenAI key
Requirements:
pip install sounddevice
pip install scipy
pip install requests
pip install openai
Use self hosted whisper if you want. Might require some more imports and config
Is the API server down?
I'm also seeing errors on the Whisper API.
Interesting that ChatGPT Voice is still operational (at least for me) while the public Whisper API is down. 
I would have expected that ChatGPT was dependent on that service too.
I feel so much better now... released a fun dictation thingy-mo-bobber using whisper today to some friends, tested thoroughly of course - and now I'm getting nothing but error reports and can't even get it to work myself:
Whisper Transcription Error: Object reference not set to an instance of an object.
==========================================================================```
So... you're all not alone, it's me too - and my friends online. Whisper has gone completely silent. 🤫
Seems to be back online and working now:
Hello, I wonder how is it possible that Whisper in ChatGPT during the voice conversation is so fast, but when using as API is slow as a turtle.
Seems to be back online and working now:
The time of return is directly related to the length of the audio file afaik - not sure if when you are using the Whisper API you are sending larger files than you would with a user input pre-processed and transcribed through Whisper for a subsequent ChatGPT API call?
Does anyone know if ChatGPT plus subscription allows recording for an hour using speech to text?
no, it doesn't
What’s the cutoff time? I get this error now after recording for 10 minutes. Any alternative?
where are you doing that?
Anyone know if the real-time Whisper API at OpenAI or Azure will allow fine-tuning any time soon?
Heh, neat. Transcribed a Flemish show with medium, then translated the same with large-v2 whisper, and then.. asked chatgpt-4 to translate sections of the english output to flemish and then the original flemish transcription output to english in the same convo -- nearly flawless english subtitles. Interesting how combining the two transcriptions gives just the right context to get a really clean output.
I'm planning to host hugging face whisper large-v2 model on Sagemaker. Let me know which instance should I use And I'm looking for transcription output within 5 minutes for a audio of length 30 - 45 minutes.
Hey everyone, I hope it is the correct channel to ask this question. I have a script and use stable-ts.
But it is telling me that the module "stable_whisper" is not found. I already installed stable-ts + whisper
This is perhaps a newbie observation, but I am used to voice typing and speaking various punctuation. I love whisper, but at first was kind of annoyed it would type out spoken punctuation. But I just realized when I emphasized the word THE, it typed it in all caps. I did not realize it listened to intonation and so I tested it out and depending on if I sound excited, it will put a period or exclamation point. Somehow, that was just very impressive to me. That is all. 🙂
I did this exact same thing 😂 we are just too used to how we HAD to do it, the future has caught up with us :)
Hello, everyone.
I am developing STT model now, But I am confusing how should I do for it at first. let you provide me some advise.
Hi, I'm using Whisper to generate subtitles for a music video. I use the 'max_line_width' and 'max_line_count' flags to format the output the way I want. Though, Whisper does not separate lines the way I need: on one line there is the end of a sentence and the beginning of the following sentence. Whisper does seem to detect that it is a different sentence as it generate an upper case letter. Do you have any insight on how to make whisper break lines at the end of sentences?
Can you paste some output example?
Sure, here are the first 3 lines. It is in french, but I hope this can still illustrate the point:
1
00:00:12,760 --> 00:00:14,900
Les Cyrânes Près du coufre, loin du2
00:00:14,900 --> 00:00:17,900
ciel Plus je souffre et moins je3
00:00:17,900 --> 00:00:21,180
sais Si ce que je crois est vrai
The upper case letters are good. What I would want to get:
1
00:00:12,760 --> 00:00:XX,XXX
Les Cyrânes2
00:00:XX,XXX --> 00:00:XX,XXX
Près du coufre, loin du ciel3
00:00:XX,XXX --> 00:00:XX,XXX
Plus je souffre et moins je sais4
00:00:XX,XXX --> 00:00:21,180
Si ce que je crois est vrai
This is with max_line_width=35 and max_line_count=1
When asking for the output as text, the line breaks are good:
Les Cyrânes
Près du coufre, loin du ciel
Plus je souffre et moins je sais
Si ce que je crois est vrai
Anyone has the perfect prompt/workflow to take Whisper output as text (ideally with timestamps), pass it through GPT-4 and get it formated in paragraphs according to the topic discussed?
When training/fine-tuning whisper locally, using:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py
..my load_model() failes (as does torch.load() on anything I try). The training doesn't create a .pt file that matches openai-whisper's .pt files (which are ZIP files). The actual files in an official .pt are like:
$ unzip -v tiny.pt | head -7
Archive: tiny.pt
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
19363 Stored 19363 0% 1980-00-00 00:00 49a57bdd archive/data.pkl
344064 Stored 344064 0% 1980-00-00 00:00 e2b0aff3 archive/data/0
1152000 Stored 1152000 0% 1980-00-00 00:00 72f960e2 archive/data/1
768 Stored 768 0% 1980-00-00 00:00 2a87d98d archive/data/10
Whereas I get a directory created with these files:
$ ls -1s
total 154444
4 README.md
36 added_tokens.json
4 all_results.json
4 config.json
4 eval_results.json
4 generation_config.json
484 merges.txt
147528 model.safetensors
52 normalizer.json
4 preprocessor_config.json
4 runs
2776 small.pt
4 special_tokens_map.json
2424 tokenizer.json
280 tokenizer_config.json
4 train_results.json
4 trainer_state.json
8 training_args.bin
816 vocab.json
anyone know how to get a proper .pt out of this thing? small.pt is not a proper zip:
$ file small.pt
small.pt: Zip archive data, at least v0.0 to extract, compression method=store
$ unzip -v small.pt
...End-of-central-directory signature not found. Either this file is not a zipfile, or...
(and obviously that 2mb small.pt file is not the full model anyway)
Hugging Face Forums
I fine-tuned whisper multilingual models for several languages. I have the checkpoints and exports through these: train_result = trainer.train(resume_from_checkpoint=maybe_resume) ... trainer.save_model(output_dir=EXPORT_DIR) Now I want to use these fine-tuned models in another script to test against a test set with whisper.transcribe(...) Wh...
Hey, guys. Do you know if it is possible to set prompt to whisperAi on .Net?
hey everyone, a couple of days ago I built a voice-cloned text to speech assistant capable of answering questions in meetings, or in any peer to peer environment. Check it out:
GitHub
automate answering in live meetings by generating readable scripts on the fly - GitHub - AndreSlavescu/Meeting-Buddy: automate answering in live meetings by generating readable scripts on the fly
Does anyone have experience resolving issues with ffmpeg on a Mac? have a Python file that uses whisper to transcribe some audio, but whenever I run it I get:
[Errno 20] Not a directory: 'ffmpeg'
I installed whisper using:
pip3 install -U openai-whisper
tried also using pip3 install ffmpeg-python, successfully installed ffmpeg. but running the file still didn’t work, so to further troubleshoot ran which ffmpeg. this gives ‘ffmpeg not found’
i then tried to add the ffmpeg executable to my .zprofile (as my terminal window is using zsh not bash) but this doesn't seem to resolve anything either, any suggestions?
Hi everyone, I'm noticing a small bug with the "client.audio.transcriptions.create" endpoint where I will send a message in french and it will transcribe it in english. Is anyone else experiencing this? I guess maybe the "audio.transcription" API is getting confused with the "audio.translation" API?
does whisper-1 point to large-v3 now or is it still large-v2
Last I heard and in the online docs, it isn't in the OpenAI API yet
How did you get whisper to work on VoiceAttack, and would it work to trigger commands?
install vlc
fork your own ffmpeg and put on whisper folder
TTS question. any voice that sounds decent in spanish?
on normal voice i use juniper idk about tts model sorry
I was told that Sky is good at Spanish compared to the others by the team
Would this also work if you were to feed it voice recordings from someone who passed away (my brother), and have it respond to questions using his voice?
Yes absolutely
You just need a short 20 second audio clip, even 2 second will work but it won’t be as good
I recommend 20 seconds since that’s what I found to work great
And it does a complete voice clone and will read the output texts to your questions in your brothers voice
If you have a powerful GPU the processing will be much faster, if you have cpu, it’ll take some time. But it’s completely on the fly
I'm looking to use whisper to transcript a podcast, is it capable of identifying speakers?
For example:
Speaker 1: "hello and welcome!"
Speaker 2: "To today's video!"
things might have changed since but iirc not natively, although you can mess around with the prompt to make it do something similar
I built a bot that does this. You just need to separate user audio streams and identify them. Then have the results tie per person
Is there any free website where we can use whisper to transcript our audio files using our API?
Or any chrome extension that transcripts videos run from any source?
I recommend reading the documentation for whisper. It has some very simple Python code where you input an audio file for transcript as an example. I don’t know of any such services as it would cost them and providing an api key is sketchy at best
Building your own little tool with help from ChatGPT could be a great learning experience too
Thanks, yes this is definitely on my todo list.
Something went wrong. If this issue persists please contact us through our help center at help.openai.com. How can we fix this problem?
My chartGPT account has been experiencing this problem since this afternoon
Folks whisper iis free to use?
Is this server error?
same for me
I am facing the same problem, I tried to build GPTs and when I tried to upload files all my OpenAI collapsed and now I am unable to chat, create and do anything with my main account. (This is the error I am recieving: Something went wrong. If this issue persists please contact us through our help center at help.openai.com.)
I've got the same issue: Something went wrong. If this issue persists please contact us through our help center at help.openai.com.
same issue here
Good morning! Are you having trouble accessing it?
I have been having trouble accessing all day
me too, no access via browser, however be able to access it through the app
Will this voice be in a new release?
This are all this voices available..
Experiment with different voices (alloy, echo, fable, onyx, nova, and shimmer) to find one that matches your desired tone and audience.
Ya the voices are different between the iOS ChatGPT and the api. Not sure about api voices for that sorry
i dont think large-v3 is even on the api yet, and the official implementation is really slow 
have been working on getting faster-whisper updated
Its not too slow in my experience but if you want instant maybe so. Do smaller batches
Official implementation? faster-whisper?
the openai/whisper python repo
faster-whisper is a ~4x faster reimplementation
is this integrated with ChatGPT?
is there a way to play around with this using ChatGPT?
Or this must be using the code?
Anyone is here?
You can use whisper and voice with ChatGPT on the app
But it’s not really a thing you can save or build without using api
I see. I'm trying to see the process of re-making the Korean song lol
I thought it was really damn cool
Hi! Do you konw whether the web app is built based on gpt-4 turbo now?
it is
I tried to create a script that:
- Converts MP4 video to MP3 audio using ffmpeg
- Transcribe the video using whisper-1
- Translates the video trying to maintain the timestamps
It does the 1. and 2. section very well, but when it comes to translate and maintain timestamps it starts bugging a bit.
For example:
1
00:00:00,000 --> 00:00:07,820
se torna difícil.
Well, here we have a samba in 11 by 8...
It has a non-translated phrase before the translated one.
Someone successfully created a script that could translate and maintain the same timestamps?
(I hope that's the right channel to ask)
You probably will want to track requests by timestamp yourself. Have the request file and response thread tied to a timestamp and then have a data object of some kind store timestamp to file to translation
whispererr on chatgpt when
I will try that, thanks 😃👍
Anyone have a good js voice detection package/api that can work well with whisper? Not a lot of info online
Hello, any updates in the official documentation?
Hi can report a error here in OpenAI will the openAI team attend the issues
you can report bugs in #1070006915414900886
I did that a while ago
that's the best I can suggest unfortunately
ive made music on ableton forever, i never looked at an ai music maker thingy. im really stoned(sedated thinking), how good of results does it give? can i get specific instruments to remix into a daw?
Whisper 3.0 amazing
best mac or web app that uses whisper so I can quickly speech to text?
whats the main difference between 3.0 and 2.0
1
Hows it going yall
😂 😂
Can I get text of audio file using whisper although that audio file is 250MB ?
Thanks in advance for your kind help.
Is that possible create near realtime translation (audio->text <> TTS/text->new-lang). Or translate recorded meetups to other langs? Did you see similar GitHub repo, example?
Hi everybody. This could be a silly question. I've been researching text to voice for an app I want to make. E.g.
User speaks into mic > Text appears
Most Browsers come with this functionality built in these days.
Is this likely how the ChatGPT app works OR is that one of the tings that whisper can do for me but better?
Hi All, based on the doc, seems Text-to-Speech only support python, not node, is this right?
Hi everybody. I have a question about speech to text. Is there any way to convert voice into text in real time? For example: I enter a voice file as soon as I click start, where will it translate the voice into text?
Chrome has a live transcribe accessibility option. I don’t know if that is what you want though 🤔
Is Whisper a in-real—time speech-to-text AI? I am an online student and looking for ways to attend classes and having subtitles being fed through the audio.
What I mean is that I'm looking into whether the open ai Whisper model's api has real-time transcription or not.
Oh, same then
Anyone know what is this "Whisper-1" model? Because i cannot find model in whisper that has this name?
Any advice? 🙂
Yes you can split that to chunks
Any ideas about this? chunks are working mp3 files so they don't seem corrupted
transcription works for the first chunk but getting that 400 for the rest
@simple hornet how large of a file are you working with? And what size are you chunking it too? The “Invalid File Format” is a weird one
i made the chunks 20mb, the first one works, 2nd and 3rd don't (sizes 20mb, 20mb, 12mb)
Maybe it’s a file integrity issue, but that also would also be weird since you seem to be chunking the first one I assume in the same way as the other ones. It might also just be a Whisper bug and not your code
hmm weird. thanks tho
hey there, can someone confirm only the large model is new v3 for whisper right. there is no insane new tiny v3 model?
cause i need a tiny one to run on android
Subtitle: Your open-source, self-hosted subtitle generator for seamless language translation
https://github.com/innovatorved/subtitle
looking for info about transcription for multiple speakers... i feel like it must exist, just not sure where to start
(I'm responding before seeing other potential replies): Have you seen the other whisper projects which attempt to provide more accurate alignment?
Got my own q: I'm using hf's seq2seq to fine-tune whisper models, but I'm preparing audio+transcription data for someone who breaths on a ventilator. I'm wondering if I can put "<|nospeech|>" tokens directly in my training text for the pauses amidst her sentences where the ventilator takes a breath, as well as to mark some areas just of ventilator noise so the model can learn, "this is NOT her speaking."
➜ subtitle git:(master) python3 subtitle.py example/story.mp4
/Users/ed/Library/Python/3.9/lib/python/site-packages/urllib3/init.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
Model base exists
ERROR:root:/bin/sh: ./binary/whisper: cannot execute binary file
ERROR:root:Error while transcribing
ERROR:app.core.app:An error occurred: Error while transcribing
ERROR:root:An error occurred: cannot unpack non-iterable NoneType object
On my M2 Macbook - it downloaded the model ok (after I pip3 installed ffmpeg and gdown) but then I ran into this
guessing I have to chmod whisper or something?
Currently binary is not compiled for Mac
I will add support for Mac within the next 24 hours.
ok no worries - I'm going through https://medium.com/gimz/how-to-install-whisper-on-mac-openais-speech-to-text-recognition-system-1f6709db6010 to install it globally, then I guess I can just symlink it
Hi! I'm new here. I'm devvng a chatbot in python.
ah ok, I'll play around a bit anyway and keep an eye on things 👍
hmm. Whisper - just learning about it.. looks interesting
whisper API question - any luck getting it to ignore silence in longer transcriptions (>30min)? it might be a bad prompt, but i either get something like Silence. Silence. Silence. ... or Yeah. Like. Yeah. Like. ... in between long pauses.
Whisper 3
Hello, If my pc is supporting gpu, can whisper generate text in short time?
With cpu, it took around 924 seconds to get text from 29mb audio.
Please advice me if you have experience.
https://github.com/openai/whisper/#available-models-and-languages
GitHub
Robust Speech Recognition via Large-Scale Weak Supervision - GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Anyone know how to use Whipser with the openai npm package in Node.js? It worked for me before but no longer works after the update to version 4.20.0
SH#
Ok I found the info now. There's a node example in this page https://platform.openai.com/docs/api-reference/audio/createTranscription?lang=node
You can use the input of your microphone and split that into chunks and feed them to the API. Real-Time is a word, as AI, which is quite flexy. Speaking of Terms like in embedded systems: No, its really absolutely not real-time. When you attend class and you have enough calculation power in your laptop: Its okay to attend with a short delay. Whisper meanwhile does not support that out-of-the-box. Separating in chunks has to be done by your software. There is in fact a module for python for that called "whisper_mic" by Blake Mallory which appeared to work fine for me.
I actually did, but did not find one
Anyone know if it is possible to add prompt to huggingface distil-whisper?
that models is focusing to detect text from audio, what kind of prompt are you talking?
Whisper is a transformer model with language modeling head on top which predicts next token just like GPT with the context of previous tokens. So because of that u can by hand append extra tokens for each chunk of audio while decoding. This can be used to provide or “prompt-engineer” a context for transcription, e.g. custom vocabularies or proper nouns to make it more likely to predict those words correctly, but I don't want to implement it myself was wondering if hf had it out of the box like openai api does:
https://platform.openai.com/docs/guides/speech-to-text/prompting
I think it would work therer as well.
transcribe(filepath, prompt="")
You can actually try in your new whisper model
anyone knows if they will also release a tiny v3?
diarization?
- naive chunking breaks transcribed words that are split on the chunk boundary, have to get creative
whisperx integrates everything but hasn't been updated for v3 yet
in case anyone didn't see, I have a question about the api here https://discord.com/channels/974519864045756446/1177642614795808849
it's really weird and i'm not really sure what to do
i got stuck for a while on this
Hey guys, can the API return an object with start time end end time in ms for each word ? is punctuation separated ?
@unkempt arrow doesn't seem like the API suports it from what I've seen, but take a look at WhisperX, whisper-timestamped, and faster-whisper on github
@simple hornet i'll check it thanks. do you know what the json or json-verbose are returning compared to the "text" ?
figured it out, needed to pass the correct mime type essentially
I think json verbose gives phrase timestamps maybe? I looked briefly but idk if I saved an example, it was very verbose and not what I was looking for at the time, but I wasn't looking for word level timestamps, I was looking for diarization.
Is there anyone who had set up local whisper using GPU?
I am having problem,and would be helpful if someonce can advice me
Hey, quick question about Whisper.
I'm working on a subtitles for my video. Whisper catches my speech perfectly, but most of the times it leaves a ton of sentences in one block in srt.
Is there a way to force Whisper to just throw only one sentence in one line? It would help a ton.
Is there anyone who had set up local
What I would do is either write some parsing algorithm that separates them evenly pondering the number of characters/length of the speech or use an external software to sync the subs with the text (basically you transcribe with whisper then you sync with something else). Look up for "thio joe youtube subtitles" on youtube, he mane a nice video and he explains how he syncs whisper with the actual video. The video is titled I Created Another App To REVOLUTIONIZE YouTube
I shared my GPTs publicly there are 8 comments on that but i cant able to see it how to see that
I want to code when whisper TTS changes voices. Basically I want to record a conversation but then have two ai's do the talking to each other for anonymity. Is it possible?
Whisper doesn't differentiate speakers. You could post process with GPT-4 to try to separate them based on semantics but that's not mega reliable.
If you have separate audio feeds for each user, then you can do that. I have a system that takes combined audio and then splits based on user before feeding to whisper. You get individual transcription that way and then feeding it to TTS is simple
Hello everyone!
I’m currently running Whisper locally in UE using the Runtime Speech Recognizer plugin. It works quite well, but I’m looking for faster and more accurate recognition. The small language models are very fast but unfortunately, they are extremely inaccurate in Hungarian, almost unusable. Only the large quantized language model begins to be usable, but it’s not precise enough and is incredibly slow. Can anyone help me with how to make a smaller language model more accurate using the available settings? I don’t fully understand what these parameters do. Also, is there a way to use Whisper with only Hungarian language libraries, since I only need Hungarian recognition? I’m guessing that could be smaller and maybe faster. Might be a silly question, but I’d appreciate any help. Thanks in advance!
Using the OpenAI API you can get faster responses since they have faster PCs
I've tried using whisper large on the OpenAI API and it's pretty fast
has anyone tried the TTS with non-English languages? i want to know if its able to read non-English texts well
I think the best way to know is to just test it
yeah, but im not fluent in any non-English language :/ wanted to know if anyone has any insights before i start reaching out to people who could test it for me!
Thanks for the suggestion, Carrot. However, I need to run Whisper locally, and I'm focusing on optimizing smaller models for speed and accuracy. Do you have any insights in understanding the parameters to improve their performance?
result = model.transcribe(filepath, language="pt", fp16=False, verbose=True)
You can specify using language parameter
In your case, hungarian, you could need to add language code for that, not "hungarian", haha
Prince, I don't quite understand what you wrote. In the settings options, I only have what I showed in the screenshot and of course, I know how to change between different Whisper language models. The large model works well but is very slow, and I need this for a real-time application, hence my question. Based on the parameters visible in the picture, what would you recommend as the most accurate configuration for a smaller language model? And yes, the language is set to Hungarian. 🙂
hi
has anyone figured out how to control (reduce) the duration each output segment to somewhere between 0.5 to 1 second ?
My output on average varies between 2 to 5 seconds
const res = await openai.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-1",
prompt:
"the duration of each output segment must be between 1 to 2 seconds",
response_format: "verbose_json",
});
return res;
Question: Does whisper work reliably with German Audio?
has anyone tried the TTS with non-
Hey guys, did anyone have deployed Whisper on Android and inference via GPU?
Hello, I am pretty sure I made everything true but I am getting an error like this
Error loading "C:\Users\MSI-NB\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
how can I fix this cudnn_cnn_infer64_8.dll thing?
I need the AI really fast
i believe the file is missing from the directory, can you check ?
also make sure you have nVidia GPU Computing Toolkit installed
It is there let me send a screen shot again
ok
It is
winerror126 means the file is not there or it cannot be loaded due to missing dependency !
It is here
What can I do about it? I really need the AI ://
cudnn_cnn_infer64_8.dl is cudnn v8 , you need to check your pytorch installation ( check compatible cuda and cudnn)
I am really new at those things, haha
How can I check it?
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
used this code
@wispy thistle
here it is @wispy thistle
can you install cuda 11.8 and cudnn (https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installdriver-windows) and try again
NVIDIA Docs
This cuDNN 8.9.6 Installation Guide provides step-by-step instructions on how to install and check for correct operation of NVIDIA cuDNN on Linux and Microsoft Windows systems.
yea, ill send u
Hey guys I'm working on a whisper transcription app using NextJS however I'm getting:
error: {
message: 'Could not parse multipart form',
type: 'invalid_request_error',
param: null,
code: null
}
}```
Can someone please check my code to correct me where I'm wrong?
```import axios from "axios";
import { useRef, useEffect, useState } from "react";
const model = "whisper-1";
export default function UploadPage() {
const inputRef = useRef();
const [file, setFile] = useState();
const [response, setResponse] = useState(null);
const onChangeFile = () => {
setFile(inputRef.current.files[0]);
};
useEffect(() => {
const fetchAudioFile = async () => {
if (!file) {
return;
}
const formData = new FormData();
formData.append("model", model);
formData.append("file", file);
console.log(formData);
fetch("https://api.openai.com/v1/audio/transcriptions", {
method: "POST",
body: formData,
headers: {
"Content-Type": "multipart/form-data",
Authorization: `Bearer youropenaikey`,
},
})
.then((res) => res.json())
.then((data) => {
console.log(data);
setResponse(data);
})
.catch((err) => {
console.log(err);
});
};
fetchAudioFile();
}, [file]);
return (
<div
style={{
backgroundColor: "#f2f2f2",
padding: "20px",
borderRadius: "8px",
}}
>
Transcribe
<input
type="file"
ref={inputRef}
accept=".mp3"
onChange={onChangeFile}
style={{ display: "block", marginTop: "20px" }}
/>
{response && <div>{JSON.stringify(response, null, 2)}</div>}
</div>
);
}
Any help greatly appreciated!
Hello! I've got a problem with the whisper API not translating from french. Any one has an idea why?
How do I achieve diarization from the output of openAI whisper model?
Hello, with the new python update for openai, I am having some trouble running some code from the Spyder IDE for transcription. Here is the code:
client = OpenAI(api_key=keyishere)
model_id = 'whisper-1'
audio_file_path = "C:\......."
audio_file = open(audio_file_path, 'rb')
response = client.audio.transcriptions.create(
model=model_id,
file=audio_file,
)
transcription_text = response.text
print(transcription_text)```
The code executes, and gives these errors:
```File ~\anaconda3\Lib\site-packages\openai\_base_client.py:1096 in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File ~\anaconda3\Lib\site-packages\openai\_base_client.py:856 in request
return self._request(
File ~\anaconda3\Lib\site-packages\openai\_base_client.py:894 in _request
return self._retry_request(
File ~\anaconda3\Lib\site-packages\openai\_base_client.py:966 in _retry_request
return self._request(
File ~\anaconda3\Lib\site-packages\openai\_base_client.py:894 in _request
return self._retry_request(
File ~\anaconda3\Lib\site-packages\openai\_base_client.py:966 in _retry_request
return self._request(
File ~\anaconda3\Lib\site-packages\openai\_base_client.py:908 in _request
raise self._make_status_error_from_response(err.response) from None
RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}```
Looks like it is attempting three times and then cutting it. I definitely have the space in my plan, I think (I have $18 in my free account). I am running this on the free version of ChatGPT. Anyone have an idea what could be wrong? ChatGPT hasn't figured it out yet lol.
I don't know if this is the right place but I have to complain
As much as I am delighted with Whisper and the fact that things that were once incomprehensible to me become understandable thanks to this tool, I am equally frustrated
I transcribe musicals in Italian and it is a very difficult and unsatisfying job. We know that Whisper has great potential and I know that one day it will be perfect, but why can't it be perfect now...?
Which issues are you experiencing?
various
Like?
Not being accurate, simulating, generating various nonsense, I would like to point out that I use tears in the subtitleediting program
has anyone tried out the whisper-large-v3 model? https://huggingface.co/openai/whisper-large-v3 had some questions about how to handle white noise
I haven’t but I’m quite interested. Thanks for sharing. That is not an official model tho right?
i figured it out. can use this mechanism to figure out if there's voice in the audio clips or not and where they are located: https://github.com/snakers4/silero-vad
so i basically wrote a small function using this:
import torch
SAMPLING_RATE = 16000 #48000
torch.set_num_threads(1)
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True, onnx=False)
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
def has_voice(file_name):
audio = read_audio(file_name, sampling_rate=SAMPLING_RATE)
speech_timestamps = get_speech_timestamps(audio, model, sampling_rate=SAMPLING_RATE)
if len(speech_timestamps)>0:
return True
else:
return False
How the hell I fixed the issue with the timestamps starting from 0 even that the speech is not? also this issue seem to related to further issue when the transcribe afterword's not aligned with the speech.
is there any way for me to figure out if a provider is using whisper as their transcriber? any specific quirks with whisper i can test?
I really need the high level top gear that already play with it. I finish installed and run and all work but I want to make some adjustments to make the output more as srt pro file and not just mouble jumple. sooooo who is with me?
my goal is that at the and maybe to use spacy to merge with logic cues and such... I am open to ideas
Is there an argument I can use to make the timestamps slightly longer? It cuts out too early.
not sure what you are trying to do?
you mean like at 30 seconds?\
Are we allowed to talk about whisper.cpp in this channel?
only in whispers..
you can ask for help with forks of the original project, but it is just less likely you will get an answer
Hey. Anyone ever tried to fine-tune Whisper while "boosting" certain words? Not necessarily while training, but could also be helpful during inference.
For example, target is "rotoescoliosis", but prediction ended up being "rot escolosis". I know that on inference time we have access to initial_prompt, but in my use-case, during inference time it's impossible for me to know what will be inserted in the initial_prompt.
tldr; I have a corpus specific to my context, and I want to "boost" the words in it during training or inference, without initial_prompt.
I'd be very thankful if someone could point me in the right direction!
Hey, is there anyone here who has worked with the whisper API in node.js? I'm trying to understand if the OpenAI node library supports it or not and so far, all I'm finding are a bunch of outdated code examples online. The official API documentation only seems to have a python example for sending audio to it... Am I missing something?
Yeah I've got it to work in node.js
But I've done it through the HTTP API, not the library
Here's how I've done it:
require('dotenv').config()
const axios = require('axios')
const { Blob } = require('buffer')
const buffer = Buffer.from(YOUR_FILE_DATA_HERE)
let data = new FormData()
data.append('file', new Blob([buffer]), 'audio.mp3')
data.append('model', 'whisper-1')
const req = {
method: 'POST',
url: 'https://api.openai.com/v1/audio/transcriptions',
headers: {
'Authorization': `Bearer ${process.env.API_KEY}`
},
data: data
}
let transcription = await axios.request(req)
transcription = transcription.data.text
console.log(transcription)
it is
Im not sure if an ios screenshot on safari a sufficient response (but I don’t know @jdo330 what IDE do you use)?
It was to tell that he can switch language example by clicking on the arrow.
Tamper with the background noise, split it into like thirty sec intervals using a while loop , use gpt api to read it and adjust for grammar
Yeah like s.k.6.9 said, you could use a chat model like GPT-3.5-turbo or GPT-4 to correct the grammar and give it certain words to correct
Has anyone here had any experience training a custom whisper model? I'd like to explore some stuff, for example training it on SDH subtitles for [EXPLOSION] type notation, or on music to better pick out lyrics out of songs
How would the data preparation work, etc
I really hope it's easy like loading audio + transcription in and have it go at it
I don't think so, it's not like you can "fine-tune" a whisper model right now, and I don't think that supporting fine tuning for whisper is OpenAI's priority
Update to this: I'm trying to make whispercpp's "grammar" function to work in my use-case: https://github.com/ggerganov/whisper.cpp/pull/1229
I thank all the suggestions you guys gave me. I thought about using a GPT model to correct the words, but I wanna try other options first.
It's possible: https://huggingface.co/blog/fine-tune-whisper
Just not "officially" through the OpenAI's API, as far as I know. You can use HuggingFace Transformers. And yeah it's easy.
Hmm it calls for an ASR dataset, which iirc is time-aligned per word. What I would have, realistically speaking, is an mp3 of a song and plaintext lyrics
Still doable?
oh nvm, it doesnt call for time aligned stuff
but it might still be a challenge to split the songs into manageable chunks, or could i train on the full 3-minute ish audios?
As far as I know, you need to split the audios to at most 30 seconds, I'm unsure if it does so automatically for longer audios
looks like i'm going to manually prepare the data
not sure how much data i need for a decent result though
considering i'd just be training it more on english, and its already rather good at picking out lyrics from songs
That's usually very hard to determine :/
Has anyone here had any experience
I don't think so, it's not like you can
not sure how much data i need for a
Is it possible to increase the volume (make the agent speak louder) when using text to speech?
or change the affect/feeling of the text?
I'm running whisper locally and am wondering how I can get it to segment more. I'm trying to write subtitles for videos, but the segments with text are too grouped up
like I got two segments of 30 seconds each
When you say run "whisper locally" you mean running it using the API right? So transcription still happens on OpenAI servers. Or can you do the computations on your computer as well?
computations are on my computer
I've got a 3090 so I can run two instances of large
Well, just barely run two
I can't do anything else
Can you please send a tutorial link that shows how to run whisper offline on your own computer?
Or is it this tutorial: https://github.com/openai/whisper/discussions/1463 ?
I just did pip install whisper
Ok. Thanks
when you use a model for the first time it will download it. that download will be slooooooow
Btw, what I understand is that you need to segment your subtitles but sometimes you end up with a lot of text which doesn't relate to the specific scene of the video you are trying to transcribe?
I would try this then:
- Acquire transcription of the video.
- Segment the transcription of the video based on the sentences which you just transcribed. For example, do string split "." to figure out where a sentence starts and ends.
- So you will now have segmented sentences. Using these sentences, go back to your audio to find out where in the audio file itself a sentence could be possibly starting an ending (short pauses maybe) or somehow calculate the approximate duration of a sentence using tts (or maybe text based ChatGPT itself). Using this duration, segment the audio file, and then split your audio file and send it back to your whisper model.
@woven pier
I think there would be specific methods to figure out where speech starts and ends.
How big is it? I am assuming it will be downloaded to APPDATA (Windows) and I dont have much space in C: But I have plenty of space in other drives.
it didn't download faster than like 70 kbps
and I have gigbit internet
Ahh, I miss those 2008 days where I used to have 70kbps internet
hey do you guys have any news on the opening of the GPT marketplace ?
All updates will be posted in #announcements 😄
Hey guys,
There is a way to transcribe an audio with 1hr+ with whisper API without cutting the audio into parts?
I already tried to compress btw, but it is still too big
(more than 26MB)
I compressed to 32k, monoaudio and 22khz and I got it to 25MB
But when I try to use whisper it loads infinitely
I stayed with the PC turned on for one day and it did not finish 
hey, am I allowed to share a link for my GPT here?
#1171489862164168774 is the place for it 
Ok, thank you
Does anyone know how to change the words per line for whisper's speech to text?
Split it with ffmpeg, send in batches, and return the text. You could even parallelize it
Are you concerned about edge cases where it splits during a word?
whisper token usage seems tom be fairly expensive compared to generations. I'm struggling to see the use case compared to just native text to speech for the cost. Am I doing something wrong?
Model Usage
Whisper $0.006 / minute (rounded to the nearest second)```
gpt-4 $0.03 / 1K tokens $0.06 / 1K tokens```
in theory it's far cheaper than gpt-4
turbo models are super cheap though
Model Input Output
gpt-3.5-turbo-1106 $0.0010 / 1K tokens $0.0020 / 1K tokens```
Also whisper is STT not TTS. Openai does have TTS models as well
Anybody can help me out, I'm using fastapi and I want to not write to file to transcribe my file-upload. I've spent hours but it I ended up going back to shutil cause in memory kept bugging out on filetype. Anybody have a snippet I could use?
Yes, and the timestamps bug out
I could just try to fix it manually, but I would like to do the process completely automatic
Probably stuck with splitting it up
If you split the ffmpeg bits into smaller pieces and split off transcription you'll already be closer
hey folks - i have a very simple Q - forgive the noobness -
assume I am recording a lecture during a classroom, while the professor is speaking - i want to trigger OpenAI to summarize the lecture so far.
so I want to trigger it by saying something like - "hey {assistant} can you summarize the notes so far?"
- how can i do so?
2.how do i remove that prompt as its not part of the lecture?
hey folks - i have a very simple Q -
hey, i'm having a bit of a problems with whisper-ctranslate2 (and normal whisper as well)
in the first minutes of a wav file, it spits out really long statements and after a while it starts to split it for some reason
i can't force it to do it one way or the other, any reason why it's happening and how to fix it?
Is whisper the same price when converting the audio from an mp4 to text and when converting like an audio recording
Does anyone know of a way to using TTS (text to speech), but instead using a plan text, using an SRT subtitle? So essentially dubbing?
Basically I have a videos with SRT transcription, but wanna generate a dub using OpenAI's text to speech. Given that SRT includes timestamps, I don't want the timestamps to be read out loud, but used as guides for timing.
@lone scaffold commercial thing?
I've done something similar for existing_text + whisper_transcription alignment in a recent project.
it worked, well, perfectly. (at least I encountered zero failures). there are timing issues, by the way, with whisper's timestamps, so use the json output with word timings. lmk if you need help.
Does anyone know if there’s any iOS app, that is basically a keyboard alternative, but with whisper functionality built-in?
it's for my own use actually, I found a really helpful course but it was in Chinese, but it had proper transcription, but I wanted to see if I could turn the SRT transcription with timing into a speech with OpenAI's TTS. How did you make TTS have delays that correspond with the SRT subtitles?
Sadly I don't but there is one on Android. It's pretty basic stuff and I can't find a GitHub repo but looks like "just some guy" free apps. I have used it for a good few months and also double checked security doing packet capture. Nothing was sent remote. The same guy has a great whisper powered notepad app too.
https://play.google.com/store/apps/dev?id=6674916867778158495
I'm the only one to have reviewed the note one, which is actually really nice. The keyboard is actually really good too. Just set the timeout to 30 seconds on the recordings because at the very end of each recording before transcription, there will be a little break in the audio that might make you miss a word. Really nice stuff. Works better than Google's voice typing even on my Pixel 6 accuracy wise. The note app has model selection from tiny.en up to base. I don't know what inference backend is used. You can also run Whisper.cpp in Termux on Android just installing through pip on Python 3. All you need is
pip install whisper-cpp-python
Found you one! Not local unfortunately
Edit -> While this isn't local, it does look to be trustworthy.
This looks cool, I’ll check it out. Thanks! 🙂 That said I was looking specifically for a keyboard like something that I could use as input to any text field in any app without having to switch between apps.
Not delays. just getting finer timings out of it using the word-based timings in the .json
that's from a court case.
(I'm kidding)
notice the word timings
I don't exactly know how accurate it is; you might want to evaluate it before you go too far.
Here's a script I wrote (sort of.. me and chatgpt since it's still a lot faster even for little things like this, imo) .. that converts the .json to audacity/tenacity labels.txt you can import (File -> Import -> Labels):
#!/usr/bin/env python3
import argparse
import json
def format_float(value, sig_figs):
return f"{value:.{sig_figs}f}"
def process_json(json_data, options):
labels = []
for segment in json_data['segments']:
if not options.no_full and segment['text']:
label_text = segment['text']
if options.probs:
label_text = f"({segment['avg_logprob']}/{segment['no_speech_prob']}) {label_text}"
if options.token_full:
label_text = f"[{len(segment['tokens'])}s] {label_text}"
labels.append(f"{format_float(segment['start'], options.sig_figs)}\t{format_float(segment['end'], options.sig_figs)}\t{label_text}\n")
if not options.no_words:
for word in segment['words']:
label_word = word['word']
if options.probs:
label_word = f"({word['probability']}) {label_word}"
if options.token_words:
label_word = f"[{word['token']}w] {label_word}"
labels.append(f"{format_float(word['start'], options.sig_figs)}\t{format_float(word['end'], options.sig_figs)}\t{label_word}\n")
return labels
def main():
parser = argparse.ArgumentParser(description='Convert Whisper JSON to Audacity Labels')
parser.add_argument('file', type=str, help='JSON file to process')
parser.add_argument('-o', '--output', type=str, default=None, help='Output filename')
parser.add_argument('-nf', '--no-full', action='store_true', help='Disable output of full text string label')
parser.add_argument('-nw', '--no-words', action='store_true', help='Disable output of individual word labels')
parser.add_argument('-sf', '--sig-figs', type=int, default=3, help='Max significant float digits')
parser.add_argument('-p', '--probs', action='store_true', help='Include probabilities in labels')
parser.add_argument('-tw', '--token-words', action='store_true', help='Include token info in word labels')
parser.add_argument('-tf', '--token-full', action='store_true', help='Include token info in full text labels')
args = parser.parse_args()
with open(args.file, 'r') as file:
json_data = json.load(file)
labels = process_json(json_data, args)
output_file = args.output or args.file.rsplit('.', 1)[0] + '.transx.txt'
with open(output_file, 'w') as file:
file.writelines(labels)
print(f"Labels written to {output_file}")
if __name__ == "__main__":
main()
use with -nf so it'll only output the word tokens (otherwise it outputs the long spans of text AND duplicates it all as individual words too)
(whisper-json-to-labels someaudio.json will make someaudio.transx.txt)
Hi, I was wondering what kinds of audio preprocessing do you guys do before sending the audio file to Whisper API to get the best results?
For example, currently I'm doing the following steps to preprocess the audio file:
def preprocess_audio(audio_clip: AudioSegment, audio_file_path: str, channels=1, bitrate=16000):
audio_clip = trim_silence_pyannote(audio_file_path, audio_clip)
audio_clip = effects.normalize(audio_clip)
audio_clip = audio_clip.set_channels(channels)
audio_clip = audio_clip.set_frame_rate(bitrate)
return audio_clip
anyone know if anyone has integrated whisper on a telephony platform, like voximplant has with dialogflow? or if someone is known to be working on that?
Good question… do you notice any difference in quality between different formats? I do merge the channels because diarization requires it to be one, but I don’t do and other processing yet
So far I haven't experimented with different audio formats yet because our inputs are pretty much just mp3 files.
I get all sorts of formats, journalists transcribing raw interviews from whatever device they use, sometimes it’s just the iphone‘s default record app. I’ll try to convert everything to a tbd common format
Other than changing bitrate and number of channels like you do, I normally don't do any preprocessing besides trimming silence since Whisper can be prone to hallucinations on silence.
I've noticed it hallucinates and sometimes repeats words on silences, overlapping speech, noise, anything that technically isn't human speech.
Hence I've been wondering if anybody managed to solve this via preprocessing the audio.
anyone know if anyone has integrated
Thought it'd be neat to share that I used Whisper-JAX to transcribe 428 hours of audio in just under 6
Hey, I'm currently using faster-whisper for TTS processing. However, my 15-second audio takes almost 2 seconds to process on my MacBook M2 with 16 GB RAM. Unfortunately, this is too long. I would like it to be closer to 1 second. Is there any way I can speed up the process? I would prefer not to use a model worse than 'small'. I will mainly be transcribing German audio.
from faster_whisper import WhisperModel
import datetime
model_size = "small"
model = WhisperModel(model_size, device="cpu", compute_type="int8")
start_time = datetime.datetime.now()
segments, info = model.transcribe("audio.mp3", beam_size=5)
end_time = datetime.datetime.now()
print(
"Detected language '%s' with probability %f"
% (info.language, info.language_probability)
)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print("Transcription took: ", end_time - start_time)```
```Transcription took: 0:00:02.104246```
I plan on making my audio compression code available as an api.
wondering if there is an appetite for it (I use in my own app)
it usually returns an mp3 link that is 10% of original size.
UI version here:
shownotes.io/crush
If I get enough interest will make the api version.
how do i cap the amount of text/time per line in the transcript? for example I want
[2.0->4.0] consectetur adipiscing elit.```
but its giving
```[0.0->4.0] Lorem ipsum dolor sit amet, consectetur adipiscing elit.```
replacement question! is there a way to get time per individual word
nevermind
Yes isn’t there a parameter „timestamps_per_word“ or something like that?
Hello, we are having issues accessing the GPT-4 API for some reason it's not available on our account even though the account is paid and we have spent money on the GPT-3 API (and also, for some reason, we haven't been charged this month). We have been working on a SAAS project for real estate brokers for a long time, have made a custom plugin for GPT, etc., but unfortunately, we cannot use it. Can you help us with what to do?
import os
import pygame
import speech_recognition as sr
def speak(text):
voice = "en-US-ChristopherNeural"
command = f'edge-tts --voice "{voice}" --text "{text}" --write-media "MichealOutput.mp3"'
os.system(command)
pygame.init()
pygame.mixer.init()
try:
pygame.mixer.music.load("MichealOutput.mp3")
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
pygame.time.Clock().tick(10)
except Exception as e:
print(e)
finally:
pygame.mixer.music.stop()
pygame.mixer.quit()
def take_command():
r = sr.Recognizer()
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source, duration=0.5)
print("Listening...")
r.pause_threshold = 1
audio = r.listen(source)
try:
print("Recognizing...")
query = r.recognize_sphinx(audio, language='en-us')
print("hi")
except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
print("Could not request results from Google Speech Recognition service; {0}".format(e))
return query
speak("hello, i am Micheal, your virtual assistant; How can i help you today")
query = take_command()
print(query)
i want to implement whisper to this
this is a speech to text code
i am using sphinx
but sphinx isnt accurate
so can anyone help me
Happy Friday https://chat.openai.com/share/b6606aff-ce32-4117-8d80-7ba421321ad8 ```import os
import pyaudio
import wave
import openai
Set your OpenAI API key
openai.api_key = "your_openai_api_key"
def record_audio(filename, duration=5):
"""Records audio from the microphone and saves it as a WAV file."""
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 1024
audio = pyaudio.PyAudio()
stream = audio.open(format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK)
print("Recording...")
frames = []
for _ in range(0, int(RATE / CHUNK * duration)):
data = stream.read(CHUNK)
frames.append(data)
print("Finished recording")
stream.stop_stream()
stream.close()
audio.terminate()
with wave.open(filename, 'wb') as wf:
wf.setnchannels(CHANNELS)
wf.setsampwidth(audio.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
def transcribe_audio(filename):
"""Sends the audio file to Whisper for transcription."""
with open(filename, "rb") as audio_file:
transcript = openai.Audio.transcribe("whisper-1", audio_file)
return transcript["text"]
def main():
audio_filename = "recorded_audio.wav"
record_audio(audio_filename)
transcription = transcribe_audio(audio_filename)
print("Transcription:", transcription)
if name == "main":
main()
I am so lost. Is there a way to use Whisper without downloading anything? They say is in the api, but I cant find the [whisper] model
From microphone to transcription: normalize the data, flatten it, make it 16000 with librosa, output it to a wave file to verify the audio was manipulated properly. If the audio is too slow tell librosa it's 2x faster than it is.
How can I use OpenAI Whisper in JavaScript/TypeScript?
I’m a subscriber, where’s my beta version with call option to other gpts and ability to locate and attach third party APIs? Can’t find it. GPT4 knows nothing of it…
click on "Explore GPTs" then on "Create"
I’ve created many already. I m seeking the new higher functions.
no idea of what you are talking about
the option to make API calls is under the "Actions" section of the custom gpt configuration
Yes, but there’s a dedicated Actions GPT that’s meant to help with that.
there are some forked implementations of whisper that claim to run faster than the original and on CPU
I never tested, but the idea sounds pretty cool
Hello !
So I have a Mac laptop from 2017 (MacOS Monterey 12.7.1). I have installed Python 3, Pytorch, ffmpeg, pip, and whisper via pip. Every one of these installation processes seem to have worked well and have been fully complted. But when I lauch "whisper [path/to/audio/file.mp3]" it says "zsh: permission denied: whisper". I have tried to run my terminal with root permission, but even then it says "-sh: whisper: command not found". Can anyone help please?
172.17.0.1 - - [14/Feb/2024 20:22:08] "POST /asr?task=translate&language=da&output=srt&encode=false HTTP/1.1" 404 -
Why ?
Bruh, there any paperwork on putting whisper into a executable pythons great and all but isn't liable to be easily distributed to less advanced computer users.
What do you mean
Can you give us more context?
You're wanting to run whisper as a commandline tool or gui?
I think there's a self hosted version but it has requirements
Check out github
I found the issue and already fixed it. When using pyinstaller, I had it installed via pip.
You have to install via conda for numpy to work.
Oh you literally meant creating an executable with Python
Yeahhhh
Anyone got any advice for bulk transcribing audio files? I was thinking renting a GPU server for a couple of hours and bulk processing with a self hosted model would be most cost effective. Also wondering if CPU models are catching up yet as these would be even cheaper.
I use the tiny model to transcribe live audio you can easily just use whisper locally and letting it run on a older laptop
Does this work on android? Disclaimer, voice is sent to OpenAI and temp stored until whisper transcribes http://switchmeme.com/lel2
That's great, how effective / accurate have you found this?
So so, I would use a larger model for it
Yeah, hence my problem I have to use the larger models for accuracy. Did you try tiny.en at all?
i did. i'm just finishing my .exe file to allow for someone to drag and drop audio into a program and getting a text document back.
a typical pc can run the small.en model
PC without latest and greatest GPU?
So I would think I got a 100 dolla windows notebook here and I could run it. it was short audio
but it ran.
plus all the whisper running i've done has been cpu
Ok, thanks. I suppose I need to try it 😀
anyone else using the javascript MediaRecorder API to record and send to whisper? on iOS the mediaRecorder.onstop gets triggered too soon
Is anyone having trouble recording audio from the browser on an iphone and sending the audio file to whisper api? It is only transcribing the first second of the recording and other times it will return random characters
Is the functionality to identify whether a user stops talking already provided in Whisper, like it is already in the ChatGPT app?
I'm new here👀 what is whisper?
Hi, is whisper capable of listening to the audio of a lecture recording and generating a clearer voice/new voice that can be dubbed over the video? My lecturer is very hard to understand due to his acent. It would be great if whisper can help!
I'm not a developer, but the code used by the program I use in Windows to use Whisper for dictation in windows look is very straightforward. I imagine you would need to implement something to stop each recording before the 25 MB limit is reached and automatically start a new recording.
ChatGPT could almost certainly ask you how to adapt https://github.com/braden-w/whispering/blob/main/apps/web-desktop-app/src/lib/transcribeAudioWithWhisperApi.ts for your needs.
The last time I looked in to natural sounding text to voice, they all had per unit pricing to use their APIs - but I have no idea how much it would cost etc
Hey,
Any idea how to get "whisper" to transcribe exactly what the text says?
For example, when the voice contains even a mild "insult" (You can shut up), whisper generates completely random text.
For a voice assistant, this is quite problematic. Any ideas? 🙂
While trying to create the processor, memory goes insanely up, even with a very short audio, like 3 sec
It seems to be a glitch, but I can't find it. If anyone has seen this in the past let me know 😅
I'll try to create a small repo reproducing it
Got it! I was using stereo audio. It needs to be mono 🤦
thanks for documenting your findings
<:book_icon:1171408210398289941> `` Rule 1 `` Be respectful.
Treat others the way you would like to be treated, and assume best intentions. Don’t harass or attack others, and don’t engage in hateful or generally malicious behavior (e.g. sexism, racism, homophobia, etc.). Keep the negativity to a minimum.
Why is whisper inference time not constant if the audio is always padded to 30s? For example, 30 seconds of input audio takes roughly 5x the inference time of 5 seconds of input audio.
Do you sir have a time machine? Because that would be an awesome way to make money through crypto!
I'm absolutely loving using Whisper through the API for dictation in Windows. However, I'm worried that using it for anything longer, like, is going to get expensive pretty quickly. Does anybody know if there are any other options just for Whisper in terms of cost? Because I kind of just was assuming that just using Whisper just for my own individual use was never really going to add up. But I think I might be wrong about that given I used nearly $1 on a particular day, and I still haven't' used it for any longer form writing yet. It might almost be worth it to use something like Tasker or an equivalent in Windows as some sort of hacky workaround to get it under the monthly subscription using ChatGPT?
The other possibility is that on that particular day I was getting used to using the Windows program and I might have been leaving a lot of dead spaces in between words and sentences? It's supposed to be about two cents a minute, correct?
There are locally run versions of whisper you could look into.
I can not vouch for the quality/speed or anything but have seen them on my "travels"
is this the proper place to ask about faster-whisper and implementations of it?
Yup 🙂
How to use whisper?
Heyy, you can either use whisper trought the openai api or directly on your machine since the model is open source: https://github.com/openai/whisper . Let me know if you have any other quesions.
Hello there, I have been having this issue lately, where the model, I am using base, would completely hallucinate stuff. My transcripts are quite long, and I had this issue where at some point, someone said "I'm hungry." and then the model started just repeating "I'm hungry." all over until the end of the transcript. If aynone could help it would be much appreciated.
is there an issue with the model requests? i keep getting request timeouts from my python code
im using my company's openai acc with its own subscription and key and everything and it worked fine in the past.
i just keep on getting erorr codes such as 503 or "request timeout" randomly
Has anybody else found that personal use of Whisper for dictation is extremely cheap? I was a bit worried about using it so much in Windows, given how easy it is to just, one hot key to start talking and one hot key to stop. And now that there's an Android keyboard, same issue, but I've been using it a lot. And I'm only at 93 cents for the month so far.
anyone unable to get whisper to generate any files after transcribing?
no .srt .txt or anything
Hello guys.
I am currently trying to get whisper up and running on my machine.
Cuda is avalable and the standard device the model chooses to run on.
I checked with whisper --help
While using the CPU (AMD Ryzen 9 7900X3D) works like a charm, using the GPU (AMD Radeon RX 7900 XTX) doesnt work at all.
The following code:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(f' The text in video: \n {result["text"]}')
raises this error:
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Whisper:
While copying the parameter named "encoder.blocks.0.attn.query.weight", whose dimensions in the model are torch.Size([384, 384]) and whose dimensions in the checkpoint are torch.Size([384, 384]), an exception occurred : ('HIP error: invalid device function\nHIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing HIP_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_HIP_DSA` to enable device-side assertions.\n',).
I dont know how to fix this.
Are you on linux ?
That's a really specific error i'm sorry
1- what is HIP
2- why does it say kernel
3- i dont think its a hardware issue
Yes I am on Ubuntu 22.04.
HIP is ROCms C++ Dialect. Cool. https://pytorch.org/docs/stable/notes/hip.html
Apart from that I dont have any clue
If my phone can run the 'base' 74M model locally, sure my trusty GTX 750ti will make light work of the larger models? Right guys?
hey guys, does anyone know how to get the Whisper python API/module to add punctuation consistently to its .words output in verbose_json mode when transcribing? it's a short audio recording & i put in the exact script (including punctuation) for the audio as the prompt. transcription.text has punctuation as expected but transcription.words has none unless it was two words inter-connected by punctuation with no space in the script (ie, it's..very works but it's.. very with the space doesn't)
(in case it's unclear, I'm talking about using the OpenAI python client & API, not running Whisper fully locally)
have you tried passing in the suggested arguments? HIP_LAUNCH_BLOCKING=1 or TORCH_USE_HIP_DSA
I did, yes.
i haven't seen this before but what was the output when you did that?
Tbh. I just sold my XTX and bought a 4090. AMD for AI ist just stupid and ROCm a piece of garbage
I tried SO many things. Updating the kernel to 6.6 on 22.04 Ubuntu, passing every Local variable like in the fixes from other people.
I uninstalled and reinstalled a ROCm-compatible PyTorch and reinstalled ROCm drivers for my respective ubuntu and kernel version
Also adjusted the HIP debugging level but that didn’t tell me anything.
But thanks for the effort to reach out to me. I hope the new GPU fixes the problem. Just throwing money at it lol
guys, I really dont know which programming language to learn next, I am abu clueless about which one to learn, at this rate ai will become good at many of the languages, which is going to be very important in the future. And please no gpt type answsers
Well, learning Python is a good all-around language to start with IMO. JavaScript can also be good, specifically Typescript. My suggestion is to find a specific platform or dataset to work with as a project, then have GPT-4 teach you
I've been learning Python using it to create simple scripts to analyze Reddit data using PRAW, a Python Reddit API wrapper that has taught me what the heck an API even is
But it still feels bland, u feeling me? I want to learn something which is heavily used in the future cause at the rate of ai progress, by the time I master python I will become absolute
Hey everyone! 👋
I'm diving into the integration of the Whisper API for a project that necessitates handling highly sensitive audio files, meant to be installed and operational within a highly secure local network. This setup is critical as it mandates that all data remain strictly offline for security reasons. I have a couple of key inquiries on this front and would greatly value any guidance or shared experiences:
Token Pre-Purchase: Is there an option to pre-purchase tokens or credits for Whisper API usage? We're looking for payment models that provide the flexibility to budget and plan ahead without immediate consumption.
Data Privacy and Usage Logs: Privacy is a top priority for us. In the context of Whisper's operation, can anyone shed light on the specifics captured within the usage logs? Are there any references to the processed data itself, or do the logs strictly record metrics such as transcribed minutes and the models utilized? Any insights into the data management and logging practices would be invaluable.
Navigating the secure and efficient use of Whisper API, particularly for projects with a high degree of sensitivity, is my current challenge. If you've had experience or have knowledge regarding deploying OpenAI technologies in similar secure, offline environments, I'd be incredibly thankful for your insights.
What exactly is your goal to master Python? Money? Vanity? I think you might be missing the point on what programming is about. The world's best programmers are constantly learning new things every year, which means there is no "best programmer". It's like asking who the best athlete is
Maybe it's Michael Phelps, but he probably isn't very good at basketball is he?
This isn't Pokemon my friend
My goal is to create a lot of cool and helpful things using programming, I need a language which pretty much has everything to do everything….
Most programs that people actually use have more than one programming language. For example, the code structure is made from Python but the user interface is made from HTML and Javascript
<:book_icon:1171408210398289941> `` Rule 1 `` Be respectful.
Practice kindness and positive regard. Harassment, hate speech (such as sexism, racism, or homophobia), or other malicious conduct will not be tolerated. Maintain a respectful and positive environment.
1- Python tools and 2- html + js tools 
i have like 20 tools here all made by chatgpt
then I have no clue what to use to build what!
Then just save it as "clock.html" and run it 
Does anyone know how to split audio file into chunks to send to whisper api using nodejs running in Vercel Serverless Functions?
I need clues guys, CLUES
Focus on building complex systems, and the languages used for those systems. Like C#. The language isn't as important as your ability to solve complex problems.
Enterprise
I’m finetuning Whisper with my own English data and need clarity on standardizing transcripts, especially regarding punctuation and capitalization. I reviewed appendix C (pg 21) of the Whisper paper but I couldn't find details on capitalization and punctuation for English transcription. Any idea how text was standardized when training Whisper?
But the thing is, I cant spend too much time learning new categories. Something I can pick up fast and master it
Are you a professional programmer or are you only just getting into programming?
I am not so professional programmer, but I have done Java c plus and web dev with smart kit
You may have to approach OpenAI directly. Chinese language support could be not offered currently.
What is this?
I was trying to recognize a piece of audio and I got errors when I choose Cantonese
I alread said it cannot find the specified language... Cantonese in this case...
it's not finding it
I got it bro, just replying
So I got ROCm working with whisper on the integrated gpu on a Ryzen 5700g (lol), but it's basically the same speed as CPU. Is there any way to use this GPU to actually speed it up or was this just pointless?
I suspect it was pointless
Yea, this sucks. Keeps crashing the igpu requiring a reboot to clear.
@fathom ridge Try unplugging monitor from integrated port
I'm using it. It runs a dashboard for my home assistant.
I can try it later but I want that more, heh.
Yes I'm sure you are using it. But it is hogging resources
If it can't spare the resources to run a tab in firefox and do this then that configuration isn't suited to my needs.
Do you have only one port to plug a monitor into
Ryzen 5700g is equivalent to a RX 550 or GTX 560 -- light gaming
yea, it's a small mitx pc with no dedicated gpu
I have things that run whisper fine, I just wanted to see if I could get this PC to be a bit faster. I wasn't expecting much, honestly I was surprised the integrated gpu on it even supported ROCm
Some of the prominent capabilities of the processor include SSE 4.2 + AVX + AVX2 + AES + VAES + AMD SVM + FMA + RdRand + FSGSBASE + BMI2
AMD is pretty good man
I run full amd as well
I'm not getting bad performance, I was just trying to get the best possible.
What would you suggest for a desktop running Ryzen 7800X3D / Radeon 7800XT GPU / 32 GB Ram that isn't being used for an ai project? @fathom ridge
idk what the 7800xt is good for in relation to AI, I have nvidia.
So I'd suggest finding a good game
This 5700g is the only time I've played with ROCm
@fathom ridge yeah you're going to need a beefier GPU, APUs are not exactly a well supported scenario to begin with
And AMD is notorious for not properly documenting which GPU supports what ROCm features.
if you have around 400 $/€ burning a hole in your pocket, you could get an RX 6800.
But even with the new windows support its still a far cry from CUDA on nvidia.
I want to build a custom cheap but "can run llm pc" any ideas please?
Is there any kind of "next-gen" x99 Motherboard ?
hahaha
i know a lot about computer hardware but not when it pertains to this. I know you benefit from a threadripper-type processor!
i imagine fast hard drives also, like read/write speed. i cant answer for HD size.
for a LLM machine only, i think this is the kind of infrastructure you want
yeah i wanted something like that, but ddr5. thanks alot for the help
If I'm buying a dedicated card for it I'm going nvidia and getting something that'll run mixtral too, heh. Right now I'm just running both ollama and whisper on my gaming PC and leaving it on.
Wishlist: VLC plugin that adds subtitles to any video
<:book_icon:1171408210398289941> `` Rule 7 `` No self-promotion, soliciting, or advertising.
Do not post or direct message any members of this server to promote non-OpenAI services, products, or projects.
Exemption: API & Custom GPTs channels
It's complicated to do that.
You would need to either stream the audio to Whisper or upload the whole thing. And such proposal can cost a lot of tokens
Could just run it locally
is there a tutorial to learn how to use whisper? Is it difficult? I want to record a 20 minutes speech of myself
Hello Ross, we are also encountering this problem. Is there any more suitable solution at present?
你好Ross 我们目前也遇到了这个问题,目前有什么比较合适的解决方式吗?
None that I'm aware of, @candid siren .
The best solution I've found is to filter out those responses with GPT. Even this is difficult, given the variations in output. You might consider fine-tuning a model to recognise these "YouTube-isms".

ECONNRESET is killing me
What happens when you transcribe reversed speech, lol
(Made using ElevenLabs, which utilizes Whisper)
Has anyone figured out a way how to use whisper for real time transcription without transcribe into nonsense?
Hello!
I hope my question is relevant here. I am attempting to convert alignment results from WhisperX into a TextGrid file for the purpose of analyzing afterwards on Praat. I initially used Parselmouth to directly convert the WhisperX alignment, and I also tried to write the results to CSV in order to convert them into a TextGrid file, but it doesn't seem to work. Has anyone else done a similar thing before, or does anyone have suggestions on how to do this? Thank you!
Hi, I'd like to know if there's a built-in method to extract the confidence score for each word in the output. Additionally, is there a feature or an add-on available that allows these confidence scores to be color-coded directly in the results?
When will you support v3 in your API ? It's been almost one year
Can anyone help me use whisper here?
I want to make some code in python that utlizizes whisper to create a SRT file or a VTT that has timestamps but only single words per each I don't want sentences or paragraphs like these:
1
00:00:00,000 --> 00:00:05,000
Open AI has recently decided to open source.
2
00:00:05,000 --> 00:00:09,000
Their translation and transcription AI whisper.
3
00:00:09,000 --> 00:00:18,000
So now it is under an MIT license and that includes both the code that's here as well as the model weights that were used to train the AI.
4
00:00:18,000 --> 00:00:26,000
So if you want it to go and try and make your own speech transcription AI with that data, you are free to do so. ```
I want single words for each... Is there a way I could do this with whisper?
Hello, I am using the API for text voice, and I am struggling to get the model to pause long enough between paragraphs. I have tried inserting 3 dots (...); an hyphen, but nothing works. Any ideas?
Hello guys, after not using Whisper for 2 months, I got some issue.
The transcription is running well but at the end I don't have any text generated...
The key is (I guess): FileNotFoundError: [Errno 2] No such file or directory: '.\test.txt'
When I google it people have problem that after "No such file or directory:" they have ffmpeg, but I think my works well (I do reinstall), and there is nothing writte with ffmpego but '.\test.txt'.
i am using whisper just fine. The problem occurs if your audio file has a too long name with spaces in it.
Damn... I did reinstall (It had helped before) and I have the same issue...
Do you know any command for whole uninstall: whisper, cuda, pip, ffmpeg etc. (connected with whisper)?
Depends on your Linux. In easily installed it with apt
I am using windows 10
