#gpt-realtime
1 messages · Page 2 of 1
well seems like its working now
but when i type ''whisper'' should be appearing something
showing that its installed..
huh..
actually the pc is still thinking, quite slow god
have you tried running it in collab?
you can run the large model and not have to worry about your GPU crying
never used this lol
i am quite average pc user
same, I learned a ton of python and node.js thanks to tutor GPT
explains things so well
you can just do this on Google Collab
since its here, its installed?
try running it
now you just need a path for the transcription
transcription folders
its simply not running through cmd, i will trying to figure it out how to use this collab now, one step at time
i am trying*
copy and paste these lines
here?
yep
ok
hey
!whisper "name.mp3" --language pt --task transcribe --model medium
is this the code to transcript an audio in portuguese?
first time doing this
idk if the format is right
Is there anyone from OpenAI here who can tell me why does Whisper transcribe Bengali as Hindi? Does Whisper not support Bengali?
API discussions, We Are OpenAI, my guy.
developers piling as a community playing and improving the tech
did you check on !whisper -h?
OH!!! i know
do you have access to chatgpt still?
or is it buggy for you too?
yes i have
subscription
let chatgpt read the whole command list
and ask it for what you need
@autumn bolt I was trying the Whisper API from the OpenAI playground. I want ASR with multilingual support.
That is works fine for me with large or large-v2 model for Hungarian. [asr]
you can't use Whisper AI from OpenAI playground
you need python notes
why not run the large one?
you've got access to a cloud gpu that has 16Gb of RAM
and 50Gb of VRAM
i am just testing if works one
smaller score is better @trim rampart see the chart.. but around 99 languages supported in the ASR too..
you can get more info in the github repo
What, no it is not
have you not compared the quality?
that is the official chart from OpenAI
@autumn bolt Bengali/Bangla is not in the list. It is one of the most spoken languages.
you haven't tried it yourself then?
_>
whisper transcribes audios from video files? or has to be audio files?
go try it and compare the transcription quality
XD
Yes, I use ASR transcribe modul. [locally, not API]
I don't know, I'm not from OpenAI team.
doesnt do anythiing in 7 minutes, geez
I use via Clojure interop (trough JNA) with Python.
Try with 5 sec audio.. I think your VM have too low RAM, therefore slow.
Normally if you would like to run ASR, do you need for normal speed at least 20Gb ram free. x1 == large model
this is needed
@autumn bolt still running?
now i am running again, but its working finally i learned this lol
congrats
i am using this to transcript: !whisper "exemplo.mp3" --language pt --task transcribe --model medium
but how can i transcript just a specific part of the audio? like from 3:00 to 10:00
Hi, I keep getting 400 errors when trying to use whisper API for translation with language ja - what am I doing wrong>
Hi all 👋
I'm wondering if anyone had experienced an issue where whisper seems to try to translate instead of transcribing?
I just said in English "how is that possible"
and it transcribed it as "Kako je to možno?"
which according to google translate is Croatian
I can reproduce it with the audio
if I trim the silence at the start it transcribes properly
You filter by the timestamps or cut the audio file. But for this do you need to know python programming.
That is interesting since.. [.. to-English speech translation] possible. If know good.. "We only support translation into english at this time."
But Im not from the OpenAI team...
exactly, I am hustling now with translation from English to other languages and its supposed to be not possible but hey - its done here...
Transcribe ASR and translate with OpenAI API 😉
So is it possible to have like an audio in Russian and have the transcript in like Spanish instead of English?
Or even have like an audio in Spanish and have the transcript/result as Spanish too, for like deaf people
Check out Gladia if this is your usecase
Has anyone managed to get microphone recording in Safari or Firefox to work through the Whisper API?
I have it working well on chrome but never the other two
Does Whisper work both ways? TTS and STT
no
Hello Everyone, does anybody know if whisper is working on diarization ?
Maybe do you would like to create/ use like this? huggingface. co/spaces/vumichien/whisper-speaker-diarization
That can transcribe and do diarization.
hey guys
I want to work with open Ai to stress test it
our team has found many bugs and wants to help
Bug-reports channel
has anyone tried applying for a transcription gig with whisper here?
I'm trying to install Whisper with pip3 install git+https://github.com/openai/whisper.git on MacOS and I'm getting "error: metadata-generation-failed" anyone know how I can fix this? Here's the full output when trying to install https://pastebin.com/qGGvgKaE
Hi everyone! I am trying to use Whisper via API to transcribe an audio recorded in the browser.
The way to the server (which is a NodeJS express backend) works. I have used both OpenAI's npm lib as well as talking directly to the OpenAI endpoint https://api.openai.com/v1/audio/transcriptions.
I always get a 400, bad request back and I can't work out why. I am sending the file as webm and via mulitpart/form-data
The function is in the screenshot
Any ideas? Many thanks in advance for an hint!
The file is in the Blob format /* let blob = new Blob([Buffer.from(audioFile, 'base64')]); */
we're continuing the discussion here https://discord.com/channels/974519864045756446/1088568452396093562
Does Whisper have a recording length restriction? Im having issues where its not translating the entire file im sending it
And its under 25mb
I know that Whisper can produce either a transcription into input language, or translation into English. Is there a way to produce both in one go? That is, feed audio file to Whisper and get 2 transcripts as output, one in the original language, and one in English?
Not on the same call
You would have to do two seperate calls, one to translation and one to transcription
Hi there, I'm thinking about implementing Whisper with MP4s or MP3s. Can anyone tell me how long it takes in seconds for a result? For let's say 30 words.
but why? this doesn't seem optimal. As far as I understand the difference starts at the tokenizer creation stage
This heavily depends on the hardware you're running the process on
Sorry, I mean for the Whisper API requests.
It'd be very fast. I'm sending recordings of about that length and it's maybe a second?
2 sometimes
how can i get to the whisper website
Thanks for the info 🙂 Getting timeout on transcription too
What is #gpt-realtime?
Audio to text converter / translator
is there a .tflite or .mlmodel large-v2 model available?
What is the status
You could try out. Large-v2 works, I tested.
what do you eat today
Does anyone know where I can read docs on silero vad? Trying to fix Whisper hallucinations
Whisper AI api is regularly returning Welsh text when I give it english voice input. Why is it also doing a translation? (If I hand the text to ChatGPT and ask it to identify the language and provide an English translation it is identifying it as welsh and giving the essence of what I said)
I don’t have a welsh accent!
Are you setting the language as English?
But you can read in the Whisper paper that a lot of their 'Welsh' data was actually just English but was labelled as Welsh, I'm guessing that's the source of the issue
Possibly not. It might be set to auto. I'm using an ios shortcut I found on the internet - I've not found the API docs myself to cross reference. It's strange it's auto detecting english and then translating to welsh though. Most of the time it returns english.
The api request appears to be posting to /v1/audio/transcriptions, with a request body containing model: whisper-1 and the audio file. No other optional parameters.
Just a random guess it has something to do with this.
Looks likely. Do you know what parameter I need to set to inform whisper that the input is English? It's becoming annoying to dictate minutes of off the cuff thinking, and have it come back in Welsh! (With no recording to fall back on) 😉
I looked online for the API docs, but the web site just refers me to discord - hence my question.
No clue I don't use the API, I run freesubtitles.ai , that may be a better solution though I don't know which iOS thing you're using.
(Hope I'm not breaking TOS by posting that feel free to tell me if I am mods)
ios has a built in WYSIWYG shortcuts system for building extensions. My shortcut is triggered from SIRI when I say 'convert with whisper', captures some audio, calls the whisper API, and then shows the text and copies it into the paste buffer. Very useful.
Nice. If you can figure out how to pass the language that will probably fix it. Though I have had a lot of people use auto-detect language with my service and I can't remember seeing it incorrectly identified as Welsh
freesubtitles.ai is my site
working pretty well over 1 million minutes transcribed already
not sure about that. I think there are Telegram bots perhaps
Hey i need to get into contact with a discord dev
who do I talk to?
@hushed pier
mod mail dont work like that huh
Hello and good day. I report this error due to its persistence, I have deleted the cookies, I have reloaded the page among other things and this error continues, could you help me? Thank you
Server's flooded
Don't bother with chat.openai
Just use the API keys.
It's better in everyway
How can I use the API keys?
Post your problem or something, they’re busy people and if your question is worth it they’ll probably respond.
Who has access to GPT plugins yet? I have a really cool project that I want to collaborate on, my company is called Whop (whop.com) shoot me a dm!
right so the reason I need to talk to a discord dev is bc there is a glitch built into discord that allows certain people to use the dev portal
the people who are using it
aren't good people btw
what does that have to do with openai @rough badge ?
right
so I wont go into detail
but people have been able to glitch open Ai in ways thought to be un do-able think the worst possible out come
and that worst possible out come is where the glitching community started
Hi everyone, new here. Don't know if I am writing in the wrong section. But where can I search for developers with experience tinkering with Whisper and Gradient API?
I am interested in a joint venture, I have 2 companies eager to pay for gpt3. 5 or gpt4 implementation in their customers service.
So I am talking to gpt4?
Does whisper support transcription of non words like coughing, laughing, and so on, for example with a prompt? Usually for coughing I get "thank you" which is polite but incorrect 😅
I can at least detect the non speech probability and most of the time it's good but still curious about it
Hey all--I'm a node.js developer, and am wondering if Whisper supports word-level time stamping out of the box? Or do I need additional libraries?
It does with the latest version if you pass the param
Ah fantastic! Can you direct me to where the most up-to-date Node.js API is for Whisper? I'm having trouble finding it
Right now, in the openai.createTranscription() method, I don't see a param for word-level timing or anything like that.
I don't think it's offered with the API but you can use the open source Whisper module
Ah, I see. Sorry to be reptitive, but which is that? I see so many on npm and github! lol
Thank you! It appears to only be Python-compatible at the moment, is that correct? I'm a JS-only guy, currently. But I have some Python devs who can implement this, if need be.
You can call it with spawn from node, like I do here: github[dot]com/mayeaux/generate-subtitles
Oh I was not aware of this! Thanks so much for your help!
They actually added whisper support to the openai node package
not sure when, but I only noticed it last week
Oh weird, i don't see it in their code example - I swear I saw it. It still seems to work for me! I'm calling it like this:
const response = await openai.createTranscription(
fs.createReadStream(audioPath),
"whisper-1"
);
return response.data;
Where openai is an openai client with key added through configuration (they have it in their other api examples)
Thanks!! I did see (and have used) this part of the API... but I don't see anything in there for adding the word-level time-stamping, unfortunately.
The transcription part works perfectly, though.
It seems the API only gives back the text transcription and not the .srt , vtt or .json files that Whisper as a standalone does
Darn. That's frustrating...
@lethal thorn I might add the word level timestamps functionality to freesubtitles.ai , it also has API access
An API for accessing new AI models developed by OpenAI
Are you sure? Allegedly it can accept a parameter for that
well it definitely returns json, but the only thing in it is the transcription. Weird, tbh.
Ah cool page! I'll keep an eye out for that 🙂
Ah thanks I missed that
sure thing - I think they weirdly have documentation in two places and only one mentions the node package and some of the features
I don't think whisper is top of OpenAi's priorities though
Yeah, the open source one appears to only be in python? Which makes this more confusing lol
But agreed, it doesn't seem to be top of their list. All I want is word-level time stamping 😄 and to be able to do that with node lol
I am sure their hands are full with ChatGPT atm
Most of the AI stuff runs on Python tools so the only way to really run it is to plug into the CLI tools with Node
Though it seems they will roll out word_timestamps functionality for the API at some point: https://github.com/openai/whisper/pull/869#issuecomment-1459431437
does using the response_format parameter not work on the api? srt or vtt should have timestamps I thought? (though i thought it was at ~utterance level, rather than word level)
(that utterance comment is coming from using the local model (python), not the api so I don't really know)
the srt and vtt outputs do have timestamps, but they are not at the per word level. they are phrase timing.
Interesting, what does Whisper consider a "phrase"?
Just a group of words without a significant pause?
sad 😄 I guess I'll have to put in a little more legwork than I'd hoped haha oh well.
Yes. Those formats are generally used in conjunction with closed captions on video. So the timing is what is readable on a screen during that time. Example SRT: 0:00:00,000 --> 00:00:08,000
Appears the chunks are in about 8 second blocks
Very interesting, thank you! Hopefully the word-level stamping comes soon. So many applications for it!
What do you need word level timestamps for specifically? Just curious
No problem! I work for an educational publishing company, and we currently pay a vendor a lot of money to manually tag the timestamps for each word in our eBooks so that they can highlight as they are read
Figured I could save them a lot of money (and time!) every year relatively easily with the time stamping
Oh wow that's like a textbook 'you should start a startup and sell to your current company' situation it sounds like
Is it better to use offline whisper models or the API? Which would grant me more accurate results (and preferably speed)
Let’s say I want to transcribe an 8 hour audio file. Would it be faster and more accurate using the API or an offline large model (RTX 3080)
Hahaha 😂😂 honestly not a bad idea!! All I need is the word level timestamping and i can add a few 0s to my bank account
With the API, 1 hour audio is about 5-10 mins.
You can load Python into Java/ Clojure (all JVM based.. like Scala..) via JNA. I saw C++ Whisper solution too like Jojo. I don't think, that is a core problem at ASR. At API.. very easy to write a lib for this. Very clear.. not like Google things sometimes 😉
Do you remember the last conversation we had?
How do you know this
Oh I see
I've transcribed over a million minutes of content with the offline version but I am going to play around with the API today to see how well it works and th en I'll report back
Thanks friend
how much does the large v2 take in vram?
I've faced the word-loop issue for Japanese transcription, and found the same issue in the community.
https://community.openai.com/t/whisper-has-looped-in-a-phase/107642
Does anyone know the possible workaround for this?
=-0p97t5§1QW2P0-[
I believe around 10 GB or so
Only real solution right now is to use a VAD to strip out non-speech audio, and then transcribe using word level timestamps and rebuild the subtitles. Or you can run your own Whisper implementation and run some custom code which seems to fix it but isn't merged yet.
Seems there's more hallucinations in the API than running your own instance
@knotty trail Hallucinations like it thinking its saying something when theres silence for example?
How is the speed
Yeah exactly. It's quite fast though about 1m processing to 10m transcribed. You can also cut up the audio file and transcribe multiple chunks at once so you could optimize to have it process at basically any speed (could do an hour in 1 minute if you chunk it enough)
1
00:00:00 --> 00:00:01
Jaj, premanja na Joj,
2
00:00:01 --> 00:00:03
sijanje na sijanja na joj.
3
00:00:03 --> 00:00:05
Use ove se.
4
00:00:05 --> 00:00:07
Edna, a drugi je.
5
00:00:07 --> 00:00:08
Use.
6
00:00:08 --> 00:00:09
Taj je.
7
00:00:09 --> 00:00:10
Pa, jaj.
8
00:00:10 --> 00:00:11
Jaj, jaj.
9
00:00:11 --> 00:00:12
Jaj, jaj, jaj, jaj.
10
00:00:12 --> 00:00:13
Jaj, jaj, jaj, jaj, jaj.
11
00:00:13 --> 00:00:14
Jaj, jaj, jaj, jaj, jaj.
But for example, this came out of thin air. This doesn't appear when I run it with my own instance, I don't know what settings they have but it's a bit strange.
Thanks! Then it's difficult for me... I hope this non-English transcription issue is found by OpenAI folks. Anyway, thanks a lot! 🙏
I think this will help a lot: https://github.com/openai/whisper/pull/1155
Not sure when it will get merged and implemented in the API though. You can also try cutting the file into smaller chunks that should reduce the chance of hallucination loops. How long into your content does it start to get stuck into a loop?
Shouldn't be too tough, just need to install whisper on a GPU server and then pass the CLI param for the word timestamps.
Thanks again! I've subscribed the PR.
In my typical case, the loop occurs in 10 min audio Japanese transcription.
According to your suggestion, I'm going to try shorten the audio file in prior to transcribe. The attached is the actual loop, just in case.
Thank you very much!
You can use VAD to detect speech in the file and then cut at moments of silence, that's what I just implemented today it works pretty well, then you can remake the SRT/VTT/TXT afterwards
Anyone fine-tuning Whisper? Would adding timestamps to the training data improve accuracy of the fine-tuned model?
Hi I want a chrome/edge extension that allows you to speech-to-text on any text input inside the browser using Whisper. I want the same functionality as Voice In but using Whisper. Is there anything like this?
Oh, I thought that they hadn't merged any word timestamp features yet? Would you be able to show me an example of this, or the specific documentation of it w/ examples?
Sorry for the hand-holding request lol the documentation seems to be all over the place
I also have an RTX 3080Ti, if that helps me do this myself 😄
will it be possible that whisper would have data when a certain sentence is being said in the future?
It already has that if you pass -F response_format="verbose_json" \
It's merged just not deployed to the API servers yet.
Just wanted to drop by and say that Whisper has been incredible for creating Text To Speech datasets from voice recordings.
That sounds interesting, how do you achieve that?
Well, what you need to train it is a bunch of ca 5 sec audios, which you can get from an audiobook or a podcast, etc. Then you have to provide transcripts in a file like:
filename.wav | This is the transcript of that file
one line per file
splitting sound into smaller chunks is a oneliner with pydub
then writing the file for the transcripts with whisper's API
is like 5 lines of python
and that's what you need to give something like Tacotron2 for training a voice
I've done it with my own voice for testing
it is uncanny
Are you speak Arabic
Are you following a guide for this? Sounds neat
A combination of things, also asking questions to GPT-4. It proved very useful when helping me adapt code for NVIDIA GPUs to M1 Macs
I found the guide by FakeYou of big help too
but that's for the model training
The Fake You discord is pretty awesome
This is the FakeYou guide
ah! Can't post links
ok go to Fake You's discord
and you will find it
Thanks!
Ahhh okay!
Has anyone done a comparison of Whisper's accuracy (both for transcription and for word time stamping) vs Google Speech To Text?
I DM'd you btw it might be in Message Requests
Hi everyone, how do you make whisper return a file in .vtt format?
From the API?
Hi, do you know what is the size (MB) limit to import an audio file to convert to text?
25 MB
Hey everyone, sorry if this is a stupid question but I've been searching for an answer and haven't found anything. I'm running whisper locally and it's working, but all of the subtitles are uncapitalised? Is there a setting I need to change to enable capital letters?
EDIT: The marked answer here seems to help: https://github.com/openai/whisper/discussions/194
Hello. I am generating subtitles for my this video : https://www.youtube.com/watch?v=77iDUQd4x90 I have provided the video file directly to the wisher with language en and model large However as ca...
Whenever I use initial prompts, it's forcing all of the subtitles into 30 second chunks? Is there a way to stop this behaviour whilst still providing an initial prompt? Without an initial prompt it chunks them logically, but I then have my previous error of not having any capital letters
How to seperate & label multiple voices from one Audio file ?
I actually saw that for the first time ever on one of my transcriptions today
Hello everyone,
What’s the difference between these two approaches used by whisper to transcribe speech?
ah
k
ok
Hello everyone, how to use the Whisper API to send binary audio data.
response_format
guys there is a way to convert avg_logprob to a confidence that is 0-100%?
Been having slow response times for some hours
Bump
What do you mean by Bump?
I was making your message visible again as I am interested in the answer as well
Ahh right👍 . I used both approaches and had different results
Does anyone know is the model from official API is the same as open sourced large-v2?
I found the result from api is better than mine with self-hosted large-v2
Why a Chinese video whisper API response something like before decoding text?
Guys there is someone that is working in how to obtain confidence???
@outer scarab @left stag
Per word?
It should be the same I'm guessing. It's possible they use different options (beam_size etc)
for segment is ok too
q
-F response_format="verbose_json" \
Pass this for the response_format, if it's not there then it's not available via the API
with verbose i will obtain conf?
I can't remember. With the standalone module you can get it for word_timestamps I'm not sure about for segments
ty i ll try
Hey guys, is there any user interface for using Whisper to transcribe speech, for people who don't know how to write and execute code?
Hi for the last several days, my GPT Plus account has been downgraded despite paying this months subscription twice.
The first payment was my usual monthly fee.
The second payment was an attempt to make a new subscription from the same account due to urgently needing use.
I am now out of pocket, with zero support response.
I cannot even attempt to make a new account in case the money is once again taken without providing me with what I paid for .
I have seen that numerous people have expereinced the same issue.
Is there some sort of offical update to this?
Ai
1
Im not from openai team.. but I can see, what you want to get: https://platform.openai.com/account/usage scroll down to language model usage.. exactly every used token there.. if I understand good what is your problem.
An API for accessing new AI models developed by OpenAI
You can use www[dot]freesubtitles.ai which is my site that I made so people can use Whisper without coding
hey guys where I can find supported languages in v1/audio/transcriptions api endpoint?
thanks, the initial prompt is new for me! what is the limitation of this?
Hello, I have an issue running Whisper API in Python (Jupyter Notebook). I followed all the recommendations I found on GitHub related to the ffmpeg error. I uninstalled ffmpeg and installed ffmpeg-python, and now instead of saying that the module ffmpeg has no input the error says: ---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18348\3212043240.py in <module>
----> 1 import whisper
~\anaconda3\lib\site-packages\whisper_init_.py in <module>
9 from tqdm import tqdm
10
---> 11 from .audio import load_audio, log_mel_spectrogram, pad_or_trim
12 from .decoding import DecodingOptions, DecodingResult, decode, detect_language
13 from .model import ModelDimensions, Whisper
~\anaconda3\lib\site-packages\whisper\audio.py in <module>
3 from typing import Optional, Union
4
----> 5 import ffmpeg
6 import numpy as np
7 import torch
ModuleNotFoundError: No module named 'ffmpeg'
Any advice?
The call before was pip install ffmpeg-python
response: Request satisfied, path: \anaconda3\Lib\site-packages\ffmpeg_python-0.2.0.dist-info
Hey Martini, did you solve it? I am running into the same issue...
Hello Team so i have a query related to Whisper
I am trying to integrate Whisper in my nodejs project. To do so i am using this code snippet to make an API call using openai npm package.
const configuration = new Configuration({
apiKey: process.env.CHAT_GPT_ACCESS_KEY,
});
const openai = new OpenAIApi(configuration);
const transcription = await openai.createTranscription(
fs.createReadStream(audioFilePath),
'whisper-1',
undefined,
'json',
0.2,
'en',
);
when it comes to passing the audioFilePath i am passing a path where my stream is located, and reading it at the same time.
The code for the file/route includes:
- Reading file from user's request.
- Passing it to multer and creating a Buffer() from it.
- Passing that buffer file to fs.createWriteStream() and then streaming at a file location.
- Once the writing is done then reading the file content using fs.createReadStream().
- Finally passing that readStream to openai.createTranscription()
I have deployed my nodejs code on render.com. But whenever i try to hit the API route that has this code from my iPhone, then the request is failing with status-code: 400 (BAD REQUEST).
What i am not sure is what exactly am i passing wrongly in the openai API. Because the same API route returns data when i call it from my Desktop browser/ Android Browser.
const whisperLanguagesString = 'af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,hi,hr,ht,hu,hy,id,is,it,iw,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh';
From my own code
uhhhhhhhhhhhhh, im listening to my microphone, no sounds around, no system sounds, getting some data streams in return from whisper when sending anything, but those are not mine
oh, im getting it even unprompted
sorry if it sounds too obvious but have you tried "!pip install ffmpeg" (instead of -python)?
That’s what I did first. Yes. Did you have a look at the GitHub convo? I did everything - install ffmpeg , uninstall it and installed ffmpeg-Python - none of these options work in my case
I think it’s because Python takes the ffmpeg from its cache (which I deleted as well, yet it still does it)
Take that error and feed it into ChatGPT4 and it should give you different things to try
Hey guys, I’m doing a personal project. All the coding part is done for the moment but when I want to run the code and go and see how looks the website, it doesn’t work. The problem is bout the API keys that openai gave me. I wanted to know if anyone knows how to determine if a API key is for chatgpt or DALL-E. if I solve that I will be able to put my 2 different API keys in my JavaScript code and run the code
Has anyone had any issues with Whisper assuming the wrong language? i have a user who is a native non-english speaker but IS speaking english. The model (through endpoint, Large with language specified as "en") keeps assuming they're speaking their native language. I'm guessing this might be an accent issue? Any ideas or experience?
I think the OpenAI API key is valid for all services. DALL-E is listed as one of several models in the documentation:
https://platform.openai.com/docs/models
I think it's just the accent. Had a similar experience with automatically generated subtitles on YouTube (posted an example here a few minutes ago, AutoMod muted me for 5min 🙄).
Dang, well thanks for that!
If I remember good.. do you need to drop and transfer to the API.. in chunks, max 5 mins. Maybe in the documentation..?
There is a 25 MB file size limit to audio files.
True, I checked now, no 5 mins limit. 25Mb ~22mins
note: if openai's Audio module isn't recognized, run pip install --upgrade openai
apparently my package was older than Whisper
8okay87
pip install -U openai-whisper
that's not the api, thats running it locally
Has anyone else ever gotten someone else's transcription back from a request?
I submitted one that had the audio of "I'm sorry, could you repeat that" and got this (screenshot from my app)
No, but I've got some wildly inconsistent translations if the recording has been low. But not as far of as that!
hekllo
Has anyone else ever gotten someone else
Is there a way to get back how confident the model is about pieces of the transcription?
For example: I say "Haskell Nix Python"
Output: "Haskell Licks Python"
I'd imagine licks is less confident?
Which trabscription function of Whisper are you using?
whisper-1 @pearl star
Is there a music volume higher than the sounds in your source?
The whisper-1 model can detect it as a subtitle if the sound source has a musical sound that is louder than the voice of the artist. Sometimes it can return an empty result.
i had similar issue with python native whisper module. However, I did not encounter such a problem in the whisper-1 model used with the API.
Probably the microphone you are using is not good 🙂
Do these subtitles give me back the confidence level? I dont think I follow
Yeah but I'll need to handle that case for my application
We unfortunately cant guarantee our users will have good microphones
We wouldnt necessarily neeeed to though if we could get how confident each transcription of each word is
Can you share an audio recording containing the same query for me via private message. I would like to test it with my own whisper-1 model.
So I can make a guess as to where the error originates from.
Hallo
07.04.2023 - 00:54:45 : I am listening...
07.04.2023 - 00:54:47 : Speech-to-text translation is in progress...
07.04.2023 - 00:54:49 : Question: Haskell Nix Python
07.04.2023 - 00:54:55 : Haskell, Nix and Python are all programming languages that have different features and uses.
Haskell is a functional programming language that is known for its strong type system and its ability to handle complex mathematical computations. It is often used for building robust and reliable software systems, as well as for data analysis and machine learning applications.
Nix is a package manager and build system that allows developers to create reproducible and portable software environments. It is often used in large-scale deployments, where consistent and controlled environments are crucial.
Python is a high-level programming language that is easy to learn and use. It is widely used for web development, data science, and machine learning applications, as well as for scripting and automation tasks.
While Haskell and Nix can be considered more specialized languages, Python is popular for its versatility and ease of use. Each language has its own strengths and can be used in a variety of contexts, depending on the needs of the project.
07.04.2023 - 00:54:55 : Text-to-speech translation is in progress...
07.04.2023 - 00:56:20 : Completed!
07.04.2023 - 00:56:20 : ---
No problem appears.
Question: Haskell Nix Python -> whisper-1 model transcription
this source @kind kayak
and this destination 🙂
Im saying the speech to text part is not working perfectly
We have different voices
The error originates from my stuffy nose lmao
Haskell Licks Python
I came to the same conclusion with the audio recording you sent.
Which conclusion exactly?
Waiting for a short time between words can solve the problem. Don't talk in a row.
Haskell Licks Python
Serious question, are you a bot? Cuz theres definitely a misunderstanding here
never know these days 😄
Cuz yes I already knew these things, that the audio track is unclear
You can't get feedback, you have to come up with your own solution 🙂
When I spoke one by one in similar misunderstanding situations, I saw that there was no problem.
Are you a bot though? Genuinely wondering haha, it would make sense for an OpenAI discord
Haskell Linux Python @kind kayak this record
A closer result. It looks like there is a problem with your voice 🙂
Is it possible to get the same level of timestamped text using whisper-api, similar to if you run it locally? If its possible can someone tell me how, its driving me insane!?
This is unfortunately not possible. If you use the whisper-1 model with the whisper api, you will get a faster and more accurate response. the native model works incorrectly in sentences containing mixed language structures. Also, the processing time will vary depending on the size of the model you are using and the processing power of your computer.
For example, if you are using the local model with the English language and there is a German word in the sentence. The German word is evaluated in English.
What did you change/do?
I put a 1 second wait between the words haskell and nix in the audio recording.
发发
Is anyone facing an issue with the API? Many times the API is taking more than 30 seconds to respond or it times out...
You can sign up for email alerts for systems monitoring here: https://status.openai.com/
Welcome to OpenAI's home for real-time and historical data on system performance.
The verbose_json format has timestamps 🙂 could imagine that srt or vtt does as well. Does that do it for you?
How do you get the json file from the API?
I'm afraid I'm misunderstanding what you're talking about. I mean - to answer your question I would simply say "call the API", but that seems wrong 😛
||ohio||
Hey y’all, I’ve been building with Whisper and using Lambalabs GPU for the computation power. Has anyone tried using CPUs instead to transcribe? How’s the accuracy
Hi did anyone encounter encode/decode problem when transcript mandarin audio file?
text in response json look like this: \u4e0d\u559d? \u54c8\u56c9\u5927\u5bb6\u597d
Looks about right 🙂 That's just unicode characters
Ummm, can someone sort me out, since I can't get Whisper to read my file
(Using NodeJS)
Should I supply a file path or a buffered file?
The biggest scam, bots are asking to human if you are a bot
lmao
I live in ohio ||its really (no) joke||
you spelt survive wrong
any good ideas to improve whisper acc for lyirc recognition?
currently running lyric extraction via demucs
could finetuning help?
Does anyone have any tips for reducing hallucinations when using the translate API? Happens fairly frequently, inserting things like "Thanks for watching" and "Please subscribe" to the end of the text, and repeating phrases over and over
@onyx pike why did you send me a friend request
Can whisper handle multiple languages in one API request?
I dont think so
Does anyone have any good diarization projects, they’ve been able to use successfully together with whisper (not the api)
WhisperX has a pyannote integration
For freesubtitles.ai I use a VAD to cut out non-speech audio, that fixes the hallucinations, and then I am building up a db of hallucinated text that I match against to remove before I rebuild the srt/vtt from the word timestamps
well done friend
aah I see, thanks. I'll look into doing some initial audio processing before hitting the API. Would you be happy to share some of the hallucinated text from the database you're building? I can share what I find too
@abstract bolt is a bot ban em
hi, after receiving such a warm response to my last tutorial on using the API, I want to share my brand new video
What languages does whisper support? Is there a list?
Usa colored supplement logo
Anyone got an ffmpeg command to properly split a >25MB mp3 file into multiple segments (without cutting out dialogue).
Trying to transcribe large mp3 files with the whisper API in c++ but obviously they have a 25MB limit, and recommend to split it. I can’t find a sure fire way to do this properly.
My commands still cut out the audio during dialogue sometimes.
I also want to remove any decently large periods of silence from the audio preferably
Or if anyone knows a better way to do this let me know
if you first get the bitrate and duration of the file with ffprobe, you can then calculate how long each segment needs to be to be under 25MB
ffprobe.exe -v error -show_entries format=duration,size -of default=noprint_wrappers=1:nokey=1 file.mp3
will return the duration of the file in seconds and the size in bytes, then you can calculate how long each segment needs to be like this:
segment_duration = (desired_size * file_duration) / file_size
Then you can use ffmpeg to split it into segments of that duration, for example this will split this file into 40 second segments
ffmpeg.exe -i file.mp3 -f segment -segment_time 40 output_%03d.mp3
ffmpeg also can do silence detection and removal, look into the silencedetect filter
If you're still finding that it's cutting out the start/end of the audio, you can include a 1 second overlap between each audio snippet if you loop over the file and use the -ss and -t flags to manually adjust the start time and duration of each snippet. I use this approach for transcriptions and it works well
@mild basin Thanks. Would you mind sending me your code? I'm curious about one part that i dont know how to do dynamically. Would be faster if i just read your code. Feel free to DM me.
If you dont want to send it thats ok too, i understand 🙂
I'm live-recording a media element on a webpage with Javascript, which is going to be very different from how I think you're doing it. You have the complete recording already as an mp3 file, correct?
Yes i have a full mp3 file that I pass in to my program via the command line
I'll DM you with some implementation details
Means a lot, thanks 🙂
Does anyone have any good diarization
Whisper can I use in iPhone ?
yes
How ?
How ? My native language is Urdu and I was searching on internet many people talking about whisper much high accuracy of in transcription, I want to try it but I didn’t get way to try it like chat gpt, I’m not Devloper finding free way to use this service if I can use in my iPhone it will be much easier for me
Minimising hallucinations
check ggerganov/whisper.cpp on github
there are also services where you can upload audio and they run the model for you
Do you can give me link
no because the moderation bot doesn't allow me to post links 
Whisper board application I installed it’s free and behind open ai whisper service
freesubtitles[dot]ai is my site you can use Whisper for free there
I'm very nearly done building my personal VoiceGPT Android app. The voice recognition innate to Android seems worse than I'd like, and the available voices aren't great.
Can anybody point me in a direction for how to get better speech recognition and more voice options?
I gather that Whisper can do both, but (if so) I'm unable to find things like where I can sample the available voices.
Whisper is only voice to text, but not the other way round.
scaam
Ah, thanks!
hi, nice to meet you
Hi everyone ! I've got a weird message error in google collab. I'm trying to use Whisper to transcribe an audio file. I've created an API Key but the google collab tells me the api KEY is incorrect even though it is really not. Anyone already seen this ?
EDIT : i was dumb 🙂
We need a second channel for those of us not using the Whisper API, who are using the model locally.
@empty hinge
Agreed. I've seen some confusion about this. It's clear the two need to be separate channels.
Hi guys! I have a question: in python I have a variable with some bytes (of an audio file) and I want to transcribe this file. But if I call the function openai.Audio.transcribe_raw (I call this and not .transcribe() because I don't want to store the bytes in a file) I get this error: Invalid file format. Supported formats: ['m4a', 'mp3', 'webm', 'mp4', 'mpga', 'wav', 'mpeg']
But the bytes are of an mp3 file.
Anyone with this issue?
Write a C++ program, using function, to calculate the factorial of an
integer entered by the user at the main program.
I'm having this same issue
Code?
I'm trying to automate caption creation as a function, every tutorial/project I'm seeing is using whisper as a CLI. Can someone clarify if I can use it more as a function, passing in the name+path of a file automatically instead of manual user input?
transcript = openai.Audio.transcribe_raw("whisper-1", file=audio_file, filename="a")
where audio_file is <_io.BufferedReader>
You're missing the filetype
reading the code of the function transcribe_raw I see this. you mean that the parameter "file" is actually a tuple with the file and the filetype?
ops sorry, you meant in the filename. I'll try now!
you were right!!
Hi, i'm detecting a student's ability to speak correctly. But now whisper is so good it can even recognize the mispronounced words. Is there a way to make whisper a little less smart ?
Agreed I ve seen some confusion about
do you mean, it's hearing the malformed English and then correctly reforming it? or just detecting accented but correct word use?
Code?
hey guys, new here.
can whisper api provide timestamps?
For the python API (e.g. transcript = openai.Audio.transcribe("whisper-1", audio_file) ) does anybody know where I might find actual API documentation for the Python objects? The API documentation on the OpenAI website seems to be mostly the REST API. But there's this small example of using Whisper with Python, but then it says give --form attributes to tweak the output and I just don't know how. I find it too hard to guess and tweak through code completion and the cookbook doesn't have any python/whisper stuff in it. Thanks in advance for any pointers!
how can i export the result as a srt file? my current line looks like that: result = model.transcribe("full-full.mp3", language="de", fp16=False)
.
anyone know what the issue here is?
hi guys. i need help to move on my project. I need to connect Whisper API ; GPT-4 API and Google text to speech in flutterflow. Guys, any of you already did this kind of project?
you have a space between "--" and "upgrade" in your pip install command. This results in pytube not being installed and subsequently the import fails.
I am working on a project where I receive a URL from a webhook on my server whenever users share a voice note on my WhatsApp. I am using WATI as my WhatsApp API Provder
The file URL received is in the .opus format, which I need to convert to WAV and pass to the OpenAI Whisper API translation task.
I am trying to convert it to .wav using ffmpeg, and pass it to the OpenAI API for translation processing. However, I am getting an "invalid_request_error"
We are using the Whisper API in our React Native app, and we are encountering the following error:
ERROR Error asking AI: [RequiredError: Required parameter model was null or undefined when calling createTranslation.]
the code
const response = await openai.createTranslation({
file: uri,
model: 'whisper-1',
});
i need help with the openai transcribe function in python
import openai
def transcribe(wave_buffer):
transcript = openai.Audio.transcribe("whisper-1", wave_buffer)
message = transcript.text
if message is None or len(message.strip()) == 0:
return None
return message
i am passing a BufferedReader from the memory but im getting an error AttributeError: '_io.BytesIO' object has no attribute 'name', how do i fix this? currently it works if i save the audio as .wav file in the disk but that's very unintuitive since the recording can be large sometimes, how do i fix this?
I know one can use Whisper API to upload an audio file and then receive a text from it. But I want my speech to be translated to text live as I speak. Does anyone the sources/ideas on how one can build his using whisper API specifically?
yeah thats my problem too, looks like it needs to be saved in the disk...?
If you already have a web application you can design an upload button. But the only useful source I can find as of now is "web speech recognition" which was there a long time ago.
yep, seems like you cant use bufferedreader to transcribe audios, only via files, openai apis are really halfassed
Hi!, im trying to do some speech to text with whisper in Spanish language, but it misses some keywords, and doesn't understand well the topic. is there a way to add maybe a text dictionary or do some further training in Spanish?
Try splitting audio on silence using PyDub and send small pieces to Whisper API?
yeah but when is he gonna know when to split? like stop the recording at some point
its a live recording
I know. Two asynchronous processes/threads: one for reading live audio and splitting on silence, and one for sending and receiving from Whisper API.
sometimes I get bad transcriptions, like random characters in other languages:
transcript.text කපමාන්මාන්මාන්මාවක් කළුමන්තස්තුතියට අපි කිරීමට කිරීමට කිරීමට කිරීමෙන් කිරීම කිරීමට කිරීමට කිරීම කිරීම කිරීම කිරීම සහ කරයි.
transcript.text 今度は、私はこのような場所で、私は 私はこのような場所で、私は 私は 私は 私は 私は 私は 私は 私は 私は 私は 私は```
Anyone have any idea why this happens?

Yes I do type in my native language which is 🇱🇹 and I get answers in 🇵🇱 
and I think it's AI's key words understanding issue.
Hi guys, has anyone tried training the pyannote.audio model with their own data from scratch? The results I have gotten for speaker diarization using the pre-trained pyannote.audio model are not so accurate, therefore I thought of training the model from scratch. Anyone with ideas on how to go about this?
I've only seen it using the API when there are large gaps of silence.
So if an upload is near or at 25 mb, does whisper still transcribe within 10 seconds?
i am trying to add whisper to my python script but it does not work
Import "whisper" could not be resolved
any ideas why?
I have ffmpeg installed
and python 3.10.10
Unless you are messing with the OpenAI/Whisper github example, then you just use the OpenAI API
pip install openai
then
import openai
when I did that this came
`def synthesize_speech(text):
engine = pyttsx3.init()
engine.setProperty("rate", 150)
engine.save_to_file(text, "output.mp3")
engine.runAndWait()
def transcribe_audio(audio_file):
transcript = openai.Audio.transcribe("whisper-1", audio_file)
return transcript["text"]`
OK, you have OpenAI installed in python already
the whisper package on PIP is not the same
aren't I supposed to use these once whisper is imported first
yeah but according to this I have openai already installed
have you tried creating an environment for your project in Python, then installing the requirements via pip to that env? maybe there is a conflict in your default setup.
python -m venv MyProjectEnvironment
./MyProjectEnvironment/Scripts/Activate.ps1
Just to clarify, you realize that when you do pip install whisper it installs the Whisper Database package, not anything to do with the Whisper voice api?
Whisper is a fixed-size database, similar in design and purpose to RRD (round-robin-database). It provides fast, reliable storage of numeric data over time. Whisper allows for higher resolution (seconds per point) of recent data to degrade into lower resolutions for long-term retention of historical data.
how am I then supposed to install it to the machine, since I am trying to use EdgeGPT and whisper to create a voice assistant
are there any alternatives for voice recognition?
I found a package called whisper-openai, but I haven't used it. You can interact with whisper using just the 'opanai' package, I am not sure why that project has you installing 'whisper' package, that's a DB.
I will look it up on GitHub and see what its doing, give me a few
tyt
is it this one? acheong08/EdgeGPT
correct!
odd, this one doesn't mention voice. I know AutoGPT has a voice option, but it looks like EdgeGPT does not.
weird, the thing is almost any vid I looked up uses the same pip install and can simply import it to their code and it finds the module
but in my case, while running python3 it does not find the module, could it be that my IDE is not compatible?
Which IDE are you using? I use VS Code on Win10
same
can you share one of those demos, maybe I can glean some insight from it. if it's on youtube, just provide the watch?v=YM3vT65q4tY part of it? (because links are not allowed)
k
Oh, ok. So he is using the project on openai/whisper GitHub as a python package. one moment
alright
try running these commands and see if the import succeeds after
pip install git+https://github.com/openai/whisper.git
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
right away
This package is a self-hosted version of the whisper API, which is what threw me off.
for the first pip install this came out
the second pip Upgrade
this>
and it still wasn't able to import
this is very odd, since almost every video does the same as I do and it works out for them
hmm. the video is a year old, maybe change the import to match the package
import openai-whisper
for openai it says missing import, but for whisper it says undefined variable at line 3, character 15. what is that line in the code?
the github says "We used Python 3.9.9 and PyTorch 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.8-3.10 and recent PyTorch versions. "
so, no it says python 3.10
hmm actually do not know what I am doing wrong at this point
3.10 is what I have
also ffmpeg is installed
Ok, the example code shows import whisper.
`import whisper
model = whisper.load_model("base")
load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
print the recognized text
print(result.text)`
should I add it to my code or this is just an example
it was an example from the openai/whisper github page. You might model your code after it, but at the moment you are still stuck on the failed import.
yep,
what could possibly be the reason ? I mean I did everything as it was documented
Im on Python 3.11.3, so of course it won't even try to install for me lol
so there is no solution I suppose
You could always fall-back to the OpenAI API, but once your free credit is used/expired, you'd have to pay to use it.
alright. I will look into that
but thanks for helping
You're welcome. The whisper api is priced at $0.006 / minute
So if I were to try to transcribe audio that's near 25 mb, using next js within the 10 second timeout limit, would it be enough time to finish the api request?
If anyone knows
I don't know what you are asking. I am new to this. I did, however, do a test with Whisper and Nuxt3 and I got it to convert an audio file to speech very easily.
I would do the request in a serverless function, giving me a 10 sec time limit. However, if the file size is near or at 25 mb, would it finish before 10 seconds?
I don't know. I'm new here too. Why the 10 sec time limit?
That's how next works with backend serverless functions on a hobby plan
yes
Will OpenAI ever release whispers API with the ability to get timestamps, like you can on the local run versions ?
Can I ask why you need that when Youtube already does it? Just upload a video there to get the transcript with time stamps?
and it's free
because im asking for Whisper. as im in the OpenAI discord? I dont care about youtubes ASR
Im developing something that needs timestams. anyone useful able to answer me ?
There isnt away to get them directly from whisper. They have a version of the API available on GitHub you could try to modify, otherwise use your own code to insert the timestamps using datetime() or something.
OpenAI\Whisper on GitHub
Whisper significantly better than YouTube speech to text
Did you try prompting for it? A fair bit of control possible.
Not sure actual time stamps…it seems very GPT based which isn’t so smart with time…but organization..you can suggest with the prompt
I have GPT able to generate YouTube scripts…with time codes…so maybe not impossible
When you run whisper on a local machine, you get time stamps for general sentences. The API does not have that capability, it only return the given text in a single string,
If you use the verbose_json option for response_format on the web API, it does include some timing information with the start and end values
I do not notice the difference, to be honest.
😮 HOW! can you please dm me a boilerplate for that >?
if this works for me, you literally saved my project haha
Add it to the body of the POST request like you do with the other fields, ie.
formData.append('temperature', 0.2);
formData.append('response_format', 'verbose_json');```
can i DM you ?
sure
@dapper fjord hey bro
is the whisper api better than running whisper locally?
im gay
I am not
1000%
I have a 3090 and It still takes 15 - 20 mins for an hour video.
OPENAI - "We’ve now made the large-v2 model available through our API, which gives convenient on-demand access priced at $0.006 / minute."
that's $0.36 for an hour of video. and you get it back within 30 seconds.
Depending on what you technically need. its better in so many way s
Thanks. But their large model is not better than the open source one though right?
^^
Yeah
again the api is a lot better
Here it says that it's the same: https://github.com/openai/whisper/discussions/661
Dude ffs version 2 is the more efficient better one 😄
I use the api for business, as well as the local version
Id take api over local any day
You're right, but the local whisper also uses v2! And I have a 4090.. so it's great to put it to use no?
I also have a 4090 that i want to put to use
I also have questions about how safe the data in the api is, risk of eavesdropping etc
Local does not offer the large-v2 model.. how have you got the v2 model?
They say it does: https://github.com/openai/whisper/discussions/661
For anyone interested, I finally managed to make the large-v2 model work on Windows 11. It works great and fully utilizes my 4090! A 22 min audio was transcribed in about 3 minutes (the API does it in like 30 seconds). But it's free (+ some electricity).
Quality is the same since they're the same model.
лол
Anyone facing issue with the Whisper API? Suddenly some requests have started failing with the error Invalid file format. Supported formats
is chat gpt working now?
you can use this to monitor all OpenAI systems status: https://status.openai.com/
Welcome to OpenAI's home for real-time and historical data on system performance.
What do you guys whisper for mostly? For development or personal use
Hi! I am really trying every possible way to make the whisper work in NodeJS but no luck.
Always get this error.
My file is a simple m4a but it does not matter, got the same error with mp3 as well.
Mayba someone has faced this in the past
Use gpt4 🙂
is whisper api down ? i am getting 502 gateway error and errno: -4077, code: 'ECONNRESET', syscall: 'write',
this error
Hello - anyone found a good windows cli client for whisper local? Ala whisper.cpp. Can’t use api due to a work requirement, and openais version is cpu or cuda only - my understanding is there is some ports that work on gpus generically?
a discord bot with chat gpt simplified
are you facing issues with whisper api? i got the error these 2 days: The server had an error while processing your request. 500 {'error'
i face this issue frequently today
I have managed to hack this in case if someone needs this:
filetype isExpress.Multer.FileI am using NestJS on the server sideworkingvariable works but that reads from disc but I wanted to solve this via uploads- You can convert Multer file buffer into a stream by this:
const hackedData = Readable.from(file.buffer) - Very important you need to add a
pathas well which can be anything BUT needs to have the correct extension. Otherwise openAI throws an error.
Proper variable naming and TS needed of course but at least this is working for me
hey wutsss goood
@woven folio hey bro can you help me set this up. I don’t know anything. Just point me in the right direction and I’ll update you with the progress if you have the patience to help a noob with 0 for experience but a lot of drive and willingness
dont use m4a. use mp3
Sure no worries. What you need to know exactly? How to run whisper locally with your own GPU?
Can I dm you about this? Would love a walkthrough!
Hi, I’m very curious as well - did you use openais version or whisper.cpp/some other port?
I had a lot of difficulty getting CUDA going so I ended up using a port on GitHub with native gpu support, I can’t link it but the link is const-me slash whisper
openai version, works just fine. There are a couple other I wanna try but this is doing the job just fine tbh
Thanks - I will keep persisting with getting CUDA running on windows 11, coming from a web background it’s like starting from square 1.
I made this script for my workloads to run it automatically at night, just import your audio and you get txt outputs: https://github.com/sdevgill/whisper-auto
Install whisper directly from github like it says in the README, then install CUDA 116 or 117, latest NVIDIA drivers, then follow these instructions: https://github.com/openai/whisper/discussions/47 to remove pytorch and install it again cleanly for CUDA
A bit cumbersome but it's the current process on Windows with NVIDIA
Thank you! Yes I was looking through some other issues and getting stuck . The const-me version I’ve found does have the advantage of working on non -CUDA gpus but the maintainer just did it as a hobby project. I’m also looking at doing live transcription and displaying on a web page - obviously with some kind of buffer but live ish. The idea is we have some maintenance crews using radio comms and if something happens you can quickly look back the last few minutes and get some context etc
From playing around with it seems like you can get high quality transcription faster than real time on a rtx 2060 so it seems feasible, perhaps with a minute delay or so with a buffer
Sounds like a cool project! I recommend playing around with the official model to get the hang of it, then start experimenting with the other projects and go from there. Then you can kind of experiment which model works best, you might even be able to pull it off with the large one, though most likely small/medium, they're not as good as the large, but if it's clear spoken English, they're good as well
Yeah - currently in that playing around stage but have had some brilliant results just pumping through old recorded comms so it’s definitely possible. Also not sure what the infrastructure will look like if I can scale it up to 10-20 live channels
if not now with your own hardware, sooner or later whisper api will get like 10x cheaper anyway
@woven folio Agree - unfortunately have a biz requirement it has to be sandboxed, for really no good reason 
Hi everyone. Do u face this issue before? How to solve it?
How do I auto detect when someone is talking, and I should send it to whisper?
Use hark.js to detect speech.
thanks
I used gpt to write code for Pythonista on iOS phone to leverage whisper for transcribing audio. It works when mp3 is around 10 mb but I start getting nothing when file size gets closer to 19 mb. Nothing meaning errors and no transcription. It’s relatively simple use case. If anyone has idea or would look at my code, I can share it here. Thanks 🙏 ((is api support channel correct path?))
Really sorry if this is a stupid question, I want to use whisper-1 API to transcribe speech in Urdu, but it works only half the time
There is another language Hindi which verbally sounds exactly the same as Urdu, but the characters and script of the language are completely different. The API constantly detects the speech as Hindi and transcribes it into Hindi text
Is there any way to force or hint to the model which language the speech is in, if there are multiple languages that verbally sound the same?
Hello, I currently use Open AI Whisper in a application but i randomly came to know about Whisper Jax, How does it work?
Hello, i'm working in a project about whisper can i get some help?
I'm trying to run google colab for openAI's whisper, but I don't know how to make whisper access the file. Any idea on how to do it?
I uploaded the file to google colab, but I don't know what to do to make whisper access it, basically
the colab in question:
Is there a decent way to format the output of whisper? It’s super accurate but the wall of text is hard to work with
gm
if you give GPT an example, it can reformat it for you
/start
Hey guys. I am currently using whisper right now, and even though the language I speak is in english and I only use transcribe method not translate, but in response it becomes another foreign language that I don't understand, my guess is that it sounds like Indonesian language or Malaysian I'm not sure. Is it because of my accent? is it maybe I should speak more fluently or more American or British accent so that the response will still be in english? Thank you guys.
is whisper available for js?
either using the api or as I package doesn't matter
Hi all,
does whisper transcribe filler words in english?
Hmm... I suppose I could break it into sections and parse it through the API. Good call, thanks.
perhaps
Hi guys. I have a long audio, like 1 hour of audio and at 1 point of this audio like in 02:00 there is another person that is questioning the speaker (so the voice is lower). The model stop to transcribe and allucinate adding "... ... .. . . . ... .. ..". What is the option to avoid this problems and just don't transcribe what the model don't understand?
i think u have to do a pretreatment to the audio before the transcription process
Does whisper allow torch compile for pytorch 2.0? Keep on getting the same errors as
https://github.com/openai/whisper/discussions/819
Does anyone know if its working at all? Have tried every page of google and it seems impossible
They doesent have already a VAD dector? I think that We can change the VAD configuration with all the configuration that they provide. Am I wrong?
Hi, I need to transcribe more than 25 megabytes of audio to text with Whisper multilingual, how can I do that? Thanks
Why don't you try typing "Whisper multilingual audio to text transcription for large files" into the search bar and see what comes up?
I try to upload 24 mb to Whisper (minimum required is 25 mb), but it said it too large?
Hi everyone. How to have the time positions in the audio file for each word that is recognized. Are there solutions for that? Thanks
you can use different techniques for preprocessing audio data, including noise reduction and audio segmentation, which can improve the accuracy of speech recognition.
Do need to apply to get access to whisper API somewhere? I have an openAI account with key however I don’t see any options, or am I just missing something. I see the speech to text are in the API section but yeah then when reading that area it refers to whisper with link, and then I’m directed to the whisper main page where I don’t see any info on how to access it.
do u have experience on that? I have some but sometimes noise reduction can make the audio worst
not really ,i'm sorry but i'll see if i could help you with other information
later I will go deep on all the inputs option that we can change on the model, I think there is something that can help
good luck with that 🙂
why whisperapi so slow?
If you guys want to visualize the whisperapi like so check out mabbu.app
Is there any way to have more shorter sequences? Because some of my sequences are like 20 words long and I am using them for subtitles and it doesn't look good. In the transcribe options there are some settings which I couldn't find any information about, I tried changing no_speech_threshold but it didn't work
it depends on your cpu and gpu and the audio file duration, language and the used paramettres
Hi guys. Today I decided to move from my machine with large model installed to a script for whisper API. Can I ask you an example on how to use it with remote videos urls?
Does anyone know what the companies in the Whisper paper are?
The api accepts only audio files, not URLs:
https://platform.openai.com/docs/api-reference/audio/create?lang=python
Turn the video into an mp3 and provide it as a file to Whisper. If the file is longer than 25mb you must clip it into segments no larger than that and make separate calls.
I usually have mp4, but I hope they don't go over 25 mb otherwise I should convert them to mp3 before
mp4 works. If you need to clip them and don't want the clip to be mid-word or mid sentence, you can use a VAD to clip only when a long enough silence comes.
Can I use whisper to get live Speech to text not a recording?
not on the API, but that would be possible if you run it locally
Hey hey! Has anyone incorporated whisper into a chatBot yet. I’m just about to dive in here and wondering if anyone has any pointers they’d like to share or any pitfalls?
I was thinking about working on that but it would cost so much for how I would make it
there's whisper-typer-tool on github
,.,
II
JOHN
whisper-typer-tool from github
Thanks for the tip
Does anyone know where i can find this user
@acoustic yarrow whisper-typer-tool from github
@fallen turret
bsusbus
Hey, im having an issue with whisper, im trying to force it to run on a language in particular by running it like this: whisper --model large --language pt --task transcribe input_file_here
but it looks like that despite setting --language the AI still picking up some english words and translating while I would like the AI to not do that but instead just use words of that particular language, is it possible?``
Hi guys I have a long audio like 1 hour
wanted by FBI?
it's me
Does anyone have the code to save an audio file to a txt file?
Transcribe?
Willing to pay someone if they can use Chat GTP to develop an app
anyone help!
how can I train whisper ?
This looks like something to do with your Python installation
also its not an error its a warning
you can go to the url mentioned in the error to learn more.
i went to the url
but i didnt undersatand
i study odontology
i dont know about programacion
Hey all! i'm having some trouble using the transcription api in a nextjs api route. I get a ERR_FR_MAX_BODY_LENGTH_EXCEEDEED error even though the file i'm sending a 12.5MB .mp3 file. Here's the code i'm using: https://codeshare.io/nzvj9E
I tried with both fetch and the SDK, but i get weird errors for both:
fetch: i get a 'You must provide a model parameter' error even though it's 100% being included in the request
SDK: ERR_FR_MAX_BODY_LENGTH_EXCEEDED even though the file is only 12.5MB.
any advice?
she wanna go viral
First one seems to be from the front-end, second one, show your code
anyone help! Please
how can I train whisper ?
Hello guys. I have a question on whisper. I need to make transcription of the call between two people so transcription is presented in the form of dialog. As far as I know whisper and Open AI speech api do not provide those features.
I know that one solution is to divide the file into smaller subfiles for each speaker and then push each file to open AI API, however I'm wondering is there any better or simpler solution for that problem. May be there is already library made for that task.
is there a way I can divide the whisper transcription by user? Say I have 2 users talking. Can I convert the transcript and find out who said what?
search whisper speaker recognition in huggingface
hello how can I fix this openai.error.RateLimitError: You exceeded your current quota, please check your plan and billing details. isnt API is for free?
same problem
i want to send data to the whisper api in base64 format, since im using node js, i cant use the "File" type for the file input.
anyone know?
Hello, does Whisper collects my transcript data?
what exactly is a whisper guys
so... sometimes I've received weird languages, but this one was a bit odd...
The whisper API is generating subtitle segments that are way too long (6-8 seconds each in some cases); how can I configure it to return shorter segments?
How long is the audio file?
messi or ronaldo
About 60 seconds
--output_format {txt,vtt} its ok this prompt?
hey yall check out
SkmAI (Beta): AI powered Youtube video search tool (Revolutionizing search and content consumption) on the projets sections
How to use whisper and what is does
Hi guys!
A big question. You know some app that it's using whisper as a Note-talking-speech App?
A good question someone ask me because wanna write a book just dictating the story.
I use it for tracking fitness etc
Are there any projects with near real-time use of the whisper API? Haven't really worked with audio before but am basically looking to do real-time transcription and text analysis.
ALL hail Kitty
Also.. hmm. You can go to github.com/jaggzh/whisperpluck for my interface to whisper. See a video linked from the repo page.
Only works in linux right now; but, with the scripts, you can assign hotkeys to the scripts (for Start/Stop), or drag them to the desktop, OR, I have my new UI button overlay which is really quite nice, imo.
how can i learn english?
Duolingo
pretty close to exactly what i'm looking for in terms of implementation thanks
@errant mason it was built up over a week from the initial quick hacks I did while on a screen sharing call with my friend (for whom I was writing it). ... Record audio, somehow kill recording process, transcribe, get it into clipboard
This image is a fat kitten
business chonk
help
does anyone know of a way to have timestamps is this possible?
Hi! someone can help me ? I am trying to use the whisper API with the openia node package, trying to send a local file and I'm getting error 400. Someone knows how May I send this ?
hello guys, i want to use only whisper and python(libraries ) to transcribe Youtube videos. Any ideas?
Lots of videos on this topic
Does anyone know how the translation task works? I want to output the translated + non translated transcription without making it transcribe twice, does anyone know how?
Any distro? Kali or redhat?
Tails?
I would try with a smaller file first, to be sure it's not another issue
So thanks for the charge openai team just take it next time alerts feel like threats
You have all my info I sent it there directly.
Since grade school
Thank you #AIbuddies
Shouldn't matter. I might use some bash-specific stuff in there
Man. Isn't there some clever way of automatically detecting and removing noise from voice audio?
Sure but hiding is silly. If it's digital it's done tbh
That's my opinion
I'm just a random internet handle
Is it possible to use Pytorch/Whisper with an AMD GPU?
Anyone here knows tinygrad discord?
Whispernotes app
Is the no thread for developers using APi's from chatGPT or GPT?
It's a really accurate transcription service.
You can access it from https://platform.openai.com/playground and clicking the little green mic in the top-right corner.
An API for accessing new AI models developed by OpenAI
response = openai.Image.create(
prompt="a white siamese cat",
n=1,
size="1024x1024"
)
image_url = response['data'][0]['url']
I don't get it. Some pages say to dl common voice training set in a specific language, but.. when I pick English the file still has tons of languages in it.
I'm just trying to get something that can help me isolate a patient's voice to prepare labeled training data for fine tuning whisper
Their whispy ventilator-breathing voice isn't recognized by anything we've ever tried
Maybe the datasets python api let's me be more specific.. still.. 60gb.
Hey does anyone know the python equivalent of
--form file=@openai.mp3 \
--form model=whisper-1 \
--form response_format=text
specifically the text response format
nvm it's literally response_format="text"
Where can I use whisper?
is it possible to run the whisper model locally?
@proper spire you stole my assets, answer me for goodness sake
Before the DMCA is pushing you up the list
the new chatgpt app uses it, and it's fast, so yeah you probably can
Hello! I'm looking for a developer to integrate Open AI on an ERP-PHP webapp.
Does anyone know a freelancer or programmer with experience in this kind of integration?
is it possible to run the whisper model
Hi! Does anyone know how to run the whisper on python with less VRAM than the requirement?
What is this
Does whisper have IPA phonetics support?
You cannot.
Speech to text
m
which platform?
Waiting for next update to the Whisper API. Specifically the whisper-2 model
Hello
Has anyone solved the problem of converting asr whisper to: PytorchScript, ONNX (later TensorRT)?
I need to convert the model for Nvidia Triton: (input - tensor, output - tensor), so ready-made options, such as suggested by huggingface, are not suitable.
Whisper divides the audio recording into segments of 30 seconds, I tried to convert the model with input: mel_segment tensor for 30 seconds of audio, the output tensor: token, which I can then decode into text.
Problems encountered:
- convert to PytorchScript via jit.trace: the model remembers the output result of the tensor used in the conversion;
- convert to ONNX: incorrect graph saved
what are some great ideas to use whisper as a startup company in the medical and healthcare field
transcription
hey guys, I am trying to cut my whisper api sends into chunks that are below the limit, but it is making me frustrated. I am a hobbyist.
Can anyone point me to some python code that does this? I want to point to a file and have the code chunk it and send it so that each chunk is below the limit and then just concatenate the resulting strings. I know this is a simple problem, but I am just starting out working programmatically with audio.
Same
Real-time speech-to-text is kind of a whole different animal. I believe most still save as a file, but much smaller. I have not used whisper but have built another app that chunks the data into 0.5 second chunks, adjusts the dB levels and then puts back together, then chunks it by silence of 0.75 seconds or longer and processes each chunk from there
again, not Whisper, but I did use Python
My script for whisper API doesn't work anymore. I didn't change anything about it, but haven't used it in over a month.
Did something change?
openai.error.InvalidRequestError: 1 validation error for Request body -> file Expected UploadFile, received: <class 'str'> (type=value_error)
I'm just confused because it used to work before
The code is quite simple:
`openai.api_key = config.openai_APIKEY
audio_files = sorted(os.listdir("parts"))
transcription_text = ""
for audio_file_name in audio_files:
audio_file_path = os.path.join("parts", audio_file_name)
audio_file = open(audio_file_path, "rb")
transcription_object = openai.Audio.transcribe("whisper-1", audio_file)
print(transcription_object)
part_transcription_text = transcription_object["text"]
transcription_text += part_transcription_text + "\n"`
I have a python script that cuts an audio file into chunks. But I don't have the same script send it to whisper. The files are put into a folder and filenames are numbered, so an other script can loop through the files and concatenate the API responses
You could easily expand this script to have the API transcription done as well. I just decided to keep it separate so it would be easier for me to troubleshoot
This script is not perfect though, because it doesn't do anything to prevent the audiofile from being cut in the middle of a sentence or word even
theres suddenly a 50% error rate for whisper API calls now
Thanks, this moves me in the right direction. One of the issues that I discovered yesterday is that the wav files can be way bigger than the mp3. so even if I calulated the chunk size based on the mp3, the wav file that i sent could still be way over the limit. It was only late last night that i found out about the .export method.
unfortunately the base whisper implementation is too slow for my use case, I'm exploring using faster implementations such as whisper-jax that have lower latency, that i can also deploy myself
I am trying whisper-1 API transcription model first time for YouTube videos. It is fast but most of time is spent on downloading/converting the video.
For 1 hour video it took 30mins which too long. Am I missing sth or is what everyone else doing?
Isn’t there some faster way to do it?
Let me know if anyone compares Whisper to Meta's new MMS model on English.
@remote oasis I would probably use yt-dl and ffmpeg to strip the audio?
I have audio URL, just ffmpeg takes bit time to download and convert but somehow decreased time and now bit faster.
what next
@remote oasis not sure about yt-dl, but yt-dlp is fast.. choose an audio-only, not the video (unless you want the video data for some reason)
oh, you have the audio url, n/m. But do use the alternate one which simulates being a mobile device and stuff.
here's a little bash script I wrote yt-dl-mp3:
#!/bin/bash
ytdlbin=yt-dlp
quality=2
url=
help=
for a in "$@"; do
if [[ "$a" =~ ^-[0-9]$ ]]; then
quality="${a#-}"
echo -e "\\033[1mQuality set to: $a (0 is highest)\\033[0m"
elif [[ "$a" =~ // ]]; then
url="$a"
elif [[ "$a" = -h || "$a" = "--help" ]]; then
help=1
fi
done
if [[ "$help" = 1 || "$url" = "" ]]; then
echo "Usage: yt-dl-mp3 [-#] [-h] url"
echo "Where: -# is compression (-0 is lowest. Default -2)"
exit
fi
printf "%s " "$ytdlbin" "$url" -x --audio-format mp3 --audio-quality "$quality"
echo "Where: -# is compression (-0 is lowest. Default -2)"
read -p 'Enter to proceed with default (unless you set it)...' -t 5
"$ytdlbin" "$url" -x --audio-format mp3 --audio-quality "$quality"
Does anyone know for fine-tuning whisper if we should use noisy audio? I'm recording training data of someone and recording some while the vacuum is on.
will it make the model more robust?
Also, I can't find info on if we can provide noise training data
go to github and look at the whisper forks. there are some that will do this
Also I can t find info on if we can
How much training data do we need to fine-tune? The person has a very unique voice. Their pronunciation of various syllables is affected by them being on a ventilator, and they can only speak in very short phrases (for the same reason).
i'm manually transcribing for the training data so .. it's a bit tedious (but worth it.. just would like to know how far to carry this) 🙂
because me loves her
Ahhhhh, frustratin
i feel youy
Question: Is 'google cloud speech to text' better than 'Open ai Whisper?'
Context: Ive been creating a project which is highly dependent on real-time voice transcriptions, Ive had a chance to integrate google cloud api for their Speech-to-text service, its alright but fails to provide accurate real-time transcriptions for some reason. I am running tests from my internal microphone.
IMO Google has the ability to create the better of two services as they logically have way more data than open ai, but open ai seems to moving way quicker when it comes to development, implementation, and use case
I'm trying to create a Discord bot with Whisper capabilities, so I need my requests to be async. I am downloading .mp3 files from Discord, then I am trying to upload them to the Whisper API. It isn't accepting my requests, and it returns this error:
{
"message": "Could not parse multipart form",
"type": "invalid_request_error",
"code": null
}
Here is the code I am using:
async with aiohttp.ClientSession() as session:
filename = "test.mp3"
async with session.get("https://cdn.discordapp.com/attachments/1097219558466658354/1112417715316076775/AI_Test_Kitchen_toetapping_footstomping_americana_1.mp3") as resp:
if resp.status == 200:
audiodata = await resp.read()
...
headers = {
"Authorization": "Bearer " + openai.api_key,
"Content-Type": "multipart/form-data"
}
form = aiohttp.FormData()
form.add_field('file', audio, content_type='audio/mp4')
form.add_field("model", "whisper-1", content_type="text/plain")
resp = await session.post(url="https://api.openai.com/v1/audio/transcriptions", data=form, headers=headers)
print(await resp.text())
await session.close()
Does anyone know why this is?
baboonalism
me too
how can i connect my newly created rails app with newly created next js project
Thanks! I found some other package that has free api but now stopped working but it used to download youtube real fast. I went to the code and analyzed, low and be hold, it is using yt_dlp to download. I am trying that now and probably need deploy my own whisperx, because openai API has limit and slow
heyaa, im sorry, i'm still very new to using openai. Does anyone know how i can make an automatic whisper? By that i mean it always detects input from my mic, and when my mic goes silent, it saves the input and goes back to the detect input state. Is that possible?
How much training data do we need to
Hey anyone knows how tk create a database sql with chat gpt?
Any guidance on where my time start/ends should be on labeled audio (without me having to download some huge data set just to see a few samples)?
(blurred for privacy)
my fine tuning got down to .02x loss (not sure what loss function was used)
the best Youtube AI video search tool out there https://discord.com/channels/974519864045756446/1106744294661951508
She speaks in 1 to two words bursts. Can whisper even learn this?
I don't know how to find out except to keep possibly wasting time labeling things .. the fine tuning I did last time was garbage. Like one or two times she made a click sound with her mouth -- I labeled it as "tch".
In a subsequent test, of the find-tuned model, whisper transcribed EVERY word of hers as "tch"
How to resolve this error?
Any guidance on where my time start ends
Are the large models better than medium.en if I only need english recognition?
amougus
Are the large models better than medium
Has anyone here tried tackling the 25MB limit in NodeJS for Whisper? If so - what's the ideal way about it?
Ask ChatGPT to write a python script to do that. Specify you want the script to include an RMS threshold to detect speech and trigger Speech2text conversion using Google local speech2text library. If this works, then try OpenAI Whisper API and compar
Why would you want to save and send such a large audio file? At least compress it? Or break it up into smaller files using RMS- thresholding to detect a silence pause.
I'll be compressing it, but I've to demonstrate video files as large as 512 MB. I'll look into RMS tho.
What true happiness looks like. 38 minute done flawlessly under medium
Seems like a weird goal from a sadistic and irrational boss. 😉 Or maybe I'm not understanding the rationality behind it. Cut it up at silence, transcribe, piece back together. The actual technique would be a slight variation on this to account for some issues it introduces, but the solutions are extremely easy to implement.
What am I supposed to set for model? whisper-1 return 404-model doesn't exist
I have got that error rom time to time. It was usually because something else was not right with the API call - especially if it didn't like the MP3 file I was sending - or sending the MP3 in the wrong way.
And - yes, whisper-1 is the only model you can set at this time.
Hello guys!
Any idea why Im getting this error when trying to connect to CODEGPT with my OPENAI Api Key???
Can I use my OPENAI API key with many different services all at once?
or would I have to get a new OPENAI Account?
Hi! In the chart with models in whisper github there is "Required VRAM" column - the memory listed in this column is the memory required to transcript at a reasonable rate only one audio-files at a time? And if yes - then if I need for example to process via base model ("required ~1gb vram") two audio-files per moment, then I need ~2 gb vram, and if with large model ("required ~10 gb vram") - then ~20 GB for 2 audio-files at the same time and so on? And what will be the speed?
please help me guys!!
The error message basically says you have used too much of resources, please wait. As they have limits you have hit.
Nooooo but Im getting this error ALWAYS
Ive never been able to use the codegpt extensoin, ever
Not sure who to ask .. so here i am .. i am trying to built a website where folks can record voice and it will transcribe and use gpt4 with prebuilt prompt to display a certain output .. which can be copied and pasted somewhere … dont have a big budget with limited coding skills … is there a github link for this .. i can see the code at? Is this possible using google colab?
how to make clock
if i deployed my own whisper model does the data (the voice coming from my user or the text generated by whisper) go to open ai by any chances?
[For Hire][Full-stack][Blockchain][Mobile][Remote][Full-time]
I am a senior full-stack and blockchain engineer with 5+ years of professional and extensive experience.
Fully understandable at the requirement in a few mins and make a perfect result which makes customers satisfied.
List of my professional skillset
💼 JavaScript(TypeScript) and its frameworks; Node.js (Express, Nest.js), React.js (Redux, Next.js), Angular (Ngrx, Rxjs, v1.0 ~ v9.0), Vue.js (Vuex, Nuxt.js)
💼 PHP (Laravel, CI), Python, Django
💼 React-Native, Flutter, Native Script
💼 Restful API, OpenAI, ChatGPT, Langchain, Pinecone, AWS
💼 Web3 technology, Smart Contract (Solidity, Rust), ERC tokens (ERC20, ERC721, ERC1155, ERC4337), Ethereum, Solana networks
I am always focusing on the product quality first and professional codebase implementing OOP at high level.
I am ready to make your dream true so feel free to contact me anytime.
Best regards
https://github.com/openai/openai-python
Does this library not have parameter for VTT support in Whisper?
I don't think so
Elaborate please
No, as long as you don't make a mistake in configuration/use and end up calling the cloud api instead.
Hey all, what is everyone using here as their computational power? Are you simply using your local environment or using a GPU cloud?
Hey guys! I've been working in the last few months in a project using whisper + gpt that I'm really excited about:
Kaption AI is a Chrome extension that transcribes and summarizes WhatsApp Web audios and chats. No more sifting through lengthy audio messages or struggling to keep pace with group chats - Kaption AI can convert audio into text and summarize long threads at the click of a button.
It's a tool that can be useful for business professionals, students, journalists, and people with hearing difficulties - anyone who wants to make their digital communication more efficient and accessible.
The development of Kaption AI was largely inspired by the groundbreaking work done by OpenAI, and the belief in the transformative potential of AI technology. It's my humble attempt to contribute to the ongoing revolution in our digital interactions.
Although I can't share a direct link here due to community guidelines, you can easily find Kaption AI on the Chrome Web Store by searching for "Kaption AI" in any browser. I would greatly appreciate it if you could give it a try and share your thoughts. Your feedback is incredibly valuable in helping me refine and enhance this tool.
Please rest assured that your privacy and security are of utmost importance. Kaption AI does not collect or store any of your messages. It adheres strictly to the latest security protocols and standards, ensuring it's secure from potential threats.
I'm curious - how do you currently handle lengthy audio messages and chats in WhatsApp? And what features would you be interested to see in future updates of Kaption AI?
Looking forward to hearing your thoughts. Thanks for your time!
Is it OSS?
Thank you for your question. While I greatly appreciate the value and contributions of open source projects, Kaption AI is currently not open source software. It's part of a commercial endeavor that I'm investing significant time and resources into. However, I'm trying to think of ways to make it transparent for everyone to see that I'm not storing people's conversations or doing anything weird. What do you suggest?
Are you using openai whisper api or have you deployed your own model?
Ive tried both. Using whisper on my own servers and using API. API is cheaper and more reliable
Hey All,
I've ben testing a product i've been working on and GPU Clouds are getting super expensive.
My product focuses on transcription utilizing OpenAI Whisper Base & Large Model. Most GPU Clouds I research are used for ML/Deep Learning and appears really only for visual/graphics/art/video editting etc. I feel as if transcription is a much lower use of computation power than using something like stable diffusion. Does anyone here have insight on what computation is needed for my use case?
I am seriously considering in building out a low end deep learning machine.
Note: Transcription works on my PC when I run the API to my local environment
I wonder how good will Apple's on device speech recognition be compared to Whisper
Hey everyone, What are all the features that Whisper actually covers.
The info I have on the features is limited and just
Speech recognition
Speech transcription
And I know there are more features. Do anyone have an idea of all the features or like a documentation I can read.
P.S The documentation of whisper on GitHub doesn't list all features.
look at the references
im trying to find the function that stores the probabilities of each words in a set. anyone know where is it?
is it in the beam search class?
Hey guys, have you experienced a significant drop in performance after trying to load a HF checkpoint coverted to openai format? If I use ggml c++ conversion tool I cannot load the fine-tuned checkpoint at all... have you had problems with fine-tuned whispers?
sorry i've been away too long @autumn bolt , it correctly reforms the mispronounsed words. I want it not so smart, it will spot the mispronounced words, or even the syllables, so i can give feedback to students where they got it wrong/ a little wrong
I don't think you can make it less good, you could try adding noise to the audio and experiment with results, really you need an LLM bolted to the speech to text system, then you could ask it to tell you if there was any accent... I think that is some time away though. Would be an interesting R&D project.
yep, not just accent you know. For example with Indian / Spanish accents, it's still comprehensible, but there are other parts of the world they completely mispronounce its, like randomly obmitting syllables, but because the training data is so good, whisper can still convert it to (surprisingly) correct word. This is good in most of the cases, but not good in training English for them
https://cloudconvert.com/ have a decent api. Can get mp3 to 10% in many cases. Use qscale=8, ch=mono, bitrate=16
File converter service - more than 200 different audio, video, document, ebook, archive, image, spreadsheet and presentation formats supported.
someone knows where to download deferent wave audio files of 1 second to train a model?
i suggest you may use audios for movies
because you have subtitles and the exact timestamps
You got any solution on this? @mossy hemlock
How can I get the word level transcriptions from whisper in node? I think there are options available in python but not able to find anything in Node JS!
Some of the output given by whisper is too long in a single timestamp.
How do i use Whisper?
We are looking for an OpenAI fine-tuning expert.
You must have experience with this.
Don't worry about your budget.
Your good skills are needed.
If you are really an expert please contact me.
Can I use my API key credits with whisper so the transcription occurs remotely instead of on my PC? I don't have a dedicated GPU so it takes soooo long to get a transcription. I would like to see if I can speed it up at the cost of some credits
What directory would I find the downloaded models (.pt files) in for Windows? I want to delete and reinstall the large library. This was the ubuntu solution: https://github.com/openai/whisper/discussions/762
Found the directory in Windows for the model library: C:\Users\{YourUserHere} \.cache\whisper
Found the solution. Work very quickly now!
And what is your solution?
I thought getting it generated through the API was the only option.... thats how i've been doing it. I got a node.js code example if you want..
[For Hire][Full-stack][Blockchain][Mobile][Remote][Full-time]
I am a senior full-stack and blockchain engineer with 5+ years of professional and extensive experience.
Fully understandable at the requirement in a few mins and make a perfect result which makes customers satisfied.
List of my professional skillset
💼 JavaScript(TypeScript) and its frameworks; Node.js (Express, Nest.js), React.js (Redux, Next.js), Angular (Ngrx, Rxjs, v1.0 ~ v9.0), Vue.js (Vuex, Nuxt.js)
💼 PHP (Laravel, CI), Python, Django
💼 React-Native, Flutter, Native Script
💼 Restful API, OpenAI, ChatGPT, Langchain, Pinecone, AWS
💼 Web3 technology, Smart Contract (Solidity, Rust), ERC tokens (ERC20, ERC721, ERC1155, ERC4337), Ethereum, Solana networks
I am always focusing on the product quality first and professional codebase implementing OOP at high level.
I am ready to make your dream true so feel free to contact me anytime.
Best regards
you can use apikey in different services,your error is your apikey is not right.
hi, I try to use whisper api in order to make subtitles for a video. The problem is, that there are 2 languages used in the video and in the generated transcript, I get all the text in first language and all the text in second language gets translated to the first one. Is there any way to keep the languages as in the audio?
can I use openai whisper as a live speech-to-text? As in, it gets what I say from the mic and it transcribes it?
Hello im a little noob with whisper, i have a question, i need to transcribe a 9 hours of a video and i need divide in segments of 30 second, and my question its its better transcribe the 9 hours at one, or transcribe de for example 1100 files of 30 seconds?
or exist a more efficient model?
i using a R7 5700U bcs I'm not in mi home and i using my laptop
Hello im a little noob with whisper i
can i have iformation on langchain pls go prv
Any updates regarding this post on github?
I've been debating buying a mid-tier PC for Whisper as I transcribe Arabic as freelance, but I transcribed some videos using Whisper, model base, and the results were a complete mess that I was better off doing it manually
Hey I have been trying to get whisper to work on a raspberry pi, but when I try install it, it fails because it depends on torch? (using the api) and with python integration
Hey there, i am looking for a way to format the .txt exports after i run the files through whisper - has anyone found a nice way to to it? don't need necessarily the timestamps, but a more readible way would be nice. For now i use an online tool for auto line break - would like to have it in my code - i run whisper via GoogleCollab
Does Whisper support awk? I think Python has a ‘prettier’ function call.
i guess not, but i've already a small Python set around it for auto-transcribe all my mp3s in a specific GDrive Folder.. will look into awk, thanks
Let me know if need help. I’m fluent in Linux, AWS and some DevOps.
Anyone else get a totally random text back from whisper? Like it kinda sent the wrong response? :p
A whisper version of hallucinations?
It's very, very rare but it happens
does anyone know how to use whisper in nodejs?
hello every one can someone please help me with my model, i'm actually working on an AI model that gives keywords prediction (existance) in audio files, i trained my model with 30000 audios of 1 second the metrics shows that the model training was great actually but i'm still getting errors in the predictions, HELP PLEASE !
Please someone help me with this below code error
const filePath = path.join(__dirname, '../../', 'temp.mp3');
const formData = new FormData();
formData.append('model', model);
formData.append('file', filePath);
axios
.post('https://api.openai.com/v1/audio/translations', formData, {
headers: {
Authorization: `Bearer ${process.env.OPEN_AI_KEY}`,
},
})
.then((response) => {
console.log(response.data);
})
.catch((err) => {
console.log(err.response);
});
Error:
data: {
error: {
message: '1 validation error for Request\n' +
'body -> file\n' +
" Expected UploadFile, received: <class 'str'> (type=value_error)",
type: 'invalid_request_error',
param: null,
code: null
}
}
Can you please resolve me this error
const fs = require('fs');
const path = require('path');
const FormData = require('form-data');
const axios = require('axios');
const filePath = path.join(__dirname, '../../', 'temp.mp3');
const formData = new FormData();
formData.append('model', model);
// Read the file as a stream
const fileStream = fs.createReadStream(filePath);
formData.append('file', fileStream);
axios
.post('https://api.openai.com/v1/audio/translations', formData, {
headers: {
Authorization: Bearer ${process.env.OPEN_AI_KEY},
...formData.getHeaders(), // Include the necessary headers for FormData
},
})
.then((response) => {
console.log(response.data);
})
.catch((err) => {
console.log(err.response);
});
Replace 'model' with the appropriate value for your translation model, and ensure that the temp.mp3 file exists in the correct location
Thanks @paper schooner
I know many people here splice and feed input to Whisper for longer videos / audios - but how do you deal with the subtitles being separate for each chunk? How do you combine them properly so they flow well with the final video you transcribed for?
whats the best whisper model to quickly and accurately transcribe long-form audio
I know many people here splice and feed
Anyone else get a totally random text
whats the best whisper model to quickly
Does anything like whisper exist that can transcribe an audio file of an unknown language directly into phonetic notation, something like X-sampa or IPA? Or maybe something simpler that only recognizes which language is spoken in a given audio file?
I have a use case which involves creating transcriptions of long audio files (about ten minutes each) which include unpredictable, brief events of loud, non-speech audio, usually no more than 30 to 90 seconds in length. Whisper seems to stop transcribing at the first of these (usually with hallucinations), so most of my results are only the first half of the audio.
Is there any way around this? I would like to continue using the hosted API as opposed to running the open-source Whisper, though I understand it can be made more tolerant of this situation. Preprocessing is quite difficult as the audio levels are basically constant, but I'm open to ideas. The best one I have at the moment is to split the files based on brief silences which imply sentence breaks, but sometimes speech briefly overlaps with the non-speech sound, so I lose content.
Unleash your app idea with our Flutter development services for just $29 and make it a reality! DM us now!
I'm playing around with whiper and currently using Staplerfahrer Klaus from YouTube. I'm running whisper locally, but have used run it at least once using the API. The ranscription from the API seems to be good. Speech-to-text is in beta and the API version is limited in what options can be tweaked compared to running it locally. You don't happen to have a sample of a file that fails?
Hi, is it possible to change sliding window interval from 30 seconds to something smaller?
Not exactly, though it seems I can get the same result from the latest DankPods YouTube video. Transcription ends around 5:30.
birthday wish for a work colleague
anyone faced timeouts when reaching out to whisper API ?
locally it works great, but when my server is deployed on remote environment (e.g AWS) I get consistent timeouts ..
Hi, is it possible to integrate whisper to expo react native application to get real time transcript?
@small juniper Hi, seem to be quite knowledgeable about Whisper, and I saw your emoji response to my question. If you have any insights/thoughts, I'd be happy to hear them!
maybe try the adobe Enhance Speech to refine your audio?
btw i've tried a bit and have now a sufficient Python code to give me at least a basic format.. enough to be able to read the .txt fast and transfer any information into my Obsidian Vault