#gpt-realtime

1 messages · Page 4 of 1

restive reef
#

just dont use windows 10 for whisper

If you did i am interest how.

unreal kraken
#

I just followed this tutorial and everything was working until recently
https://www.youtube.com/watch?v=XX-ET_-onYU&t=488s

OpenAI has done some fantastic things. Whisper is a great project open to the public. Transcribe (Turn audio into text) for MANY languages, all completely for free and all from your computer. No subscription, no fees, nothing. Your data, your computer, your free unlimited transcription.

Whisper AI install guide: https://hub.tcno.co/ai/whisper/i...

▶ Play video
restive reef
#

it looks helpfull wanted to try out Tensorflow anyways.

unreal kraken
#

Yesterday I tried uninstall all Whisper and I used command in powershell: iex (irm whisper.tc.ht) but it got worse...

#

I will try again do the whole installation

#

Btw are you using python 3.9.9 or leatest?

restive reef
restive reef
#

my whisper is not in the pakage list

restive reef
# unreal kraken Yesterday I tried uninstall all Whisper and I used command in powershell: iex (i...

i would try it with an absolute path if i would you.
for example. I first added whisper to the path.
Than i wrote a script with help of chat gpt.

Beispiel .sh datei.

#!/bin/bash

nummer=220
endnummer=233

pfad="/home/Nutzername/Dokumente/whisper_using/jenny/22_folgend/"

datei="${pfad}${nummer}.mp3"

model="base"
output_format="vtt"
target_location=""

echo "datei DATEI DATEI DATEI= ${datei} "

whisper "$datei" --model "$model" --output_format "$output_format"

Schleife durch die Nummern

while [ $nummer -le $endnummer ]
do
# Konstruiere den Dateinamen
datei="${pfad}${nummer}.mp3"

# Führe den Befehl aus
echo "Führe whisper für $datei aus"
whisper "$datei" --model "$model" --output_format "$output_format" --output_dir "$pfad"

# Erhöhe die Nummer für die nächste Iteration
nummer=$((nummer + 1))

done

whisper /home/Nutzername/Dokumente/whisper_using/jenny/12_jenny.mp3 --model base --output_format vtt

unreal kraken
#

@restive reef
YES!!!
I have it
It works
I can't have any space in this line of names
and I have to do it in download folder
Thanks for help ❤️

dense jetty
#

Deez

drifting echo
#

wondering if there's a way to use whisper to transcribe only the primary speaker in a file? Imagine something really long like a lecture with questions from the audience at the end. Primary speaker has way more presence. I'd like to transcribe just the speaker.

anything open in the space capable of doing that?

gilded oasis
#

Hello all. I have created a python whisper app for my company to transcribe videos. Right now it works great with given videos on the server, however, I would like the option to re-process specific parts of the videos where the transcription gave no results or hallucinations, for example, in a 1hour long video, I would like to be able to say 'reprocess from 00:05:15 to 00:07:20'.
Is there some way I can give these parameters directly to whisper?

thick agate
restive reef
dense jetty
#

E

carmine scroll
#

Hi, I've an error with my code.

There is my code:

const openai = new OpenAI({ apiKey: "my-openai-api" });

async function transcription() {
  console.log("Démarrage de la transcription...");
  const transcription = await openai.audio.transcriptions.create({
    file: fs.createReadStream("audio_${time}.mp3"),
    model: "whisper-1",
  });
  console.log("Fin de la transcription...");
  console.log(transcription.text);
}

But I've this error:

Error: APIConnectionError: Connection error.
    at OpenAI.makeRequest (C:\Users\Eleve\Documents\Videos\node_modules\openai\core.js:292:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async transcription (C:\Users\Eleve\Documents\Videos\index.js:157:25)
    at async C:\Users\Eleve\Documents\Videos\index.js:74:5 {
  status: undefined,
  headers: undefined,
  request_id: undefined,
  error: undefined,
  code: undefined,
  param: undefined,
  type: undefined,
  cause: FetchError: request to https://api.openai.com/v1/audio/transcriptions failed, reason: read ECONNRESET
      at ClientRequest.<anonymous> (C:\Users\Eleve\Documents\Videos\node_modules\node-fetch\lib\index.js:1501:11)
      at ClientRequest.emit (node:events:519:28)
      at TLSSocket.socketErrorListener (node:_http_client:492:9)
      at TLSSocket.emit (node:events:531:35)
      at emitErrorNT (node:internal/streams/destroy:169:8)
      at emitErrorCloseNT (node:internal/streams/destroy:128:3)
      at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
    type: 'system',
    errno: 'ECONNRESET',
    code: 'ECONNRESET'
  }
}
#

I've the free trial API idk if its the cause

candid siren
#

Friends in the community, we encountered a problem during the use of whisper and wanted to ask if anyone has experience with it. At what decibel does the whisper speaker's voice maintain? WER's performance is relatively excellent.

社区的朋友们,针对whisper 我们使用的过程中遇到一个问题想问问大家谁有经验。whisper的讲话者声音维持在多少分贝,WER得表现是比较优异的情况呀。

hazy grotto
karmic raft
carmine scroll
#

sad... Thanks for the answer

fallow narwhal
#

Hello everyone
I'm currently workin on Whisper to specialize it in French railway language. I'm facing some issues with transcribing amnigous words, as well as recognizin station names. Initially, i tried training it with audio file totaling 2 hours, but the results didn't meet my expectations. I then turned to usings prompts, which solved thé ambiguity problème, however since the context size is limited to 244 tokens, i can't include aller station names.

Could you please provide me with some tips? I'm new to this field.
Thank you

brave kestrel
#

I haven't used whisper offline much but the time I did it didn't perform well out of the box. the api has pretty good results but still has the "hallucinations"

fallow narwhal
upbeat totem
#

How are you guys using the Whisper API to create subtitles for 2-hour videos?

rapid spindle
#

has anyone tried the wishper v3 ? does it significantly improve the v2 or not?

ruby agate
#

Guess open source whisper is a dead end, glad openai folded it into their proprietary models

loud plinth
#

I'd like to know if there will be enhancements to Whisper with SSML or a similar markup which will help with enunciation, pronounciation, accentuation, tone, and other features noted today in the presentation.

brave kestrel
fair pecan
#

Has anyone noticed this with whisper?
When there is no audio input it's sometimes outputs something like "welcome to chatgpt" or "thank you for watching"

#

It also outputed this, on real-time chat with chatgpt

#

This transcript was recorded on October 12, 2021. Thank you for participating in this webinar.

static sage
#

Hi,

I’m using the openAI API, I’m trying to get short segments (~4 words) with timings & punctuations.
What I went through:

  • API doesn’t allow to set the number of words per segment
  • Thought I could build it from words level transcribe → there is no punctuation there, also characters like - and ' weirdly managed
  • Thought I could merge text or segments with words (I can get punctuation from text and timing from words)

Until I noticed a few things between text/segments and words:

  • Text might differ. Literally having word in words that totally not exist in text/segments
  • Timestamps is a big mismatch between words and segments
  • No punctuation in words
  • Words containing ' or - like « it’s » in some language would be consider as one word, in other language as two word.
    This makes merging segments and words difficult since there is not the same amount of words in both side and rules on specific characters differ depending on the language
    Did anyone succeed getting a word based transcribe with punctuation and level timestamp with the API or short segments ?

Thank you

vocal veldt
loud plinth
#

Servers are having load issues now. All of the services are getting that in one way or another.

rough sequoia
#

it would be nice to have a date of when the voice engine (or new whisper) is going to be released

crystal yew
#

Hello there !
I am Loïc Founder of the french startup Callifly. I'm looking for our french speaking future CTO to complete our Core Team (possibility of equity entry). Already made up of 4 members (company of 8 people all remotely) I am looking for an AI expert to automate our solution.
We specialize in recovering abandoned shopping carts over the phone for e-retailers. More generally, we want to become the voice of e-retailers by selling, advising and supporting customers in their purchasing journey.
To start, I would like to work on a demonstrator (MVP type) and offer it to our clients for simple campaigns (Promo Code proposal for example). They are already excited about the idea.
First step, meet and discuss. 🙂
Who feels up to the challenge? Dm me.

alpine timber
#

Simple question about Whisper. Does it pad audios shorter than 30s to 30s?

urban parcel
#

ॐ गं गणपतये नमः – Language bot test

#

चिद्धर्मा सर्वदेहेषु विशेषो नास्ति कुत्रचित् ।
अतश्च तन्मयं सर्वं भावयन्भवजिज्जनः ॥ १०० ॥

#

𑆃𑆤𑆴𑆫𑆾𑆣𑆩𑇀 𑆃𑆤𑆶𑆠𑇀𑆥𑆳𑆢𑆩𑇀 𑆃𑆤𑆶𑆖𑇀𑆗𑆼𑆢𑆩𑇀 𑆃𑆯𑆳𑆯𑇀𑆮𑆠𑆩𑇀 𑇅
𑆃𑆤𑆼𑆑𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆤𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆓𑆩𑆩𑇀 𑆃𑆤𑆴𑆫𑇀𑆓𑆩𑆩𑇀 𑇆
𑆪𑆂 𑆥𑇀𑆫𑆠𑆵𑆠𑇀𑆪𑆱𑆩𑆶𑆠𑇀𑆥𑆳𑆢𑆁 𑆥𑇀𑆫𑆥𑆚𑇀𑆖𑆾𑆥𑆯𑆩𑆁 𑆯𑆴𑆮𑆩𑇀 𑇅
𑆢𑆼𑆯𑆪𑆳𑆩𑆳𑆱 𑆱𑆁𑆧𑆶𑆢𑇀𑆣𑆱𑇀𑆠𑆁 𑆮𑆤𑇀𑆢𑆼 𑆮𑆢𑆠𑆳𑆁 𑆮𑆫𑆩𑇀 𑇆

#

𑆃𑆤𑆴𑆫𑆾𑆣𑆩𑇀 𑆃𑆤𑆶𑆠𑇀𑆥𑆳𑆢𑆩𑇀 𑆃𑆤𑆶𑆖𑇀𑆗𑆼𑆢𑆩𑇀 𑆃𑆯𑆳𑆯𑇀𑆮𑆠𑆩𑇀 𑇅
𑆃𑆤𑆼𑆑𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆤𑆳𑆫𑇀𑆡𑆩𑇀 𑆃𑆤𑆳𑆓𑆩𑆩𑇀 𑆃𑆤𑆴𑆫𑇀𑆓𑆩𑆩𑇀 𑇆
𑆪𑆂 𑆥𑇀𑆫𑆠𑆵𑆠𑇀𑆪𑆱𑆩𑆶𑆠𑇀𑆥𑆳𑆢𑆁 𑆥𑇀𑆫𑆥𑆚𑇀𑆖𑆾𑆥𑆯𑆩𑆁 𑆯𑆴𑆮𑆩𑇀 𑇅
𑆢𑆼𑆯𑆪𑆳𑆩𑆳𑆱 𑆱𑆁𑆧𑆶𑆢𑇀𑆣𑆱𑇀𑆠𑆁 𑆮𑆤𑇀𑆢𑆼 𑆮𑆢𑆠𑆳𑆁 𑆮𑆫𑆩𑇀 𑇆

#

anirodham anutpādam anucchedam aśāśvatam .
anekārtham anānārtham anāgamam anirgamam ..
yaḥ pratītyasamutpādaṃ prapañcopaśamaṃ śivam .
deśayāmāsa saṃbuddhastaṃ vande vadatāṃ varam ..

urban parcel
#

𑆪𑆼 𑆣𑆩𑇀𑆩𑆳 𑆲𑆼𑆠𑆶𑆥𑇀𑆥𑆨𑆮𑆳 𑆠𑆼𑆱𑆁 𑆲𑆼𑆠𑆶𑆁 𑆠𑆡𑆳𑆓𑆠𑆾 𑆄𑆲 𑇅
𑆠𑆼𑆱𑆚𑇀𑆖 𑆪𑆾 𑆤𑆴𑆫𑆾𑆣𑆾 𑆍𑆮𑆁 𑆮𑆳𑆢𑆵 𑆩𑆲𑆳𑆱𑆩𑆟𑆾 𑇆

amber schooner
#

hey Rule 4 is no spamming, you are gonna get kicked for posting that in muliple channels if you aren't careful @arctic plover

steep shadow
#

Hello. I don't know why, since almost 2 weeks, Whisper API seems to not be able anymore to transcribe lyrics from songs, where it was the best at this game a few weeks ago... Why???

#

I've got this on a song like Paradise from ColdPlay, where it was able to clearly transcribe the lyrics 2 weeks ago 😕

steep shadow
#

Can please someone at open ai explain us what happend??

weary skiff
#

Hey, does anyone here have experience with insanely-fast-whisper? I've managed to get it working with flash attention 2, but to achieve the best final results, I need to utilize Spacy and the PunctuationModel for optimal quality. However, I'm encountering an issue with obtaining word-level timestamps to complete this task. Any ideas on how to address this?

bright flax
#

So it seems OpenAI is dynamically changing the amount of usage we get with a plus subscription even after we have subscribed... I'm curious how this is okay? You can't change the terms of a contract after both parties have agreed to it?

#

4o has changed from 50 prompts to 40 prompts per 3 hours.. and these limits are not mentioned at all when subscribing for plus. Not trying to be rude just curious what a deb would have to say about this?

nimble pumice
#

i just wanna speak to the new voice

dapper bobcat
#

.

nimble pumice
bright flax
dapper bobcat
gilded fiber
#

I am willing to pay for someone to help get the whisper API working for IOS and Mac OS
It only gets the first few seconds using Mobile IOS
Please DM ❤️ I am so tired

neat cedar
gilded fiber
#

Yes @neat cedar I am

#

The problem is with the Audio file itself

#

I found a work around using cloud convert but my application needs a straight through change and I really don’t want to use FFMMEG

neat cedar
#

I'm going to guess you're streaming audio directly to the whsiper apis and you're probably only transcribing the first segment?

gilded fiber
#

@neat cedar So I am using an audio file

#

It will pick up the first few seconds then poof

#

like it grabs the first word then breaks

#

I vow to open source a fix

#

Pls @ me ❤️ actively working on this

gilded fiber
#

I figured it out

gilded fiber
#

Ok so fluent-ffmpeg

#

IF YOU GOT STUCK WHERE I GOT STUCK

YOU NEED TO USE fluent-ffmpeg

SO YOU CAN USE Whisper IOS, Whisper on IOS, Whisper Mobile, Whisper Mac OS (Tags for those that get stuck)

Convert .wav -> .mp3 then process

No do not setup the recording as mp3 off the get go; for some reason you need to translate it into that format and the audio file will process fine

FOR DOCKER USERS

RUN apt-get update &&
apt-get install -y ffmpeg

CONVERSELY USE CLOUD CONVERTS API TO DO THE TRANSFER OF AUDIO TYPE

const ffmpeg = require('fluent-ffmpeg');

function convertWavToMp3(inputFile, outputFile) {
    // Set the PATH environment variable to include FFmpeg bin before conversion
    process.env.PATH = `C:\\FFmpeg\\bin;${process.env.PATH}`;

    ffmpeg(inputFile)
        .toFormat('mp3')
        .on('end', () => {
            console.log('File has been converted successfully');
        })
        .on('error', (err) => {
            console.error('An error occurred: ' + err.message);
        })
        .saveToFile(outputFile);
}

// Example usage
convertWavToMp3('audio.wav', 'output.mp3');

still moon
#

lame foo.wav 🙂

#

(the requirement "straight through change" is not exactly clear, imo) @gilded fiber

gilded fiber
#

@still moon wym

gilded fiber
#

Or conversion

still moon
#

i was just throwing 'lame' out there.. i use the binary program from cli

#

but not for whisper.. haven't had the issue you're having with wav

#

@gilded fiber i wonder why yours stops.. wav file format issue? max size issue?

gilded fiber
#

It’s all formats

#

I just figured out the best way to fix it just to convert file type

#

Works from wav to mp3

still moon
#

i wrote (cgpt'ed up) a script to split my files into N mb chunks..

#

haven't yet done the part where re-writing timestamp offsets is done for reassembly though [haven't needed to]

#

from the sound of it, it still could be a size issue you hit? (mp3 bing much smaller than wav)

rare summit
#

hello, I am new to openai and am trying to use whisper for speech to text translations. I've been trying to use it in the magic leap, and im finding some target architecture issues. it seems whisper requires arm64 but the magic leap requires x86_64. does anyone know if i can re-target whisper or if there's anyway i can still use it? or if there are other speech to text tools i can use?

lofty aurora
rare summit
# lofty aurora

Not helpful—if you don’t think the question is worthy of answering please don’t reply

lofty aurora
brave kestrel
#

If you're the one building stuff chatgpt almost always will steer you wrong on the details. It's not bad if you already know how to make something. maybe also if you're brain storming without needing a viable program

brave kestrel
#

you can def run whisper on x86_64. I've done it on my system. can you tell me more about your setup

rare summit
brave kestrel
#

wut

#

x86_64 is the standard processor architecture on intel and amds

#

windows is an operating system that is often x86_64, but sometimes they have an arm version

#

android is almost always arm64

#

I might be missing something

rare summit
#

I think the issue is that I have to use Android platform to build to the headset and specifically if I use Android I have to use arm64

brave kestrel
#

gotcha

#

Have you looked at the android's apis for voice

rare summit
#

But also I’m really new to this so I could be confused about something else or maybe there’s a better way to do this

#

Not yet—I’ll look into it. Thanks for your help!

brave kestrel
#

you could probably find a way to run whisper on arm64 but depending on the hardware it may not run great

#

there's always the openai whisper api but not sure if that's going to work for your implementation

rare summit
#

Thanks I’ll try those out!

opaque willow
#

model = WhisperModel('tiny', compute_type="int8")
segments, _ = model.transcribe("input.wav")
text = ''.join(segment.text for segment in segments)
return text

i am trying to use the whisper model and get it to work on cpu and i think the code i have written should work fine on cpu but it says that is needs cudnn to work, how can i fix this without needing cudnn?

#

tag me if u answer thanks

gilded fiber
clever blaze
#

whats this channel for

livid mauve
gentle furnace
#

Anyone knows an app for windows for TTS like the built-in one (win+h) but with an option to use external tts api like whisper?

I want to put the transcription into an any text field immediately without copy paste actions..

static stirrup
#

tldr: whisper wont use cuda depsite torch detecting graphics card and cuda in same file

Soooo I have Cuda installed

and these in my venv:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install -U openai-whisper

My code checks if cuda is available and it prints yes and my GPU:

import whisper
import torch
print(torch.cuda.is_available())
import torch

if torch.cuda.is_available():
    print(f"CUDA is available. Using GPU: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available. Using CPU.")

model = whisper.load_model("medium")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("test.opus")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)
(venv) ➜  mssp-transcript python test.py
True
CUDA is available. Using GPU: NVIDIA GeForce GTX 1070

But it uses my CPU???? The source code of the whisper module also checks torch_cuda.is_avaibale() and runs on cuda if it is so I have NO idea why its running on the CPU

lunar mirage
#

ハンバーガーやチーズバーガーはフライドポテトと一緒に食べると美味しいと思う

lone island
#

I have a Korean show that has English subtitles files, but I would like to generate Korean subtitles using whisper.

Is there a simple way to use the english subtitle timings as a guide for the whisper generated transcription subtitles?

I would like to have the generated native Korean subtitles match the existing timings of the English subtitles.

#

I have ideas for how to hack together some stuff but I'd like to check first if there's an easy way

vocal veldt
outer musk
outer musk
halcyon trail
#

Hello, I'm looking for a TTS model that can handle caribbean creole or ocean indian creole.

torpid pine
#

Shhh ya'll too loud

glad seal
brave kestrel
#

Sweet

dapper bridge
#

Hey - I have a problem with my subtitles. I use large-v2 model and whenever there's a pause in audio (no one is talking), timecodes desync.
It's really annoying to manually resync them, especially on videos that are 30+ minutes long. Did any of you figured out what to do in that situation? Some kind of fix?

torn bramble
dapper bridge
#

Okay, to be honest - I've found another solution. Switched to WhisperX, set my model to large-v3 and align_model to WAV2VEG2_ASR_LARGE_LV60K_9COH and it works wonders. It's much better than before.
BUT - for some reason I can't force it to work with CUDA (float16), it only works with CPU (int8) - and it's slow as hell. And I have no idea why it doesn't want to use it.

worn vector
#

hey so i have some code meant to put subtitles on a video, but i only want the subtitles to be of a very short length (e.g. like 9 characters). this is my python code using whisper. it works perfectly fine except for the fact that my max_line_width and max_line_count parameters arent respected (the subtitles still are multi-lined and much longer than i would like). does anyone know why?

def generate_subtitles(audio_path):
    subtitle_filename = audio_path.replace('.mp3', '.srt')
    # max_line_width = 9
    # max_line_count = 1

    model = whisper.load_model("base")
    result = model.transcribe(audio_path)

    srt_writer = WriteSRT(".")
    srt_writer.write_result(
        result,
        file=open(os.path.join(".", os.path.splitext(os.path.basename(audio_path))[0] + ".srt"), "w",
                  encoding="utf-8"),
        options={"max_line_width": 9, "max_line_count": 1, "highlight_words": False}   # Why aren't the options working?
    )
agile niche
#

Hi guys 👋
I’m a newbie here, I’m currently using the Whisper transcription API to transcribe audio files within a python app.
My question is : is it mandatory to read the audio file prior to sending it to the transcription API ? I’m scared RAM wise since I may have a lot of concurrent requests to handle with 9MBs for each audio files, and my app is hosted on Heroku with only 512MB of RAM.

#

Is there a way to send the transcription request without loading the audio file in RAM ?

agile niche
#

Is this approach a risk of memory exhaustion ?

dapper bridge
agile niche
#

Would this approach be correct ? :

async with aiofiles.open(audio_file_path, 'wb') as f:
while True:
chunk = await response.content.read(8192)
if not chunk:
break
await f.write(chunk)

            transcription_response = await client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file_path
            )
agile niche
#

Anyone ?

tender rivet
#

the API request has to contain the audio file, it will be loaded in memory at some point even if you decide to not use the lib and do the HTTP reqeust yourself

agile niche
lofty aurora
agile niche
tender rivet
#

even if it is all being done on the cloud, the machine will need some good amount of IO to keep uploading all of that consistently

agile niche
#

Yup makes sense, I guess my temporary solution is to define a limit of concurrent calls to 10 for example to lower my expectations

#

Ram is extremely expensive on Heroku, my instance only has 512MB lol

tender rivet
#

for sure you will need a queueing process to be able to control that

eager ember
#

There is also Speech to Text API from local file. You can pick a file from the device and ofcurse use it as API
https://rapidapi.com/swift-api-swift-api-default/api/ai-speech-to-text/playground/apiendpoint_e98a440e-6f3d-473a-8d13-56afe020f179

Empower your applications with cutting-edge speech recognition technology, crafted to meet the highest standards of excellence. Equipped with the essential tools and resources, developers can confidently create and deploy their projects swiftly and efficiently. Our solution ensures unparalleled performance, delivering a seamless experience that ...

frigid shale
#

I know I'm pulling up an old thread here, but where can I find info on these more specific parameters such as vad or even compression_ratio_threshold and temperature_increment_on_fallback. I'm having the same issue as a lot of people have mentioned on here (hallucinations), but with the API and I can't find relevant docs on either OpenAI's platform docs nor Azure's OpenAI services.

#

This seems to be common in audio clips longer than 15 minutes or so, and it happens pretty often. I've tried setting my parameters based on some online answers (I'm using node and axios fetch) but none seem to get whisper any closer to a coherent response on this audio clip

const audioResponse = await axios.get(audioUrl, { responseType: 'stream' });
const form = new FormData();
form.append('file', audioResponse.data, {
  filename: 'audio.mp3',
  contentType: 'audio/mpeg'
});

// THESE ARE MY WHISPER API PARAMS
form.append('response_format', 'verbose_json');
form.append('timestamp_granularities', 'segment');
form.append('temperature_increment_on_fallback', 'None');
form.append('compression_ratio_threshold', 1.2);
form.append('temperature', 0.1);
form.append('vad', 'True');
// ===============================

const azureResponse = await axios.post(
  `${AZURE_OPENAI_ENDPOINT}/openai/deployments/${deploymentName}/audio/translations?api-version=2024-05-01-preview`,
  form,
  {
    headers: {
      'api-key': AZURE_OPENAI_API_KEY,
      ...form.getHeaders()
    }
  }
);

return azureResponse.data;
#

yes I'm using the Azure instance; and no I have no idea whether these params can actually be passed to the API but it should be the same as the OpenAI whisper api

shrewd rose
icy phoenix
real zealot
#

yes

manic plover
woven pier
#

Has whisper been abandoned by OpenAI?

austere halo
#

they released v3 large less then a year ago

woven pier
#

they did edit the readme recently

#

but they haven't accepted any PRs, even spelling mistake fixes. I know I submitted one that removed an error when run on Windows

surreal plover
#

Pull requests have always taken literal years if there not like large security issues.

autumn bolt
#

how to use whisper to detect sound effects in a video?

woven pier
#

it'll try and put a word or letters to the sound

surreal plover
#

when i use whisper through command prompt it works flawlessly, however when i try to run it through a python script it says:

'whisper' has no attribute 'load_model'

i tried to debug and used: print(dir(whisper))
which gave me:

['AudioTranscriber', 'QApplication', 'QColor', 'QMainWindow', 'QPalette', 'QPushButton', 'QTextCursor', 'QTextEdit', 'QTimer', 'QVBoxLayout', 'QWidget', 'Qt', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'np', 'queue', 'sd', 'sf', 'sys', 'tempfile', 'threading', 'whisper']

and now im stumpted, ive updated everything, reinstalled everything, restarted everything 10 times over and this still happens.

#

edit: you can not name any file "whisper.py" anywhere on your pc or it will break everything. i'd put in a pull req to fix this but im sure there is an open one already.

willow current
#

hi

#

am new here. I have some doubts about the training and fine tuning the whisper model

#

is any posibilities to finetune the model with larger duration audio files

#

i have various audio files minimum 10 mins to maximum 30 mins and also have their trancription to in my dataset

#

i need to train the model with my dataset it is possible

#

any one help me about this

surreal plover
willow current
willow current
#

@surreal plover is it possible with finetuning the model with larger audio files more than 20mins audio and its transcription (note : i need to finetune the model with full size audio i don't want to split 30sec)

surreal plover
willow current
surreal plover
#

Your current model is set up expecting all files in the batch to have the same dimensions between both audio files. The simplest fix would be to either pad or truncate the files. A simple python script could automate this if you don't want to get into changing what the batch expects.

willow current
surreal plover
#

Sounds like there is an error with how your padding and or turncating. I'd add some logging and check the length of files after they have been padded, etc.

willow current
#

@surreal plover do you give me any sample files for your padding and truncate logic code i'll go through that and i'll follow your methods will you help me for this

covert kraken
#

Hi,
I want to optimize whisper model for running local Android device.
What is the best approach for optimziing whisper?
Which model is the best for base one?
While maintaining accuracy, how to optimize much?

surreal wing
#

Hi, I have problems with whisper. Whisper works well this way:


model = whisper.load_model(“base”)
result = model.transcribe(“audio2.wav”)
print(result[“text”])```

But when I want to use it in my code, I get errors. Here a part of my code : 
```import whisper
import sounddevice as sd
import numpy as np
import torch  # Assurez-vous que PyTorch est importé

def voice_model(language='en-EN', mic_index=0,
                voice_id='HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\MSTTS_V110_enGB_HazelM'):
    working = True
    model = whisper.load_model("base")
    engine = pyttsx3.init()
    engine.setProperty('voice', voice_id)

    if language == 'fr-FR':
        print("Apuuyez sur * sur votre clavier pour mettre en pause la conversation")
    else:
        print("Press * on your keyboard to pause the conversation")

    def talk(text):
        engine.say(text)
        engine.runAndWait()

    def listen():
        command = ''
        try:
            # Définir la durée de l'enregistrement en secondes
            duration = 5  # Durée de l'enregistrement en secondes
            samplerate = 16000  # Fréquence d'échantillonnage de l'audio

            # Enregistrer l'audio du microphone
            print('Assistant :')
            audio = sd.rec(int(samplerate * duration), samplerate=samplerate, channels=1, dtype='int16')
            sd.wait()  # Attendre que l'enregistrement soit terminé
            audio = np.squeeze(audio)  # Assurez-vous que l'audio est en mono

            # Convertir l'audio en tensor PyTorch et en type flottant
            audio_tensor = torch.tensor(audio).float()  # Convertir en tensor flottant

            # Charger le modèle whisper et transcrire l'audio
            model = whisper.load_model("base")
            result = model.transcribe(audio_tensor)
            command = result["text"]
        except Exception as e:
            print(f"Erreur lors de la transcription Whisper: {e}")
        return command```
#

In English : ```import whisper
import sounddevice as sd
import numpy as np
import torch # Make sure PyTorch is imported

def voice_model(language='en-EN', mic_index=0,
voice_id='HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\MSTTS_V110_enGB_HazelM'):
working = True
model = whisper.load_model(“base”)
engine = pyttsx3.init()
engine.setProperty('voice', voice_id)

if language == 'fr-FR':
    print(“Press * on your keyboard to pause the conversation”)
else:
    print(“Press * on your keyboard to pause the conversation”)

def talk(text):
    engine.say(text)
    engine.runAndWait()

def listen():
    command = ''
    try:
        # Set recording duration in seconds
        duration = 5 # Recording duration in seconds
        samplerate = 16000 # Audio sampling frequency

        # Record microphone audio
        print('Assistant:')
        audio = sd.rec(int(samplerate * duration), samplerate=samplerate, channels=1, dtype='int16')
        sd.wait() # Wait for recording to finish
        audio = np.squeeze(audio) # Make sure audio is mono

        # Convert audio to PyTorch tensor and floating type```
#

            # Load whisper model and transcribe audio
            model = whisper.load_model(“base”)
            result = model.transcribe(audio_tensor)
            command = result[“text”]
        except Exception as e:
            print(f “Transcription error Whisper: {e}”)
        return command```
rugged narwhal
#

Please post a traceback

#

Anyone here with the same problem? It all started when I included a prompt for the API, something like this:

prompt = (
                "Please provide a clean and formatted transcription of the following audio:\n"
                "The transcription should include punctuation and capitalization to make the text readable."
            )

The API then goes wild sometimes and returns "If you have any questions or comments", instead of transcribing the audio submitted. Sometimes it works, sometimes it does not. The fact that there's also no "If you have any questions or comments" in the MP3 file at all makes me question it even more

gilded oasis
#

any tips on improving Greek language recognition on whisper and faster-whisper?

celest wolf
rugged narwhal
#

Thanks for responding though! Love

celest wolf
rugged narwhal
celest wolf
split acorn
#

Love it.

I've been playing with whisper and Google voice synthesis using SSML. It's pretty awesome wiht some of the new voices google has for SSML. You can create a group conversation as a discussion using made up actors that take the transcription, reprocess it with OpenAI API to genereate the conversation and SSML code - send to Google and get the audio back. We've layered it on top of other videos as commentary and it was pretty cool.

Here is a sample of a cmpletely unscripted conversation that used Whisper to take the conversation of the exising marketing video and created the dialog. The app alloows you to add as many actors as you want and their roll. One person knows everything, one person is asking questions and we've added a third who is the joke commentary to provide levity in what was a boring video before it was processed. https://www.robsoninc.com/wp-content/uploads/2024/06/420OTU6uCnA-final.mp4

split acorn
#

Best part is, It can run off any existing youtube video. I've tried publicly avaliable music videos (to test), which was just odd when you remove the existing voices, keep the music and add in a multi person conversation talking about the lyrics. Next version is going to overlay pop-ups on the video to provide contextual information. It's all about increasing user engagement, creating new unique content.

sudden aspen
# frigid shale

set repetition penalty and use default temperature. why did you change it to 0.1 ?

subtle vortex
#

hey
whisper does accept .mp4 video files directly to transcript?

late wave
alpine lion
#

Hey guys, I have some issues when transcribing to WORD LEVEL, I get some words with the same start and end time, how can I solve that ?

#

Would love your help !

abstract viper
#

You are best to run Whisper locally on your own GPU

split acorn
feral haven
#

that is so cool

alpine lion
turbid bay
#

Did I mess up installing something?

stuck blaze
#

Have you tried forcing a reinstall/update for Whisper? That should help make sure any dependencies are installed.

honest rampart
#

i think try updating your pytorch module or go to the fit repo and pip install -r requirements.txt

#

mate at this point i just tell people to upload pip list

#

dependency conflict is so freaking annoying honestly

sudden aspen
#

whats the best tts ai ?

#

tts1-hd not so good

tender rivet
sudden aspen
#

😛

tender rivet
#

hopefully that ages badly and we get that sweet sweet voice mode on the API soon =P

buoyant olive
#

what is whisper

tender rivet
buoyant olive
#

speech to text? isnt that just on every phone/youtube?

agile niche
#

Is there any workaround to transcribe a multilingual audio file ? I have french and english in the same audio file

latent sand
#

trying to use whisper to subtitle an episode and running into an issue

#

whisper via the api works great for picking up silence and properly segmenting the text, but the timings are all off

#

and so now I'm trying to use whisperx, where the timings are correct, but it does a much poorer job with breaking up the text by speaker

#

anyone have any familiarity w this and/or recommendations?

latent sand
#

actually I'll just write out the srt myself based on the word timings...

split acorn
split acorn
split acorn
tender rivet
# split acorn You can... if you're a paying subscriber.

GPT-4o voice features are not available, there is no API endpoint for it, even if you are a subscriber of ChatGPT, the ChatGPT subscription is a completely separated service to the API and a subscription on ChatGPT does not grant any API usage

split acorn
# tender rivet GPT-4o voice features are not available, there is no API endpoint for it, even i...

You don't need GPT4o to use OpenAI TTS - I mean there would be zero point or benefit.

If you want to create new, better content off the transcript you would use GPT4o to generate either the SSML (if you want to use Google Voice) or you can just send the text to OpenAI TTS and let it do it's thing. Which I found to be much more human sounding.

https://platform.openai.com/docs/api-reference/audio

I used tts-1-hd

tender rivet
split acorn
#

How do you see the features being introduced not programmatically available via the API now? Genuine question.. The issues as I see it is just speed for realtime conversation.

You need an OpenAI subscription to access the API, not ChatGPT Plus. - right?

TTS is included with the OpenAI subscription.

latent sand
#

left is from openai's API, right is from whisperx run locally

#

the api did a better job with breaking up sentences with silence/no speech between them

#

whisperx kinda just groups them together if they were close enough

#

i know the latter is basically just whisper under the hood, but I'm not sure which parameters were used to achieve the former (for whisper, i.e. past the api)

#

a clear difference is the punctuation obv but I have no idea how to control that hehe

#

toyed with params like chunk_size no_speech_threshold max_line_count and only the first one of those made a significant difference, though its effect isn't exactly what I'm looking for

split acorn
#

I found OpenAI TTS-HD to be more realistic that Google Voice with SSML.

#

But Google has a lot more languages/accents and there are some new - pretty great sounding voices but the SSML can be tricky. I had issues when adding in two GenAI people in to a conversation.

woven pier
prime prism
junior briar
woven pier
pastel cradle
#

I had never seen "Amara" message before too.

junior briar
#

i noticed it when i was building a STT/TTS agent and sometimes i’d accidentally send an empty message to the agent

deep cradle
#

hi i am trying to transcribe the audio but whisper seems to skip a big chunk every time in the audio in other audio i work fine but in this one i seem to skip a big chunk

here the audio i about 56 sec

if you listen to the audio in the start the lady say anyway okay let's move on so what I'm doing right now and after the the whole chunk it skipped untell did you figure it out

https://filebin.net/mnk9e6qmy5nfm9oa

TimeStamp

00:05 - 00:30 audio is skipped every time i transcribe it 3 time i get the same response

transcriptions

anyway okay let's move on so what I'm doing right now did you figure out what I was saying there or you just bought the test like pretty much everybody does because now we have some groups that are supposed to help people guide people but actually what happens is that sometimes people go into their small private discussions and when somebody wants to do the test they just ask someone from that group\n
flint temple
#

Hey, trying to get whisper.cpp working on windows. anyone been able to compile with cmake? for whatever reason manually compiling the files doesn't work as it doesn't seem to use any backend and therefore doesn't work. i can't comile any example with cmake other than the main and i wanna do more than that. any help?

gilded oasis
#

I have a self hosted whisper server used to transcribe Greek news videos (Large-v3 model), however I am experiencing an issue where sometimes, after the speaker changes, the transcription is lost.
Sometimes it comes back after a sentence or two, other times it does not understand the 2nd speaker at all.
Audio quality is excellent for both speakers and neither has a thick accent or is unintelligible.
Is there anything I can do to improve this?

rustic ginkgo
#

Hey everyone!

Does anyone else here finds a problem that max file size for whisper being 25MB?
If it is an issue for anyone else, I thought I'd share with you a simple api for transcoding videos / audios in a variety of formats into opus ogg, which greatly reduces the file size.

Some tests I ran were able to reduce 1GB (video) mp4 file into a ~15 MBs audio, and 50MBs mp3 audios into ~ 2MB files, and the transcription with whisper worked perfectly!

It's a very simple api (only one file), that can be run/deployed with a docker container.
You can find all instructions (including, deploying it to fly.io) at the repo:
https://github.com/vfssantos/ffmpeg-deno-microservice

GitHub

Contribute to vfssantos/ffmpeg-deno-microservice development by creating an account on GitHub.

sleek valve
#

Is whisper-1 the only whisper model?

rocky ruin
#

Hi I'm trying to generate translated subtitles for a non-english video using whisper but using --task translate doesn't seem to do anything what am I missing

rocky ruin
#

i found the issue turbo doesn't have the translation capabilities

neon silo
unique veldt
#

looks like large-v3-turbo is not yet available :/

manic cliff
#

How does one transcribe an audio file with multiple speakers, and have the AI distinguish between them?? Examples: podcasts, meeting call recordings

hollow orchid
#

You will need to do speaker diarization. Whisper doesn't support this itself but you can use libraries like pyannote https://github.com/pyannote/pyannote-audio or if you are okay with paying you can check assembly.ai they have this available via api and its fairly cheap if your usage is limited.

GitHub

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding - GitHub - pyannote/pyannote-audio: Neural build...

rapid spindle
#

Anyone know is there will be a new version or completed remake of wihsper when the audio version of chatgpt will be available?

#

i'm waiting for that info before i make an investment....would be really hurtfull to invest just that some days later a new complete version of whisper comes out....it's for STT project

#

if anyone has any news, it would be great, thanks!

upbeat marten
#

Hi all. Any tips or resources for generating an srt, given an mp3 file and its lyrics?

I thought about using whisper, but sometimes the lyrics it hears are wrong -- eg when the song audio isnt too clear or words are in other languages --

neat cave
#

What is the best wrapper your have seen to run whisper locally on mac?

Atm I use flowvoice.ai but I would prefer running something open source if possible

rocky dawn
# neat cave What is the best wrapper your have seen to run whisper locally on mac? Atm I us...

simplest would be to do a single http request with curl, or if you install the OpenAI API a simple Python script:

client = OpenAI()

THE_PATH = "[path goes here]"
THE_NAME = "[file name goes here]"

audio_file = open(THE_PATH+THE_NAME+".wav", "rb")
transcription = client.audio.transcriptions.create(
  model="whisper-1",
  file=audio_file,
  response_format="srt"
)

print(transcription)
f = open(THE_PATH+THE_NAME+'.txt', 'w')
print(transcription, file = f)```
#

(i just added THE_PATH and THE_NAME to make this example clear)

#

also you can install ffmpeg and convert the audio to .wav first which can prevent weirdness with server side conversion:
ffmpeg -i <audio> -ar 16000 -ac 1 -c:a pcm_s16le <output>.wav

#

if you divide it into multiple audio files you can use a GPT to combine the transcript files smoothly

rocky dawn
#

Also you could create your own Whisper app for Mac by telling o1-preview:
Please write me a Swift app for Mac that uses SwiftUI for the UI, and Combine for asynchronous data and UI. The app should let me select an audio file, convert it to a transcript with an OpenAI Whisper http request, and then put it into a text editor window so i can edit it. Please add whatever other features you think will make it amazing. Please write out all source files 100% completely three files at a time, and ask permission to continue.

#

There are many ways

neat cave
lethal shoal
#

Hey everyone! I have a short question (maybe not so short, haha) about fine-tuning Whisper models on own labelled data. Following the general jupyter notebook on Huggingface, there seems to be no preprocessing steps for transcriptions that are generated during the training and are used for evaluations over the course of training. Is it necessary to add this step in the training somehow?
What I mean:
For example, for my own dataset that I am using for fine-tuning, all letters are lowercase, and there are no punctuation marks, only pure letter characters and whitespaces. Now, after starting fine-tuning, for example, every 200 steps the current model gets evaluated, however, is it not possible that model will generate output that will have uppercase and punctuation characters, and therefore WER will be higher than if they were preprocessed after being generated?

naive plinth
#

hi I am new to the channel, so forgive me if it has been answered before. Has anybody tried to use gpt4o- native audio 2 text to transcribe audio? How does it compare with Whisper? How good is it for long-context? Can it somehow do speaker identification?

crystal jetty
#

hi is it possible to directly summarize an audio with whisper without a previous transcription?

dapper bridge
#

Hey, I've recently discovered a problem with using two different versions of Whisper with CUDA.
I have installed whisperx and whisper-ctranslate2 on two different conda enviroments.

CUDA v12.5 and v11.8 are installed in Windows.
CUDNN v8.9.7 and v9.6 are installed in Windows.
Both of them are included in environment variable PATH.
Additionally I have two CUDA_PATH and CUDA_PATH_V11_8 set up.

whisperx works correctly with CUDA and spits out transcripts perfectly
whisper-ctranslate2 recognizes file but it refuses to transcript it - it crashes with no error. It only works with --device CPU argument in command line.
So, it looks like whisper-ctranslate2 has some problems with reading CUDA and CUDNN files, but I have no idea how to force it to recognize files correctly.

(I remember that on my previous Windows installation, the problem was reversed - whisper-ctranslate2 was working with CUDA and whisperx was not having any of it)

Any ideas how to fix it?

misty cove
golden sage
#

Hey,

We've recently added Intel® Gaudi® support to the Whisper repo! 🎉
Check out the GitHub discussion for more details - https://github.com/openai/whisper/discussions/2463

Let us know your thoughts or any feedback in the thread!

GitHub

Introduction We are excited to announce that we have opened a pull request (#2450) on the Whisper GitHub repository to add support for Intel Gaudi. This enhancement aims to improve performance and ...

whole tangle
#

I have seen that midjourney allow to generate img in relaxed mode for free users! Is that true

crystal copper
#

Is there documentation on how to tune the ServerVAD API for PSTN calls? The default seems to be really sensitive

worldly lantern
#

Hello guys, I need some help. I don't know any coding but through chatgpt, I somehow got whisper integrated in my terminal, but when I ran the prompt to get speech to text transcription, "Error calling OpenAI Audio API: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https:/platform.openai.com/docs/guides/error-codes/api-errors.", even though I have the chatgpt $20/ month subscription

ornate river
worldly lantern
#

Ohh okay

frozen spoke
#

hey, i'm currently trying to integrate whisper locally to save on some api call costs, and i've followed the github repo's instructions on how to load and set up the module and model.

this is my code:

import whisper
model = whisper.load_model("tiny")

This is the error it raises:
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)>

Anyone know how to fix?

#

Oh - running python 3.12.8 in a venv using VSCode

shrewd rose
#

Or public wifi ?

frozen spoke
shrewd rose
umbral palm
#

Can whisper detect um, ah and other filler words? Could you help me with what I should be putting in as a prompt so that I can detect it. I need it for the app I am building.

surreal dragon
#

if you search for "allow wisper to detect filler words" there's a few good conversations on various online forums for tricks and methods of doing so :)

umbral palm
#

I don't think there is a parameter for normalisation

surreal dragon
# umbral palm This is not working for me, even the prompt is not working.

This video goes in depth on how you can do something like you're trying to achieve: https://youtu.be/pUzBuwjvH9E

Let's build an AI-powered filler-word detector to play an air horn when we say "um" or "uh." We'll use powerful off-the-shelf tools and wire them together with minimal code.

00:00 - Intro
01:17 - Getting Started
03:17 - Recording in chunks
06:15 - Roughing out recognition
07:16 - Detecting filler words
15:16 - Adding the air horn
16:56 - Faster...

▶ Play video
mighty beacon
#

I am using whisper on windows10 and set up the settings the way I like now I just drop the files I want to transcribe onto it and it does so dropping the translations into the folder of the file its transcribing.

I wanted to ask if theres a command I can add to the *.bat file that would do the following: The audio file I drop onto the .bat I want to be moved to another folder after its completed.

Right now i Just do it manually, but I wanted to automate it.

paper obsidian
#

Is there anyway I can have taxonomy included in the whisper? I have some banking terms like TMRW (pronounced as tomorrow), UOB where I don't get these terms when spoken. Based on the context can whisper identify these terms? I tried with initial prompt but that doesn't work out really well. Any suggestions or ideas?

prime escarp
#

What's the best way to do speaker diarization via the API for a react native app?

surreal dragon
surreal dragon
magic forge
#

Any best practices to avoid hallucination with whisper? Nothing I try works for a specific audio file

quiet condor
#

what is whisper?

#

i dont think ive heard of this

#

or maybe just not realised

tardy sonnet
quiet condor
#

ty

quaint spire
#

Guys is there an app that lets you run the Whisper Large V3 model for free on MacOS?

queen valve
queen valve
# queen valve ``` brew install python pip install openai-whisper whisper --model=large-v3 what...

If you do whisper --help it says the default is --model=turbo:

MODELS = {
...
    "large-v3": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
...
    "turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
}

so it defaults to large-v3-turbo

quaint spire
#

Thank you!

queen valve
#

I wrote a thing for transcribing to the terminal from the default mic pip install catvox

#

you can pass --model to that, it uses whisper. Currently working on getting it to recognise who is talking, so it can just show me my own speech, so I can send that to an assistant.

quaint spire
#

That's awesome

queen valve
#

Thanks. Building an automatic pipeline builder is a headache. Been scratching my head over it for ages

agile niche
#

Are there any best practices to avoid having wrong timestamps via Whisper API when transcribing in SRT format ? I notice huge offsets on the timestamps...

queen valve
#

you can get word-level timestamps and then consolodate. that gives me more accurate results

agile niche
#

Are you guys using Silero VAD to exclude silences of the audio chunks sent to Whisper API ?

#

Currently doing the following : Silero VAD --> Whisper API --> Aeneas

last ridge
#

what i've to do ?

fast drift
#

Am using openai whisper model and observed following issues

  • When I don't speak anything instead of giving empty response generates random words
  • Sometimes, detect incorrect words where audio is clear

Can anyone please advise me on how do i improve?

agile niche
amber schooner
fast drift
agile niche
#

An efficient transcription pipeline is : Audio --> VAD --> transcription job --> correction job

#

Do you also need time stamps ?

fast drift
#

I don't need timestamps

jaunty loom
#

Hey folks,
My colleague raised a cool PR for whisper that boostx transcription peformance and I was wondering if I could find anyone here who might be able to review the PR and give her some feedback? Apologies if this is the wrong place, here's a link if not https://github.com/openai/whisper/pull/2516

GitHub

Implements a suite of optimizations focusing on memory efficiency, tensor initialization, and model loading functionality. These changes improve performance, code clarity, and model handling flexib...

hollow silo
#

hi, I just got the command example running using whisper. this is using ggml and the 'base' model. its great, only when i have no real ambient music on. something as simple as a kind of hangar background or something makes it not detect anything. do i adjust 'temperature' or anything?

#

i would train it based on how well a voice seems in context with any detected music, in the sample. that way you can have anything on, and as long as you arent nailing the right vocal tones with the music it would think its the speaker

paper obsidian
#

Hi Folks, can whisper base model be used for other langauges than english for the real-time usecase? Also can it be used for non-english use case for batch processing? Has anyone tried and how is the accuracy?

paper obsidian
#

Hi Guys, we are working on a real-time STT using whisper base model, this is for a conversation between agent and customer. 2 websockets connection to the backend STT engine. The latency is little high as I understand whisper is not meant for real-time, but is there any way we can enhance this?

Also, the WER is around 35-40%, any ways to reduce this? The language is mostly Singapore English.

wanton girder
#

hii i was wondering if whisper could run on any microcontrollers?

#

im planning to use it in a project with an esp32, but idk if itll run on it and theres barely anything on the internet about it

manic cliff
#

So I keep getting these markers for blankness or silence. I'll get stuff like this:

00:26:56.599 --> 00:27:07.599
Silence.

00:27:08.599 --> 00:27:35.599
Silence.

00:27:36.599 --> 00:27:49.599
Silence.

Or

00:26:56.599 --> 00:27:07.599
...

00:27:08.599 --> 00:27:35.599
...

00:27:36.599 --> 00:27:49.599
...

But it's not the same. It varies by each output file. Each one has its own idea of conveying "silence" or blankness. But the thing is, I really don't want this in my stuff. Is there a way to get it to not do this? Can I just put it into the prompt? Will that work?

#

Also what it does with dead space on a track is actually hilarious. Here's what I got on dead space on the track of one of my players:

00:06:14.000 --> 00:06:32.000
Yeah.

00:06:32.000 --> 00:06:38.000
Yeah.

00:07:02.000 --> 00:07:12.000
Yeah.

00:07:12.000 --> 00:07:22.000
Yeah.

00:07:22.000 --> 00:07:40.000
Yeah.

00:07:40.000 --> 00:07:50.000
Yeah.

00:07:50.000 --> 00:08:00.000
Yeah.

00:08:00.000 --> 00:08:10.000
Yeah.

00:08:10.000 --> 00:08:20.000
Yeah.

00:08:20.000 --> 00:08:30.000
Yeah.

00:08:30.000 --> 00:08:40.000
Yeah.

00:08:40.000 --> 00:08:50.000
Yeah.

00:08:50.000 --> 00:09:00.000
Yeah.

00:09:00.000 --> 00:09:10.000
Yeah.

00:09:10.000 --> 00:09:20.000
Yeah.

00:09:20.000 --> 00:09:44.000
Yeah.

He said literally nothing in this ~4min period. At all. I listened to the whole thing to make certain of it

inner quarry
sleek bane
manic cliff
sleek bane
#

open source GitHub projects like faster-whisper and whisperx have VAD preprocessing built-in that works rather well

versed slate
#

So, will it be possible to download gpt-4o-transcribe? I want to run it locally

sleek bane
#

I’ve only seen it announced on API and seems quite expensive unless I’m reading their chart incorrectly.

hot belfry
#

so RIP this channel?

sleek bane
#

Whisper isn't being removed and I doubt the new models will be much more (if at all) popular considering they're API only.

#

Much cheaper and easier to run Whisper locally.

finite jetty
#

whisper is open source also

safe apex
#

Hello, I am having an issue with a whisper related plugin. WhisperAttack is a plugin for Voice Attack that changes its language recognition to whisper's STT. But whisper never loads its model when I'm using GPU, on CPU it loads fine. It doesn't throw any error codes or something similar, it just doesn't get past the loading stage. I've done some admittedly meager testing, because this is the first time i've actually stepped into ai territory, but i've found that my system can run whisper itself through python+pytorch using a gpu. Or atleast thats what the code ChatGPT gave me said. I'd appreciate if anyone knew if there was a more concrete way I could test it, or if its something else.

#

Sorry that i've asked here, but WhisperAttack has no forums or discord servers i can find. So this was my best option, tied with the VoiceAttack discord.

sleek bane
#

I have no clue about the software, never heard of it or used it.

If you created the code, simply add an insane amount of logging so you know what’s actually going on and paste it back into ChatGPT.

safe apex
sleek bane
#

Though you shouldn't just trust that any tools can do what ChatGPT says they can do. You should actually ensure it has the ability to do whatever it is you're wanting to do first. If you can't verify that, then ya, you're definitely possibly gonna spend many hours on it to achieve nothing.

safe apex
safe apex
gleaming lake
#

hello i have a question, how do you use whisper word timestamps?

sleek bane
#

however you want to utilize them?

paper obsidian
#

How does whisper prompts work? Does it uses the prompt for every chunk of audio that we send to the model?

finite hawk
#

Hi all,can we get pronunciation feed back while the audio is been transcribed?

willow delta
#

Heyhey. I am trying to get whisper to work in a java environment but am not sure what the best approach is. I am targeting a live transcript so I was using whisper live for a bit but am getting a bit tied up with the library. Is there someone that has a bit of experience or idea towards this? If this is a space to even ask this

willow delta
harsh merlin
willow delta
harsh merlin
vestal belfry
#

@calm finch

calm finch
vestal belfry
#

yujonglee seems to hav something meant for your meeting stuff

#

above me

calm finch
#

i think i'll just use the API it's not too bad

#

i'm gonna test 4o transcribe to see how good it is

harsh merlin
#

yeah mine(Hyprnote) is for using local whisper(for now)

native storm
#

4o transcription isn't working for me, does anyone know what to do?

cerulean tusk
#

Is there anyone here who might be able to assist me with better understanding OpenAI whisper pricing? We are testing using them for transcription of call recordings from a telephony server, and I cannot reconcile their charges with our usage.

serene gull
#

You listening

strange mauve
#

What is whisper ?

hushed forge
#

Why is ChatGPT whisper not even working

left bluff
cedar moss
#

what's up with whisper? i get The server had an error while processing your request. Sorry about that! Please contact us through our help center at help.openai.com if the error persists.

quaint patrol
#

for the life of me I cannot get whisper to translate from english to another language. Any help appreciated

pure swift
#

If I need accurate timestamps so k can show the current location in a transcript while playing the audio , is there anyway to use the new transribe models or do I have to use whisper ?

distant gazelle
#

hello im new here so guys can you help me

bitter bough
#

Not the discord user for clarification.

lone edge
#

Hoi, is whisper capable of making non verbal sounds by using symbols to activate X non-verbal action?
Or is it capable of processing a voice shouting? As i'm trying to make some nonsense automation in home assistant, but can't seem to figure out how to make my VA sneeze for instance lol. Or make it phonetically make the noise i want by using only letters., as if there's too many letters and whatnot, it will start a spelling-bee competition lol

muted axleBOT
#

success @oceangrover muted

Reason: Possible spam: Excessive mentions in a short period.
Expiration: 56 seconds
proofverifiedProof: @winter island

atomic totem
placid falcon
#

I need to be able to get language realtime captured off of a PC livestream and then auto translated. I heard that whisper is the best captioning software but it doesnt have realtime captioning?? I also heard there has been a lot of things added to this software through huggingface etc. Just wondering if what im asking is possible

placid falcon
#

So apparently thats a yes but the actual good model large v2 requires a nvidia cuda gpu.... Surely you can get this service as a cheap api key that doesnt run locally?

verbal walrus
#

Shhhhhhh

placid falcon
#

WhisperAI is one of the most interesting things that OpenAI has built and apparently no one uses or engineers it.

plucky dune
#

I am Poison Apple

flint crow
#

🚀 New Project Alert!
Hey everyone! I just fine-tuned OpenAI's Whisper-Tiny model to translate Bengali-English code-switched speech into English, especially for healthcare use cases like doctor-patient conversations. An easy fine-tune script for translation is open-sourced.🎙️🏥

🧠 It’s called MediBeng-Whisper-Tiny — perfect for building better clinical transcriptions and exploring speech translation tasks!

Check it out here:
🔗 GitHub: https://github.com/pr0mila/MediBeng-Whisper-Tiny
🔗 Hugging Face: https://huggingface.co/pr0mila-gh0sh/MediBeng-Whisper-Tiny

Let me know what you think or if you try it out! 😊
#SpeechTranslation #gpt-realtime #HealthcareAI #Bengali #OpenAI

GitHub

MediBeng Whisper Tiny improves doctor-patient transcription by training the Whisper Tiny model to translate mixed Bengali-English speech into English, making it easier for analysis, record-keeping...

fast drift
#

Hi,
I have video which has only music/ no audio
i need to show video response for the user when they ask the queries related to video

what approach should i follow please advise

fast drift
#

Hi,

shut quartz
#

Hey there! is anyone here experienced with using faster-whisper while multithreading?

I’m running Faster-Whisper in a Python app that transcribes multiple audios (.wav) in parallel. I’m using ProcessPoolExecutor

The issue: sometimes the transcription just hangs silently — no errors, no CPU or GPU usage, no output. Eventually I have to kill it. When I add logging inside the transcription subprocess, it often doesn’t get past WhisperModel(...).transcribe(...).

Still stuck.

Has anyone run into this kind of silent freeze or dead GPU thread in Faster-Whisper? Any known workarounds or tips?

Thanks 🙏

rich idol
#

the voice is better indeed now

#

its the best thing for foreign language learners

pallid bay
hollow sandal
hollow sandal
pallid bay
hollow sandal
#

Oh i know it well

toxic moss
#

Hello! Anyone here use an automated whisper transcription workflow for better note taking?

sturdy spire
#

Anyone else having very delayed Whisper API responses over this last week?

velvet perch
#

@narrow thorn

fair pecan
#

anybody knows some platform or app that can do audio/video files transcription with translatioin support?

#

and export to srt

#

preferably able to run locally on windows

#

and optionally on mac

stone drift
#

👋🏽

pale hatch
#

Whisper not working in chatgpt?

stone drift
ember steppe
#

why are we whispering?

slow storm
#

shhh!

hollow wigeon
#

has anyone figured out how to avoid large v3 going into an endless loop of repeating the same sentence thousands of times?

#

makes v3 essentially useless for longer videos, even with no background noise

quiet terrace
hollow wigeon
#

doesnt help

#

yea I am using it locally

#

using this ui

#

10日は関東甲信地方を中心に各地で猛烈な雨となり、都内でも道路が冠水するなどの被害が相次いだ。横浜では、大雨の影響でマンホールが吹き飛び道路が陥没するという被害も出た。

■都心大冠水 帰宅ラッシュ混乱

打ちつける雨に鳴り響く雷鳴。10...

▶ Play video
#

around 4 minutes in large v3 gets stuck even with condition on previous text disabled

quiet terrace
#

I'm using pure Python script myself, and got this result with large-v3 You mean model stuck around here?

hollow wigeon
#

indeed

#

might be a bug in the gui that doesnt pass on the parameter

#

so its the previous text conditioning that breaks v3

quiet terrace
#

I think so. I mostly use Whisper with Korean, and every single transcription had that kind of problem. When I turn off previous text conditioning, every problem gone; it rarely appears, but re-transcribe works.

#

This is with previous text conditioning. It indeed causes problem...

hollow wigeon
#

okay but it doesnt get permanently stuck

#

with previous text conditioning i get the same sentence repeated from around 4 mins until the end of the video

quiet terrace
#

Hmm... that's strange. I only get partial repetition with conditioning. Maybe because WebUI uses its own setting for transcription?
I randomly got that kind of thing, but not every time.

hollow wigeon
#

okay ran the app in debug mode, first of all it doesnt pass through the condition_on_previous_text to the actual transcribe call

quiet terrace
#

It uses various default settings...

whisper:
  model_size: "large-v2"
  file_format: "SRT"
  lang: "Automatic Detection"
  is_translate: false
  beam_size: 5
  log_prob_threshold: -1
  no_speech_threshold: 0.6
  best_of: 5
  patience: 1
  condition_on_previous_text: true
  prompt_reset_on_temperature: 0.5
  initial_prompt: null
  temperature: 0
  compression_ratio_threshold: 2.4
  chunk_length: 30
  batch_size: 24
  length_penalty: 1
  repetition_penalty: 1
  no_repeat_ngram_size: 0
  prefix: null
  suppress_blank: true
  suppress_tokens: "[-1]"
  max_initial_timestamp: 1
  word_timestamps: false
  prepend_punctuations: "\"'“¿([{-"
  append_punctuations: "\"'.。,,!!??::”)]}、"
  max_new_tokens: null
  hallucination_silence_threshold: null
  hotwords: null
  language_detection_threshold: 0.5
  language_detection_segments: 1
  add_timestamp: false
  enable_offload: true
hollow wigeon
#

second, it uses the faster_whisper library

#

isnt faster_whisper being deprecated?

quiet terrace
hollow wigeon
#

okay well then I guess I know where to file a bug report at least, thanks

snow dirge
#

Critical user face bug

#

@snow dirge

#

Saw expected Behavior then became critical behavior need to talk to someone privately

gilded oasis
#

Hello everyone, I'm having some difficulties training whisper (specifically large_v2) with a Greek dataset.
If anyone is available to let me pick their mind it would be greatly appreciated.
Specifically, doing a full finetune with Transformers using the common_voice_11 dataset gives me very high WER% even if loss stays low
Doing a LoRA finetune on the same dataset does not produce better results
Finetuning on a smaller dataset I have created (~1 hour) appears to be overfitting

loud plinth
snow dirge
#

The app on iOS locked up my phone and had to do a hard restart multiple times are you part of the AI team

loud plinth
#

No, guides are community members just like you, helping people to find things and get what they're looking for.

#

Staff doesn't respond here. Our (we, the community) resources include the channels here or Help Center on the OpenAI site.

#

But I am a career developer, and you're question seemed to involve Whisper, so ... I'm offering what I can.

#

I get the impression that you're talking about ChatGPT and not Whisper?

snow dirge
#

Yes I'm sorry it's about chat GPT i thought this Whisper was to whisper to a developer i can't get through there help stuff is driving me nuts

loud plinth
paper obsidian
#

Is there any benchmarking available for Translation of data using whisper ? (WER / CER)

green wolf
#

What is whisper

green wolf
pastel sparrow
#

I've tried whisper models base and large and I get nonsensical results, I've gotten Japanese, English(but far off from what i said), and in my most recent attempt, Viet

#

I only speak english and ive never been to the east of the world so my accent isn't influenced by there

#

Results:

Good morning. Today, I hope you are more compacted and to the right. Today, I hope you are more compacted and to the right. Madison land, London, and Cleveland. It was no less than 10 hours ago that a black September owl was talking about the worst attack that had happened to a black and a white and black owl. That was a very, very rough shock and a terrible shock.

hollow wigeon
#

anyone got any idea whats going on?

strong mango
pastel sparrow
# strong mango

it's not about the transcript. I need help debugging why my code produces an incredibly wrong transcript

strong mango
#

Ah.

#

I could probably help, then.

#

Either that, or you could use ChatGPT.

pastel sparrow
#

Ive tried, no luck

strong mango
#

What about Claude?

pastel sparrow
#

did so as well

strong mango
#

Grok 4, if you have access?

pastel sparrow
#

Haven

#

Haven't tried it

strong mango
#

Well, there you go.

#

It's very smart, so it should be able to easily help.

muted axleBOT
#
<:book_icon:1363314738255364126> Rule 3: Stay on topic.

-# Be mindful of what other users in a channel might find helpful or interesting when posting. Stay on topic in order to keep conversations focused and productive.

-# Consider posting in #off-topic or an appropriate channel.

tender rivet
#

this is so cool

#

trasncriptions directly on ffmpeg with the open source model

#

I love ffmpeg.. really, I couldn't think of a software that had a better impact in the world..

viscid kite
ebon cosmos
distant dome
#

Whisper is no longer working properly because OpenAI is shutting down the Whisper voice Aug 8.

#

And I am not going to use the new one 😒

distant dome
#

They retired the #

#

They will

wintry heron
wheat monolith
#

Meta meta meta ...... game inside VM inside VM loool :3 open for co-creation 😉 prompt exchange, link the sun and send the moons as func(open moon = "{" and closed moon = "}" end this with a shadow or full moon 😉 :3);

#

So fun... T_T

hollow wigeon
#

イギリス・ロンドンの観光名所となっている、ろう人形館「マダム・タッソー」に5日、ある“新メンバー”が登場しました。新たに展示されたのは「人」ではなく…人気の軽食「ソーセージロール」。そのワケとは。

この動画の記事を読む>
https://news.ntv.co...

▶ Play video
#

another video that whisper really struggles with

#

are there some optimal set of parameters for japanese?

#

it really struggles with clear, low-background noise language

#

if the chunk size is above 5s it skips whole sentences and even then it doesnt translate whats literally being said but often omits or changes parts of the grammar

#

1
00:00:00,000 --> 00:00:05,000
ロンドンの観光名所になっている老人業館マダム

2
00:00:05,000 --> 00:00:10,000
タッソーにはイギリスの王室のメンバーや世界の著名人の老人

3
00:00:10,000 --> 00:00:14,000
業が数多く展示されています。

4
00:00:15,000 --> 00:00:20,000
ロンドンのマダムタッソーに新たな仲間が増えました。

5
00:00:20,000 --> 00:00:23,000
ソーセージロールです。

6
00:00:25,000 --> 00:00:30,000
今回新たに展示されたのは人ではなく、

7
00:00:30,000 --> 00:00:33,000
イギリスで人気の軽食ソーセージロールです。

8
00:00:35,000 --> 00:00:38,000
6月5日がナショナルソーセージロールデーとされていて、

9
00:00:39,000 --> 00:00:42,000
マダムタッソーにも参加しました。

10
00:00:40,000 --> 00:00:43,000
マダムタッソーではイギリスのチェーン店グレックス社の

11
00:00:44,000 --> 00:00:47,000
ソーセージロールそっくりに作っています。

12
00:00:45,000 --> 00:00:49,000
今月末まで展示されることになりました。

13
00:00:50,000 --> 00:00:53,000
食べ物の老人業が飾られるのは初めてです。

14
00:00:55,000 --> 00:00:58,000
グレックス社のソーセージロールはイギリスで人気のスナックです。

15
00:01:00,000 --> 00:01:03,000
およそ100万個が販売されているということです。

16
00:01:05,000 --> 00:01:07,000
ソーセージロールはカリカリで柔らかく、

17
00:01:08,000 --> 00:01:10,000
柔らかくて柔らかく、

18
00:01:10,000 --> 00:01:13,000
調味料もとても良いです。

19
00:01:15,000 --> 00:01:17,000
老人業の製作チームは、

20
00:01:18,000 --> 00:01:20,000
ソーセージロールのパリッとしたパイの層と、

21
00:01:20,000 --> 00:01:23,000
サクサク感を再現するために試行錯誤を重ね、

22
00:01:24,000 --> 00:01:26,000
数か月をかけたものです。

23
00:01:25,000 --> 00:01:28,000
すべて作品を完成させたということです。

ruby night
#

What's up with the Whisper being so horribly broken in the ChatGPT SVM for the last 3 or 4 weeks? It hallucinates all the time... Is it deliberate action by OpenAI to discourage users from SVM? I tested it on several different devices including the browser with high quality microphone and results are basically the same... Because of that we (I mean, me and ChatGPT) started calling it the Careless Whisper...

rocky light
#

Which model are you using?

ruby night
# rocky light Which model are you using?

ChatGPT with gpt-4o and SVM, but the STT hallucinations persist when using gpt-5 too. AVM doesn't seem to have these issues, but being AVM it's horrible when it comes to conversation being significant.

wind herald
#

HyprNote looks great but looking for something like that for Windows. Does anyone have any suggestions?

abstract orbit
ruby night
#

Since the last weekend the Whisper-related issue in ChatGPT I reported in previous msg seems to be fixed!

trim rune
#

Hey guys i am having some troubles using the transcripition method . for some reason some audio files sometimes dosent transcript all the content in it.
did you guys ever got that problem?
how did you solve them?

visual crypt
#

Hi @trim rune
Yeah, that happens sometimes. A few reasons could be background noise, overlapping voices, or the model hitting length limits and cutting off. I usually fix it by either cleaning up the audio first (noise reduction, splitting long files into chunks) or using a different transcription tool/model. Breaking the file into smaller segments tends to help the most.

hollow wigeon
#

I have a lot of "almost correct" subtitles in Polish. I.e. not CC but regular TV subtitles. Would it be possible to use Whisper to correct each of the lines in this subtitle given an audio segment and the original almost correct line?

analog remnant
#

hi everyone i just want to know how can i use whisper and is it free for the chatgpt plus user ? Thank for any respons please

dense imp
#

Has anyone had a lot of success with diarization using things like pyannote? I seemed always have problems

worldly laurel
#

how can i start using sora

ripe stag
ripe stag
ripe stag
dense imp
# ripe stag I haven’t used pyannote but you should just be able to use WhisperX as that is b...

👯‍♂️ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
https://github.com/m-bain/whisperX

WhispherX is built on pyannote. I tried it a few years back without much luck. I'll try again to see if its better.

GitHub

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization) - m-bain/whisperX

ripe stag
#

Ok yeah that’s a while ago, definitely try it again and see how it goes.

vocal sierra
#

what can whisper do

#

in ChatGPT once i uploaded the audio file

#

but it's a concern if ChatGPT puts out an error due to some dependency issue iirc

#

i wonder if they fixed it now

#

because i tried it earlier it starts to complain 😤

ripe stag
#

@dense imp I was working on getting WhisperX to work on my windows pc, and I finally got it to work. Here's what you need to do:

  1. I take it you already have the basic Whisper installed (without speech diarization), if so you already have dependencies needed for WhisperX (PyTorch, ffmpeg, etc.). Regarding Python specifically, I recommend version 3.11, it worked the best for me, so in your environment system variables on your Windows PC make sure your paths for both user and system variables has only the python 3.11 version path, as other python version paths might cause conflicts.
  2. Download Git for HuggingFace models: https://git-scm.com/downloads/win
  3. Install required Python packages in Windows Command Prompt:
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
    pip install whisperx pyannote.audio pydub
    pip install ffmpeg-python
    If you already have a later version of CUDA installed on your computer (e.g., 13.0), that's fine, as it is backward-compatible. If you don’t have a CUDA GPU, just install the CPU version of PyTorch with --index-url https://download.pytorch.org/whl/cpu.
  4. Get Hugging Face token: https://huggingface.co/join
    Sign up, create a new token, make it Read access, and give it any name you'd like.
  5. Accept each of these model's conditions while still logged in your Hugging Face account:
    https://huggingface.co/pyannote/speaker-diarization-3.1
    https://huggingface.co/pyannote/segmentation-3.0
  6. Run the script file attached in Python. Fill in yourhuggingfacetoken with your own token and pathtoaudiofile in mp4_file and wav_file with the path to your own audio file.

Hope this helps.

#

@vocal sierra Whisper transcribes audio files to text, and if you have WhisperX it will add speech diarization. Regarding your concern with ChatGPT, you don't upload the audio file directly into ChatGPT, you have to install Whisper first, and then once that's running then you can take an audio file of yours in wav, mp3, or mp4 audio file format, then run the following one-line command to test: whisper "pathtoaudiofile.wav", replace pathtoaudiofile with your actual path for your audio file in File Explorer and replace .wav with whatever your audio file format is. If you want to add diarization, you can go through the process I outlined above to get WhisperX running.

dense imp
ripe stag
#

Oh gotcha, yeah that’s interesting because my diarization also wasn’t very accurate, but it did compile, just like yours. Hopefully it gets improved soon.

manic holly
#

I've just released a full benchmark on a STT model trained for one of our customer. 9-page full of value to compare models + GitHub code to be able to compare models by yourself ! Send me a DM and I send you the PDF 🙂

marble crater
#

its called whisper guys so shhh we have to whisper

glossy bone
#

It includes whispering

slow hound
solar dew
#

What do u guys think about the new gpt 4o diarize

ripe stag
# solar dew What do u guys think about the new gpt 4o diarize

I haven’t tried it as I’m not sure if I want to have to pay for the API tokens, but I heard it’s a lot more accurate than Whisper. However, you’re restricted to what OpenAI allows you to do in the API so you won’t get as much customization and options as Whisper being open-source.

solar dew
#

They should release whisper diarize open

marsh canopy
#

When we say whisper are we talking about the translation model?

worthy fable
#

Mostly I find that it always try to translate audio in english, even if the language parameter is set and no one speaks english in the original audio

slow hound
#

🎥✨ Introducing AITranscript.in
— Free, Fast, and No Sign-Up Needed!

🚀 AITranscript turns any video into clear, actionable insights — instantly.
No login. No limits. No catch.

💡 What you get:
🎧 Automatic Transcription — powered by Whisper
🧠 AI-Generated Action Points — focus on what truly matters
💬 Smart Chat Assistant — ask anything from the video context
📜 Instant View Mode — see full script + key takeaways before download

🆓 100% Free. Forever.
⚡ Just upload a video → get insights → done.

🔗 Try now → https://aitranscript.in

AI Transcriber

Transform your videos into accurate transcripts instantly with AI. Upload, transcribe, and analyze your content with our powerful video transcription tool. Free trial available.

outer pawn
#

Check your Transcriptions! Please read attachment. Very quick update. I had a song that I wrote in English, I took the English Lyrics and transcribed them in Chinese Simplified to create the song in Chinese. I used AI tools and created the song, but wanted to see how the lyrics looked after Chinese conversion. I took the song .mp3 file and had the song file transcribed back into text using melobytes. (which I discovered later uses 'Whisper". The song was in Chinese, so it generated text output in Chinese. I converted txt back into English and was shocked that Whisper had given attribution, (credit for the song as writer and composer to a Li Zongsheng, which IS a real person, he is a well known musician. I was shocked. I did extensive testing, and reported it to OpenAI. I then downloaded Whisper on my home computer and ran the text again. For a 2nd and 3rd time received the note that the song was written composed by Li Zongsheng. Running the same tests on a different song, some spam/donation for some other persons YouTube account was in my transcription. Please look at the attachment for a lot more information and check your transcriptions/conversions as it is likely that the output contains credit and spam injected in your work. The results differ with different languages. I did not see any errors in English conversion/text. Chinese was the most problematic. But millions of people have used Whisper to transcribe audio into text, so this is a real problem. Look at the attachment and GitHub: https://github.com/openai/whisper/discussions/2685

GitHub

Critical Bug: Whisper Fabricates False Copyright Attribution (Li Zongsheng) on Original Chinese Music ### 2. In the "Add a body" field (the large text box), paste this entire text: ST...

wheat pond
main current
#

So Ive been running into an issue when trying to upload a file to transcribe using Whisper. It tells me that it cannot perform this operation because its using a path instead of being uploaded but I am dragging the file into the chat box and creating the prompt

#

I decided to try using the API, but it wont allow me to attach my card for billing (separate issue lol) But is this issue im having with uploading the file because its too large?

jolly forge
#

can someone send me the most accurate settings and model possible i can use for max accuracy im not too concerned with time

wheat pond
spring umbra
oblique thunder
#

I used Whisper as the transcription engine for a new project I’ve been building called ContentRob. It takes any informational video and turns it into high-quality written content, SEO articles, tutorials, case studies, and even share-ready infographics.

Whisper handles the speech-to-text layer, and the output is then processed into different content formats. You can also export as PDF or DOCX, repurpose the same video into multiple formats, and publish or schedule posts directly.

If you want to try the demo, it’s available here:
https://contentrob.com/

Open to connect, collaborate, or discuss the implementation details.

AI-powered platform that converts videos into blog posts, tutorials, and articles. Export, repurpose, and publish across platforms effortlessly.

amber marsh
#

Hello

remote crow
#

Um

#

Yur

proud venture
brisk flicker
#

now I am looking for dev who have rich experience with openai, whisper, electron and ffmpeg. now I am going to try to get app for ai interview assistant. so we need to implement function that voice to text from mic, and speaker from live meeting

tacit tree
#

Voice meeter potato

tacit tree
#

D bus the mixer then speech zo text

brisk flicker
tacit tree
#

But i do produce music

#

In Ableton

#

Bassicly

#

Poržtato lets u route audio

#

Over ports

#

Same like pulse audio

#

But u have 8 channels

#

This is bsnnsna i think verdion eithb4

#

But the point is the vbsn

#

Vban

hollow dock
wet wren
#

I’ve already been working on similar pipelines necessary for my own project actually lol

#

Takes live audio streams, handles audio transcription, uses context and information to determine various important variables. Long story short it is for a NVR system that leverages ai workflows to streamline alerting and monitoring of security systems via audio and video data.

#

Your could possibly use a heavily modified form of this pipeline that essentially just has the basic parts and systemization already together anyways

severe solar
#

would it be legal/ethical to manually port whisper to windows

#

I mean I think chatgpt could do it if I gave enough time

#

WRONG THING not whisper Ignore my message

worn latch
severe solar
sour cloud
#

I'd love to connect AI engineers who love learning

sonic cedar
sour cloud
swift karma
#

I am running into problems with whisper-v3-large is just performs so much worse thatn v2. Refusing to translate, looping over sentences. Is there anything I can do

lofty mesa
#

Ok

small swallow
#

Wsp

tacit cairn
#

Hola mi gente

marsh canopy
marsh canopy
marsh canopy
worthy fable
hollow dock
#

I'm just to lazy to fix it right nowq.

#

😄

crude axle
#

ₛₕₕₕ...

lilac cairn
#

Hello how do I fix wrong transcrib of the whisper small model on Serbian/Croatian/Bosnian

exotic cape
#

New whisper coming out never

oak sapphire
#

Hi

glacial ferry
#

hi

mellow ingot
#

Is the Whisper project still being actively developed, or has it gone quiet? It’s a great model, but it’s starting to fall behind the competition. It would be awesome to see a Whisper v4.

dusk notch
#

s

keen magnet
keen magnet
#

Mixtral’s Voxtral Transcribe 2 just launched. Really wish whisper have new version soon.

left matrix
#

yes, many guys need speech to text servise!!!!

#

fast, and precise speech to text model

visual crypt
#

what is the difference between whisper and elevenlabs?

blissful parrot
#

Higgisfiead code

visual crypt
#

?

gloomy sable
visual crypt
#

To use method or capabilites, I wanna know

visual crypt
#

@gloomy sable

#

Can you hear me now @gloomy sable

vocal moth
#

What is the best STT or TTS model?

wary night
#

@vocal moth For STT you may use Azure Voice API and for TTS you may use Cartesia.

vocal moth
wary night
#

Azure voice api gives you real time language changes facility.

pine shuttle
#

help my whisper-large-v3 is tweaking

#

why is it doing that?

vocal moth
#

I think Deepgram also provides that function.

wary night
#

I recently done a PoC on it.

maiden sentinel
#

bruh

whole venture
#

ai

#

am i right guys

gusty fjord
visual crypt
wary night
#

No

visual crypt
#

What is your POC?

keen magnet
hybrid robin
wary night
dark shore
#

hello

silver wraith
lean patrol
#

What’s whispering?

void bramble
tawny horizon
#

I still can't find the speech-to-text of oss that goes beyond whisper in Japanese.

narrow viper
#

hii

visual crypt
#

hi

keen magnet
#

Can we expect newer whisper model this year?

keen magnet
#

Can we expect a better open-weight whisper model this year?

scenic condor
#

no

strong orchid
#

Since this incident happens today (5/13) https://status.openai.com/incidents/01KRG0AZKH41DV4D9SNJSXM33Q#01KRG0AZKHH37CKBST5E3WBQW6 we are with realtime api down. Have anyone with the same issue?

In our case, the flow is now:

  1. SIP INVITE reaches OpenAI.
  2. OpenAI dispatches the realtime.call.incoming webhook to our server.
  3. We call /v1/realtime/calls/{call_id}/accept.
  4. The accept request returns HTTP 200.
  5. Immediately after that, connecting to wss://api.openai.com/v1/realtime?call_id={call_id} returns:

{
"error": {
"message": "No session found for the provided call_id",
"type": "invalid_request_error",
"code": "call_id_not_found",
"param": ""
}
}

tender escarp
#

Since the realtime api went down my applications was getting the following error:

#

Realtime call failed {
status: 400,
statusText: 'Bad Request',
body: '{\n' +
' "error": {\n' +
' "message": "The Realtime Beta API is no longer supported. Please use /v1/realtime for the GA API.",\n' +
' "type": "invalid_request_error",\n' +
' "code": "beta_api_shape_disabled",\n' +
' "param": ""\n' +
' }\n' +
'}'
}

#

i made some changes and now i cant escape this error

hardy snow
#

hi

worthy glen
#

Ok