I put ha on proxmox and now i want to offload whisper to the desk top. im guessing the only way is docker desktop atm. what i'm curious about as i have little to almost zero experience with docker but am i under the assumption that using this container would i need to do the nvidia toolkit etc to get this to work? like can i put this in a container and be fine?
#linuxserver/faster-whisper on docker Desktop
1 messages · Page 1 of 1 (latest)
would it be better to run linux in a vm on windows instead? open to thoughts as i kind of have a grasp but not by much lol
volumes: - /path/to/faster-whisper/data:/config would i point that to where i want whisper to be installed?
im just lost what to put there for the path
ok i tried searching for the linux faster-whisper and ran it.. only to get these errors constantly
after pulling the linux image for the faster whisper ive got these options. any idea what i'd do here?
You should use compose. You also need to use the gpu tag and follow this: https://docs.docker.com/desktop/features/gpu/
i got whisper working in docker and have pointed ha to it. it works but definately need to figure out how to pass gpu in
do i need Nvidia Container Toolkit?
I told you all that you need.
i have the latest gameready driver yet it fails when installing the compose
Please share the compose file you use.
--- services: faster-whisper: image: lscr.io/linuxserver/faster-whisper:gpu container_name: faster-whisper environment: - PUID=1000 - PGID=1000 - TZ=Etc/UTC - WHISPER_MODEL=tiny-int8 - WHISPER_BEAM=1 #optional - WHISPER_LANG=en #optional volumes: - C:\Users\Viper\faster-whisper\data:/config ports: - 10300:10300 restart: unless-stopped
The nvidia/gpu section is missing.
Can add the following to the environment vars:
That should make it use the nvidia container toolkit
Yeah just bear in mind if you set the GPU reservation in compose, it takes the entire GPU blocking other pods from utilizing it 🙂
Not my experience.
might be a quirk of pure docker maybe. I'm using k8s so it's probably more strict 😄
I'm using the same GPU like this for whisper, ollama and piper. Just normal docker and compose on linux.
Ok ill try it when i get home. Info is so scattered about this. Read countless sites and you get bits and pieces of the puzzle. For someone not experienced in docker its a pain wrapping your head around all the little details and how to enable them
Is that linux compose still a good one for this or is it out of date?
that linux compose
What are you referring to specifically?
I mean the image from that linux server. -dumb windows guy-
Sure.
so i run it now and i get this and it restarts
services:
faster-whisper:
image: lscr.io/linuxserver/faster-whisper:gpu
container_name: wyoming-whisper
environment:
- PUID=1000
- PGID=1000
- TZ=Etc/UTC
- WHISPER_MODEL=large-v3
- WHISPER_BEAM=1 #optional
- WHISPER_LANG=en #optional
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- 10300:10300
volumes:
- C:\Users\Viper\faster-whisper\data:/config
restart: unless-stopped
thats my compose
WAIT
HOLY
SHIT I GOT IT WORKING
2025-04-28 18:37:36.296 | [ls.io-init] done.
2025-04-28 18:40:08.884 | INFO:faster_whisper:Processing audio with duration 00:04.330
2025-04-28 18:40:09.923 | INFO:wyoming_faster_whisper.handler: Close the blinds.
2025-04-28 18:41:54.479 | INFO:faster_whisper:Processing audio with duration 00:03.620
2025-04-28 18:41:55.024 | INFO:wyoming_faster_whisper.handler: What time is it?
thats large v3 so i already know its gpu just because of the response time
wonder if i should switch to turbo or leave it here.
now to put piper on there
services: faster-whisper: image: lscr.io/linuxserver/faster-whisper:gpu container_name: wyoming-whisper environment: - PUID=1000 - PGID=1000 - TZ=Etc/UTC - WHISPER_MODEL=large-v3 - WHISPER_BEAM=1 #optional - WHISPER_LANG=en #optional - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=all ports: - 10300:10300 volumes: - C:\Users\Viper\faster-whisper\data:/config restart: unless-stopped runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: - gpu - utility - compute
would a piper compose look roughly the same for gpu?
Yep.
sorry for nagging you guys but docker has been a real pain to get. first time using it
Kind of like that with everything in technology. Everything is connected to other things, some more complicated than others. Docker involves storage, networking, containers, in your case a VM and desktop, GPU, drivers, toolkits, etc.
TLDR: I get it.
thinking i'll keep ollama without control and just use it for tts automations so it doesn't feel boring and repetitive. not sinking a mountain of money in making it run nearly as quick.
yea it is. had to learn linux,docker and proxmomx for what i wanted to do. been a pain because it's easy to overload yourself with info and get scatter brained
The HA "LLM" alone is too strict for my liking.
for me i don't have enough knowledge with llm to get it working the way i want. it was hit and miss and hallucinations.
a guy at work has one set up and he literally runs many cards for it in a server just to learn it. he's the IT guy and i suspect he's getting the cards from work for free. shrug..not making waves
only thing i'm not sure about is how would i set up a custom voice with piper in docker? in ha it's easy
yea ha doesnt see custom voices in the data file for what ever reason
i take that back it sees it. it just doesnt play the file
yea it pulls from hugging face i guess when a new voice is added and if it doesnt find it in the piper repo it errors out. damn that sucks
Which image do you use?
Set it via the PIPER_VOICE variable. It should download the language during start.
See languages here: https://rhasspy.github.io/piper-samples/
well im adding a custom voice i found from hugging face. tried putting it in the data folder with the rest of the voices
i had set the voice in the compose but i think it looks on hugging face for that voice instead of the data folder
unless im missing something i guess the only way to do custom voices is via the addon and share folder
but that would mean running piper on ha
i'll look in to it tmrw, but honestly not sure if it matters putting piper on the gpu. with whisper on it, it's still pretty damn quick
Realistically not, no. I do it because I can.
Yeah, Piper is fast
If you have room left in your GPU's VRAM, kokoro TTS sounds a lot better than piper IMHO.
Nevertheless Kokoro has pretty funky ways to be installed in Docker...
I have a bit off topic question, i only want to use llm right now for just re phrasing text. for instance i have it where it lets me know when someone is on the camera but it changes it up each time, but for this i don't think qwen2.5 is it as thats a 7b model and a bit big for basically rephrasing things. Whats a more suitable model for this. i have no local control atm enabled and i have the prompt where it basically keeps things simple
With ollama windows server and faster whisper large-v3 turbo i'm only utilizing 6gb of 12gb in vram
which makes me want to ask. since apparently everything windows sucks, would ollama in docker run any quicker vs the windows app
speaking of large turbo vs large-v3 makes a significant difference in speed and the lack of accuracy i have yet to notice
is kokoro any slower, in theory i'd have 6gb to throw at it but if there is a 2 second difference i'd stick with piper. i tried some of the voices out on their site and they are better.
For memory intensive things like TTS and LLMs, I prefer Linux server (Ubuntu). It doesn't have a GUI and uses substantially less resources, leaving more for the fun stuff. Kokoro is instant for me, even with long phrases.
What gpu do you use?
I use two in my "AI Box". Both are 3060s with 12GB VRAM. One runs the LLMS and the other runs Whisper and Kokoro.
is there a way to speed ollama up. i'm not giving it control and the context is 8192. using llama3.2 but for instance for 12 words more or less to be said it runs 5-6 seconds before the response. gpu only spikes to 60%
Hard to say without seeing logs, tokens/s, full usage statistics, and so on. What does ollama ps say?
still learning but my prompt is stupid simple and i know i could be doing something wrong
`actions:
- action: conversation.process
metadata: {}
data:
agent_id: conversation.llama3_2
text: >-
Rephrase the following text and impersonate jarvis from iron man in your
response: notify the owner that a person has been spotted on the back
deck security camera.
response_variable: response - action: assist_satellite.announce
metadata: {}
data:
message: "'{{ response.response.speech.plain.speech }}'"
preannounce: true
target:
entity_id: assist_satellite.home_assistant_voice_095f33_assist_satellite
mode: single`
simple automation. took 6 seconds to play
That looks okay.
Can you run model locally with --verbose and ask it to write 100-word poem or something?
`Softly falls the evening dew,
A gentle hush, a calm anew.
The stars appear, like diamonds bright,
A celestial show, in all its light.
The world is still, in quiet sleep,
Dreams dance upon, the darkness deep.
The moon's pale glow, illuminates the night,
A silver path, where shadows take flight.
In this serene and peaceful place,
I find my heart, a calm and peaceful space.
Where worries fade, and love resides,
And in the stillness, I am free to glide.
total duration: 2.9474056s
load duration: 1.6026694s
prompt eval count: 31 token(s)
prompt eval duration: 93.5146ms
prompt eval rate: 331.50 tokens/s
eval count: 110 token(s)
eval duration: 1.2502285s
eval rate: 87.98 tokens/s`
well i'm not even sure it's ollama. i was just guessing. so here is what i have going via yaml i have havpe play out to an external media player "sonos one" could it be that handshake there thats causing the delay?
im using the tts uri method from another post in here
i'm also using a custom jarvis voice "high" from hugging face
and piper is being run in ha not on gpu
my assumption was the model itself might be too big for what i'm using it for?
llama3.2 seems quicker then qwin2.5 for sure
tried walking in front of a camera..counted to 10 and only ever got the announce sound from the speaker. it seems hit and miss. not seeing anything in the logs about it. sound like it's timing out
If you go into settings / voice assistants, choose debug from the three dot menu by the assistant. It will show you exactly how long each step is taking (speech to text, natural language processing, and text to speech). Then you can see if something else is slowing you down. Maybe ollama's snappy at 2 - 3 seconds and tts is adding 3 more seconds or something like that.
thing is, the debug shows the last time i spoke to voice assistant. it doesn't show when i run in in a conversation.process in automation
and i use home assistant as the conversation agent for speaking to assist. llm was messing things up. but when i use it in automations, i point them to the ollama integration and the voice model i use
but here is one for when i opened the blinds. it took ha longer to act then faster whisper
but i get it faster whisper has noting to do with my issue since it's the llm im using there
there just seems to be a bottle neck somewhere in between. not sure if it's maybe the sonos api
In your screenshot 2.68 s was processed locally, so it's not LLM, it's Hassil.
that was just referencing local assist when spoken to. i dont know of a way to see how llm is processed when used in automations
right now when someone went on the deck it started responding then cut off after partially playing the audio
my context is still 8192
You can trace any automation too. Go to trace and check the timeline.
i traced the automation and it showed everything running
nothing errored out
restarted ollama and it's still cutting off the audio
plays like 2 seconds then stops
could it be the pe doing it?
You may click on TTS URL and hear what your TTS generated. Then it will be clear.
If it's PE, check if it's not underpowered. Use good power adapter.
tts files play all the way through
so the pe is cutting it off..
Error while executing automation automation.test_voice_pe: Error calling SonosMediaPlayerEntity._play_media on media_player.foyer_speaker: UPnP Error 714 received: Illegal MIME-Type from
just got this
wonder if its the sonos integration at the root
Sorry, I didn't read all the thread. Do you play TTS on PE or Sonos?
sonos. pe kinda sucks for speaker
almost like the type of file passing on to sonos it doesnt like
Well then it's definitely not PE to blame, right? 🙂
Looks like yes, Sonos doesn't like MIME.
got bored and gave it another shot and finally got the custom voice to work in piper container. had some initial dl error in the log but the voice works fine and verified via resources it calls to the gpu