#GPU vs CPU

1 messages · Page 1 of 1 (latest)

turbid forum
#

At the moment I run HA on a small Optiplex system, with a 10th gen Intel CPU.

I keep wanting to get into the voice assistants, but am not sure if I need a system with a ‘proper’ GPU or if what I have will suffice. My understanding is that speech to text and thus Whisper would have the most benefit in speeding up voice processing.

So the options the way I see it:

  1. My current system has plenty of CPU capacity, with very low load numbers. It has 16GB now, but I could increase this to 64GB. This would use CPU for processing, but I could be using large models.

  2. Use another small system I have for which I could buy a low-profile GPU with 6GB VRAM.

To make the right decision, I am wondering if someone has any comments regarding the following:

A) Can the Whisper add-on use either a CPU or a GPU, and is this user-configurable?

B) Using for example the ~5GB medium model, is there a massive difference between
CPU and GPU in terms of how fast speech gets converted to text?

C) Most people won’t have a GPU on their HA server. Am I overcomplicating this and is CPU-based STT plenty fast enough?

crystal geyser
#

My 2 cents is as follows:

A) I don't believe the HA whisper addon is user configurable to use the CPU or GPU. I am running faster whisper on my old windows computer repurposed as a Linux server. There is some cofiguration involved to the the rhasspy/wyoming-whisper docker container to use the GPU. This server also has a refurbished RTX 3090 that I purchased when prices were more reasonable. I want a GPU with 24 GB of VRAM because I also run my AI (LLM) for HA use also.

B) With My GPU whisper setup it takes 0.13 seconds for the STT of the sentence " What is the garage Temperature?" to be converted. That is with the large model. When I use the the HA version of Whisper it takes 2.84 seconds with the small-int8 model and 12.75 seconds with the medium model. My Home Assistant is running a mini computer with a N100 processor and 16GB of RAM.

C) The response time of the small-int8 model is tolerable at 2.84 seconds in my opinion, but of course that model is not as accurate as the larger models. If you want to get into the AI part of things plus the SST I would recommend a used/refurbished NVIDIA GPU with 12 to 16GB of VRAM. I would recommend a used GTX 1070 for TTS only.

turbid forum
#

Thanks for the comments! Food for thought.

#

If you don’t mind: the 0.13s vs 2.84s …. Was this both using the GPU, just a different version of Whisper?

crystal geyser
#

I aplogize for not being clear. I also misspoke. The 0.13 seconds was using my remote server that uses the Willow Inference Server docker container with a model size that changes based on the length of the speech. It uses the RTX 3090 GPU. I tested the same sentence with the rhasspy/wyoming-whisper docker container on my remote server that uses the RTX 3090 GPU that only uses the faster-distil-whisper-large-v3 model and the SST took **0.28 **seconds. The 2.84 seconds was using the Whisper Addon to Home Assistant that runs on my N100 Mini computer (Home Assistant Host) with no GPU, but with the small-int8 whisper model. The 12.75 seconds was also run on the N100 (no GPU) but with the Whisper medium model. I am pretty sure that the home assistant Whisper Addon uses the rhasspy/wyoming-whisper docker container also but without GPU. A more comprehensive comparison of the speed improvement using Whisper with a GPU versus a CPU can be found at https://github.com/toverainc/willow-inference-server

GitHub

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS - toverainc/willow-inference-server

turbid forum
#

I was already asleep when you replied 🙂

Basically .... yes, a GPU is dramatically faster than a CPU .... extrapolating (wildly unscientifically) with my current CPU running the medium model and a 3.8s input, it would take 4.3s for the result ... that seems uncomfortably long.

So I guess I'll need to spend money on a new server to which i can add a GPU. Sigh. I was hoping to avoid that!

crystal geyser
#

There is also the Nabu Casa Home Assistant cloud which is very accurate and fast, but that defeats the purpose of not depending on the cloud.

tribal delta
turbid forum
#

Thanks for your comments, guys.

jade trellis
#

Buying more is more suitable and faster--- NVIDIA🚀

ebon pumice
#

fwiw, I have a somewhat similar setup. My home server has a 12th gen 1220P processor. 10 cores, 12 threads. I tried running faster whisper inside and outside home assistant and the conclusion I got is that without a GPU the best you can expect to use is the small-int8 model, which for me takes ~2.5s to process each voice command. Jumping to the medium model it takes around 10 seconds, and the large model I don't even know but a lot.
Sub 3s per command is the limit of what feels okay to use.
That said, the small model is okay to use. It is not perfect, but it does understand you pretty well.
There's also the alternative of running Vosk, which is way faster than whisper.

#

For whisper you probably don't need a super-beefy GPU. Most likely a modern AMD APU that supports ROCm would suffice for faster-than-realtime whisper. LLMs tho are a different story

turbid forum
#

Thanks for sharing. That’s useful.

glossy nexus
#

Most AI stuff is leveraging CUDA, and ROCm sadly isn't as easy to use as CUDA yet from what I have read. You can get it to work, but it's a bit of a pain, just a heads up. It's getting there though, hopefully in the next year it gets better with all the AI focus AMD has now. 🙂

#

That said, I'm running on a 4060ti using distil-large-v2 and my response times from faster-whisper are around 250ms

#

pretty sure you could get a used gtx 1070 on ebay and get pretty decent performance as well though 🙂

#

Also to add, I'm not sure the wyoming faster-whisper container would support ROCm as it is now, even to get CUDA to work there's a bit of work you have to do to get the container to recognize the CUDA device and load the proper libraries to utilize it.

turbid forum
#

Im not sure what cuda and rocm are 🙂

glossy nexus
#

CUDA is the nvidia libraries used for GPGPU Compute stuff, IE the processing used by things like STT inference and LLMs. ROCm is the AMD version of that. 🙂

turbid forum
#

Ah ok. Thx 🙂

#

It’s clear what I was hoping for isn’t going to work.

#

I like the compact little system I run HA on right now (one of those ‘1 liter’ pcs) and I was hoping I could get away with my 10th gen i5 and lots of RAM. But it’s pretty clear that’s not going to give satisfactory results using the medium model or bigger.

#

So I will have to rethink my setup. I will probably have to spend a chunk of cash to buy a GPU-capable system.

glossy nexus
#

Yeah, if you want something on par with Google/Alexa, you'd need a decent GPU. That said you can also leverage the Nabi Casa cloud as well as others have mentioned 🙂

#

Other thing is that HA itself can't use the GPU, it isn't exposed

turbid forum
#

Yeah but the point has always been to keep it local.

glossy nexus
#

most people leveraging GPU with wyoming faster-whisper are running that container on a separate server running docker with the GPU exposed.

#

Or if you are like me, a k8s cluster 😄

turbid forum
#

I run HA in a Proxmox vm, so I’d create a second vm or lxc with the GPU.

glossy nexus
#

Yeah, that's what I have.

#

HA VM, controlplane VM, worker VM with GPU passed to it.

turbid forum
#

Not sure I’m ready to drop $1000 on a system with a GPU, just so I can turn the kitchen lights on through voice 🙂

#

Especially as I went through the trouble of creating voice commands on my phone, that do anything I want. Obviously that’s not as versatile and scalable as using voice assistant, but it’s been working fine for the most part.

#

But I appreciate your comments. Useful insight added.

#

And hey ….. maybe I will convince myself I need a system to “play with AI” on 😉

glossy nexus
#

Well you could get a gtx 1070 used for under $100 and just add that to your existing Proxmox server if you wanted 🙂

#

But either way, you have all the informational pieces now 🙂

turbid forum
#

I can’t. The case has no space for a card. Not even a single slot low profile one.

#

But yeah … thanks 🙂

ebon pumice
#

I'm on the same camp. I have an intel NUC so a proper GPU won't fit.

#

Also, idle power consuption matters to me. I don't like the idea of having a computer that runs 24/7 and draws 50w idle. That's why I'm eyeing the new AMD APUs that are startint to appear (strix point / Ryzen AI). The GPU is decent, most stuff works with ROCm if you tinker with it a bit and mini-pcs with those APU use ~7-8w idle.
It is unclear to me when or if the fact that modern CPUs have an NPU will be helpful. I've seen a few demos of small LLMs running purely on NPU, which is nice because they use very little power compared with a GPU, but they are so new that even linux doesn't have drivers for them built in

#

modern AMD APUs have NPUs with 50tops, and modern intel/snapdragon processors are at 45-48 TOPs, which is actually decent, but without drivers it's just a pretty number.

#

also, even if you have the fastest GPU on the market, I've noticed that there are missing pieces for getting local voice control to be as good as commercial products like alexa. The biggest one, voice lock IMO. Alexa learns your voice and can tell you appart from other family members. Also, if you issue a voice command and someone is speaking on the TV or other people are speaking nearby, alexa can tell you apart and ignore the voices that are not the one that triggered the wake up word. There isn't anything like that yet for local voice control

#

speech recognition right now is is good but dumb

glossy nexus
#

So that last part is what XMOS works to solve, using AI to filter and target your voice, along with various technologies like BSS, Beam Forming, AEC, etc.

#

As for Voice Identification, that is also something I think Nabu is looking into. That actually doesn't occur at the server side technically, it occurs at the wake side I think

#

so something that would happen in MWW

ebon pumice
#

@glossy nexus what is XMOS? A technology or a peson in the community?

glossy nexus
#

A technology 😉

#

The voice satellite being worked on by futureproofhomes.net, the ReSpeaker Lite that just came out, and the HA Voice Sattelite hardware Nabu is working on all make use of the XMOS chip

ebon pumice
#

I'd love to read more about that , I assumed all the work was done by a regular ESP32-S3 on the board

#

I can say that the Home Assistant Voice speakers are working much better than my old Onju Home v3

glossy nexus
#

the ESP32-S3 is the core micrcontroller, the XMOS chip is in front of that. So it would process the audio and such, then hand that off to MWW for instance

#

MWW would be running on the ESP32-S3

tribal delta
#

I guess the only XMOS solution you can try right now is Respeaker Lite. I have couple (well, 4, and 2 more ordered), and they're decent, although still lack noise suppression...

turbid forum
#

What about the Coral TPUs? Those plug into an M.2 E-key. I assume they require their own alternative to the aforementioned CUDA/ROCm. Is there a Whisper implementation that uses those and do they have enough juice to be useful for this specific workflow?

tribal delta
glossy nexus
#

I think i saw someone got whisper to work on tpu somewhere, but it wasn't super fast

#

Maybe something that took 5 seconds went down to 3 for instance, wasn't the sub-one second performance one would want, and don't think they were using a large model like distil-large

turbid forum
#

Hm ok so not an option then. Thx!

turbid forum
#

So …. CUDA and ROCm are …. APIs? Architectures?

#

The Coral TPUs will have their own, I imagine.

#

I was wondering …. the Apple silicon graphics cores and ‘neural engines’ … what API do they use?

#

And then collectively, all 4 of these … (cuda/rocm/coral/apple) … how would I refer to this? “GPGPU architectures/apis”???

#

Sorry, a bit off topic but you guys seem to have a better understanding than I do. Hope you don’t mind.

glossy nexus
#

CUDA and ROCm are a combination of Architecture and Frameworks. CUDA is facilitated by the CUDA Cores on the Nvidia GPU, and then there are CUDA libraries that allow code to tap into those specialized cores. ROCm is the same thing, just with AMD's architecture. These are libraries that allow tapping into special Processor Cores that are very good at things like Matrix Multiplication, Floating Point Calculations, Vector Calculations, etc. that are critical components of processing and running neural networks. 🙂

#

Can google and find out more, that's my rough high-level explanation 😄

#

For Nvidia there's also Tensor Cores which are specialized AI cores that can also be utilized if the code implementing the framework supports it as well as the AI model itself. That's what Nvidia's Chat with RTX makes use of.

turbid forum
#

Fantastic brief. Thanks so much for taking the time to educate me this past week. Very much appreciated.

#

Next time you’re in Bangkok, drinks are on me 🙂

mild gazelle
#

I'm guessing this is old, but wondering how people are calling the server whisper engine from their home assistant? I'd like to run a whisper large model using GPU compute on my server, and then have my home assistant running on a raspberry pi4 call the server to Speech to Text.

eager oyster
#

also, I think you win "necro thread of the week"

eager oyster
mild gazelle
#

haha thanks, I'm trying to search instead of create a new thread 🙂

#

your container, does it work on windows? I know there's wsl but not sure if its a direct equivalent...

#

Looking at the home assistant forums, I stumbled upon Speaches and wyoming-openai projects. but not sure if it'll do what I'm thinking in my mind the architecture that i'm trying to achive

eager oyster
mild gazelle
#

ah k

#

I'm trying to centralize the models on the server, and then have them called by different end points

#

like ollama can handle the models, whereby Home assistant and OpenWebUI can call to Ollama for the LLM

#

similarily, I'd like openwebui and homeassistant to call to whisper-large and kokoro. via their respecitive mechanism

eager oyster
#

you dont run whisper in ollama. you run it seperatly with a wyoming wrapper

mild gazelle
#

right, but i'm thinking that with a wyoming wrapper, would openwebui be able to use it?

eager oyster
#

why would you need it to?

mild gazelle
#

well i locally host on my server, and i'd like my openwebui interface to have speech as well

#

kinda like having a speech specialists, that responds to open web and responds to home assistant... would that work in the wyoming wrapper?

eager oyster
#

if you run the pipeline through home assistant then you can select the model there.
if you want to be able to type on open web ui and get the responce on speakers then you ccould maybe use MCP to send the command to HA to use piper to TTS to the speaker

mild gazelle
#

sorry maybe to clarify, i'm not using openwebui with HA, but more as different use cases. i.e. I'm having a conversation with my local llm for work via openwub. Then I flip to talking to the HA voice to turn down the heat.

eager oyster
#

these are seperate things, HA will speak directly to ollama

mild gazelle
#

in that case I'd like the ASR and TTS to exist on the server local pipeline, but have the 2 tools access them independently

#

yeah exactly, but how do I deploy Whisper and Piper/kokoro to allow both to access?

#

i'm guess that whisper with wyoming protocol wrapper won't be accessible to openwebui? but perhaps I'm wrong on that one...

eager oyster
#

if your using both whisper and piper to speak to an llm why do you need the webui at all?

mild gazelle
#

openwebui holds the interface, including voice interaction. Frequently need to send it text and other files for analysis and work

eager oyster
#

right i see what your getting at. it looks like you can add whipper to openwebui but its not a setup that home assistant will be able to use. you will need a seperate instance of it running with wyoming for use with home assistant

#

both of the things you want to do involve whisper but they are different deployment setups

dapper crane
# turbid forum So I will have to rethink my setup. I will probably have to spend a chunk of cas...

As soon you fall into the LLM rabbit hole there is no stopping. I started with a Mac Mini M4 32gb. On this, TTS and STT stuff is still running. To have a capable LLM running locally, there is no way to go with NVIDIA. In the meanwhile I have a custom LLM rig running two RTX3090. I use them for Gemma 3 27B to serve HA. Yesterday I ordered an RTX 5090 to achieve even more snappines on the LLM responses with high context. On my mini PC running HA I don’t have any voice and LLM stuff running after the I5 CPU is way to slow for any kind of this task.

mild gazelle
#

yes it is endless. but also relatively accessible

mild gazelle
#

wyoming whisper uses the tiny model, and i'd like to increase the accuracy with a larger model running on a gpu to maintain speed. similarily TTS

eager oyster
#

i use large-v3-turbo with mine just fine

mild gazelle
eager oyster
mild gazelle
#

oh right, sorry i forgot

olive forum
#

EdgeTPUs have very limited on-chip SRAM and very limited bandwidth to host DRAM. If your model won’t fit on the on-chip SRAM, I doubt that the EdgeTPU is a reasonable option.

I believe most of the EdgeTPU die is actually the SRAM. It’s just that the off-chip memory bandwidth is very limited.

mild gazelle
#

Ok, so I've answered my own question. On the home assistant forums this is the answer:
https://community.home-assistant.io/t/gpu-support-in-wyoming-whisper-docker/862783/2

I was able to use GP123's link to Roryeckel's container to get everything I needed. I did install pytorch and ffmpeg on baremetal on path first (not sure if its necessary but did it anyway). And ran the requirements downloads within a virtual environment in the folder as per Roryeckel's post.

Everything is working nicely. I can use openwebui to call kokoro, and so can home assistant

#

i can run Gemm3:27b, speaches whisper large, and kokoro fastapi all in 24gb