#Whisper on GPU

1 messages · Page 1 of 1 (latest)

thorny bane
#

I'm trying to get Whisper to run on the GPU on my Qnap NAS. I have a T400 4Gb GPU, which I bought mostly for transcoding, however I figure I should be able to run a medium.en or small Whisper model as well.

That said, I've been rather unable to tell if the GPU is offloading. When I run nvidia-smi, I only see 400 Mb of memory usage, no proccesses, however GPU-Util goes up to 100%. So that makes me think it's running on the GPU, but I would have expected vRAM to be loaded.

Am I missing something here? It could be a quirk of how Qnap works and their Docker configuration. I've had other strange issues in the past, so I wouldn't rule it out.

vast dune
#

I ran v3-large-turbo on a 4gb 1650 without issue

#

v3-large-turbo on my current setup is using 1044mb apparently

#

change the model your using and see if the gpu memory changes 😛

thorny bane
#

Hmm. I had seen some tables indicated size of models and it said medium was going to be 5Gb, but maybe the newer turbo and distils are just that much smaller.

brave atlas
#

Which image do you use and how are you running it?

vast dune
thorny bane
# brave atlas Which image do you use and how are you running it?

Docker Compose using Speaches:

  speaches:
    container_name: speaches
    image: ghcr.io/speaches-ai/speaches:latest-cuda
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      - enable_ui=True
      - log_level=debug
      - WHISPER__MODEL=Systran/faster-distil-whisper-medium.en
      - WHISPER__compute_type=int8_float32
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    volumes:
      - huggingface-hub:/home/ubuntu/.cache/huggingface/hub
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [ gpu ]
vast dune
#

i use a different image to be fair but looks like your missing a couple of whisper options

#

moment

#
  faster-whisper:
    image: lscr.io/linuxserver/faster-whisper:gpu
    container_name: faster-whisper
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Etc/UTC
      - WHISPER_MODEL=cstr/whisper-large-v3-turbo-int8_float32
      - WHISPER_BEAM=5
      - WHISPER_LANG=en
    volumes:
      - ./config/:/config
    ports:
      - 10300:10300
    restart: unless-stopped
    devices:
      - /dev/dri:/dev/dri
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]
thorny bane
#

Speaches sets some of those for me. I'm really only using Speaches because Koroko STT sounds nicer than Piper.

#

I did wonder about passing the device via /dev/dri

#

Upon doing another test, I can see memory usage go up to ~1Gb, so I suppose it is running on GPU, I was just expecting higher and got worried.

vast dune
#

check beam is being set though. you dont want that defaulting to 1

thorny bane
#

Hm. I don't see an option to configure beam size. That's how many matches it considers before settling, right? So bigger = slower but probably better?

#

Beam size is not set in speaches, but it looks like it defaults to 5 in faster_whisper.

vast dune
vast dune
#

if your transcribing massive audio files it would probably add meaningful amounts of time but for commads to a VA it wont really be perceptable

thorny bane
#

That makes sense. I'll stick to 5.

craggy plank
#

I also use whisper-turbo and it uses ~1700mb of vram.

#

I got a used 12gb 3060 for AI stuff in home assistant and whisper is very fast, usually a sentence is processed in ~300-400ms

#

the LLM part is not that fast tho

#

the entire speak-response cycle using a gemma3 4B is ~4-5s

vast dune
#

either way a 4gb card above will be fine for anything whisper will throw at it

#

i have a 16gb card for whisper and llm and its definetly the llm thats the bottleneck 😛

craggy plank
#

of those, 90% is the LLM

#

I need to experiment with running llama.cpp or vLLM direclty instead of using ollama

#

in theory it can be up to a 30% faster sometimes

vast dune
#

i have heard that but i am super new at messing with local llm and ollama is "easy"

craggy plank
#

same here. there's another STT project called sherpa-onnx that looks interesting as it has more features than whisper (speaker identification in particular seems important. Knowing who is speaking beside what is being spoken is also useful)

vast dune
#

speed with qwen3:14b is pretty good tbh

vast dune
craggy plank
#

there's an addon with wyoming, yes

#

but I believe it only supports basic STT and TTS for now. TTS however supports kokoro, which has very high quality voices

vast dune
#

maybe ill give it a go on something

#

i have a pile of VA's on the desk for experimenting with

thorny bane
#

Yea, Qwen3:14b is pretty quick on my desktop (5700 Ti) but that runs intermittently. I want to run Whisper on my NAS, which is always on. So, this should be good. The biggest delay in my STT pipeline I think is model loading. That takes 4s, but the actual STT is very fast.

#

@vast dune could you share that ollama model?

#

Oh, to elaborate on the model loading... distill-medium.en loads in 4s, distill-large-v3 loads in 30s or something. Once loaded, both are fast, but I either need to set TTL to prevent unloading or stick with the smaller models.

serene schooner
#

How does gwen3 work for you?
For me it always shits itself with the thinking stuff

#

I assume you patched that yourself

#

Ig you probably run that nginx hack lol

#

Currently using qwen2.5 7b on my 3060 12gb

thorny bane
#

Looks like he's using a "nothink" variant that supresses the tags, so that's why I'm asking. 🙂 I did find a few nothink variants on ollama.com though. Maybe I'll just try one.

brave atlas
#

No need for a gwen3 variant, just add /no_think to the system prompt. IIRC you still get (empty) think tags though.

thorny bane
#

Yea, that's the problem. 😄

#

Based on that screenshot, it looks like the model that @vast dune is using has that supressed.

viral crane
#

Think that's the no think variant

thorny bane
#

Thanks! That one doesn’t think, but it also doesn’t seem to want to call any tools or talk to my home. If I switch back to qwen3:8b, it all works but I get think tags. I guess I should try the regular qwen3:14b to rule out some issue with model size.

thorny bane
#

Well, that was my bad. I thought the assist checkbox was per pipeline, not per model. Doh!

#

This is great! Now all I gotta do is get HASS to send a wake on lan request to my desktop when my assist device goes to listening… would be easy, but of course I have them on separate vLANs and have to set up an HTTP service to proxy the request.

vast dune
vast dune
#

its not perfect, sometimes the thinking randomly appearing but for the most part it works great

serene schooner
#

no 7b tho which I might need

#
time=2025-05-09T13:44:22.361Z level=DEBUG source=memory.go:194 msg="gpu has too little memory to allocate any layers" id=GPU-264b5eed-c8bc-92b8-9c0b-dd2b56c4931f library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA GeForce RTX 3060" total="11.8 GiB" available="9.6 GiB" minimum_memory=479199232 layer_size="462.3 MiB" gpu_zer_overhead="0 B" partial_offload="8.3 GiB" full_offload="8.3 GiB"
#

yupp

#

seems like it just about works when I turn of faster-whisper lol

serene schooner
#
INFO:faster_whisper:Processing audio with duration 00:02.960
Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so}
Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<frozen posixpath>", line 181, in dirname
TypeError: expected str, bytes or os.PathLike object, not NoneType
INFO:__main__:Ready
Connection to localhost (127.0.0.1) 10300 port [tcp/*] succeeded!

time to run that on the cpu ig

#

maybe not the best idea! with the v3-large model

brave atlas
#

Load Average: 347.67
👀

serene schooner
brave atlas
#

Yeah going into ballooning and swap territory is not good.

serene schooner
#

luckily I dont have ballooning enabled since I do GPU passthrough

brave atlas
#

KSM starts at 80% by default too and tries its best to "free" some memory as well. I think it can use up to one core for that.

serene schooner
#

the whole VM is locked up

#

nice

craggy plank
#

coming back to this, so far the model that I like the most for home assistant use is still Gemma3-QAT

#

i find it very performant and quite good at understanding smart home commands

thorny bane
#

That's interesting. It isn't listed as having tool support. It works fine though?

outer crescent
#

There are variants of Gemma 3 with activated tools support

craggy plank
#

That is the non QAT one btw

#

There’s another one with qat and tools

#

To put it simple, I kind of makes the q4 version of a model be roughly as good as que q6 version for free

#

Which in turn is just a tiny bit worse then the unquantized model

#

So it’s pretty good

jagged linden
#

Hey quick question.
Im currently using CroqCloud Whisper and wondering if it would be a good upgrade if i go get a gpu and do it locally?

thorny bane
#

Depends. If it’s Whisper, it’s probably about the same tech. All depends on the size mode you’re using, which is going to depend on how powerful of a GPU you get.