#Wyoming Whisper on NVidia DGX Spark

1 messages ยท Page 1 of 1 (latest)

real ginkgo
#

I've run several models of Wyoming whisper successfully on a dgx spark computer using the rhasspy/wyoming-whisper image.
However this image does not use the GPU. To run with the GPU on this ARM computer, I tried the abtools/wyoming-whisper-cuda image using the docker compose file (I adapted the volumes entry for my local setup) given in https://hub.docker.com/r/abtools/wyoming-whisper-cuda
The image seems to run, but the voice system from home assistant does not seem to get a reply. Instead it seems to time out without any error feedback.

Has anyone run this image or have an image that works on a dgx spark?
Any suggestion on how to investigate this further is also welcome.

sterile estuary
real ginkgo
real ginkgo
#

Should I try to run it on the command line maybe to see any error message, before making the docker compose? The image I mentioned in my OP was running, but I don't know how to see any log from it. Would running as mentioned in the docker-cli show some logs?

sterile estuary
real ginkgo
#

Trying the docker compose gave me an error.

#
 โ ธ Container faster-whisper-linux  Starting                                                                                                                                                                                      0.3s 
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
sterile estuary
real ginkgo
#
michel@Super-DGX:~/Docker/HomeAssistant/whisper$ sudo apt search nvidia-container-toolkit
Sorting... Done
Full Text Search... Done
nvidia-container-toolkit/unknown,now 1.18.1-1 arm64 [installed,automatic]
  NVIDIA Container toolkit

nvidia-container-toolkit-base/unknown,now 1.18.1-1 arm64 [installed,automatic]
  NVIDIA Container Toolkit Base
#

Looks like I do.

sterile estuary
real ginkgo
#
services:
  faster-whisper:
    image: lscr.io/linuxserver/faster-whisper:latest
    container_name: faster-whisper-linux
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Etc/UTC
      - WHISPER_MODEL=tiny-int8
      - LOCAL_ONLY= #optional
      - WHISPER_BEAM=1 #optional
      - WHISPER_LANG=en #optional
    volumes:
      - ./whisper_data:/config
    ports:
      - 10300:10300
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
real ginkgo
#

I am not sure if it is relevant or not, but the dgx-spark uses CUDA 13 by default.

sterile estuary
#

the linuxserver image wont work by the looks of it

#

its gpu support is amd64 version only not arm64

#

just noticed in the docs

real ginkgo
#

Ah ๐Ÿ™

sterile estuary
#

although its docker image looks to be based on an older cuda image though. not sure if the dgx spark will be compatible or not.

real ginkgo
#

Let me try.

#

If the issue is a cuda version, I could create a derived image in which the necessary CUDA is installed?

#

It's amd64

#

But Maybe I can use the github repo and modify it.

sterile estuary
#

maybe there is a arm version of that image

#

if it would build is anyone's guess though.

real ginkgo
#

Yeah I see the cuda 12.3.2. I've used the 12.9..1 successfully for other work, it could also get that version without trouble.
Changing to 12.9.1 should likely work. Switching to 13 is very unlikely to. There are breaking API changes from CUDA 12 to CUDA 13.

#

I'm building the image.
Spending a lot of time in apt-get update and followup install

sterile estuary
#

I use the addon version of that on my setup but I set it up a long time ago when was first in dev and I havent really touched it since but it works. i did run the stand alone before the addon version exisited though.

real ginkgo
#

The build failed....

 cannot stat '/app/lib/python3.10/site-packages/nvidia/cudnn/lib/lib*.so.*': No such file or directory

I think I am going to go to the DGX Spark help and see how to get that to work....

sterile estuary
real ginkgo
rugged hollow
#

@real ginkgo I might be wrong, but I was under the impression that switching to onnx-asr (within rhasspy/wyoming-whisper) would already provide CUDA support. Isn't that the case?
However, using "--stt-library","onnx-asr","--model","istupakov/parakeet-tdt-0.6b-v3-onnx" is just so fast that I am not sure it's worth looking for extra optimizations ๐Ÿ˜‰

I have edited rhasspy/wyoming-whisper to launch faster-whisper with those params on my DGX Spark

real ginkgo
#

@rugged hollow, thanks. Glad to hear I am not alone using a DGX Spark. I got an error message when I tried to add CUDA to rhasspy/wyoming-whisper. Do you have a docker compose file that attaches the GPU to the container?
Running tiny-8 works fast, but it makes too many mistakes. It often transcribes off as up and other common and annoying errors. I tried to use other models, going all the way to turbo (alias for large-v3-turbo???). But with that model, the voice to text takes about 8s, clearly not using the GPU. Could you give that model a try and let us know if you get a quick answer?

#

PS: I am trying the arguments you suggested and so far it seems good. If it makes much fewer mistakes than tiny-8, I might keep that setting. Nevertheless, from a theoretical point of view, I'd still like to be able to run big models with CUDA on the DGX Spark.

real ginkgo
#

@rugged hollow That parakeet model is great. So far it hasn't made a single mistake in English. I also tried French and it made no mistake. One mistake in German only ๐Ÿ™‚

#

And it is fast.

rugged hollow
rugged hollow
#

@real ginkgo what model do you use for processing afterwards?

real ginkgo
rugged hollow
real ginkgo
real ginkgo
#

@rugged hollow I installed unsloth/Nemotron-3-Nano-30B-A3B-GGUF:latest but even though "think before responding" is off and even if I add a prompt "detailed thinking off" I get long thinking responses... What am I missing?

rugged hollow
# real ginkgo <@1262353931464605736> I installed unsloth/Nemotron-3-Nano-30B-A3B-GGUF:latest b...

it depends on what engine you are using. with llama.cpp (the one I am using) you need to append an extra JSON field to the API request and that will tell the model to not do any thinking at all. To talk to llama.cpp I use this add-on https://github.com/skye-harris/hass_local_openai_llm and I just posted a PR to add a config knob to enable/disable thinking via that JSON field

GitHub

Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc) - skye-harris/hass_local_openai_llm

rugged hollow
real ginkgo
#

I have started with the ollama integration. You had mentioned Llama.cpp but I didn't know how to integrate it. I won't have much time in the coming day, but I hope I can get it working before the end of next weekend.

rugged hollow
# real ginkgo I have started with the ollama integration. You had mentioned Llama.cpp but I di...

I suggest you read the feature request linked to my PR. The maintainer has evolved that code into something slightly more generic and there are some bits regarding ollama too....
To integrate llama.cpp you can just use the add-on I linked. It's basically a simple OpenAI API integrator.

For getting started with llama.cpp I suggest reading the playbook on https://build.nvidia.com/spark
It's reasonably easy and for now I just keep it running in a tmux window

Choose how you would like to connect to your DGX Spark.

real ginkgo
rugged hollow
#

or at least in a systemd unit, for the same reason

rugged hollow
rugged hollow
rugged hollow
rugged hollow
# rugged hollow darn the server-cuda image is for amd64 only - so no DGX Spark support
real ginkgo
#

@rugged hollow I got the Llama cpp running and I can interrogate it via curl on port 30000, following the NVidia Nemotron playbook. But when I try to give the URL to the Local OpenAI Server, I get an error Could not communicate with the LLM server.
The URL I put is http:<dgx-ip-address>:30000; I have also tried adding v1/chat and v1/chat/completion

rugged hollow
real ginkgo
#

Oh ends in v1 only

#

That works.

rugged hollow
real ginkgo
#

You mentioned adding a json to disable thinking. How do you do that? Your PR was merged if I understand correctl.

rugged hollow
#

alternatively, you can overwrite the addon code with that you find in the master branch on github

real ginkgo
#

Ooops. Trying to add a conversation agent gave me config flow could not be loaded: 500 Internal Server Error Server got itself in trouble

#

I am using the code in the master git branch.

rugged hollow
#

ah ok

#

the conversation agent config panel should have some extra knob then

real ginkgo
rugged hollow
#

ouch

#

@real ginkgo doesn't it open the panel when you click "+ conversation agent" (or whatever the text in English is - I have italian UI here)

real ginkgo
#

I'm looking at the log.

#

It opens an error dialog.

rugged hollow
#

ok weird - maybe something is wrong in the latest code then

#

although the guy has been testing this as far as I know

#

if you find any error, you could report it to the GH ticket

real ginkgo
#

Yeah. I'll try to look into it.

#

Maybe I'll try the release version. The log gives an error:
File "/config/custom_components/local_openai/config_flow.py", line 402, in get_schema
return vol.Schema(schema)
^^^^^^
UnboundLocalError: cannot access local variable 'schema' where it is not associated with a value

rugged hollow
#

yap

real ginkgo
#

Stable version works. The problem seems to be with the github main branch.

#

It's working but slow. I presume the thinking is what takes so long. How do you disable it with the stable version?

rugged hollow
#

I edited the python code and added the chat_template_kw_args thing

#

I think my feat request and PR should allow you to see that

#

you can also just copy/paste my PR into your running python code

#

that works as I am still using it

#

my PR will add a checkbox to the conversation agent panel

real ginkgo
#

Commit 1bd1bca ? Can I grab it from your fork?

#

I checked out that commit.

rugged hollow
#

yup, try that

real ginkgo
#

Let's hope it won't give me the same error as the latest commit.

#

That gave me the option to disable thinking.

#

How long does it take to get an answer for you?

#

A simple request "Turn off the light in the office" took 5 seconds for the action to happen and 8 to get a text answer.

rugged hollow
#

this way using HA intents/sentences it is still super fast

#

there is a knob for that

#

when you configure the voice assistant, not in the agent

#

but if you use your own sentences it takes some time, yes. because the HA context is not small and the LLM may need to do an extra call to get context

#

ok, it seems faster here

#

1.8s total - this is pretty good

#

not using stt or tts here. just sending text and checking response time. tts may have its own delay, but I'd tackle one problem at a time

real ginkgo
#

I read a post somewhere that llama cpp caches prompts in a greedy way. And the poster said that the made the response time twice as fast by reordering the prompts. I.e. before his change, there was a prompt giving the current time early in the flow. Let's say it's prompt A. The series of system prompts were A B C D and then the first user prompt was U and gave an answer R. The problem is that at the next iteration, the prompts are A' B C D U R and a new user prompt V. But since A != A', llama cpp was not caching anything. By putting the time prompt at the end of the sequence, then llama cpp can cache a lot:
First sequence: B C D A U -> R
Second sequence: B C D U R A' V -> S
llama cpp can cache B C D. For the next sequence, it will have cached B C D U R.

#
real ginkgo
#

Actually, forget about that other integration. I cannot get it working. But I am doing a new conversation with "Local OpenAI LLM" with thinking turned off and I am getting replies very quickly after the first one.

#

Problem: I asked it: "Can you turn on the office light?" and it turned on ALL the lights in the house.

real ginkgo
rugged hollow
#

yeah, sometimes I have these issues as well

#

to be frank, I think using HA intents is much more reliable and faster

#

using these pseudo commands just for the fun of it can be quite unpredictable

#

enjoying the agent "personality" is a bit tedious if you ask me. but let me know if you succeed ๐Ÿ™‚

#

regarding the time prompt, I don't think it is being sent at the beginning, thus we should not have that problem. the llama.cpp output seems to be saying that caching is working

rugged hollow
#

if you just disable the "Assist" functionality, it'll become muuuuch faster. hence my conclusion that the context is what is making the agent slower.
For this reason I have created a second agent (to be activated with a second wakeword) which has thinking enabled, but assist disabled. this way I can use it to talk things

#

bu not being super useful so far

rugged hollow
#

I got the same error as you, so it's confirmed it's a problem in the new version

rugged hollow
rugged hollow
#

ok, the guy pulled it in 1.3.1

real ginkgo
#

Hello. Long time no see. I'm now using unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_XL with Llama.cpp and Local OpenAI LLM with standard configuration. It's working pretty well and I get answers fairly quickly. One thing I still need to fix is that I don't get weather information.

sterile estuary
real ginkgo
#

@sterile estuary Thanks. I finally got it to work. What threw me off was the fact that the weather was exposed to the OpenAI LLM agent but the home assistant agent did not see it. So more complex questions such as "What is the weather forcast for tomorrow?" would work, while "What's the weather like?" would give me "none none" as an answer. The former question was sen to OpenAI LLM while the latter stayed with the local HA agent.

#

It's a bit strange that there is such a distinction.

sterile estuary
real ginkgo
#

I got it to work. I had to separately expose the weather provider (Pirate Weather) to the HA assistant.

#

I had it exposed to the OpenAI LLM agent only.

#

@sterile estuary How do I view the assist pipeline trace?

sterile estuary
real ginkgo
#

I think I see now how OpenAI LLM did get the weather exposed while HA Assistant did not: OpenAILLM uses the Tools for Assist integration, where the weather is setup.

#

You have a DGX Spark?

sterile estuary
real ginkgo
#

OK ๐Ÿ™‚ Just curious. Looking to see what setup people use.