If anyone is struggling with the context dump limitations with voice assists, I built myself a tool-calling alternative that instead performs a chain of MCP calls. I thought I'd share it with the community if is of help to anyone else. It's on HACS or Github - https://github.com/mike-nott/mcp-assist
#MCP Assist - Alternative conversation agent integration with own MCP server
1 messages · Page 1 of 1 (latest)
✌️
That looks great, I'll play with it in a bit, any chance you can get it to work with llama-server?
I have this running w/LM Studio. It is very nice to have smooth multiple tool calls and Web-search with an LLM.
This should already work using the LM Studio profile setup, but I'll add it as another option in the UI.
@snow vector - I've added llama-server support to latest release (v0.9.1) - I don't run it myself so cant do a full test, so would be great if you can check it all works ok?
@maku I responded in the FutureProof Homes thread also. With llama-server Brave search is quirky. Sometimes it works, other times it returns 'Im processing your request' and then the voice pipline closes.
I want to add the speed increase is substantial over LM Studio.
it works perfectly!
For the record here's how i'm starting llama-server: llama-server -hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q8_K_XL --host 0.0.0.0 --port 10600 -fa 1 -ngl 99 --ctx-size 16384 --jinja --cache-ram -1
Thank you for your work
@maku Updates are fast and furious 🙂 Update to 0.10 seems to be working well.
I switched to Qwen3-VL-8B Q6. Attached are some sample times: 1 is direct from the model, 2 is a request to turn off 2 lights. 3 is a web search for thhe weather in NYC this afternoon. (I am in Wisconsin). Performance and accuracy are very good.
What were you originally using?
I was using Qwen3-VL-30B. It was overkill; wasn't much slower, but was pointless, and used too much GPU that could be used for other stuff on my server.
@visual vortex - I've made changes to the whole follow-up and end conversation system that you raised the issue for. Hopefully it improves things, but do let me know once you get a chance to test.
Key thing is you must manually update the technical instructions prompt to replace the previous sections with the new variable.
I opted to delete my llama-server profile and re-add it. It really only took a minute or two. Everything seems to work correctly. Increasing the Max Tool Iteration value completely removed the 'processing your request' hang, and the Qwen models no longer ramble on. So far so good.
Using MCP Assist I was finally able to get responses with Ollama similsar to those I was getting with LM Studio. Makes me wonder what vLLM will look like🔥
Yeah, vLLM is my ultimate goal as I want to use full tensor parallel across 2 GPUs. 😎
Problem is I need to finish with all these bugfixes and ideas for the integration first 😆
You’re making great strides.
I tried it out and got nowhere with it past some basic things however I removed it to try another day. However I can not remove it from my system. I deleted the conversation agent but the system default is still there and I can’t remove it. I tried in HACS to remove and it said I couldn’t until I remove the configuration. There’s no delete here.
@balmy thunder I'm really sorry about that. I've released an update that fixes it - https://github.com/mike-nott/mcp-assist/releases/tag/v0.11.2
You'll need to update the integration, then recreate a dummy profile which you' then delete. The System Settings will then be auto-deleted along with it.
@balmy thunder That is so opposite my experience. I am wondering what hardware and LLM models you are trying it on. The default profile it creates is superflous; it doesn't seem to have any influence on creating other profiles in my system. I have a profile for Ollama, LM Studio, and llama.cpp. The llama.cpp works best and very much makes my system perform on almost the same level as an Amazon Alexa device. Ollama and LM Studio respond slower, but it isn't horrible, by any means.
It’s got nothing to do with my model or anything. The integration itself did not have a way to remove the integration. I had to restore to before installed it to get rid of it. I installed it, tried it, found it needed some work and tried to uninstall it. Could not do so. I deleted the folder and HA then complained it was missing config. HACS would not remove it because it had a config.
As far as the functionality, I run Ollama on a M4 Pro Mac Mini with 64 GB ram and used qwen3:4b-instruct-2507-q4_K_M model as well as gpt-oss:120b-cloud which both work well in other integrations, one local and the other cloud based. They could not do some basic things and gave incorrect data on asking to tell temperature in a room, just made up crap. I followed the setup guide for temperature and context. So I thought I’d give it some time for things to get better and could not remove the integration.
@balmy thunder Thats interesting because it points out how uneven performance is from system to system for AI can be. Ollama had been almost unusable on my Ubuntu server with a 3090 even with the Qwen3 4b instruct model, whereas LM Studio was dramtically better. Ollama got much better for me when I added the MCP piece.
llama-server is even faster than lm studio
llama-server --host 0.0.0.0 --port 10600 -fa 1 -ngl 99 --ctx-size 16384 --jinja --cache-ram -1 --threads -1 --temp 0.7 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 -hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q8_K_X
llama-server is indeed faster than LM Studio. We are not Alexa speeds, but we are approaching it.
I shall give this a go on my Thor AGX running ollama. Running stock ollama integration with neoali/qwen3-4k works surprisingly well but I’m sure it can be better. (No idea why I’m using that particular model actually)
Not sure if this is a big in this or what, but I'm using a weather blueprint with custom triggers and it seems like the mcp doesn't pick it up at all
It worked with home-llm
https://github.com/TheFes/ha-blueprints/blob/main/documentation%2Fweather%2F1_voice_weather_forecast_local.md this blueprint to be specific
@sweet fern
Is the automation it creates exposed to voice assistants?
Yeah, it worked with home llm
Wait hold on
It is exposed
Try adding a line to the technical instructions, telling the LLM for any weather queries, use xxxxxxx
Sorry can you give me an example
I don't know the exact entity IDs from your system. But literally just try adding a "For any weather related queries, use the sensor.weather_blueprint entities" line to Technical Instructions.
it worked!, however it's returning the temperature in "C
also another bug, hitting enter in the box (to add a new line), closes the dialog
hmm no doesn't work for other days
set let me try home-llm against to make sure it works
You can tell it to use fahrenheit in the prompt etc. With all this, it's all about exposing the right entities, then finding the best instruction for each different LLM to know how to find it and what to do with it.
I have been exxperimenting with some different models with MCP-Assist using llama.cpp server. So far the most reliable with out extra chit chat and the occasional hallucination has been gpt-oss-20B. I have settled on that. I am using the lmstudio-community model gpt-oss-20b-MXFP4.gguf.
I don't see a reason to try anything else. I will just wait for vLLM. Parallel tensor cores should rock.
vLLM now works with MCP-Assist. Works pretty well with a 3090. Intructions for gpt-oss-20b vLLM install can be found on Hugging Face.
@visual vortex any chance you could have it support AI tasks like the local llm integration
@OneOfOne
Think this was meant for @maku. He is the developer. However, it is unclear to me what MCP-Assist doesn't do that local llm integration does. On my system this gives me everything that local llm integration was giving me, and responds faster as well. FWIW I am back to using Qwen3-VL-8B-Instruct because I had to modify the code to get gpt-oss-20b to play nice with a 3090 with this:
--max-model-len 32768
--gpu-memory-utilization 0.75 \
Oops, it doesn't support AI tasks
https://www.home-assistant.io/integrations/ai_task/
Thanks. This stuff is such a deep rabbit hole . . . Without actually having had read this, AI Task provides a way to take an AI carried out instruction and turn it into an entity, correct?
Yep, for now using home-llm just for that and this for the Assistant
I use an AI task for the doorbell camera. Mostly out of interest, I wouldn’t call it useful. But anyway I imagine for tasks latency is a lot less important than with a conventional agent.