#MCP Assist - Alternative conversation agent integration with own MCP server

1 messages · Page 1 of 1 (latest)

sweet fern
#

If anyone is struggling with the context dump limitations with voice assists, I built myself a tool-calling alternative that instead performs a chain of MCP calls. I thought I'd share it with the community if is of help to anyone else. It's on HACS or Github - https://github.com/mike-nott/mcp-assist

GitHub

MCP-powered Home Assistant conversation agent that solves entity context limitations through dynamic discovery instead of full entity dumps - mike-nott/mcp-assist

digital kiln
#

✌️

snow vector
#

That looks great, I'll play with it in a bit, any chance you can get it to work with llama-server?

visual vortex
#

I have this running w/LM Studio. It is very nice to have smooth multiple tool calls and Web-search with an LLM.

sweet fern
sweet fern
#

@snow vector - I've added llama-server support to latest release (v0.9.1) - I don't run it myself so cant do a full test, so would be great if you can check it all works ok?

visual vortex
#

@maku I responded in the FutureProof Homes thread also. With llama-server Brave search is quirky. Sometimes it works, other times it returns 'Im processing your request' and then the voice pipline closes.

#

I want to add the speed increase is substantial over LM Studio.

snow vector
visual vortex
#

@maku Updates are fast and furious 🙂 Update to 0.10 seems to be working well.

visual vortex
#

I switched to Qwen3-VL-8B Q6. Attached are some sample times: 1 is direct from the model, 2 is a request to turn off 2 lights. 3 is a web search for thhe weather in NYC this afternoon. (I am in Wisconsin). Performance and accuracy are very good.

snow vector
visual vortex
sweet fern
#

@visual vortex - I've made changes to the whole follow-up and end conversation system that you raised the issue for. Hopefully it improves things, but do let me know once you get a chance to test.

Key thing is you must manually update the technical instructions prompt to replace the previous sections with the new variable.

visual vortex
#

I opted to delete my llama-server profile and re-add it. It really only took a minute or two. Everything seems to work correctly. Increasing the Max Tool Iteration value completely removed the 'processing your request' hang, and the Qwen models no longer ramble on. So far so good.

visual vortex
#

Using MCP Assist I was finally able to get responses with Ollama similsar to those I was getting with LM Studio. Makes me wonder what vLLM will look like🔥

sweet fern
#

Yeah, vLLM is my ultimate goal as I want to use full tensor parallel across 2 GPUs. 😎

Problem is I need to finish with all these bugfixes and ideas for the integration first 😆

dark burrow
#

You’re making great strides.

balmy thunder
#

I tried it out and got nowhere with it past some basic things however I removed it to try another day. However I can not remove it from my system. I deleted the conversation agent but the system default is still there and I can’t remove it. I tried in HACS to remove and it said I couldn’t until I remove the configuration. There’s no delete here.

sweet fern
visual vortex
#

@balmy thunder That is so opposite my experience. I am wondering what hardware and LLM models you are trying it on. The default profile it creates is superflous; it doesn't seem to have any influence on creating other profiles in my system. I have a profile for Ollama, LM Studio, and llama.cpp. The llama.cpp works best and very much makes my system perform on almost the same level as an Amazon Alexa device. Ollama and LM Studio respond slower, but it isn't horrible, by any means.

balmy thunder
balmy thunder
# visual vortex <@1288657281633882162> That is so opposite my experience. I am wondering what h...

As far as the functionality, I run Ollama on a M4 Pro Mac Mini with 64 GB ram and used qwen3:4b-instruct-2507-q4_K_M model as well as gpt-oss:120b-cloud which both work well in other integrations, one local and the other cloud based. They could not do some basic things and gave incorrect data on asking to tell temperature in a room, just made up crap. I followed the setup guide for temperature and context. So I thought I’d give it some time for things to get better and could not remove the integration.

visual vortex
#

@balmy thunder Thats interesting because it points out how uneven performance is from system to system for AI can be. Ollama had been almost unusable on my Ubuntu server with a 3090 even with the Qwen3 4b instruct model, whereas LM Studio was dramtically better. Ollama got much better for me when I added the MCP piece.

snow vector
snow vector
#

llama-server --host 0.0.0.0 --port 10600 -fa 1 -ngl 99 --ctx-size 16384 --jinja --cache-ram -1 --threads -1 --temp 0.7 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 -hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q8_K_X

visual vortex
#

llama-server is indeed faster than LM Studio. We are not Alexa speeds, but we are approaching it.

mint escarp
#

I shall give this a go on my Thor AGX running ollama. Running stock ollama integration with neoali/qwen3-4k works surprisingly well but I’m sure it can be better. (No idea why I’m using that particular model actually)

snow vector
#

Not sure if this is a big in this or what, but I'm using a weather blueprint with custom triggers and it seems like the mcp doesn't pick it up at all

#

It worked with home-llm

sweet fern
snow vector
#

Wait hold on

#

It is exposed

sweet fern
#

Try adding a line to the technical instructions, telling the LLM for any weather queries, use xxxxxxx

snow vector
sweet fern
# snow vector Sorry can you give me an example

I don't know the exact entity IDs from your system. But literally just try adding a "For any weather related queries, use the sensor.weather_blueprint entities" line to Technical Instructions.

snow vector
#

hmm no doesn't work for other days

#

set let me try home-llm against to make sure it works

sweet fern
#

You can tell it to use fahrenheit in the prompt etc. With all this, it's all about exposing the right entities, then finding the best instruction for each different LLM to know how to find it and what to do with it.

snow vector
#

hmm it's working

#

on home-llm either

#

i'll dig more into it, thank you

visual vortex
#

I have been exxperimenting with some different models with MCP-Assist using llama.cpp server. So far the most reliable with out extra chit chat and the occasional hallucination has been gpt-oss-20B. I have settled on that. I am using the lmstudio-community model gpt-oss-20b-MXFP4.gguf.

#

I don't see a reason to try anything else. I will just wait for vLLM. Parallel tensor cores should rock.

visual vortex
#

vLLM now works with MCP-Assist. Works pretty well with a 3090. Intructions for gpt-oss-20b vLLM install can be found on Hugging Face.

snow vector
#

@visual vortex any chance you could have it support AI tasks like the local llm integration

visual vortex
# snow vector <@1223393161490075819> any chance you could have it support AI tasks like the lo...

@OneOfOne

Think this was meant for @maku. He is the developer. However, it is unclear to me what MCP-Assist doesn't do that local llm integration does. On my system this gives me everything that local llm integration was giving me, and responds faster as well. FWIW I am back to using Qwen3-VL-8B-Instruct because I had to modify the code to get gpt-oss-20b to play nice with a 3090 with this:
--max-model-len 32768
--gpu-memory-utilization 0.75 \

snow vector
visual vortex
snow vector
mint escarp
#

I use an AI task for the doorbell camera. Mostly out of interest, I wouldn’t call it useful. But anyway I imagine for tasks latency is a lot less important than with a conventional agent.