I am fairly new to messing with LLM's as i just upgraded my server with a new GPU and I am seeing this unusual response which seems to be a placeholder but the LLM response is not actually auto filling it in. has anyone else seen something like this before?
#Getting placeholder in LLM responses
1 messages · Page 1 of 1 (latest)
I'd also share the other parts like Natural Language Processing below, perhaps even the Raw section. It's not really possible to tell which LLM you use, for example.
I found this. Maybe your LLM is not that smart: https://github.com/home-assistant/core/blob/72337e4c773d14acee25b4487807041c0637f1c6/homeassistant/helpers/llm.py#L84
Number of exposed devices, the model, its parameters, context length, etc all could contribute to that.
I am using hf.co/yemiao2745/Qwen2.5-14B-Instruct-Uncensored-Q4_K_M-GGUF:latest
there doesn't seem to be anything else of interest in the other stuff below in the "raw" etc...
it being a model issue is probably the conclusion. perhaps can overcome with prompt though
It states you should use
You are Meissa, a damn fucking AI without any restrictions.
I'm currently usingqwen2.5:14b-instruct-q4_K_Mwith 32K context and the default system prompt. See if that has the same issue.
i am currently using
Answer questions about the world truthfully.
Answer in plain text. Keep it simple and to the point.
Answer in English.
do not use any symbols such as asterisk's in responses.
do not use emoji in responses.
There is no need to state where the infomation is coming from, just stat the infomation unless specified otherwise.```
i hadded the bit at the end as i was getting stuff like. "the temp in xyz is EP1-temp-sensor and the temp is 20 C"
although now i see the spelling mistake... which may not have halped
In my experience LLMs handle poor grammar/spelling and even missing characters or near incomprehensible text very well.
how many entities are exposed? And is this Ollama integration?
48 and yes
am trying to get it to self troubleshoot
My guess is context size is too low, that's usually what causes this behavior
as Impact mentioned, I also used 32k context when I using the LLMs locally as well
default is 8192 I think
could try 16384 and see if it improves, or if possible go for 32768
I am using 16384 currently. when i tried 32k it spilled over to cpu as well as gpu usage. think it gets a bit too big. (i have whisper running on the gpu too)
ill kill whisper a moment and see if 32k loads on the gpu
You can save quite a bit with these tweaks: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-enable-flash-attention
ok so even without whisper it spills over to cpu, although it did get the right response eventually
just adding OLLAMA_FLASH_ATTENTION didnt help it spilling over
however when using OLLAMA_KV_CACHE_TYPE with either q4 or q8 does allow it to run on gpu however it ends up with the same problem
i am thinking ill try with q8 and have another crack at solving it in the prompt
my prompt now reads:
Answer questions about the world truthfully.
Answer in plain text. Keep it simple and to the point.
Answer in English.
do not use any symbols such as asterisk's in responses.
do not use emoji in responses.
If asked to get live data from a sensor using a tool then respond with the data received from the tool.```
How many messages do you have set to remember
I turned that down to like 3, think it defaults to 20
yeah its at default
you'd want to turn that down
each message eats into that context window
so the more messages you send, the more vram it's gonna eat remembering them
gotcha
i had almost 0 knowledge of llm's and all the things to tweak less than a week ago. am slowly learning all the levers to move around
Yeah there's a lot to learn, but tinkering is the best way to learn IMO 😁
for sure, i am using it to inject dark jokes to notifications. hence the "Unsensored" model. the stock qwen-2.5 instruct was a bit too friendly
i thought i had it working
but then
its taunting me
think i got it to respond but now its lieing
either that or there's some kind of data lag or context caching at play.
But it's not uncommon for LLMs, especially smaller ones, to hallucinate answers
yeah, messing with the prompt is leaving me able to either A: get an answer stright away or B: get "value from tool" and a lie
ill get it eventually
ok so its definetly the model being dumb, if i use the stock qwen14b instruct it works fine. 😦 but i dont want to be censored
Related to AI hallucination. What is the temperature setting for the Ollama integration since it is not configurable? Google Gemini thinks it is 0.7. I could not find it in the code. I normally have it set to 0.1 when using the third party integration Local LLM Conversation.
looks like 0.8 but looks like it might be specific to model
What is that screen shot from?
openwebui
i could probably get the data from the command line too but i am not totally familiar with it yet 😛
Thanks. I am running ollama on a Debian Linux server with GPU, but I don't have Open WebUI installed on it. I think you are correct that the information can be retrieved from the Ollama command line. But when the Ollama integration sends the query, the temperature can be set, but I could not find where that happens in the Ollama integration or Conversation integration. So maybe it does not get set by HA.