#Bug with gemma model

1 messages · Page 1 of 1 (latest)

supple token
#

I have an issue to use gemma3:270m. It's using ollama to connect to it.

I began with llama 3.2 model but is way too slow to respond. have to wait few minutes. But it works.

Just my server is not power enough for AI, so I decided to use a very light model.

When I run gemma directly on Ollama's shell, it works

supple token
#

I don't see anything wrong in Ollama's 🪵🪵🪵

supple token
#

qwen3:0.6b works but have to wait long

#

Simple question takes 104.34s to process.

qwen3:0.6b is still too heavy for my server

hearty cave
#

You have huge context window, for starters. Reduce it to 2048, set keep alive to -1 to hold model in memory, and you will see improvement.

#

And since you're using control, all the entities and tools you have exposed to Assist - are packed into each message. So reduce the count. Or untick that "Assist" tick, and leave LLM to general questions.

supple token
#

Same thing after reducing to 2048 and set to -1.

No improvements for gemma.

#

I thought that set to few secs for realeasing ressources as soon as is done

#

That means I should leave "assist" checked for AI to control devices?

#

With ollama doesn't work but directly without AI it does work

#

Yup after unchecked assist for a test, it does work much faster.

But I still want AI to control my IoTs

dim bear
#

The normal gemma doesn't support tools, does it?

supple token
#

No idea

#

What other very light models I can use?

dim bear
supple token
#

Aaaah Okayyy. Eye sea!

supple token
#

I ned a very light model. Only CPU. no GPU.

It doesn't support legacy GPUs such as Quadro k600 that I have.

I spent days trying to hack TrueNAS. No luck

#

Always goes down

supple token
#

Who can beat me?

supple token
#

I got timeout even with just 7 exposed entities

dim bear
#

Check the debug for it. Without a GPU you will not a a great time.

supple token
#

Nearly wanted to ping someone. I hold myself cause I feel that's inappropriate.

This is what I always do.

Yeah I know. Only have one machine that does it all. It's 10+ years old with PCIe3 and DDR3 supported.

I have a Quadro K600 but's TrueNAS's NVIDIA drivers does not support it. Tried to hack it to be supported for days. Impossible.

What's I'm longing to know, if there is a lightest model that is usable.

#

This one, only answers. But nothing is done

dim bear
#

Can you create a VM and pass the GPU to it or something like that so you have control over the driver?

supple token
#

Oh shoot. Assist is unchecked

supple token
dim bear
#

Limited performance
In what way?

#

I don't really see another good option. Personally I'd have virtualized TrueNAS inside of PVE if you absolutely need it.
TrueNAS can run LXC too nowadays.

supple token
#

Just have one 4 C 4T CPU and 32 GB of RAM

dim bear
#

All you lose is some memory.

supple token
#

I did the opposite previously. TN under PVE.

#

Changed to bare metal

dim bear
#

Why run wordpress in PVE rather than on TrueNAS?

#

How about this. Rather than using PVE just for one CT you could run a debian VM which you can give your GPU and run ollama on it and wordpress as a docker container in it.

supple token
#

I also had issue with my 3TB Pool that goes corrupted. That's was before having an HBA.

Forgot to power down a VM before booting up TrueNAS.

#

I used the command qm set VMID disk by id sth like that to pass throu the disks

supple token
#

Thanks!

Now I prefer to use TrueNAs as bare metal and use their virtualisation instead.

I also got a hang on Docker. That's a good thing. I run apps out of Docker. That replace some LXC

supple token
dim bear
#

Look I absolutely prefer PVE but in your case I don't see the point to use it.

supple token
#

Yeah, everyone has their own opinions.

I also experimented by making te move to bare-metal TrueNAS, I gained a lot in performance.

#

For some apps, I include custom tailored made compose.yml to suit my needs
It's mixed with official and custom apps

supple token
# dim bear Check the debug for it. Without a GPU you will not a a great time.

Here are the details with exposed entities for assist

stage: error
run:
  pipeline: 01j65rs63npecxs8gymqv4pmqb
  language: en
  conversation_id: 01K5K7F23FKB4QVDRJA7R7QM8S
  runner_data:
    stt_binary_handler_id: null
    timeout: 300
events:
  - type: run-start
    data:
      pipeline: 01j65rs63npecxs8gymqv4pmqb
      language: en
      conversation_id: 01K5K7F23FKB4QVDRJA7R7QM8S
      runner_data:
        stt_binary_handler_id: null
        timeout: 300
    timestamp: "2025-09-20T09:32:13.807534+00:00"
  - type: intent-start
    data:
      engine: conversation.ollama_conversation
      language: en
      intent_input: set kitchen lights to 3200k
      conversation_id: 01K5K7F23FKB4QVDRJA7R7QM8S
      device_id: null
      prefer_local_intents: true
    timestamp: "2025-09-20T09:32:13.807571+00:00"
  - type: run-end
    data: null
    timestamp: "2025-09-20T09:37:13.812968+00:00"
  - type: error
    data:
      code: timeout
      message: Timeout running pipeline
    timestamp: "2025-09-20T09:37:13.817209+00:00"
intent:
  engine: conversation.ollama_conversation
  language: en
  intent_input: set kitchen lights to 3200k
  conversation_id: 01K5K7F23FKB4QVDRJA7R7QM8S
  device_id: null
  prefer_local_intents: true
  done: false
error:
  code: timeout
  message: Timeout running pipeline
#

Now I got success on responding after I don't know how many attempts.

But still no physical change

dim bear
#

Check with the Ollama CLI how long a normal query takes first. Take care of HA later. As long as it's over 7s there's no need to continue. Get your GPU to work.

dim bear
#

For example

# ollama run llama3.1:8b "Explain what can you do in 20 words or less" --verbose
I can provide information, answer questions, and assist with tasks through text-based conversations.

total duration:       251.344113ms
load duration:        86.836349ms
prompt eval count:    22 token(s)
prompt eval duration: 520.088_s
prompt eval rate:     42300.53 tokens/s
eval count:           18 token(s)
eval duration:        163.411121ms
eval rate:            110.15 tokens/s
supple token
#

on my end

# ollama run llama3.1:8b "Explain what can you do in 20 words or less" --verbose
I can answer questions, provide information, and engage in conversations on a wide range of topics.

total duration:       18.881727143s
load duration:        8.613023707s
prompt eval count:    22 token(s)
prompt eval duration: 3.046990865s
prompt eval rate:     7.22 tokens/s
eval count:           20 token(s)
eval duration:        7.219001573s
eval rate:            2.77 tokens/s
#

Well, I can take another server to experiment with.

That will add up un juice cost as well
Cha ching🤑

#

The thins is that K600 does not have tensor cores which is crucial for AI

#

I should change server anyway.

I tried to boot TrueNAS with the GPU on the actual machine, it doesn't boot most of the time.

Meanin' if I use LXC under TrueNAS is complicated

supple token
#
# ollama run llama3.1:8b "Explain what can you do in 20 words or less" --verbose
I can provide information, answer questions, offer suggestions, and engage in conversation on a wide range of topics.

total duration:       55.90515925s
load duration:        41.99840976s
prompt eval count:    22 token(s)
prompt eval duration: 3.601500155s
prompt eval rate:     6.11 tokens/s
eval count:           23 token(s)
eval duration:        10.303274367s
eval rate:            2.23 tokens/s
# ollama run llama3.1:8b "Explain what can you do in 20 words or less" --verbose
I can provide information, answer questions, and engage in conversations on a wide range of topics.

total duration:       7.569800943s
load duration:        103.960109ms
prompt eval count:    22 token(s)
prompt eval duration: 411.139995ms
prompt eval rate:     53.51 tokens/s
eval count:           20 token(s)
eval duration:        7.054140214s
eval rate:            2.84 tokens/s
# ollama run llama3.1:8b "Explain what can you do in 20 words or less" --verbose
I provide general information, answer questions, and assist with tasks such as text summarization and language translation.

total duration:       8.345894523s
load duration:        102.151913ms
prompt eval count:    22 token(s)
prompt eval duration: 310.380232ms
prompt eval rate:     70.88 tokens/s
eval count:           22 token(s)
eval duration:        7.932756997s
eval rate:            2.77 tokens/s

1st time usually slower. Think is because it needs to gather ressources first

#

The initialisation takes time

supple token
#

@dim bear

Let's Gooooo!

#

Passing the GPU completely through for the VM

#

Oh! It detects the GPU and installing drivers

dim bear
supple token
dim bear
#

That looks a lot better.

supple token
#

Indeed! Cause it's dedicated for that

dim bear
supple token
#

I can even pump up the ram for >25 GB

dim bear
#

Does the whole model fit in VRAM? Check ollama ps.

#

In my experience performance will be pretty bad in relation to pure GPU if even 1% of it is on the CPU.

supple token
#

Enable discard
Which I did. Is to prolong SSD's lifespan

dim bear
#

No. Nothing to do with your SSD in this case as we work with a virtual disk. See link.

supple token
#

I always thought so

dim bear
#

It is if you do fstrim on bare-metal, for example on your node, but not in a VM. Here the fstrim has to do with the LVM-Thin provisioning.

supple token
#

ollama ps

dim bear
#

Hmm. I expected this to run on the GPU since the performance is much better.

#

What does nvidia-smi say?

supple token
#

Uh oh!

dim bear
#

Does the GPU show up in lspci -nnk?

#

I see you enabled secure boot for the VM. Maybe you didn't add the MOK keys for the driver.

supple token
#

Yup! It does show up

dim bear
#

WIll be kinda hard to do with Primary GPU.

#

I recommend you disable secure boot unless you need it and re-install the driver.

supple token
#

Yeah I don't ned secure boot

#

I just reinstall the VM. Can do

dim bear
#

Right now it use nouveau. It should be nivida.

supple token
#

I felt comfortable with VM with PCIe passthrough. That's why I go this route.

Thanks!

#

I could just use Bios. Right?

dim bear
#

UEFI is generally recommended if you do passthrough. It also allows you to use GPT partitioning which is very much recommended.

#

Are you starting completely fresh?

supple token
#

yeah, redoing the VM

#

Disabled in the BIOS

#

I haven't thought to disable it

#

I make a backup first

dim bear
#

A snapshot should be good enough here too.

supple token
#

So that I don't need to go through some process

dim bear
supple token
#

Nevermind...

#

I did both

#

@dim bear
Still uses nouveau despite I disabled secure boot and pre enroll keys

dim bear
#

Did you re-install the driver?

supple token
#

It install during the OS installation

dim bear
#

I don't use ubuntu so I can't help that much with specifics.

#

I linked docs on how to install nvidia drivers on ubuntu earlier. Please follow that.

supple token
#

okii

#

It begins badly

#

I can change OS though.

dim bear
#

Maybe you'll have more luck with debian.

supple token
#

Yeah I thought about debian

supple token
#

bookworm?

dim bear
#

I use trixie.

#

You can also just keep going with ubuntu for now and use nvidia's .run file if you want.

#

You might want to purge existing nvidia and cuda packages first though.

supple token
#

I tried this with TrueNAS. Didn't work.

#

Enabled dev mode

dim bear
#

Try again 🙂

supple token
#

So I use wget?

dim bear
supple token
dim bear
#

./NVIDIA... --dkms just like in the link.

supple token
#

thanks!

#
 ERROR: The Nouveau kernel driver is currently in use by your system.  This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details
         on how to correctly disable the Nouveau kernel driver.
dim bear
#

A reboot afterwards shall fix that.

#

Otherwise try to blacklist nouveau and reboot. You need to disable Primary GPU or noVNC will not work and you see nothing.

supple token
#

Yeah I unckeck primary gpu

dim bear
#

Like I said, CTs are simpler 😄

supple token
#

Not familiar😅

#

for pass through

#
root@llm-server:/home/sysadmin# bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
root@llm-server:/home/sysadmin# bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"

root@llm-server:/home/sysadmin# update-initramfs -u
update-initramfs: Generating /boot/initrd.img-6.8.0-83-generic
root@llm-server:/home/sysadmin#

On the right track. Thanks!

supple token
#

YAY!🥳

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107GL [Quadro K600] [10de:0ffa] (rev a1)
        Subsystem: NVIDIA Corporation GK107GL [Quadro K600] [10de:094b]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
dim bear
#

Lot let's hope ollama can use this thing.

supple token
#

🤞 🤞 🤞

#

so far so good

dim bear
#

Hmm. Does this GPU only have 1G of VRAM?

supple token
#

yes

#

Just checked

dim bear
#

Oof. I should have looked into that sooner. Ancient low spec GPU.

supple token
#

No tensor cores

dim bear
#

But I guess you now have a system which you can just plug another GPU in and it should work. Assuming you use resource mappings so it doesn't fail to boot when the original GPU goes AWOL.

supple token
#

I can. PCIe version is limited

#

Hard to find supported ones

dim bear
#

Limited in what way?

supple token
#

is 3.0 or 3.1 at most. definetly not 4.0

#
root@llm-server:/home/sysadmin# ollama run llama3.1:8b "Explain what can you do in 20 words or less" --verbose
Answer questions, provide information, engage in conversation, and assist with tasks to the best of my ability.

total duration:       9.389095498s
load duration:        116.067642ms
prompt eval count:    22 token(s)
prompt eval duration: 2.766631948s
prompt eval rate:     7.95 tokens/s
eval count:           22 token(s)
eval duration:        6.505379216s
eval rate:            3.38 tokens/s
root@llm-server:/home/sysadmin# ollama ps
NAME           ID              SIZE      PROCESSOR    CONTEXT    UNTIL
llama3.1:8b    46e0c10c039e    5.6 GB    100% CPU     4096       4 minutes from now
root@llm-server:/home/sysadmin#
dim bear
#

That doesn't matter much for this case.

supple token
#

Maybe a P1000

dim bear
#

You could plop a PCI(e) 5 GPU in there without issues.

supple token
#

Well, I have a RTX 3050 from my main rig

dim bear
#

That should work too. I'd recommend at least 8G of VRAM though.

#

I started with a used RTX 3060.

supple token
#

Nice!

#

I'll try that

#

But wait, it needs extenal power. Cannot draw juce directly from PCI

#

Think should be decent enough now for HAOS

dim bear
#

If you want to live dangerously you can also use MOLEX/SATA to PCI(e) power adapters if your PSU doesn't have them.

supple token
#

It's proprietary. A prebuilt server. No Molex

PowerEdge T20.

I did experienced once molex to SATA that smokes

#

I know the reason

dim bear
#

If you use the right connectors and adhere to the power limits it "should" be fine but it was just meant as a disclaimer.

supple token
#

Now I need to figure out what port number ollama uses or if I need to open it

dim bear
#

Check ss -lntp.

supple token
#

Yuup. 11434

#

LISTEN 0 4096 127.0.0.1:11434 0.0.0.0:*

dim bear
#

Test locally before you add it to HA.

#

Do the same test we did earlier.

supple token
#

I tested it

dim bear
#

And what was the result of it? What does ollama ps say?

supple token
#

what you can d in 20 words?

supple token
dim bear
#

:<

supple token
#

I wrote that is decent enough

dim bear
#

Can you access your ollama on its port in the browser?

supple token
#

neither

dim bear
#

UFW might interfere then.

supple token
#

Add the rule into FW

#

UFW it is. I used that command

#

a while ago

#

ufw allow

#

Still nothin

#

the command passed though

dim bear
#

Which commands?

supple token
#
sudo ufw allow 11434
[sudo] password for sysadmin:
Rules updated
Rules updated (v6)
dim bear
#

What does curl -v http://localhost:11434 say?

supple token
#
curl -v localhost:11434
* Host localhost:11434 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:11434...
* connect to ::1 port 11434 from ::1 port 58968 failed: Connection refused
*   Trying 127.0.0.1:11434...
* Connected to localhost (127.0.0.1) port 11434
> GET / HTTP/1.1
> Host: localhost:11434
> User-Agent: curl/8.5.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< Date: Mon, 22 Sep 2025 14:18:12 GMT
< Content-Length: 17
<
* Connection #0 to host localhost left intact
dim bear
#

You cut off the last part.

supple token
#

huh?

dim bear
#

So basically it's reachable from the host itself. Either you used the wrong address in your browser or the firewall is still blocking it.

supple token
#

should be the fw\

dim bear
#

Try ufw disable.

supple token
dim bear
#

This already worked. Check in the browser.

supple token
#

to show you that I point to the correct IP

#

Still nothin

#

fw disabled

#

I reboot to see

dim bear
#

Which URL do you use in the browser?

supple token
dim bear
#

Can you try curl -v http://10.0.10.130:11434 from the PVE node?

supple token
#
curl -v http://10.0.10.130:11434
*   Trying 10.0.10.130:11434...
* connect to 10.0.10.130 port 11434 from 0.0.0.0 port 50797 failed: Connection refused
* Failed to connect to 10.0.10.130 port 11434 after 2016 ms: Could not connect to server
* closing connection #0
curl: (7) Failed to connect to 10.0.10.130 port 11434 after 2016 ms: Could not connect to server
dim bear
#

Hmm. But ping 10.0.10.130 works?

supple token
#

yuup

dim bear
#

You didn't enable the PVE firewall, right?

#

Try this in the CT

ufw disable
iptables -L
supple token
#

My my, think so

#

pve FW disabled

dim bear
supple token
#

nope haven't followed.

currently doin it

#

everything is commented by default

#

When I save after it goes away

dim bear
#

Add this in there

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
supple token
#

what I did

#

it doesn't save

dim bear
#

Then restart ollama and it should now work.

supple token
#

used cat after and doesn't show up

#

Which I did

dim bear
#

cat what?

supple token
#

cat the file

#

or use nano back again

#

Used ctrl +x then yto save it. When I go back again, the lines are gone

dim bear
#

I can't really tell anything without seeing the whole process. For me this works fine.

supple token
dim bear
supple token
#

my my!

#

😅

#

numnut

#

Haven't paid attention to that. I have never seen that

#

finally!

#

Thanks so much!

#

( ͡^ ͜ʖ ͡^ )

supple token
dim bear
#

The model has to be loaded into VRAM.

supple token
dim bear
#

Depends where the bottleneck is. Faster disks, faster VRAM, more memory to keep it in cache, etc.

supple token
#

Still insanly faster with my previous system though