#Prometheus integration causes 100% CPU when lots of entities reload.

27 messages · Page 1 of 1 (latest)

normal cobalt
#

When I reload the Unifi Integration (or any integration with lots of entities, enable a lot of entities, etc), my HA completely locks up. After ssh'ing in and running a sudo docker stats I can see that the homeassistant container is using 100% CPU.

It remains at 100% CPU for about 4-5 minutes before coming back down, and then HA is responsive again.

During this time HA's web UI is completely down and its completely deadlocked. Feels like some sort of thread locking issue.

I am running on an Intel N200 with 16GB of ram, so I have quite a bit of power behind this.

Core: 2024.12.2
Supervisor: 2024.11.4
Operating System: 14.0

The logs don't appear to be helpful. During this window I'll see a bunch of network timeouts log entries, but I know the networking is not the problem (and SSH remains responsive).

#

After turning on debug logging I see it make this request:

2024-12-12 12:55:00.863 DEBUG (MainThread) [aiounifi.interfaces.connectivity] sending (to https://192.168.40.1:443/api/auth/login) post, {'username': [REDACTED], 'password': [REDACTED], 'rememberMe': True}, True

Then it hangs there for awhile, and then:

2024-12-12 12:56:09.462 DEBUG (MainThread) [aiounifi.interfaces.connectivity] sending (to https://192.168.40.1:443) get, None, False

And then finally:

2024-12-12 12:56:13.474 DEBUG (MainThread) [aiounifi.interfaces.connectivity] received (from https://192.168.40.1:443) 200 text/html <ClientResponse(https://192.168.40.1) [200 OK]>

If you check the timing there, its very weird, its clearly slow and its intermixed with a bunch of other timeout errors all over.

normal cobalt
#

I'm seeing this 100% CPU usage bursts happen at other times too .. so I don't know that its related just to reloading Unifi Network .. but I'm not sure where to start to debug why it might be happening.

#

But reloading Unifi makes it happen 100% of the time.

#

I might need to start disabling integrations to find out

torpid pagoda
#

Any chance you could have a routing loop going on? They can show up as CPU load triggered by network. Don't think they usually stop until something falms over though.

cinder raven
#

Be sure to try in safe mode

#

Can you move this to an issue on GitHub?

#

I would suggest adding the profiler integration and execute profiler.start and then reload

normal cobalt
#

Same thing in safe mode.

normal cobalt
normal cobalt
#

We tracked this down to the prometheus integration

normal cobalt
#

Prometheus integration causes 100% CPU when lots of entities reload.

#

Comment this morning: https://github.com/home-assistant/core/issues/133157#issuecomment-2542717119

Ok, I've reproduced this issue with a dev version of home assistant. The issue is the _remove_labelsets method iterates over all metrics just to delete 1, and now metrics are deleted when they become unavailable. I'll look at making this faster, possibly by caching metrics within the component.

cinder raven
#

@normal cobalt Apparently a bugfix PR has been opened, can you try it out using the script I gave you? You now have to replace the number after -p with 133219

#

I think it would be nice to gather another callgrind from the profiler when you test this with reloading the unifi integration to see the change of impact

normal cobalt
#

Okay so it still has 100% CPU, but it only lasts for about 10-20 seconds. Taking a profile now.

#

But thats probably expected for 3600 entities

#

callgrind uploaded to issue

cinder raven
#

Yep, saw that

#

awesome

#

Like, I believe it should still not be possible to cause a complete lock up

normal cobalt
#

At least it didn't cause the websocket to disconnect like it would before.

cinder raven
#

Oh definetly

#

If I would suspect something I think its because the websocket did not have connection in between so it suspects that the client is offline