#High CPU usage with Postgres

35 messages · Page 1 of 1 (latest)

jolly pewter
#

Hello, my backstage instance has recently started consuming a large amount of database CPU on our Postgres instance. I am working to troubleshoot what direct change may have caused this.

Particularly when monitoring the pg_stat_activity table, I see a cycle of transactions being opened, some selects, and some updates which seems to also trigger autovaccuum on the refresh_state and refresh_state_references table. At the surface this cycle could seem to be the cause: these both appear to be relatively large tables so I could imagine auto vacuum and those queries at such frequency could drive up CPU usage.

Does anyone have insight on the cause of this or potential mitigation?

#

A follow up: once a refresh is complete, should the refresh_state table be holding data still? What is the point of storing the unprocessed & processed entity there rather than just the processed entity in final_entities? I am curious if some large refresh got cancelled or interrupted causing that data to be stuck in the table.

Admittedly I haven’t investigated much to how the backstage ingestion process works behind the scenes yet

eternal dagger
#

From my undertstanding: an entity provider creates an unprocessed entity in the refresh state table. Each entity providers work on their own schdeule.

Then a set of processors run in sequence on the unprocessed entity creating an intermediate state (per processor step) in the processed entity field in the refresh table. Then stiching creates the entity in the final entities table.

Processors are run in the "processing loop" in a scedule separate from the entity providers.

https://backstage.io/docs/features/software-catalog/life-of-an-entity#ingestion

jolly pewter
#

Context: the history of our CPU credit usage seems to follow the growth of our database. We saw one spike when we begin ingesting a certain provider, and another spike as we added a new (much larger) provider

jolly pewter
eternal dagger
#

There is a schedule.. if you look at the refresh state table, there is a field called next_update_at, it will show you when the next due time for the processing loop on that entity.

jolly pewter
#

From my observance of pg stat, it keeps setting that next update at to 100 seconds in the future on a presumably large number of them, I haven’t been able to get the values of the prepared statement it’s executing yet

eternal dagger
#

You can configure the processing interval with "catalog.processingInterval"

jolly pewter
#

Awesome I will try that out and see if it lowers usage which will help us narrow the problem and tune to our needs. For my follow up question if you know the answer: entities should be living in refresh state though? Or they should be removed once all processors have ran and the entity is stitched?

eternal dagger
#

They stay there until the entity provider removes them.

jolly pewter
#

And in the case that for some reason the entity provider does not remove them, they are stuck in the processing interval - which in the case of a large dataset (~90k entities) the work of the processing loop could theoretically drive up CPU? (I know this feels a little circular, just trying to make sure I have the right understanding)

#

Would to processing loop also output errors to logs to confirm that is not a potential issue?

eternal dagger
#

Yeah basically. but there is a limit to the number of entities that are processed at a time.. so it should limit out on cpu at some level.

We generally monitor the oldest next_update_at, to ensure the queue is not growing. ... if that makes sense.

jolly pewter
#

Yes we do see a limit - our instance is capping around 60-80% so it's not crashing out on full CPU usage. But it is at 60-80% 24/7 which is concerning lol. Just checked, my refresh_state table has 108k rows.. my final_entities has 107k rows... we have not had a significant entity provider run recently, so that is not adding up. The oldest next_update_at seems to be continuously moving by about ~1minute in the past

eternal dagger
#

Its a large catalog you have there. Maybe you need a bigger db?

jolly pewter
#

Perhaps, but does the size of refresh_state not concern you? Is there any reason why I shouldn't manually delete data in that table if all the rows have no errors with an associated final_entity? With no next_stitch as well?

The majority of our catalog is coming from the LDAP entity provider, we are a large org syncing around 40k users and 50k groups... been searching for a way to narrow that down to only data we need but we're still in early stages and it was the easiest way to guarantee access and visibility for all users for now.

#

We're running on 2vCPU & 8gb memory for PG17 on RDS

eternal dagger
#

You could delete from the refresh state but i think the ldap provider will create the rows again.

jolly pewter
#

Gotcha, I'll try to increase processing interval and continue to work on getting our catalog size down. But I am still honestly confused to the amount of (what seems to be) data duplication across tables for the entity provider/processor process.

lucid creek
#

Hang on

#

Don't do surgery 🙂

#

As you'll see in the linked "life of an enyity" article stitching happens by design over and over. And it always starts from the raw data - not from the previous output

#

So there's constant re-evaluation of unprocessed -> processed -> stitched

#

This seems wasteful but it has safeguards in terms of hashing that only does the later parts of the work if things actually changed

#

You can absolutely not wipe things from refresh_state; that's the master table and everything else hangs off of it with cascade delete foreign keys. If you delete a refresh_state row, ALL other data related to it will be deleted too and the entity will be gone

jolly pewter
#

Good to know, no surgery then. The processing interval config change has fixed our CPU usage for now. 45 minutes seems reasonable, but in the event we want to scale it back down: any guidance on how to tune the processing loop with such a large catalog or is the answer indeed “bigger database”?

lucid creek
#

For reference, ours is set to 24 hours. But that's because we have set things up with event driven updates (based on webhooks) so we get instant updates anyway.

jolly pewter
#

For prod I think this is acceptable. The team has some concerns around speed of spinning up feature and dev environments when we will have to already wait for such a large LDAP sync and processing time. There’s some optimizing to do there in terms of the LDAP query and we’re thinking of a solution that picks up from where our long standing staging instance DB is. But just want to consider all possibilities

#

We haven’t experimented much with event driven updates, not sure how many of the providers we’re using supports it

lucid creek
#

The thing about the processors running repeatedly is that it unlocks some really powerful refinement abilities, where you just add processors that for example frap contextual data from other systems and massages them "automagically".

That's really powerful and nicely composable.

But. That being said. That first-generation pattern is ready to be challenged and we have plans for improving things significantly so that it's all much more cleanly reactive

#

Yeah local dev is a different beast. It's not a generally solvable problem space. First, not all data sources are necessarily reachable by devs (for network or auth reasons). Second, reading them may actually take significant time. Third, their data may be so large that there's no way around the fact that mashing that much data into the database and processing it will take a long tame and eat tons of resources

#

So generally you'd probably not fetch prod data into local dev

jolly pewter
#

Yea, we will need some creative solutions 🤷‍♂️. Very cool for the event driven work though, excited to see what comes from it. Thanks @lucid creek & @eternal dagger for the help & info.

lucid creek
#

By the way - ours is on the order of half a million entities last i checked. Just for some scale reference. And the way we have it set up there's usually a sub-second delay between source code changes and them being reflected, and both service and database are at very low resource usage levels, mostly cruising.