#How does backstage work on horizontal pods scaling?

1 messages · Page 1 of 1 (latest)

manic wave
#

Hi all, I'm reading throughout documentation and I wanted to know more about how does Backstage handles entity ingestion and processing on horizontal scaling using kubernetes.

I did some very simple test where I tried ingesting through the app and I saw both of my existing pods seems to be ingesting. The other pod stopped ingesting when the other one finished it. I was only able to say because looking at the logs, they have different duration of burst and different timestamp. If I concluded wrong, let me know.

I want to know how should I approach this kind of setup, its best practices, and other stuff I needed to know.

river void
#

All of the builtin things in Backstage are built with horizontal scaling ability as part of their design. We leverage that heavily ourselves. All of the nodes collaborate to produce results.

#

How, precisely, they do that depende on the plugin/module at hand.

#

The catalog backend itself has a highly efficient processing queue implementation that dynamically spreads load among workers to handle the processing of entities. They do not overlap, and scaling up does not cause issues; it just makes things go faster

#

Individual providers may choose to use the scheduler service, which likewise is a globally collaborative one that makes sure that hosts take turns running tasks without overlap

#

We could never use Backstage ourselves at Spotify unless this was solidly in place

manic wave
#

I see, that's a good affirmation.Thank you!
I was worried about what I found out, maybe that also explains the consistent usage of resources among the pods. I thought it was supposed to only run on one pod then just create more pods when requests starts coming in, so load balancer can do its job seeing one pod is handling all of ingestion and processing

river void
#

We do that. It is a huge benefit on big installations but not something to worry about when starting out.

#

It lets your API serving nodes be offloaded to do nothing but just that, leading to lower latency and better predictability

manic wave
#

oh wow I am not aware of this

#

this is something to consider for us, since we have large scale of catalog now

river void
#

Alright! As the article mentions, this is a little cumbersome right now, we wish it were fewer steps. So we'll want to frameworkify that a bit some time down the line

#

We have 2-3 writer nodes (depending on autoscaling) in general, in one region close to the database master. And then a couple of reader nodes in every region globally, obviously fluctuating by load

#

I spoke a bit about this at kubecon if you want to hear about our journey https://youtu.be/anqWhSnN7sA?si=nddeUzVZa4t_Yy1P

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Hong Kong, China (June 10-11); Tokyo, Japan (June 16-17); Hyderabad, India (August 6-7); Atlanta, US (November 10-13). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement o...

▶ Play video
#

20-something minutes in

manic wave
river void
#

Are you on a recent version of it? We also made a bunch of speedups lately. Including in this very last release!

manic wave
river void
#

Ah right. Yeah I fixed that. Weird compression middleware ☠️

#

It REALLY helped our latency metrics to have that streaming. It takes a shocking amount of time to json serialize large payloads

#

And you start hitting the 750 MB limit after a while

pure vault
river void
#

Sure, do explore all of the otel metrics that it exports

#

for this one you may want to look at catalog_processing_queue_delay* and catalog_stitching_queue_delay*

river void
#

Those should both be at fractions of a second, except in exceptional circumstances

#

If they ever go high and stay high, you're in trouble because your system is underpowered and can't keep up

#

It can be OK for short bursts (like when ingesting very large new datasets etc) but under normal operation