#Getting intermittent connection errors on all services connected to my uptime kuma.

77 messages · Page 1 of 1 (latest)

woeful stump
#

Network errors? They seem to be temporary but consistent. Every minute or so. Uptime kuma can't connect to my app server and my n8n servers can't connect to uptime kuma.

upbeat barnBOT
#

Project ID: N/A

#

Project ID: N/A

drifting sirenBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

woeful stump
#

n/a

#

Project of Uptimekuma: 7d1c4d0a-d9b2-4143-aa29-f775a92a5c6e

twilit crater
#

Do you have app sleeping on by any chance?

woeful stump
#

Where would I check that? AFAIK, our app servers shouldn't ever sleep.

obsidian junco
#

I came here to report the same, we're on FastAPI on a single instance (no app sleeping), and having been getting errors all morning (for the last hour and a half):

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: Cannot assign requested address

I've only been able to replicate on a few occasions, and retires almost always fix it.

They are intermittent. We had no code changes to our app server. If helpful our project ID is: 1d3ab213-7e4f-45f6-a136-1326d80d0606

woeful stump
#

This issue only started happening about 2 hours ago, and happens on all services at once.

twilit crater
#

that's weird 🤔

#

are you guys using private network?

woeful stump
#

No

twilit crater
#

I remember this exact same issue in a thread not too long ago

#

#🚨|incidents message

#

well

#

maybe it's this?

obsidian junco
#

We don't use private networking (unless there is something automatically configured for us)

woeful stump
#

Our uptime kuma connections all use the public railway urls.

obsidian junco
#

We were alerted by Checkly that we use to monitor our public customer-facing API endpoints. We thought maybe something was wrong on their end, but have since replicated myself on my machine.

I came here when that happened, and wondered if it was networking that we didn't control, since we haven' t pushed any code to our affected service in the last week.

charred geyser
#

Same. We are using private network

fresh brook
#

seems like #🚨|incidents is the cause

#

nothing we can do but wait for the team to solve it

plush beacon
charred geyser
#

@molten linden

#

@unkempt matrix

twilit crater
#

Theres a incident going on, wait for it to get resolved

plush beacon
#

Would appreciate a follow up

#

at least

twilit crater
charred geyser
#

I think I joined the channel before they had that readme. Didn't know

twilit crater
plush beacon
fresh brook
#

It is urgent. The team is working on it

#

locking this thread now, don’t ping the team. You’re distracting them from solving the issue

maiden cove
#

You al should be noticing connections restore.

maiden cove
woeful stump
#

Still pinging down

maiden cove
#

Yep- new flood.

#

Updating

plush beacon
maiden cove
#

okay- seeing connections restore on our end

#

will be 5-15 minutes till DNS resolves

maiden cove
#

@woeful stump - looking good now?

#

also @obsidian junco

woeful stump
#

yes my pager has stopped flipping.

obsidian junco
#

We haven't detected any alerts since 11:50am MST (18 min ago)

maiden cove
#

Our proxy cpu usage is now back to normal- either the DDoS stopped or we swung the hammer enough times to teach them (bad actors) a lesson

#

Okay

#

Good, will update to monitoring

woeful stump
#

Ik we can’t help DDoS attacks from happening but there has to be a way to improve reliability here

#

I can not and will not continue to keep getting questions like this

plush beacon
#

Same, we're gearing up to likely switch to Porter because we are losing customer trust over things like this that we can't control

#

I'd rather our company get DDoS'd directly as we can do something about it

woeful stump
#

And yes, I responded to that query with only great things to say about railway, and how much time the infrastructure saves me. I am on your team.

maiden cove
#

Understood, and these are not empty words that it pains me as well when I know that this impacts your business. I am jumping on the call with the infrastructure team in 5 minutes and we are going to conduct a full retrospective. I will share what went down, and how we plan to avoid something like this in the future.

With that said: I want you to do what is best for your business and not what is best for Railway. If you need to migrate, we can help you with that or help you configure your environments that your prodction is hardened. In Toma's case, you can invite me to your Slack and we can talk next steps.

plush beacon
#

Sounds good, will do. Really appreciate your hard work

merry cosmos
#

(Background: Envoy powers all the HTTP requests. It's being removed in a favor of a more resillient proxy we built in house)

plush beacon
#

Yeah definitely, those are my reservations about moving into K8s ^. We're a one man dev team atm (just me) so whatever guarantees a combination of minimum downtime and ease of use is what we'll go with

merry cosmos
plush beacon
#

Thanks for the transparency too. We're also on your side and in 95% of cases, Railway has worked great for us. It's just that this came at the worst timing for us as a business and now we're a bit shell-shocked

maiden cove
#

Even if you aren't we need to know how critical your workloads are so we can best plan for that as well. I appreciate you sharing this.

woeful stump
#

Pinging down

maiden cove
woeful stump
#

looks like it’s resolved now.

merry cosmos
#

Alrighty. Here's the response:

We had a user on a custom domain create a mammoth amount of traffic which overwhelmed a small subset of boxes (aka 1)

Unfortunately y'all were the unlocky folks on that box

We've put up a PR + a monitor which will immediately page, per instance/domain/etc

This will wake someone up automatically. They now have a 1 line way to fix this, so, even in the rare rare event that this happens again, it will be resolved in less than 5 minutes

We're rebuilding the proxying layer to allow us to do fully, domain based, RPS configurations per domain. This will be live in the next month or so

merry cosmos
woeful stump
#

Thank you for resolving the issue quickly by the way.

plush beacon
#

Sg, thanks for the postmortem. Is there a way to get notified when the new proxy layer is productionalized?

maiden cove
#

Changelog, as we are open about all Infra changes under the hood every Friday and incoming impact, but if thats not enough we can talk about communication plan and how to keep you all in the loop.

merry cosmos
#

!remind me to update this thread in 690 hours

molten sparrowBOT
merry cosmos
#

!remind me to update this thread in 1358 hours

molten sparrowBOT