Getting intermittent connection errors on all services connected to my uptime kuma. | Railway | Page 1

woeful stump Jan 18, 2024, 4:52 PM

#

Network errors? They seem to be temporary but consistent. Every minute or so. Uptime kuma can't connect to my app server and my n8n servers can't connect to uptime kuma.

upbeat barnBOT Jan 18, 2024, 4:52 PM

#

Project ID: N/A

#

Project ID: N/A

drifting sirenBOT Jan 18, 2024, 4:52 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

woeful stump Jan 18, 2024, 4:53 PM

#

n/a

#

Project of Uptimekuma: 7d1c4d0a-d9b2-4143-aa29-f775a92a5c6e

twilit crater Jan 18, 2024, 5:06 PM

#

Do you have app sleeping on by any chance?

woeful stump Jan 18, 2024, 5:07 PM

#

Where would I check that? AFAIK, our app servers shouldn't ever sleep.

obsidian junco Jan 18, 2024, 5:09 PM

#

I came here to report the same, we're on FastAPI on a single instance (no app sleeping), and having been getting errors all morning (for the last hour and a half):

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: Cannot assign requested address

I've only been able to replicate on a few occasions, and retires almost always fix it.

They are intermittent. We had no code changes to our app server. If helpful our project ID is: 1d3ab213-7e4f-45f6-a136-1326d80d0606

woeful stump Jan 18, 2024, 5:10 PM

#

This issue only started happening about 2 hours ago, and happens on all services at once.

twilit crater Jan 18, 2024, 5:10 PM

#

that's weird 🤔

#

are you guys using private network?

woeful stump Jan 18, 2024, 5:11 PM

#

No

twilit crater Jan 18, 2024, 5:11 PM

#

I remember this exact same issue in a thread not too long ago

#

#🚨｜incidents message

#

well

#

maybe it's this?

obsidian junco Jan 18, 2024, 5:11 PM

#

We don't use private networking (unless there is something automatically configured for us)

woeful stump Jan 18, 2024, 5:12 PM

#

Our uptime kuma connections all use the public railway urls.

obsidian junco Jan 18, 2024, 5:14 PM

#

We were alerted by Checkly that we use to monitor our public customer-facing API endpoints. We thought maybe something was wrong on their end, but have since replicated myself on my machine.

I came here when that happened, and wondered if it was networking that we didn't control, since we haven' t pushed any code to our affected service in the last week.

charred geyser Jan 18, 2024, 5:14 PM

#

Same. We are using private network

fresh brook Jan 18, 2024, 5:14 PM

#

seems like #🚨｜incidents is the cause

#

nothing we can do but wait for the team to solve it

plush beacon Jan 18, 2024, 5:25 PM

#

Jumping in here as well (x-posted here: #1197588560564469760 message), we lost 2 leads due to broken demos this morning bc of this

charred geyser Jan 18, 2024, 5:25 PM

#

@molten linden

#

@unkempt matrix

twilit crater Jan 18, 2024, 5:26 PM

#

charred geyser <@1031562283274473552>

#🛂｜readme 5) Don't ping team or conductors

#

Theres a incident going on, wait for it to get resolved

plush beacon Jan 18, 2024, 5:26 PM

#

Would appreciate a follow up

#

at least

twilit crater Jan 18, 2024, 5:27 PM

#

plush beacon Would appreciate a follow up

Yeah, lets just wait for the incident to get resolved

charred geyser Jan 18, 2024, 5:27 PM

#

I think I joined the channel before they had that readme. Didn't know

twilit crater Jan 18, 2024, 5:27 PM

#

twilit crater Yeah, lets just wait for the incident to get resolved

And see if your issue persists

twilit crater Jan 18, 2024, 5:27 PM

#

charred geyser I think I joined the channel before they had that readme. Didn't know

np, just be careful

plush beacon Jan 18, 2024, 5:37 PM

#

Why is this not more urgent, our whole site is down https://www.toma.so/

fresh brook Jan 18, 2024, 5:38 PM

#

It is urgent. The team is working on it

#

locking this thread now, don’t ping the team. You’re distracting them from solving the issue

maiden cove Jan 18, 2024, 6:06 PM

#

You al should be noticing connections restore.

maiden cove Jan 18, 2024, 6:06 PM

#

plush beacon Why is this not more urgent, our whole site is down https://www.toma.so/

Check now for restoration.

woeful stump Jan 18, 2024, 6:15 PM

#

Still pinging down

maiden cove Jan 18, 2024, 6:16 PM

#

Yep- new flood.

#

Updating

plush beacon Jan 18, 2024, 6:21 PM

#

maiden cove Check now for restoration.

Yeah still down

maiden cove Jan 18, 2024, 6:37 PM

#

okay- seeing connections restore on our end

#

will be 5-15 minutes till DNS resolves

maiden cove Jan 18, 2024, 7:05 PM

#

@woeful stump - looking good now?

#

also @obsidian junco

woeful stump Jan 18, 2024, 7:07 PM

#

yes my pager has stopped flipping.

obsidian junco Jan 18, 2024, 7:08 PM

#

We haven't detected any alerts since 11:50am MST (18 min ago)

maiden cove Jan 18, 2024, 7:08 PM

#

Our proxy cpu usage is now back to normal- either the DDoS stopped or we swung the hammer enough times to teach them (bad actors) a lesson

#

Okay

#

Good, will update to monitoring

woeful stump Jan 18, 2024, 7:08 PM

#

Ik we can’t help DDoS attacks from happening but there has to be a way to improve reliability here

#

I can not and will not continue to keep getting questions like this

plush beacon Jan 18, 2024, 7:13 PM

#

Same, we're gearing up to likely switch to Porter because we are losing customer trust over things like this that we can't control

#

I'd rather our company get DDoS'd directly as we can do something about it

woeful stump Jan 18, 2024, 7:16 PM

#

And yes, I responded to that query with only great things to say about railway, and how much time the infrastructure saves me. I am on your team.

maiden cove Jan 18, 2024, 7:16 PM

#

Understood, and these are not empty words that it pains me as well when I know that this impacts your business. I am jumping on the call with the infrastructure team in 5 minutes and we are going to conduct a full retrospective. I will share what went down, and how we plan to avoid something like this in the future.

With that said: I want you to do what is best for your business and not what is best for Railway. If you need to migrate, we can help you with that or help you configure your environments that your prodction is hardened. In Toma's case, you can invite me to your Slack and we can talk next steps.

plush beacon Jan 18, 2024, 7:18 PM

#

Sounds good, will do. Really appreciate your hard work

merry cosmos Jan 18, 2024, 7:19 PM

#

plush beacon Same, we're gearing up to likely switch to Porter because we are losing customer...

FWIW I can almost guarantee you even more pain and equivalent outages for managing a kubernetes cluster

Make no mistake, we're not saying the above is acceptable. Just grass is greener type stuff

We'll provide a retro on this in about 15 minutes (writing stuff up)

#

(Background: Envoy powers all the HTTP requests. It's being removed in a favor of a more resillient proxy we built in house)

plush beacon Jan 18, 2024, 7:20 PM

#

Yeah definitely, those are my reservations about moving into K8s ^. We're a one man dev team atm (just me) so whatever guarantees a combination of minimum downtime and ease of use is what we'll go with

merry cosmos Jan 18, 2024, 7:21 PM

#

plush beacon Yeah definitely, those are my reservations about moving into K8s ^. We're a one ...

Sounds good. We'll talk traffic steering with the team and like I said, give us 15m to get together to retro

30 minutes to get back to you

plush beacon Jan 18, 2024, 7:21 PM

#

Thanks for the transparency too. We're also on your side and in 95% of cases, Railway has worked great for us. It's just that this came at the worst timing for us as a business and now we're a bit shell-shocked

maiden cove Jan 18, 2024, 7:28 PM

#

Even if you aren't we need to know how critical your workloads are so we can best plan for that as well. I appreciate you sharing this.

woeful stump Jan 18, 2024, 8:21 PM

#

Pinging down

maiden cove Jan 18, 2024, 8:27 PM

#

woeful stump Pinging down

again- gotcha

woeful stump Jan 18, 2024, 8:40 PM

#

looks like it’s resolved now.

merry cosmos Jan 18, 2024, 8:44 PM

#

Alrighty. Here's the response:

We had a user on a custom domain create a mammoth amount of traffic which overwhelmed a small subset of boxes (aka 1)

Unfortunately y'all were the unlocky folks on that box

We've put up a PR + a monitor which will immediately page, per instance/domain/etc

This will wake someone up automatically. They now have a 1 line way to fix this, so, even in the rare rare event that this happens again, it will be resolved in less than 5 minutes

We're rebuilding the proxying layer to allow us to do fully, domain based, RPS configurations per domain. This will be live in the next month or so

merry cosmos Jan 18, 2024, 8:45 PM

#

merry cosmos Alrighty. Here's the response: We had a user on a custom domain create a mammot...

Lmk if you have any questions on the above @plush beacon, @woeful stump, or anybody else

woeful stump Jan 18, 2024, 8:54 PM

#

Thank you for resolving the issue quickly by the way.

plush beacon Jan 18, 2024, 10:07 PM

#

Sg, thanks for the postmortem. Is there a way to get notified when the new proxy layer is productionalized?

maiden cove Jan 18, 2024, 10:58 PM

#

Changelog, as we are open about all Infra changes under the hood every Friday and incoming impact, but if thats not enough we can talk about communication plan and how to keep you all in the loop.

merry cosmos Jan 19, 2024, 1:23 AM

#

!remind me to update this thread in 690 hours

molten sparrowBOT Jan 19, 2024, 1:23 AM

#

merry cosmos !remind me to update this thread in 690 hours

Got it, I will remind you to update this thread at Fri, 16 Feb 2024 19:23:21 GMT

merry cosmos Jan 19, 2024, 1:23 AM

#

!remind me to update this thread in 1358 hours

molten sparrowBOT Jan 19, 2024, 1:24 AM

#

merry cosmos !remind me to update this thread in 1358 hours

Got it, I will remind you to update this thread at Fri, 15 Mar 2024 15:23:59 GMT

#Getting intermittent connection errors on all services connected to my uptime kuma.