#Urgent! all our workers not working! Any network issues?

57 messages · Page 1 of 1 (latest)

severe hull
#

Please take a look at our workers in endpoint h16kk1hi79s3t0 or kn0n8ry69jj1t7
All the workers are stuck at something!!

quiet raven
#

also whats your template ?

severe hull
#

we're using our custom docker image

#

how could I create a support ticket?

quiet raven
quiet raven
severe hull
#

yes, we've running these for months without problem

quiet raven
quiet raven
severe hull
#

yes, sure, I'll paste it here

quiet raven
#

nice that would help identify the problem

severe hull
#

two different worker logs. as far as I can see, I think there's definitely some kind of network problems.

These templates have been running for months without any changes.

#

for the first screenshot, after our logic is done the worker is just not doing anything.
for the second, we do some requests in our docker logic, and it seems these network requests are all failing

severe hull
#

yes, all stuck in running state

quiet raven
#

network request failing? what is it like

#

wow thats a huge amount of workers

severe hull
#

I don't know. I'm just guessing there's a network problem in runpod now.
We've been using runpod heaviliy for months and this is quite urgent

#

These templates have been running without any problem, but since just a few hours ago this problem started happening

quiet raven
severe hull
#

here's our requests graph.

severe hull
#

yeah I can paste the last line, but I don't think this will help you. it's just our docker logic.

2024-05-30T02:45:08.562489094Z exception in main_handler in validation check: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

#

but please look into it asap.. 🙂

quiet raven
#

Im not really the guy that can access your account's deeply but in technical i can help

#

What region is this btw?

severe hull
#

we use all the regions. is this what you mean?

severe hull
#

we send a request to amazon s3 to store our image

quiet raven
quiet raven
severe hull
#

yes, but we checked locally to send a request to amazon s3, but that works 😦

severe hull
quiet raven
#

the connection to aws's rekognition can be failing

severe hull
#

we checked that, it works in my computer

#

the serious thing is, here when it prints "push_output_image" that means our docker logic is done.
normally after that, it should fetch the next runpod job to start, but it's just stuck here

quiet raven
#

okay i saw another user just posted this

""
"exception in main_handler: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))"
""

#

seems like there is a problem in runpod's network somewhere

severe hull
#

I think so too.
Would really appreciate it if you could take a look

quiet raven
#

I couldn't access into runpod's infra atm im sorry 😦

#

but im sure there's another internal guys working on this

severe hull
#

oh no..

quiet raven
#

For now what you can do is just create a support ticket, and if you have maybe you can send me the ticket id

severe hull
#

that would take too long.. I'm just DMing RunPod members when we first started using RunPod a year ago.
Thank you

severe hull
#

but they're not responding.. are they all off time?

quiet raven
#

btw, is your service is a public one?

severe hull
#

yes

quiet raven
severe hull
#

oh, we're in Korea and I guess it's sleeping time in US..

quiet raven
#

Hmm korea huh

#

the guy that reported the error seems to be also from korea #💻|technology message

severe hull
#

possibly, this is urgent..

quiet raven
#

well theres nothing i can do for now hahah, but if you want to you can try deploying to regions ( like 1 per endpoint ), and try seeing which fails