#Serverless endpoint requests failing to reach workers.

10 messages · Page 1 of 1 (latest)

azure cosmos
#

Hello there. On January 23rd, at approximately 5pm, my serverless endpoint was reachable here: https://kviznxeq34txwt.api.runpod.ai (i.e. it was receiving requests and spinning up workers accordingly and then sending responses once running.)

Shortly there after, it suddenly stopped sending requests to workers. This morning, after forcing a worker to stay online permanently, my request did land on that worker and it spun up my services, but my health checks were not returning anything to my app server. (i.e. I could see from within the RunPod user interface that all of my services had started up successfully, but the worker was unable to respond.)

I believe this endpoint has become corrupted and I would have no problem spinning up a new endpoint in the same region so I can attach my network volume to it, but I worry I won't be able to have the GPUs I need allocated to it. Is there any way the RunPod team could either force a reset of these workers or allocate me the GPUs I would be forgoing if I terminated this endpoint? ( I have terminated workers/done everything I can on my end to try to force a reset...the new workers are successfully pulling in new docker images; they are just continually failing to respond to health checks.)

wispy geyserBOT
stark urchin
#

I'll have to look into why this happened a little more intently, but upfront just want to explain what I see.

It looks like for a period on the 23rd (and for a while, this morning, Jan 26th) your endpoint reported that it didn't have access to the healthChecking URL. This would be why workers stopped receiving jobs, but I can see that workers still spun up, even if they just took a while because they needed to be pulled from GHCR. Remaking the endpoint probably won't help this, and there's generally nothing we can do to force GPU allocation to a specific endpoint.

azure cosmos
#

So the workers refused to spin up until I forced a minimum of one worker to be online 24-7. At that point, I pinged the endpoint and it started the services. What's confusing is that after pinging the endpoint to start the services, all of my subsequent pings did not trigger a response.

And if I do not have a worker running, my requests to spin up workers are not received. (i.e. I can't spin up workers unless I do so from the Runpod UI)

#

@stark urchin

stark urchin
#

Let me check your config in the UI, the solution may be easier than we expect.

#

Firstly, I've granted you a few more max workers up from 5 - let me know if you'll need more.

We should be spinning up the worker when you send a request. If you have a request ID or a worker id that didn't behave for you let me know and I can figur eout why. A lot of the log spam I saw about not being able to healthcheck your workers could be caused by you scaling down to zero but that shouldn't be blocking you.

azure cosmos
#

I have terminated and been reallocated workers a handful of times since this issue first started cropping up, and I just retried my request and no workers were spun up.

Are you able to see the failed request metrics? I would prefer not to leave a worker running 24/7, but for the sake of debugging, I would be happy to do so.

Looking at my metrics historically, it seems inevitable that some of the requests made to this endpoint will fail. It's just that since January 23rd, all of my requests have failed. Unfortunately, I cannot even give you a request ID because the endpoint is not responding. ( although this used to be an expected behavior when I had no workers online...it would take about 40 seconds until the worker started responding to health checks after the initial request to spin-up)

#

Sorry! the reason that the request failure was historically expected is because I would send subsequent requests prior to allowing the worker enough time to spin up and send a response.

Unfortunatly, the reason all of my current requests are failing is not due to allowing sufficient time for the worker to complete spin up.
@stark urchin

azure cosmos
#

I also just created a new serverless endpoint with the same docker image/mirorred settings and it was able to be reached successfully without error.