#Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2...

29 messages · Page 1 of 1 (latest)

shadow umbra
#

Hi, have a serverless endpoint. Job completes successfully but the results are never returned and the job times out.

Any ideas how to resolve this? There are a few threads about this, but the conversation always drifted to another topic. Also I have submitted a ticket yesterday with no response. I am using this in production and whole my website is not working because of this. Seriously concerned about using runpod as this is probably the fourth time all stopped working for one or another reason.

Total progress: 100%|██████████| 38/38 [00:09<00:00, 4.15it/s]
2024-07-12T17:16:46.296682863Z INFO: 127.0.0.1:60656 - "POST /sdapi/v1/img2img HTTP/1.1" 200 OK
2024-07-12T17:16:48.343583857Z {"requestId": "7c46852a-a3b1-46a8-a0dc-2fe8c006287a-e1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/2lpmcp0nczozlm/job-done/nbvo20j26ji2mh/7c46852a-a3b1-46a8-a0dc-2fe8c006287a-e1?gpu=NVIDIA+RTX+A5000&isStream=false", "level": "ERROR"}
2024-07-12T17:16:48.343613177Z {"requestId": "7c46852a-a3b1-46a8-a0dc-2fe8c006287a-e1", "message": "Finished.", "level": "INFO"}

jolly sage
#

Are you using /runsync?

shadow umbra
#

no, it is a run endpoint and then I am checking it.

jolly sage
shadow umbra
#

five of 512x512, 768x768 or 1024x1024 images. Is this too much?

jolly sage
#

Hmm max is 10mb / run or per job

#

or its just runpod error

#

try to report this to runpod from the contact button

shadow umbra
jolly sage
#

yeah then its maybe from runpod

chilly pawn
#

What do you have set for Execution Timeout(s)?

shadow umbra
#

120 seconds, but the job usually completes after 10. This message appears on completion (after 10 seconds).

shy bloom
#

I'm experiencing the same issue.

When the Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/{endpoint-id}/job-done/... error occurs, the job remains stuck in IN_PROGRESS indefinitely, even though the log indicates that the job is completed. Plus, no webhook is sent from RunPod. It appears that RunPod fails to mark the job as completed due to this internal HTTP request failure.

Some people have suggested that this issue might be caused by a large payload returned from the handler. However, in my case, the output size is only a few KBs, as it is just a JSON containing a URL to the output file.

jolly sage
#

try to report it to runpod

daring robin
#

Same here

daring robin
#

Seems to be fixed now.
I did however clone the endpoint into a new one, just in case.

jolly sage
daring robin
jolly sage
#

Hope so too hahah

shy bloom
#

The problem still exists, it just occurred again.

#

Yes, it only happens sometimes, not consistently. It seems like the internal webhook connection on RunPod isn't stable.

#

I hope this issue gets fixed ASAP because it causes production jobs to get stuck indefinitely. Even worse, the stuck jobs might continue to drain credits.

jolly sage
#

Ey try to up your container disk space on the endpoint

daring robin
gentle notch
#

having the same issue here too. Only started happening in the past couple of days but I haven't modified the handler at all in weeks 2024-07-20T23:41:26.892920681Z {"requestId": "f2d7a4d6-8bde-40d2-8ac9-d86e4134c165-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/...", "level": "ERROR"}

using the async endpoint
edit: scaling workers to 0 then back up seems to have fixed it for now

somber gulch
#

Same issues

jolly sage