I'm using a 24 GB vRAM serverless endpoint, the endpoint is way too unstable, 90% of the times the "QUEUE" takes a couple of seconds and then inference of the Omniparser v2 model takes between 3-8 seconds.
This is a replicable result in Google Colab and other GPUs, nonetheless, every once on a while Runpod takes more than 40 seconds ofr even minutes to process a request. This happens when a specific worker bugs and then multiple request goes through it. The worker bugs for no reason and takes multiple minutes to do the job it should do in seconds. This only happens for some workers and when the same worker is used multiple times, it makes no sense and Runpod charges you multiple minutes of DELAY TIME, sometimes it does not even go through, meaning it says "IN_PROGRESS" as seen in the image for multiple minutes without finishing while Runpod charges you every second.
In any other environment and even runpod this process takes seconds, the "IN_PROGRESS" print shows between 3-8 times only. This makes the endpoint highly unstable and way too expensive for a model that does not even use half of the vRAM.