I was trying to debug the latency on my test PODs and now I figured that PODs running on the same physical machine are lagging too much on IO access.
After profilling, I've got these results.
Example:
Initial test on POD
- running on a single POD model load time for 6Gb model is 2 sec
- when I pulled 2 GPUs from the same server model load increased to 40 sec
Even inference is affected, RAM leaking?
On Serverless:
- Same GPU 4090, gets different inference and load time as well
- 30s for loading, 4 sec depending on the machine
- inference is non uniform as well: 20s on some and 10s on some
All running the same docker, and same scripts with the same libraries.
Do we have any work in place to ensure we have uniformity on HW?
Are we enforcing servers to have separate SSD / NVME for each GPU and including different pipe for IO access?
Need to have some idea if this is persisting issue, I'm pretty sure the Mbps on the descriptors are not reflecting the reality at all.
EDIT: I'm using US region now, Global the problem is worse.