Typical requests to my service end up with a response payload of about 1.5MB. They end up taking 20-30 seconds to complete. Actual time the GPU is crunching data is only about 1 second. Are runpod workers bandwidth limited? I'm trying to figure out what's causing the delay. When running the service locally it's near instantaneous
#requests are very slow
30 messages · Page 1 of 1 (latest)
Without further context, that sounds like model fetch/load time.
Consider catching a more-detailed profiling/log to see what those 20-30 seconds are spent on.
If you adhered to the suggestion to load the model outside of the handler, consider testing whether this is only cold-start and subsequent requests to the same hot worker complete as fast as expected.
Also don't forget that your local internet connection could be the culprit during the upload/request transmission stage, depending on what you're doing.
here the code for my handler:
import runpod
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Optional, Tuple, Dict
import base64
import json
import time
from functools import wraps
Text = str
Hash = str
B64Text = str
FileChunks = Tuple[Hash, List[Text]]
B64FileChunks = Tuple[Hash, List[B64Text]]
ChunkEmbedding = np.ndarray
FileEmbeddings = Tuple[Hash, ChunkEmbedding]
B64FileEmbeddings = Tuple[Hash, str]
InData = List[B64FileChunks]
OutData = List[B64FileEmbeddings]
EMBEDDING_MODEL = "thenlper/gte-base"
model = SentenceTransformer(EMBEDDING_MODEL)
def generator_timer_decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
print(f"Generator {func.__name__} started at {start_time}")
generator = func(*args, **kwargs)
try:
yield from generator
finally:
end_time = time.time()
print(f"Generator {func.__name__} closed at {end_time}")
print(f"Total time taken: {end_time - start_time} seconds")
return wrapper
def encode_array_to_base64(array: np.ndarray) -> str:
return base64.b64encode(array.tobytes()).decode("utf-8")
def decode_base64_texts(b64_texts: List[str]) -> List[str]:
texts = [base64.b64decode(t).decode("utf-8") for t in b64_texts]
return texts
def get_embeddings(texts: List[str]) -> np.ndarray:
return model.encode(texts)
def processing(input_data: InData):
print("Begin processing")
tstart = time.time()
# collect all texts
texts = []
for _, b64texts in input_data:
texts.extend(decode_base64_texts(b64texts))
# get embeddings for all texts
embeddings = get_embeddings(texts)
# split embeddings by file
results = []
start = 0
for hash, b64texts in input_data:
# Calculate the number of texts in the current file
num_texts = len(b64texts)
# Slice the embeddings for the current file
file_embeddings = embeddings[start : start + num_texts, :]
# Append the results
results.append((hash, encode_array_to_base64(file_embeddings)))
# Update start index for the next file
start += num_texts
print(f"Processed {len(texts)} texts in {time.time() - tstart} seconds")
for result in results:
yield result
@generator_timer_decorator
def generator_handler(job: dict):
input_data = job["input"]["data"]
for result in processing(input_data):
yield result
if __name__ == "__main__":
runpod.serverless.start(
{
"handler": generator_handler,
"return_aggregate_stream": True,
}
)
I captured some logs for consecutive invocations to make sure the worker was hot:
2023-12-20T00:39:19.756560445Z SentenceTransformer.py:66 2023-12-20 00:39:19,756 Load pretrained SentenceTransformer: thenlper/gte-base
2023-12-20T00:39:20.210118511Z SentenceTransformer.py:105 2023-12-20 00:39:20,209 Use pytorch device: cuda
2023-12-20T00:39:20.210148021Z --- Starting Serverless Worker | Version 1.3.3 ---
...
2023-12-20T00:39:49.275224622Z Generator generator_handler started at 1703032789.2751427
2023-12-20T00:39:49.275254092Z Begin processing
2023-12-20T00:39:49.887406947Z
Batches: 0%| | 0/10 [00:00<?, ?it/s]
Batches: 10%|█ | 1/10 [00:00<00:01, 7.30it/s]
Batches: 30%|███ | 3/10 [00:00<00:00, 11.63it/s]
Batches: 50%|█████ | 5/10 [00:00<00:00, 14.06it/s]
Batches: 70%|███████ | 7/10 [00:00<00:00, 15.27it/s]
Batches: 100%|██████████| 10/10 [00:00<00:00, 16.63it/s]
2023-12-20T00:39:49.890978879Z Processed 309 texts in 0.6157608032226562 seconds
2023-12-20T00:40:01.637560841Z Generator generator_handler closed at 1703032801.637435
2023-12-20T00:40:01.637593471Z Total time taken: 12.362292289733887 seconds
2023-12-20T00:40:01.637595341Z INFO | sync-4a3ea96b-d4f5-4024-913c-32f9d63ea40b-u1 | Finished
2023-12-20T00:40:01.978754304Z INFO | sync-4a3ea96b-d4f5-4024-913c-32f9d63ea40b-u1 | Finished
The log Processed 309 texts in 0.6157608032226562 seconds indicates the time the processing function was doing work.
The log Total time taken: 12.362292289733887 seconds indicates the time it took the handler to finish running. I'm not sure what's taking up all this time. The request was initiated by a cloudflare worker, which I don't believe has a slow network connection.
Here's the log from my cloudflare worker's perspective that was timing how long it took to post to my handler's runsync endpoint:
"logs": [
{
"message": [
"Request contained 309 chunks"
],
"level": "log",
"timestamp": 1703032990665
},
{
"message": [
"Request completed in 12879ms"
],
"level": "log",
"timestamp": 1703033003544
},
…
I'd love some help understanding what could be causing this delay.
for result in results: in processing happens after the prcessing time capture.
Looks like it might be slowed down by the yielding of the results there?
(notably, involved round-trips or direct reads out of GPU memory by python and such would easily explain severe slowdown.)
round-trip as in GPU-CPU
could be. I could try using a normal handler instead of a generator. I don't think it's GPU-to-CPU transfer delay. That would be happening during the processing function, which finishes in under a second.
If you can, I'd throw it against teh non-serverless runner setup or see to use a normal python stack capturing profiler against it in the serverless if I can, to actually see how it manages to spend all that time
hmmm, does it? hmm, I guess the encode_array_to_base64 isn't lazy, though.
Ahh, yeah, https://github.com/benfred/py-spy , I remember.
I have the same gpu on a linux box at home. When I run the handler on that machine along with a local cloudflare worker it doesn't have the extra delay. That's why my initial hunch was that the runpod worker was severely bandwidth limited. The response payload from the runsync invocation is a couple MB
you just run it against the python processe's pid
the speedscope interface for interpreting the profile results is IMO a rather comfy flamegraph analyzing interface style (don't go for the weird svg, that's ancient flamegraph visualization style from like 15 years ago or so)
you said about 1.5 MB earlier
that's 30 Mbit
so you'd need to slow down to a single Mbit/s to fill 30 seconds of delay with that.
and this many seconds even tcp slow-start can't really explain
It depends on how many text chunks are being converted to embeddings. In this example it was 309 text chunks. 309 vectors is about 1MB. Each embedding vector is 3KB.
hmm, yeah, but at this size, it's not really explaining the long delay, because that'd imply extremely low bandwidth
(1-2 orders of magnitude lower than I'd expect them to have at the bare minimum)
py-spy attempts to only include stack traces from threads that are actively running code, and exclude threads that are sleeping or otherwise idle.
You can disable this functionality by setting the
--idleflag, which will include frames that py-spy considers idle.
(I'd look at both, or rather, also look at the other if the first one doesn't obviously show clear information.)
I can make a special worker that tests the bandwidth. It will serve 1MB worth of pre-calculated stuff. That way I can rule out what's going on in the processing function.
That's an interesting looking tool. I'll give it a try
oh, it looks like docker might block a syscall critical for py-spy to work; if that blocks it on RunPod serverless, ask the officials.
beware of network stream compression at the webserver/http API gateway layer.
is this something I could disable with request headers?
hadn't thought of that before but it could be another factor
in theory, yes; in practice, the gateway might ignore your declaration of not supporting comrpression and throw comrpession at you anyways.