#requests are very slow

30 messages · Page 1 of 1 (latest)

vagrant meadow
#

Typical requests to my service end up with a response payload of about 1.5MB. They end up taking 20-30 seconds to complete. Actual time the GPU is crunching data is only about 1 second. Are runpod workers bandwidth limited? I'm trying to figure out what's causing the delay. When running the service locally it's near instantaneous

obtuse perch
#

Without further context, that sounds like model fetch/load time.
Consider catching a more-detailed profiling/log to see what those 20-30 seconds are spent on.
If you adhered to the suggestion to load the model outside of the handler, consider testing whether this is only cold-start and subsequent requests to the same hot worker complete as fast as expected.

Also don't forget that your local internet connection could be the culprit during the upload/request transmission stage, depending on what you're doing.

vagrant meadow
#

here the code for my handler:

import runpod
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Optional, Tuple, Dict
import base64
import json
import time
from functools import wraps

Text = str
Hash = str
B64Text = str
FileChunks = Tuple[Hash, List[Text]]
B64FileChunks = Tuple[Hash, List[B64Text]]
ChunkEmbedding = np.ndarray
FileEmbeddings = Tuple[Hash, ChunkEmbedding]
B64FileEmbeddings = Tuple[Hash, str]

InData = List[B64FileChunks]
OutData = List[B64FileEmbeddings]

EMBEDDING_MODEL = "thenlper/gte-base"
model = SentenceTransformer(EMBEDDING_MODEL)


def generator_timer_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        print(f"Generator {func.__name__} started at {start_time}")

        generator = func(*args, **kwargs)
        try:
            yield from generator
        finally:
            end_time = time.time()
            print(f"Generator {func.__name__} closed at {end_time}")
            print(f"Total time taken: {end_time - start_time} seconds")

    return wrapper


def encode_array_to_base64(array: np.ndarray) -> str:
    return base64.b64encode(array.tobytes()).decode("utf-8")


def decode_base64_texts(b64_texts: List[str]) -> List[str]:
    texts = [base64.b64decode(t).decode("utf-8") for t in b64_texts]
    return texts


def get_embeddings(texts: List[str]) -> np.ndarray:
    return model.encode(texts)


def processing(input_data: InData):
    print("Begin processing")
    tstart = time.time()
    # collect all texts
    texts = []
    for _, b64texts in input_data:
        texts.extend(decode_base64_texts(b64texts))

    # get embeddings for all texts
    embeddings = get_embeddings(texts)

    # split embeddings by file
    results = []
    start = 0
    for hash, b64texts in input_data:
        # Calculate the number of texts in the current file
        num_texts = len(b64texts)
        # Slice the embeddings for the current file
        file_embeddings = embeddings[start : start + num_texts, :]
        # Append the results
        results.append((hash, encode_array_to_base64(file_embeddings)))
        # Update start index for the next file
        start += num_texts

    print(f"Processed {len(texts)} texts in {time.time() - tstart} seconds")

    for result in results:
        yield result


@generator_timer_decorator
def generator_handler(job: dict):
    input_data = job["input"]["data"]
    for result in processing(input_data):
        yield result


if __name__ == "__main__":
    runpod.serverless.start(
        {
            "handler": generator_handler,
            "return_aggregate_stream": True,
        }
    )
#

I captured some logs for consecutive invocations to make sure the worker was hot:

2023-12-20T00:39:19.756560445Z SentenceTransformer.py:66   2023-12-20 00:39:19,756 Load pretrained SentenceTransformer: thenlper/gte-base
2023-12-20T00:39:20.210118511Z SentenceTransformer.py:105  2023-12-20 00:39:20,209 Use pytorch device: cuda
2023-12-20T00:39:20.210148021Z --- Starting Serverless Worker |  Version 1.3.3 ---
...
2023-12-20T00:39:49.275224622Z Generator generator_handler started at 1703032789.2751427
2023-12-20T00:39:49.275254092Z Begin processing
2023-12-20T00:39:49.887406947Z 
Batches:   0%|          | 0/10 [00:00<?, ?it/s]
Batches:  10%|█         | 1/10 [00:00<00:01,  7.30it/s]
Batches:  30%|███       | 3/10 [00:00<00:00, 11.63it/s]
Batches:  50%|█████     | 5/10 [00:00<00:00, 14.06it/s]
Batches:  70%|███████   | 7/10 [00:00<00:00, 15.27it/s]
Batches: 100%|██████████| 10/10 [00:00<00:00, 16.63it/s]
2023-12-20T00:39:49.890978879Z Processed 309 texts in 0.6157608032226562 seconds
2023-12-20T00:40:01.637560841Z Generator generator_handler closed at 1703032801.637435
2023-12-20T00:40:01.637593471Z Total time taken: 12.362292289733887 seconds
2023-12-20T00:40:01.637595341Z INFO   | sync-4a3ea96b-d4f5-4024-913c-32f9d63ea40b-u1 | Finished 
2023-12-20T00:40:01.978754304Z INFO   | sync-4a3ea96b-d4f5-4024-913c-32f9d63ea40b-u1 | Finished

The log Processed 309 texts in 0.6157608032226562 seconds indicates the time the processing function was doing work.

The log Total time taken: 12.362292289733887 seconds indicates the time it took the handler to finish running. I'm not sure what's taking up all this time. The request was initiated by a cloudflare worker, which I don't believe has a slow network connection.

Here's the log from my cloudflare worker's perspective that was timing how long it took to post to my handler's runsync endpoint:

"logs": [
    {
      "message": [
        "Request contained 309 chunks"
      ],
      "level": "log",
      "timestamp": 1703032990665
    },
    {
      "message": [
        "Request completed in 12879ms"
      ],
      "level": "log",
      "timestamp": 1703033003544
    },
…

I'd love some help understanding what could be causing this delay.

obtuse perch
#

(notably, involved round-trips or direct reads out of GPU memory by python and such would easily explain severe slowdown.)

#

round-trip as in GPU-CPU

vagrant meadow
#

could be. I could try using a normal handler instead of a generator. I don't think it's GPU-to-CPU transfer delay. That would be happening during the processing function, which finishes in under a second.

obtuse perch
#

If you can, I'd throw it against teh non-serverless runner setup or see to use a normal python stack capturing profiler against it in the serverless if I can, to actually see how it manages to spend all that time

obtuse perch
vagrant meadow
#

I have the same gpu on a linux box at home. When I run the handler on that machine along with a local cloudflare worker it doesn't have the extra delay. That's why my initial hunch was that the runpod worker was severely bandwidth limited. The response payload from the runsync invocation is a couple MB

obtuse perch
#

you just run it against the python processe's pid

#

the speedscope interface for interpreting the profile results is IMO a rather comfy flamegraph analyzing interface style (don't go for the weird svg, that's ancient flamegraph visualization style from like 15 years ago or so)

obtuse perch
#

that's 30 Mbit

#

so you'd need to slow down to a single Mbit/s to fill 30 seconds of delay with that.

#

and this many seconds even tcp slow-start can't really explain

vagrant meadow
#

It depends on how many text chunks are being converted to embeddings. In this example it was 309 text chunks. 309 vectors is about 1MB. Each embedding vector is 3KB.

obtuse perch
#

hmm, yeah, but at this size, it's not really explaining the long delay, because that'd imply extremely low bandwidth

#

(1-2 orders of magnitude lower than I'd expect them to have at the bare minimum)

#

py-spy attempts to only include stack traces from threads that are actively running code, and exclude threads that are sleeping or otherwise idle.

You can disable this functionality by setting the --idle flag, which will include frames that py-spy considers idle.

#

(I'd look at both, or rather, also look at the other if the first one doesn't obviously show clear information.)

vagrant meadow
#

I can make a special worker that tests the bandwidth. It will serve 1MB worth of pre-calculated stuff. That way I can rule out what's going on in the processing function.

#

That's an interesting looking tool. I'll give it a try

obtuse perch
#

oh, it looks like docker might block a syscall critical for py-spy to work; if that blocks it on RunPod serverless, ask the officials.

obtuse perch
vagrant meadow
#

hadn't thought of that before but it could be another factor

obtuse perch