RunPod•7mo ago

Execution time discrepancy

I built a custom text embedding worker, when i time the request on the pod it takes about 20ms to process from start to finish. The request takes a lot longer (about 1.5 seconds), and runpod returns an executionTime: 1088ms in the response object. do you know where this discrepancy might come from? As it is, it's currently really limiting the throughput of my worker, and there isn't much point in using a GPU if it's so heavily bottlenecked. Thanks in advanced! Also happy to share the code or any logs if it'd help diagnose what's up

42 Replies

yhlong00000•7mo ago

Usually you might want to run multiple tests, and there are network delays depending on where you test it. The first request could take a bit longer because of cold start, but subsequent requests the response should be quickly

kipOP•7mo ago

Hey @yhlong00000 Thanks for getting back to me 🙂 I'm not so worried about the network delays, the gpu's are in America and i'm in not that is pretty expected. The difference between a few ms to process the request and Runpod saying it takes a full second is annoying I'm running many back to back requests to ensure it's not cold start or flashboot, and the worker doesn't shut down between tests, it just takes significantly longer than it should

nerdylive•7mo ago

Hmm you used the same gpu, same config, same code in both? yeah i think this might have to do with runpod's serverless then

yhlong00000•7mo ago

here is my testing endpoint, it just sleep for 5 seconds, return some text, as far as execution time, it looks pretty accurate for me.

nerdylive•7mo ago

What about running for 0.2 secs Maybe do the inference on some models vs pods

yhlong00000•7mo ago

Maybe share some logs? I’m not sure exactly which time is inaccurate.

kipOP•7mo ago

I'm just logging times from within the inference pod, so the setup is the same I'll send some logs, gimme a minute So my inference code is really just:

import torch
import runpod
from model import load_model
from rp_schemas import INPUT_SCHEMA
from runpod.serverless.utils.rp_validator import validate
import time

device = torch.device("cuda:0")

model, tokenizer = load_model("~/.cache/model_tensors", device)

async def process_request(job):
    start_time = time.time()

    # My business logic is here
    ...

    total_time = time.time() - start_time
    
    print(f"Total processing time: {total_time:.4f} seconds")

    return {"total_time": total_time}

def main():
    runpod.serverless.start({
        "handler": process_request,
    })

if __name__ == "__main__":
    main()

import torch
import runpod
from model import load_model
from rp_schemas import INPUT_SCHEMA
from runpod.serverless.utils.rp_validator import validate
import time

device = torch.device("cuda:0")

model, tokenizer = load_model("~/.cache/model_tensors", device)

async def process_request(job):
    start_time = time.time()

    # My business logic is here
    ...

    total_time = time.time() - start_time
    
    print(f"Total processing time: {total_time:.4f} seconds")

    return {"total_time": total_time}

def main():
    runpod.serverless.start({
        "handler": process_request,
    })

if __name__ == "__main__":
    main()

(removing some of the logic, as i don't think it can be related) And the requests i get back from the runpod worker look like

{
  "delayTime": 2627,
  "executionTime": 1132,
  "id": I removed this,
  "output": {
    "total_time": 0.05837368965148926
  },
  "status": "COMPLETED"
}

{
  "delayTime": 2627,
  "executionTime": 1132,
  "id": I removed this,
  "output": {
    "total_time": 0.05837368965148926
  },
  "status": "COMPLETED"
}

So to be clear, the execution time correct, as in that's how long the request occupies the pod, but why is this so much more than the actual execution of my code Also in the example above, I’m the only one making queries to the server

yhlong00000•7mo ago

Oh, you mean why delay time is so long? If the delay time is long, it’s usually due to a cold start. As you can see from my screenshot, I hadn’t used my endpoint for a while, so the worker went to sleep, and waking it up takes some time. Additionally, for some users, waking up a worker also involves loading the model into the GPU’s VRAM, which can take even longer depending on the size of the model. However, if you have workers actively processing requests, the delay time should be significantly reduced.

kipOP•7mo ago

It's not the cold start time The cold start time is something like 2.5 seconds, but even after sending a few requests and increasing the idle timeout to ensure it didn't spin down, execution consistently takes more than a second I really can't figure out why it's taking more than the ~60ms of the actual workload

kipOP•7mo ago

nerdylive•7mo ago

like the "total_time": 0.05837368965148926 is different from executionTime in your example response from endpoint?

kipOP•7mo ago

the time it takes to do the actual inference call is pretty quick, but it takes 1 second to execute I can make a minimal repo tomorrow if it helps debug

nerdylive•7mo ago

maybe thats because of flashboot, still loads the model into vram again but quicker, not sure..

kipOP•7mo ago

Time time it takes to load the model is more than 1 second so i'd be suprised I was hoping i'd done something stupid and there was a quick fix

nerdylive•7mo ago

it probably isnt, and how serverless should work, where's the other time difference here? from your total_time and execution time should be from the booting right

kipOP•7mo ago

Sorry, i'm not quite sure what you mean

nerdylive•7mo ago

this oh it is probably isnt a fault on your code, looks right to me

kipOP•7mo ago

Yeah the total execution time is actually much higher than the time it takes to perform the inference call Would it help if i made a minimal example? I'll probably do so anyway to ensure i haven't made a mistake?

nerdylive•7mo ago

sure ya if you don't mind opensourcing your code

kipOP•7mo ago

i'll probably just make a standalone thing with transformers Out of curiosity, do you know if there are any existing runpod templates that takes less than a second to execute the inference call?

nerdylive•7mo ago

idk whisper? let me try a quick test after this if i can

kipOP•7mo ago

maybe, one of the faster trasnformers for sure I'll do this tomorrow anyway though, I'll ping you with the code then

nerdylive•7mo ago

i'll be sending whisper times test { "delayTime": 18697, cold start, load models, start, etc "executionTime": 63, throw an error no inference time "id": "4c64b8ad-4411-427e-9989-a48ba53f5308-e1", "status": "COMPLETED" } No inference again: { "delayTime": 1705, "executionTime": 61, "id": "be8ac9cc-0e28-4e41-9c55-d2b52eea1574-e1", "status": "COMPLETED" } W inference: { "delayTime": 1367, "executionTime": 846, one more before this ( i think this is the real one ): exec time 2.6 ( request expired so i type this out )

Encyrption•7mo ago

Each request is being proxied. There is going to be some overhead in that. 1 second? That is likely just overhead in proxying your request.
How are you handing your models? Are they stored on Network Storage, baked into the image, or loaded at run time? What region are your workers in? What is your active and max workers set to?

kipOP•7mo ago

@nerdylive @yhlong00000 I put together a small example that has the issue https://github.com/kipgparker/clip-runpod The response from the worker is the embedding as well as the time it takes to perform the forward pass, and the time it takes to execute is always significantly higher than the time it takes to perform the forward pass

Each request is being proxied. There is going to be some overhead in that. 1 second? That is likely just overhead in proxying your request.

I assume you mean by whatever scheduler Runpod uses? I would expect that to be included in the delay time? Also if the request is being proxied that should be blocking any new requests getting to the pod, but for me it seems to be blocking. And one full second is a lot of time

Encyrption•7mo ago

Are you using RUN (async) or RUNSYNC (synchronous)?

kipOP•7mo ago

How are you handing your models?

I've given an example in the github repo i sent, but generally i just load them at the top of the handler file

Are they stored on Network Storage, baked into the image, or loaded at run time?

The models are baked into the image, i think i saw that this was the recommended way of doing it on Runpod

What region are your workers in?

Global, i'm not sure how to see the region of a specific pod though. also i don't see how this can really impact execution time

What is your active and max workers set to?

One for now, just to test runsync

Encyrption•7mo ago

Have you done any testing locally? If you have the hardware to do so it gives you the ability to connect with bash/sh and run your handler directly bypassing the RunPod proxy. This might provide more details about where the latency you are experience is happening.

kipOP•7mo ago

I've tested running the handler locally with --rp_serve_api and the time to perform the forward pass is something like 20 -> 50ms

Encyrption•7mo ago

So you only see the 1 second delay when running on RunPod?

kipOP•7mo ago

Correct, yeah

Encyrption•7mo ago

So, you will have some ms delay because your traffic is traversing the Internet. When you hit any of the RunPod URL, (runsync, run, status, cancel and health) behind the scene your traffic is being proxied. Your image doesn't have the code built in to respond to API calls so RunPod proxies it for you. My guess is those 2 things combined is causing your 1 second delay. I am not sure it would be accounted for in delay. I think delay is more of the proxy waiting on your image to load. You could PING your worker IP to get a rough idea of how many ms the trip on the Internet is taking

kipOP•7mo ago

Yeah i don't think it accounts for the 1 second delay, maybe the proxy polls to see when the job is done or something and this adds some delays though I don't really mind about how long it takes the request to reach the runpod scheduler, this isn't blocking the delay has to be happening sometime after the it hits the scheduler just not sure what to do to decrease this hidden delay, the worker should be able to handle 20ish requests a second, but at the moment isn't able to handle 1 request per second

Encyrption•7mo ago

Is your worker running as an active worker?

kipOP•7mo ago

I tried keeping 1 worker active, didn't make a difference

Encyrption•7mo ago

If not, you cannot guarantee that your worker will respond to more than 1 request. Unless your worker is running in active mode and you have a job waiting in the queue or one arrives shortly after your worker has responded to a request and you have Flashboot enabled it will respond to them, without shutting down. Otherwise each time your worker responds to a request it will shut down. Active workers run all the time and respond the fastest (after their first response). As you get more traffic and have more servers working the higher the chance Flashboot will kick in. Suggest setting max workers to 30 until you are sure what you will need.

yhlong00000•7mo ago

thanks for sharing, let me give a try and let you know.

kipOP•7mo ago

Thanks for taking a look 🙂 Had a chance to have a look yet?

yhlong00000•7mo ago

I took a look, and it seems the issue is related to billing. Since serverless is billed by the second, we can’t charge for usage less than a second. I need a bit more time to look deeper. I mean any usage under a second is rounded up to the nearest second.

kipOP•7mo ago

Ahh, I kind of suspected that this might be the case, the usage was always suspiciously just over a second I’m surprised you haven’t run into that before though

yhlong00000•7mo ago

Yeah, for your case it’s better to use concurrent handler, so you can maximize the usage of the GPU

kipOP•6mo ago

Yeah, I had issues with this too, which I think is also tied to the billing I had it concurrent first and then moved to synchronous to try and debug, I’d prefer not to share the code for that one though Is there a plan to fix this, or not yet?

Gaming

Programming

Execution time discrepancy

Did you find this page helpful?