R
RunPod•4mo ago
kip

Execution time discrepancy

I built a custom text embedding worker, when i time the request on the pod it takes about 20ms to process from start to finish. The request takes a lot longer (about 1.5 seconds), and runpod returns an executionTime: 1088ms in the response object. do you know where this discrepancy might come from? As it is, it's currently really limiting the throughput of my worker, and there isn't much point in using a GPU if it's so heavily bottlenecked. Thanks in advanced! Also happy to share the code or any logs if it'd help diagnose what's up
42 Replies
yhlong00000
yhlong00000•4mo ago
Usually you might want to run multiple tests, and there are network delays depending on where you test it. The first request could take a bit longer because of cold start, but subsequent requests the response should be quickly
kip
kipOP•4mo ago
Hey @yhlong00000 Thanks for getting back to me 🙂 I'm not so worried about the network delays, the gpu's are in America and i'm in not that is pretty expected. The difference between a few ms to process the request and Runpod saying it takes a full second is annoying I'm running many back to back requests to ensure it's not cold start or flashboot, and the worker doesn't shut down between tests, it just takes significantly longer than it should
nerdylive
nerdylive•4mo ago
Hmm you used the same gpu, same config, same code in both? yeah i think this might have to do with runpod's serverless then
yhlong00000
yhlong00000•4mo ago
here is my testing endpoint, it just sleep for 5 seconds, return some text, as far as execution time, it looks pretty accurate for me.
No description
No description
nerdylive
nerdylive•4mo ago
What about running for 0.2 secs Maybe do the inference on some models vs pods
yhlong00000
yhlong00000•4mo ago
Maybe share some logs? I’m not sure exactly which time is inaccurate.
kip
kipOP•4mo ago
I'm just logging times from within the inference pod, so the setup is the same I'll send some logs, gimme a minute So my inference code is really just:
import torch
import runpod
from model import load_model
from rp_schemas import INPUT_SCHEMA
from runpod.serverless.utils.rp_validator import validate
import time

device = torch.device("cuda:0")

model, tokenizer = load_model("~/.cache/model_tensors", device)

async def process_request(job):
start_time = time.time()

# My business logic is here
...

total_time = time.time() - start_time

print(f"Total processing time: {total_time:.4f} seconds")

return {"total_time": total_time}

def main():
runpod.serverless.start({
"handler": process_request,
})

if __name__ == "__main__":
main()
import torch
import runpod
from model import load_model
from rp_schemas import INPUT_SCHEMA
from runpod.serverless.utils.rp_validator import validate
import time

device = torch.device("cuda:0")

model, tokenizer = load_model("~/.cache/model_tensors", device)

async def process_request(job):
start_time = time.time()

# My business logic is here
...

total_time = time.time() - start_time

print(f"Total processing time: {total_time:.4f} seconds")

return {"total_time": total_time}

def main():
runpod.serverless.start({
"handler": process_request,
})

if __name__ == "__main__":
main()
(removing some of the logic, as i don't think it can be related) And the requests i get back from the runpod worker look like
{
"delayTime": 2627,
"executionTime": 1132,
"id": I removed this,
"output": {
"total_time": 0.05837368965148926
},
"status": "COMPLETED"
}
{
"delayTime": 2627,
"executionTime": 1132,
"id": I removed this,
"output": {
"total_time": 0.05837368965148926
},
"status": "COMPLETED"
}
So to be clear, the execution time correct, as in that's how long the request occupies the pod, but why is this so much more than the actual execution of my code Also in the example above, I’m the only one making queries to the server
yhlong00000
yhlong00000•4mo ago
Oh, you mean why delay time is so long? If the delay time is long, it’s usually due to a cold start. As you can see from my screenshot, I hadn’t used my endpoint for a while, so the worker went to sleep, and waking it up takes some time. Additionally, for some users, waking up a worker also involves loading the model into the GPU’s VRAM, which can take even longer depending on the size of the model. However, if you have workers actively processing requests, the delay time should be significantly reduced.
kip
kipOP•4mo ago
It's not the cold start time The cold start time is something like 2.5 seconds, but even after sending a few requests and increasing the idle timeout to ensure it didn't spin down, execution consistently takes more than a second I really can't figure out why it's taking more than the ~60ms of the actual workload
kip
kipOP•4mo ago
No description
nerdylive
nerdylive•4mo ago
like the "total_time": 0.05837368965148926 is different from executionTime in your example response from endpoint?
kip
kipOP•4mo ago
the time it takes to do the actual inference call is pretty quick, but it takes 1 second to execute I can make a minimal repo tomorrow if it helps debug
nerdylive
nerdylive•4mo ago
maybe thats because of flashboot, still loads the model into vram again but quicker, not sure..
kip
kipOP•4mo ago
Time time it takes to load the model is more than 1 second so i'd be suprised I was hoping i'd done something stupid and there was a quick fix
nerdylive
nerdylive•4mo ago
it probably isnt, and how serverless should work, where's the other time difference here? from your total_time and execution time should be from the booting right
kip
kipOP•4mo ago
Sorry, i'm not quite sure what you mean
nerdylive
nerdylive•4mo ago
this oh it is probably isnt a fault on your code, looks right to me
kip
kipOP•4mo ago
Yeah the total execution time is actually much higher than the time it takes to perform the inference call Would it help if i made a minimal example? I'll probably do so anyway to ensure i haven't made a mistake?
nerdylive
nerdylive•4mo ago
sure ya if you don't mind opensourcing your code
kip
kipOP•4mo ago
i'll probably just make a standalone thing with transformers Out of curiosity, do you know if there are any existing runpod templates that takes less than a second to execute the inference call?
nerdylive
nerdylive•4mo ago
idk whisper? let me try a quick test after this if i can
kip
kipOP•4mo ago
maybe, one of the faster trasnformers for sure I'll do this tomorrow anyway though, I'll ping you with the code then
nerdylive
nerdylive•4mo ago
i'll be sending whisper times test { "delayTime": 18697, cold start, load models, start, etc "executionTime": 63, throw an error no inference time "id": "4c64b8ad-4411-427e-9989-a48ba53f5308-e1", "status": "COMPLETED" } No inference again: { "delayTime": 1705, "executionTime": 61, "id": "be8ac9cc-0e28-4e41-9c55-d2b52eea1574-e1", "status": "COMPLETED" } W inference: { "delayTime": 1367, "executionTime": 846, one more before this ( i think this is the real one ): exec time 2.6 ( request expired so i type this out )
Encyrption
Encyrption•4mo ago
Each request is being proxied. There is going to be some overhead in that. 1 second? That is likely just overhead in proxying your request.
How are you handing your models? Are they stored on Network Storage, baked into the image, or loaded at run time? What region are your workers in? What is your active and max workers set to?
kip
kipOP•4mo ago
@nerdylive @yhlong00000 I put together a small example that has the issue https://github.com/kipgparker/clip-runpod The response from the worker is the embedding as well as the time it takes to perform the forward pass, and the time it takes to execute is always significantly higher than the time it takes to perform the forward pass
Each request is being proxied. There is going to be some overhead in that. 1 second? That is likely just overhead in proxying your request.
I assume you mean by whatever scheduler Runpod uses? I would expect that to be included in the delay time? Also if the request is being proxied that should be blocking any new requests getting to the pod, but for me it seems to be blocking. And one full second is a lot of time
Encyrption
Encyrption•4mo ago
Are you using RUN (async) or RUNSYNC (synchronous)?
kip
kipOP•4mo ago
How are you handing your models?
I've given an example in the github repo i sent, but generally i just load them at the top of the handler file
Are they stored on Network Storage, baked into the image, or loaded at run time?
The models are baked into the image, i think i saw that this was the recommended way of doing it on Runpod
What region are your workers in?
Global, i'm not sure how to see the region of a specific pod though. also i don't see how this can really impact execution time
What is your active and max workers set to?
One for now, just to test runsync
Encyrption
Encyrption•4mo ago
Have you done any testing locally? If you have the hardware to do so it gives you the ability to connect with bash/sh and run your handler directly bypassing the RunPod proxy. This might provide more details about where the latency you are experience is happening.
kip
kipOP•4mo ago
I've tested running the handler locally with --rp_serve_api and the time to perform the forward pass is something like 20 -> 50ms
Encyrption
Encyrption•4mo ago
So you only see the 1 second delay when running on RunPod?
kip
kipOP•4mo ago
Correct, yeah
Encyrption
Encyrption•4mo ago
So, you will have some ms delay because your traffic is traversing the Internet. When you hit any of the RunPod URL, (runsync, run, status, cancel and health) behind the scene your traffic is being proxied. Your image doesn't have the code built in to respond to API calls so RunPod proxies it for you. My guess is those 2 things combined is causing your 1 second delay. I am not sure it would be accounted for in delay. I think delay is more of the proxy waiting on your image to load. You could PING your worker IP to get a rough idea of how many ms the trip on the Internet is taking
kip
kipOP•4mo ago
Yeah i don't think it accounts for the 1 second delay, maybe the proxy polls to see when the job is done or something and this adds some delays though I don't really mind about how long it takes the request to reach the runpod scheduler, this isn't blocking the delay has to be happening sometime after the it hits the scheduler just not sure what to do to decrease this hidden delay, the worker should be able to handle 20ish requests a second, but at the moment isn't able to handle 1 request per second
Encyrption
Encyrption•4mo ago
Is your worker running as an active worker?
kip
kipOP•4mo ago
I tried keeping 1 worker active, didn't make a difference
Encyrption
Encyrption•4mo ago
If not, you cannot guarantee that your worker will respond to more than 1 request. Unless your worker is running in active mode and you have a job waiting in the queue or one arrives shortly after your worker has responded to a request and you have Flashboot enabled it will respond to them, without shutting down. Otherwise each time your worker responds to a request it will shut down. Active workers run all the time and respond the fastest (after their first response). As you get more traffic and have more servers working the higher the chance Flashboot will kick in. Suggest setting max workers to 30 until you are sure what you will need.
yhlong00000
yhlong00000•4mo ago
thanks for sharing, let me give a try and let you know.
kip
kipOP•4mo ago
Thanks for taking a look 🙂 Had a chance to have a look yet?
yhlong00000
yhlong00000•4mo ago
I took a look, and it seems the issue is related to billing. Since serverless is billed by the second, we can’t charge for usage less than a second. I need a bit more time to look deeper. I mean any usage under a second is rounded up to the nearest second.
kip
kipOP•4mo ago
Ahh, I kind of suspected that this might be the case, the usage was always suspiciously just over a second I’m surprised you haven’t run into that before though
yhlong00000
yhlong00000•4mo ago
Yeah, for your case it’s better to use concurrent handler, so you can maximize the usage of the GPU
kip
kipOP•4mo ago
Yeah, I had issues with this too, which I think is also tied to the billing I had it concurrent first and then moved to synchronous to try and debug, I’d prefer not to share the code for that one though Is there a plan to fix this, or not yet?
Want results from more Discord servers?
Add your server