Execution time discrepancy
I built a custom text embedding worker, when i time the request on the pod it takes about 20ms to process from start to finish.
The request takes a lot longer (about 1.5 seconds), and runpod returns an executionTime: 1088ms in the response object.
do you know where this discrepancy might come from? As it is, it's currently really limiting the throughput of my worker, and there isn't much point in using a GPU if it's so heavily bottlenecked.
Thanks in advanced! Also happy to share the code or any logs if it'd help diagnose what's up
42 Replies
Usually you might want to run multiple tests, and there are network delays depending on where you test it. The first request could take a bit longer because of cold start, but subsequent requests the response should be quickly
Hey @yhlong00000 Thanks for getting back to me 🙂
I'm not so worried about the network delays, the gpu's are in America and i'm in not that is pretty expected.
The difference between a few ms to process the request and Runpod saying it takes a full second is annoying
I'm running many back to back requests to ensure it's not cold start or flashboot, and the worker doesn't shut down between tests, it just takes significantly longer than it should
Hmm you used the same gpu, same config, same code in both?
yeah i think this might have to do with runpod's serverless then
here is my testing endpoint, it just sleep for 5 seconds, return some text, as far as execution time, it looks pretty accurate for me.
What about running for 0.2 secs
Maybe do the inference on some models vs pods
Maybe share some logs? I’m not sure exactly which time is inaccurate.
I'm just logging times from within the inference pod, so the setup is the same
I'll send some logs, gimme a minute
So my inference code is really just:
(removing some of the logic, as i don't think it can be related)
And the requests i get back from the runpod worker look like
So to be clear, the execution time correct, as in that's how long the request occupies the pod, but why is this so much more than the actual execution of my code
Also in the example above, I’m the only one making queries to the server
Oh, you mean why delay time is so long?
If the delay time is long, it’s usually due to a cold start. As you can see from my screenshot, I hadn’t used my endpoint for a while, so the worker went to sleep, and waking it up takes some time. Additionally, for some users, waking up a worker also involves loading the model into the GPU’s VRAM, which can take even longer depending on the size of the model. However, if you have workers actively processing requests, the delay time should be significantly reduced.
It's not the cold start time
The cold start time is something like 2.5 seconds, but even after sending a few requests and increasing the idle timeout to ensure it didn't spin down, execution consistently takes more than a second
I really can't figure out why it's taking more than the ~60ms of the actual workload
like the "total_time": 0.05837368965148926 is different from executionTime in your example response from endpoint?
the time it takes to do the actual inference call is pretty quick, but it takes 1 second to execute
I can make a minimal repo tomorrow if it helps debug
maybe thats because of flashboot, still loads the model into vram again but quicker, not sure..
Time time it takes to load the model is more than 1 second so i'd be suprised
I was hoping i'd done something stupid and there was a quick fix
it probably isnt, and how serverless should work, where's the other time difference here? from your total_time and execution time
should be from the booting right
Sorry, i'm not quite sure what you mean
this
oh it is probably isnt a fault on your code, looks right to me
Yeah the total execution time is actually much higher than the time it takes to perform the inference call
Would it help if i made a minimal example? I'll probably do so anyway to ensure i haven't made a mistake?
sure ya
if you don't mind opensourcing your code
i'll probably just make a standalone thing with transformers
Out of curiosity, do you know if there are any existing runpod templates that takes less than a second to execute the inference call?
idk whisper? let me try a quick test after this if i can
maybe, one of the faster trasnformers for sure
I'll do this tomorrow anyway though, I'll ping you with the code then
i'll be sending whisper times test
{
"delayTime": 18697, cold start, load models, start, etc
"executionTime": 63, throw an error no inference time
"id": "4c64b8ad-4411-427e-9989-a48ba53f5308-e1",
"status": "COMPLETED"
}
No inference again:
{
"delayTime": 1705,
"executionTime": 61,
"id": "be8ac9cc-0e28-4e41-9c55-d2b52eea1574-e1",
"status": "COMPLETED"
}
W inference:
{
"delayTime": 1367,
"executionTime": 846,
one more before this ( i think this is the real one ):
exec time 2.6 ( request expired so i type this out )
Each request is being proxied. There is going to be some overhead in that. 1 second? That is likely just overhead in proxying your request.
How are you handing your models? Are they stored on Network Storage, baked into the image, or loaded at run time? What region are your workers in? What is your active and max workers set to?
How are you handing your models? Are they stored on Network Storage, baked into the image, or loaded at run time? What region are your workers in? What is your active and max workers set to?
@nerdylive @yhlong00000 I put together a small example that has the issue https://github.com/kipgparker/clip-runpod
The response from the worker is the embedding as well as the time it takes to perform the forward pass, and the time it takes to execute is always significantly higher than the time it takes to perform the forward pass
Each request is being proxied. There is going to be some overhead in that. 1 second? That is likely just overhead in proxying your request.I assume you mean by whatever scheduler Runpod uses? I would expect that to be included in the delay time? Also if the request is being proxied that should be blocking any new requests getting to the pod, but for me it seems to be blocking. And one full second is a lot of time
Are you using RUN (async) or RUNSYNC (synchronous)?
How are you handing your models?I've given an example in the github repo i sent, but generally i just load them at the top of the handler file
Are they stored on Network Storage, baked into the image, or loaded at run time?The models are baked into the image, i think i saw that this was the recommended way of doing it on Runpod
What region are your workers in?Global, i'm not sure how to see the region of a specific pod though. also i don't see how this can really impact execution time
What is your active and max workers set to?One for now, just to test runsync
Have you done any testing locally? If you have the hardware to do so it gives you the ability to connect with bash/sh and run your handler directly bypassing the RunPod proxy. This might provide more details about where the latency you are experience is happening.
I've tested running the handler locally with
--rp_serve_api
and the time to perform the forward pass is something like 20 -> 50msSo you only see the 1 second delay when running on RunPod?
Correct, yeah
So, you will have some ms delay because your traffic is traversing the Internet. When you hit any of the RunPod URL, (runsync, run, status, cancel and health) behind the scene your traffic is being proxied. Your image doesn't have the code built in to respond to API calls so RunPod proxies it for you. My guess is those 2 things combined is causing your 1 second delay. I am not sure it would be accounted for in delay. I think delay is more of the proxy waiting on your image to load.
You could PING your worker IP to get a rough idea of how many ms the trip on the Internet is taking
Yeah i don't think it accounts for the 1 second delay, maybe the proxy polls to see when the job is done or something and this adds some delays though
I don't really mind about how long it takes the request to reach the runpod scheduler, this isn't blocking
the delay has to be happening sometime after the it hits the scheduler
just not sure what to do to decrease this hidden delay, the worker should be able to handle 20ish requests a second, but at the moment isn't able to handle 1 request per second
Is your worker running as an active worker?
I tried keeping 1 worker active, didn't make a difference
If not, you cannot guarantee that your worker will respond to more than 1 request. Unless your worker is running in active mode and you have a job waiting in the queue or one arrives shortly after your worker has responded to a request and you have Flashboot enabled it will respond to them, without shutting down. Otherwise each time your worker responds to a request it will shut down. Active workers run all the time and respond the fastest (after their first response).
As you get more traffic and have more servers working the higher the chance Flashboot will kick in. Suggest setting max workers to 30 until you are sure what you will need.
thanks for sharing, let me give a try and let you know.
Thanks for taking a look 🙂
Had a chance to have a look yet?
I took a look, and it seems the issue is related to billing. Since serverless is billed by the second, we can’t charge for usage less than a second. I need a bit more time to look deeper.
I mean any usage under a second is rounded up to the nearest second.
Ahh, I kind of suspected that this might be the case, the usage was always suspiciously just over a second
I’m surprised you haven’t run into that before though
Yeah, for your case it’s better to use concurrent handler, so you can maximize the usage of the GPU
Yeah, I had issues with this too, which I think is also tied to the billing
I had it concurrent first and then moved to synchronous to try and debug, I’d prefer not to share the code for that one though
Is there a plan to fix this, or not yet?