Flashboot not working
I have flashboot enabled on my worker, but it appears all of them are running off a cold boot, every time for some reason.
40 Replies
delay times:
Usually its not like these? when did it start?
I'm having the same problem. This seems to have started yesterday.
Very inconsistent, and these are all sequential requests to the same worker
@1AndOnlyPika
Escalated To Zendesk
The thread has been escalated to Zendesk!
I just rolled back to RunPod 1.6.2 (from 1.7.1, since I updated it yesterday) in my Docker image and it seems to have fixed. I'll run some more tests to confirm.
It did.
is it the same as this https://discord.com/channels/912829806415085598/1291355541599420467
@1AndOnlyPika @Rodka
No it did not timeout
no, i rolled back to 1.6.2 as well but the issue did not fix
after some more testing, appears the delay time has been decreased a bit
~3s, compared to the 6-7s cold boots before
downgraded further to 1.6.0 and looks like that made it a little bit better, weirds
Has it always been at a 1-second idle timeout? There’s a bug in 1.7.1 that affects tasks running longer than the idle timeout. That’s getting fixed in 1.7.2 that is releasing soon. See PR https://github.com/runpod/runpod-python/pull/362
GitHub
fix: pings were missing requestIds since the last big refactor by d...
Distinguish JobsQueue(asyncio.Queue) and JobsProgress(set) and reference them appropriately
Cleaned up JobScaler.process_job --> rp_job.handle_job
Graceful cleanup when worker is killed
More tests
yep, my tasks take 40 seconds and come in bursty batches so i have it set to 1s so that as soon as its done it will shut off
Alright. That should be fixed with the 1.7.2 release. I’ll let you know when it’s out.
thanks. is it safe to install from git repo directly?
Yes. You can do that from the main if you’d like to test it out.
Override the Container Start Command with something like
Thankfully I found this thread. I was about to invest a whole day optimizing my container image because I thought a change I made broke Flashboot! Will wait for the new release
@deanQ Out of curiosity, do you have a sense of when 1.7.2 will be released?
Today. I’m just running some final testing.
Awesome! Thanks @deanQ
FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2
GitHub
Release 1.7.2 · runpod/runpod-python
What's Changed
Corrected job_take_url by @deanq in #359
Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353
fix: pings were missing requestIds since the last b...
does not appear to have fixed fastboot
downgraded all the way to 1.5.3 and flashboot is a bit more consistent now
@1AndOnlyPika What view is that? What are the columns?
Requests
On serverless endpoints
Oh I see, mine don't seem to be showing up, I guess because they're older than 30 min?
Anyway, the metric I've been using to evaluate is the Cold start time P70.
You can see prior to deploying a new image (assuming with a newer version of the pip runpod library) our P70 start time was around <150ms. After deploying it was up to >5,000ms and then redeploying with runpod 1.6.2 it's back down but still higher than before <700ms.
I'm not using delay time as that seems to also factor in queue times.
I'm assuming Flashboot impacts the Cold start time and that is the correct metric to evaluate, yeah?
Yep maybe the requests expired already
I'm not sure if it's right or not, and I think to compare you should filter the data into specific time range or maybe do another endpoint, is it
So avyer you downgraded the python library for the worker it became slower for the coldstart?
Well, downgrading to 1.6.2 did improve things quite a bit.
But not as good as before I think, but maybe I just need to wait to see if things get faster through more usage.
Is Flashboot performance related at all to image size or container disk size? For example, should the image fit in the Container Disk Size specified?
Not sure how Flashboot works, so hard to know what's happening.
Oh idk, but I'm not sure how much, last time I heard it's fine to have big container images
I guess not
With a setup like this, you will face cold start issues. For example, if you have burst consecutive jobs coming in, workers will stay alive and take those jobs. The moment a second or two have a gap without a job then your workers will go to sleep. Any job that comes in after that will have to wait in queue until a worker is ready. And by ready I mean, flash-booted or fully booted as a new worker. Extra few seconds will not cost you more, and will guarantee quick job takes between the gaps. Incurring cold start and boot times will end up costing you more time in total.
My gaps are 10 minutes long so i only want workers to boot up to take one job and then done
The jobs must complete within one minute, including the delay/cold start time
Which is why the longer delay times is a problem for me
This is exactly where flash-boot should help. I’ll investigate what I can about this.
Thank you, my endpoint id is 8ba6bkaiosbww6
most of the time, half of the max workers work with flashboot and start in 2s but but lots of them take 15+
Sometiems cannot get all of them even in 45s
Btw 1andonlypika just wondering what model are you running in the serverless, and maybe if you use any specific applications to run it ?
its for bittensor, just running out of a python file
Ooh so no setup scripts before? What happens before the start() call?
nope it just directly starts the worker and waits for requests
Ic
What might be the downsides of using an earlier version of the library (eg. 1.5.3)? I'm finding using this version yields much quicker startups.
no downsides that i've noticed so far
you'd probably lose a few features from the new versions, but i dont know if that'd matter so much