Flashboot not working

I have flashboot enabled on my worker, but it appears all of them are running off a cold boot, every time for some reason.
No description
40 Replies
1AndOnlyPika
1AndOnlyPikaOP4mo ago
No description
1AndOnlyPika
1AndOnlyPikaOP4mo ago
delay times:
No description
nerdylive
nerdylive4mo ago
Usually its not like these? when did it start?
Rodka
Rodka4mo ago
I'm having the same problem. This seems to have started yesterday.
No description
Rodka
Rodka4mo ago
Very inconsistent, and these are all sequential requests to the same worker
Poddy
Poddy4mo ago
@1AndOnlyPika
Escalated To Zendesk
The thread has been escalated to Zendesk!
Rodka
Rodka4mo ago
I just rolled back to RunPod 1.6.2 (from 1.7.1, since I updated it yesterday) in my Docker image and it seems to have fixed. I'll run some more tests to confirm. It did.
nerdylive
nerdylive4mo ago
Rodka
Rodka4mo ago
No it did not timeout
1AndOnlyPika
1AndOnlyPikaOP4mo ago
no, i rolled back to 1.6.2 as well but the issue did not fix after some more testing, appears the delay time has been decreased a bit ~3s, compared to the 6-7s cold boots before downgraded further to 1.6.0 and looks like that made it a little bit better, weirds
deanQ
deanQ4mo ago
Has it always been at a 1-second idle timeout? There’s a bug in 1.7.1 that affects tasks running longer than the idle timeout. That’s getting fixed in 1.7.2 that is releasing soon. See PR https://github.com/runpod/runpod-python/pull/362
GitHub
fix: pings were missing requestIds since the last big refactor by d...
Distinguish JobsQueue(asyncio.Queue) and JobsProgress(set) and reference them appropriately Cleaned up JobScaler.process_job --> rp_job.handle_job Graceful cleanup when worker is killed More tests
1AndOnlyPika
1AndOnlyPikaOP4mo ago
yep, my tasks take 40 seconds and come in bursty batches so i have it set to 1s so that as soon as its done it will shut off
deanQ
deanQ4mo ago
Alright. That should be fixed with the 1.7.2 release. I’ll let you know when it’s out.
1AndOnlyPika
1AndOnlyPikaOP4mo ago
thanks. is it safe to install from git repo directly?
deanQ
deanQ4mo ago
Yes. You can do that from the main if you’d like to test it out. Override the Container Start Command with something like
/bin/bash -c "apt-get update && \
apt-get install -y git && \
pip install git+https://github.com/runpod/runpod-python && \
<insert Dockerfile CMD here>"
/bin/bash -c "apt-get update && \
apt-get install -y git && \
pip install git+https://github.com/runpod/runpod-python && \
<insert Dockerfile CMD here>"
Arjun
Arjun4mo ago
Thankfully I found this thread. I was about to invest a whole day optimizing my container image because I thought a change I made broke Flashboot! Will wait for the new release @deanQ Out of curiosity, do you have a sense of when 1.7.2 will be released?
deanQ
deanQ4mo ago
Today. I’m just running some final testing.
Arjun
Arjun4mo ago
Awesome! Thanks @deanQ
deanQ
deanQ4mo ago
FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2
GitHub
Release 1.7.2 · runpod/runpod-python
What's Changed Corrected job_take_url by @deanq in #359 Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353 fix: pings were missing requestIds since the last b...
1AndOnlyPika
1AndOnlyPikaOP4mo ago
does not appear to have fixed fastboot
1AndOnlyPika
1AndOnlyPikaOP4mo ago
No description
1AndOnlyPika
1AndOnlyPikaOP4mo ago
No description
1AndOnlyPika
1AndOnlyPikaOP4mo ago
downgraded all the way to 1.5.3 and flashboot is a bit more consistent now
Arjun
Arjun4mo ago
@1AndOnlyPika What view is that? What are the columns?
nerdylive
nerdylive4mo ago
Requests On serverless endpoints
Arjun
Arjun4mo ago
Oh I see, mine don't seem to be showing up, I guess because they're older than 30 min? Anyway, the metric I've been using to evaluate is the Cold start time P70. You can see prior to deploying a new image (assuming with a newer version of the pip runpod library) our P70 start time was around <150ms. After deploying it was up to >5,000ms and then redeploying with runpod 1.6.2 it's back down but still higher than before <700ms. I'm not using delay time as that seems to also factor in queue times. I'm assuming Flashboot impacts the Cold start time and that is the correct metric to evaluate, yeah?
No description
No description
No description
nerdylive
nerdylive4mo ago
Yep maybe the requests expired already I'm not sure if it's right or not, and I think to compare you should filter the data into specific time range or maybe do another endpoint, is it So avyer you downgraded the python library for the worker it became slower for the coldstart?
Arjun
Arjun4mo ago
Well, downgrading to 1.6.2 did improve things quite a bit. But not as good as before I think, but maybe I just need to wait to see if things get faster through more usage. Is Flashboot performance related at all to image size or container disk size? For example, should the image fit in the Container Disk Size specified? Not sure how Flashboot works, so hard to know what's happening.
nerdylive
nerdylive4mo ago
Oh idk, but I'm not sure how much, last time I heard it's fine to have big container images I guess not
deanQ
deanQ4mo ago
With a setup like this, you will face cold start issues. For example, if you have burst consecutive jobs coming in, workers will stay alive and take those jobs. The moment a second or two have a gap without a job then your workers will go to sleep. Any job that comes in after that will have to wait in queue until a worker is ready. And by ready I mean, flash-booted or fully booted as a new worker. Extra few seconds will not cost you more, and will guarantee quick job takes between the gaps. Incurring cold start and boot times will end up costing you more time in total.
1AndOnlyPika
1AndOnlyPikaOP4mo ago
My gaps are 10 minutes long so i only want workers to boot up to take one job and then done The jobs must complete within one minute, including the delay/cold start time Which is why the longer delay times is a problem for me
deanQ
deanQ4mo ago
This is exactly where flash-boot should help. I’ll investigate what I can about this.
1AndOnlyPika
1AndOnlyPikaOP4mo ago
Thank you, my endpoint id is 8ba6bkaiosbww6 most of the time, half of the max workers work with flashboot and start in 2s but but lots of them take 15+ Sometiems cannot get all of them even in 45s
nerdylive
nerdylive4mo ago
Btw 1andonlypika just wondering what model are you running in the serverless, and maybe if you use any specific applications to run it ?
1AndOnlyPika
1AndOnlyPikaOP4mo ago
its for bittensor, just running out of a python file
nerdylive
nerdylive4mo ago
Ooh so no setup scripts before? What happens before the start() call?
1AndOnlyPika
1AndOnlyPikaOP4mo ago
nope it just directly starts the worker and waits for requests
nerdylive
nerdylive4mo ago
Ic
Arjun
Arjun4mo ago
What might be the downsides of using an earlier version of the library (eg. 1.5.3)? I'm finding using this version yields much quicker startups.
1AndOnlyPika
1AndOnlyPikaOP4mo ago
no downsides that i've noticed so far you'd probably lose a few features from the new versions, but i dont know if that'd matter so much

Did you find this page helpful?