RunPod•7mo ago

Flashboot not working

I have flashboot enabled on my worker, but it appears all of them are running off a cold boot, every time for some reason.

40 Replies

1AndOnlyPikaOP•7mo ago

delay times:

Jason•7mo ago

Usually its not like these? when did it start?

Rodka•7mo ago

I'm having the same problem. This seems to have started yesterday.

Rodka•7mo ago

Very inconsistent, and these are all sequential requests to the same worker

Poddy•7mo ago

@1AndOnlyPika

Escalated To Zendesk

The thread has been escalated to Zendesk!

Rodka•7mo ago

I just rolled back to RunPod 1.6.2 (from 1.7.1, since I updated it yesterday) in my Docker image and it seems to have fixed. I'll run some more tests to confirm. It did.

Jason•7mo ago

is it the same as this https://discord.com/channels/912829806415085598/1291355541599420467 @1AndOnlyPika @Rodka

Rodka•7mo ago

No it did not timeout

1AndOnlyPikaOP•7mo ago

no, i rolled back to 1.6.2 as well but the issue did not fix after some more testing, appears the delay time has been decreased a bit ~3s, compared to the 6-7s cold boots before downgraded further to 1.6.0 and looks like that made it a little bit better, weirds

deanQ•7mo ago

Has it always been at a 1-second idle timeout? There’s a bug in 1.7.1 that affects tasks running longer than the idle timeout. That’s getting fixed in 1.7.2 that is releasing soon. See PR https://github.com/runpod/runpod-python/pull/362

GitHub

fix: pings were missing requestIds since the last big refactor by d...

Distinguish JobsQueue(asyncio.Queue) and JobsProgress(set) and reference them appropriately Cleaned up JobScaler.process_job --> rp_job.handle_job Graceful cleanup when worker is killed More tests

1AndOnlyPikaOP•7mo ago

yep, my tasks take 40 seconds and come in bursty batches so i have it set to 1s so that as soon as its done it will shut off

deanQ•7mo ago

Alright. That should be fixed with the 1.7.2 release. I’ll let you know when it’s out.

1AndOnlyPikaOP•7mo ago

thanks. is it safe to install from git repo directly?

deanQ•7mo ago

Yes. You can do that from the main if you’d like to test it out. Override the Container Start Command with something like

/bin/bash -c "apt-get update && \
apt-get install -y git && \
pip install git+https://github.com/runpod/runpod-python && \
<insert Dockerfile CMD here>"

/bin/bash -c "apt-get update && \
apt-get install -y git && \
pip install git+https://github.com/runpod/runpod-python && \
<insert Dockerfile CMD here>"

Arjun•7mo ago

Thankfully I found this thread. I was about to invest a whole day optimizing my container image because I thought a change I made broke Flashboot! Will wait for the new release @deanQ Out of curiosity, do you have a sense of when 1.7.2 will be released?

deanQ•7mo ago

Today. I’m just running some final testing.

Arjun•7mo ago

Awesome! Thanks @deanQ

deanQ•7mo ago

FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2

GitHub

Release 1.7.2 · runpod/runpod-python

What's Changed Corrected job_take_url by @deanq in #359 Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353 fix: pings were missing requestIds since the last b...

1AndOnlyPikaOP•7mo ago

does not appear to have fixed fastboot

1AndOnlyPikaOP•7mo ago

downgraded all the way to 1.5.3 and flashboot is a bit more consistent now

Arjun•7mo ago

@1AndOnlyPika What view is that? What are the columns?

Jason•7mo ago

Requests On serverless endpoints

Arjun•7mo ago

Oh I see, mine don't seem to be showing up, I guess because they're older than 30 min? Anyway, the metric I've been using to evaluate is the Cold start time P70. You can see prior to deploying a new image (assuming with a newer version of the pip runpod library) our P70 start time was around <150ms. After deploying it was up to >5,000ms and then redeploying with runpod 1.6.2 it's back down but still higher than before <700ms. I'm not using delay time as that seems to also factor in queue times. I'm assuming Flashboot impacts the Cold start time and that is the correct metric to evaluate, yeah?

Jason•7mo ago

Yep maybe the requests expired already I'm not sure if it's right or not, and I think to compare you should filter the data into specific time range or maybe do another endpoint, is it So avyer you downgraded the python library for the worker it became slower for the coldstart?

Arjun•7mo ago

Well, downgrading to 1.6.2 did improve things quite a bit. But not as good as before I think, but maybe I just need to wait to see if things get faster through more usage. Is Flashboot performance related at all to image size or container disk size? For example, should the image fit in the Container Disk Size specified? Not sure how Flashboot works, so hard to know what's happening.

Jason•7mo ago

Oh idk, but I'm not sure how much, last time I heard it's fine to have big container images I guess not

deanQ•7mo ago

With a setup like this, you will face cold start issues. For example, if you have burst consecutive jobs coming in, workers will stay alive and take those jobs. The moment a second or two have a gap without a job then your workers will go to sleep. Any job that comes in after that will have to wait in queue until a worker is ready. And by ready I mean, flash-booted or fully booted as a new worker. Extra few seconds will not cost you more, and will guarantee quick job takes between the gaps. Incurring cold start and boot times will end up costing you more time in total.

1AndOnlyPikaOP•7mo ago

My gaps are 10 minutes long so i only want workers to boot up to take one job and then done The jobs must complete within one minute, including the delay/cold start time Which is why the longer delay times is a problem for me

deanQ•7mo ago

This is exactly where flash-boot should help. I’ll investigate what I can about this.

1AndOnlyPikaOP•7mo ago

Thank you, my endpoint id is 8ba6bkaiosbww6 most of the time, half of the max workers work with flashboot and start in 2s but but lots of them take 15+ Sometiems cannot get all of them even in 45s

Jason•7mo ago

Btw 1andonlypika just wondering what model are you running in the serverless, and maybe if you use any specific applications to run it ?

1AndOnlyPikaOP•7mo ago

its for bittensor, just running out of a python file

Jason•7mo ago

Ooh so no setup scripts before? What happens before the start() call?

1AndOnlyPikaOP•7mo ago

nope it just directly starts the worker and waits for requests

Jason•7mo ago

Arjun•7mo ago

What might be the downsides of using an earlier version of the library (eg. 1.5.3)? I'm finding using this version yields much quicker startups.

1AndOnlyPikaOP•7mo ago

no downsides that i've noticed so far you'd probably lose a few features from the new versions, but i dont know if that'd matter so much

Gaming

Programming

Flashboot not working

Did you find this page helpful?