Stuck on "loading container image from cache"
Hi, I have updated my serverless endpint release version but some of my workers are stuck on "loading container image from cache" even though its a new version that shouldn't exists in the cache to begin with.
Any advice on how to solve this issue?
20 Replies
Use another tag, re-push using new image tag
Thanks for such a quick response! I am using a brand new tag that's why I had to increment the release version accordingly. Some of the workers are pulling the image as expected but some are just "loading container image from cache"... :/
This doesn't fix the issue but you can quickly reload to new version by setting max workers to 0 then once all workers stop put back to desired value.
also, might want to block out EU- regions as there is an active network issue there.
Hmm now EU too?
Some you mean stale workers? Or the latest one
oh gosh, looks like you are right. The workers that are stuck on loading from cache are on EU.
Oh they said OR... I was noticing same in EU.
Thanks!
That's kind of weird
I found when that happens if left alone they will evenetualy pull the image and start but it does take some time.
yup, I disabled EU workers and all my workers are pulling as expected. Thanks alot guys!
@nerdylive Should I mention elsewhere that this same issue seems to be impacting EU?
That's should be the normal behavior for rolling in production yes
Hmm yeah if you have experienced network related issues feel free to open a ticket
I keep getting these:
and when this happens then it NEVER loads! Wish this would get fixed. 😦
do you have a endpoint id or job id for me to take a look?
I don't I just delete it and removed the region from my selection.
yeah, it would be hard to look up without any id, save the id with error next time.🙏🏻
I am getting this issue again. My endpoint ID is lzpelslkrkfml2 This was trying to load on worker id n1jxzk00as5yk0 in CA-MTL-1 When this happens it never loads just hangs in init forever.
It did finally load. It took 28 minutes and 17 seconds.
From the logs, it looks like you first sent a request that took a while, then you canceled it. After that, a new worker was created and terminated by you, and then the n1jxz worker was deployed. How large is your Docker image? Downloading the image can take some time. You might want to set your max workers to 2 or 3. This way, multiple workers can initialize at the same time, instead of just waiting on one.
81.5 GB image size currently. I plan to set more max workers when I go into production but I tend to keep at max 1 during development to reduce delay between versions. I think I was being thrown off since it was not giving any updates during the entire 28 minutes. I did not expect it to ever finish. I plan to keep baking more and more models in until something breaks. The more I can load on a single endpoint the less workers I need and it should help with Flashboot.