R
RunPod3mo ago
Hello

Stuck on "loading container image from cache"

Hi, I have updated my serverless endpint release version but some of my workers are stuck on "loading container image from cache" even though its a new version that shouldn't exists in the cache to begin with. Any advice on how to solve this issue?
20 Replies
nerdylive
nerdylive3mo ago
Use another tag, re-push using new image tag
Hello
HelloOP3mo ago
Thanks for such a quick response! I am using a brand new tag that's why I had to increment the release version accordingly. Some of the workers are pulling the image as expected but some are just "loading container image from cache"... :/
Encyrption
Encyrption3mo ago
This doesn't fix the issue but you can quickly reload to new version by setting max workers to 0 then once all workers stop put back to desired value. also, might want to block out EU- regions as there is an active network issue there.
nerdylive
nerdylive3mo ago
Hmm now EU too? Some you mean stale workers? Or the latest one
Hello
HelloOP3mo ago
oh gosh, looks like you are right. The workers that are stuck on loading from cache are on EU.
Encyrption
Encyrption3mo ago
Oh they said OR... I was noticing same in EU.
Hello
HelloOP3mo ago
Thanks!
nerdylive
nerdylive3mo ago
That's kind of weird
Encyrption
Encyrption3mo ago
I found when that happens if left alone they will evenetualy pull the image and start but it does take some time.
Hello
HelloOP3mo ago
yup, I disabled EU workers and all my workers are pulling as expected. Thanks alot guys!
Encyrption
Encyrption3mo ago
@nerdylive Should I mention elsewhere that this same issue seems to be impacting EU?
nerdylive
nerdylive3mo ago
That's should be the normal behavior for rolling in production yes Hmm yeah if you have experienced network related issues feel free to open a ticket
Encyrption
Encyrption3mo ago
I keep getting these:
2024-09-08T14:19:06Z loading container image from cache
2024-09-08T14:19:06Z loading container image from cache
and when this happens then it NEVER loads! Wish this would get fixed. 😦
yhlong00000
yhlong000003mo ago
do you have a endpoint id or job id for me to take a look?
Encyrption
Encyrption3mo ago
I don't I just delete it and removed the region from my selection.
yhlong00000
yhlong000003mo ago
yeah, it would be hard to look up without any id, save the id with error next time.🙏🏻
Encyrption
Encyrption3mo ago
I am getting this issue again. My endpoint ID is lzpelslkrkfml2 This was trying to load on worker id n1jxzk00as5yk0 in CA-MTL-1 When this happens it never loads just hangs in init forever.
No description
Encyrption
Encyrption3mo ago
It did finally load. It took 28 minutes and 17 seconds.
yhlong00000
yhlong000003mo ago
From the logs, it looks like you first sent a request that took a while, then you canceled it. After that, a new worker was created and terminated by you, and then the n1jxz worker was deployed. How large is your Docker image? Downloading the image can take some time. You might want to set your max workers to 2 or 3. This way, multiple workers can initialize at the same time, instead of just waiting on one.
Encyrption
Encyrption3mo ago
81.5 GB image size currently. I plan to set more max workers when I go into production but I tend to keep at max 1 during development to reduce delay between versions. I think I was being thrown off since it was not giving any updates during the entire 28 minutes. I did not expect it to ever finish. I plan to keep baking more and more models in until something breaks. The more I can load on a single endpoint the less workers I need and it should help with Flashboot.
Want results from more Discord servers?
Add your server