RunPod•8mo ago

Stuck on "loading container image from cache"

Hi, I have updated my serverless endpint release version but some of my workers are stuck on "loading container image from cache" even though its a new version that shouldn't exists in the cache to begin with. Any advice on how to solve this issue?

20 Replies

Jason•8mo ago

Use another tag, re-push using new image tag

HelloOP•8mo ago

Thanks for such a quick response! I am using a brand new tag that's why I had to increment the release version accordingly. Some of the workers are pulling the image as expected but some are just "loading container image from cache"... :/

Encyrption•8mo ago

This doesn't fix the issue but you can quickly reload to new version by setting max workers to 0 then once all workers stop put back to desired value. also, might want to block out EU- regions as there is an active network issue there.

Jason•8mo ago

Hmm now EU too? Some you mean stale workers? Or the latest one

HelloOP•8mo ago

oh gosh, looks like you are right. The workers that are stuck on loading from cache are on EU.

Encyrption•8mo ago

Oh they said OR... I was noticing same in EU.

HelloOP•8mo ago

Thanks!

Jason•8mo ago

That's kind of weird

Encyrption•8mo ago

I found when that happens if left alone they will evenetualy pull the image and start but it does take some time.

HelloOP•8mo ago

yup, I disabled EU workers and all my workers are pulling as expected. Thanks alot guys!

Encyrption•8mo ago

@nerdylive Should I mention elsewhere that this same issue seems to be impacting EU?

Jason•8mo ago

That's should be the normal behavior for rolling in production yes Hmm yeah if you have experienced network related issues feel free to open a ticket

Encyrption•8mo ago

I keep getting these:

2024-09-08T14:19:06Z loading container image from cache

2024-09-08T14:19:06Z loading container image from cache

and when this happens then it NEVER loads! Wish this would get fixed. 😦

yhlong00000•8mo ago

do you have a endpoint id or job id for me to take a look?

Encyrption•8mo ago

I don't I just delete it and removed the region from my selection.

yhlong00000•8mo ago

yeah, it would be hard to look up without any id, save the id with error next time.🙏🏻

Encyrption•8mo ago

I am getting this issue again. My endpoint ID is lzpelslkrkfml2 This was trying to load on worker id n1jxzk00as5yk0 in CA-MTL-1 When this happens it never loads just hangs in init forever.

Encyrption•8mo ago

It did finally load. It took 28 minutes and 17 seconds.

yhlong00000•8mo ago

From the logs, it looks like you first sent a request that took a while, then you canceled it. After that, a new worker was created and terminated by you, and then the n1jxz worker was deployed. How large is your Docker image? Downloading the image can take some time. You might want to set your max workers to 2 or 3. This way, multiple workers can initialize at the same time, instead of just waiting on one.

Encyrption•8mo ago

81.5 GB image size currently. I plan to set more max workers when I go into production but I tend to keep at max 1 during development to reduce delay between versions. I think I was being thrown off since it was not giving any updates during the entire 28 minutes. I did not expect it to ever finish. I plan to keep baking more and more models in until something breaks. The more I can load on a single endpoint the less workers I need and it should help with Flashboot.

Gaming

Programming

Stuck on "loading container image from cache"

Did you find this page helpful?