Runpod occasionally fails to pull from ECR
Every now and again I have issues starting a pod as it fails to pull from AWS ECR. Nothing in my setup changes.
6 Replies
I also have a bunch of entries in
Container Registry Auth
and I can't delete them.I'm under the impression ECR tokens entirely expire after 12 hours - this came up as a feature request for us to streamline this process last night.
They do, however with
dstack,
I automatically take care of this process and make sure I have a valid ECR password every time.
Looking into it, I looks like dstack
creates Container Registry Auth
entries with the username and password and then links the pull
command to the correct entry. This could an issue on their end, as it's not cleaning up old entries. I now have 6 of them (with the docs stating I can only have a max of 4) and I can't manually delete any of them. 😦
Update: I was able to delete my entries via API.
We worked with dstack and believe the issue is due to graphQL not having an input for registry auth on pods creation/start, so you have to edit the pod after and assign the registry auth...and we think it's a weird timing issue.
Looks like the REST API has this input , so they will change their automation to use REST to see if it resolves this issue
As per this message , it looks like the containerRegistryAuthId
input field for the REST API isn't working properly. We will have to wait for a fix on runpod's side before we can test if that's the actual issue or not.@nathaniel This ones all you^
debugging this now
will update you when fix is found
Thank you!