No active workers after deploying New Release
Had 5 active workers. Deployed new release, which was quickly pulled. Shortly afterwards all workers went to "Initializing" state, fully shutting down the endpoint.
Would expect some workers to stay active so the endpoint can handle requests.
This is not the first time that this happened. As of now it is not stable to use this feature on production pods.
@Papa Madiator
29 Replies
Did you see active workers
Set
Afterwards ye, to see if it changes anything. Usually 0.
If you click at worker you should be able to see if it’s for example pulling docker image
Are you using version tags?
This kind of behavior usually happens when you don't use proper version tags for your images.
10/11 have new image. It has a version tag, although not sure what would be the proper one.
And if you push to the same tag, this will also happen. Each deployment should have a different tag.
The two images have two different tags yes
Hmm, then this should not happen
I would set all workers to 0 wait for all of them to be deleted and then spawn new ones
happens consistently.
deploy new release -> all workers shutdown (effectively shutting down the endpoint for 15-20min)
scary to use this feature for production
@Papa Madiator @haris
endpoint shutdown, workers active (billed for) and downloading the new image, queue is not being handled
this is really bad tbh
Another solution would be to manually kill X number of workers
new release, again all workers shut down, can't deploy new images on same endpoint.
Did you use the same tag?
nope
I didn't even know this was a thing, was so used to the queue building up for every new Docker pull for 20 minutes 🤣
@Papa Madiator Hi, this issue is still present, shuts down all workers.
Is there a chance this will be looked into by devs, or is there a way to reach out otherwise? In support chat I was told this will be escalated (~2wks ago), no status update yet.
whats the ticket id?
Didn't get such a thing
Could you show the logs of an initializing worker
click on the box of a worker, it will have a logs button of its own inside
i have the same problem (without any update to workers), everything stuck in initializing.. we're planning a marketing push for our app in the coming weeks. Sorta scary this happens in production :/
hmm, the GPU i had is now unavailable. so that explains it
Hi @pazanchick, would you be able to show me the configuration for your endpoint?
Hi @haris anything specific? Docker image ~90GB
@Alpay Ariyak they usually log the (docker iamge) Download progress bar.
Issues that i encounter are either: - Worker is set active and billed for while downloading the image - All workers initalize at the same time, shutting down the endpoint
Issues that i encounter are either: - Worker is set active and billed for while downloading the image - All workers initalize at the same time, shutting down the endpoint
Could you show the logs next time it happens please
I guess it doesn't matter anymore.
used new release feature.
all workers start updating at same time, shutting down the endpoint.
@Alpay Ariyak @Papa Madiator
I'm not sure if mail would be the way to reach out to runpod, is there some other way? I tried via website chat and was told that it would be "escalated" maybe a month ago. No status update yet.
do you have ticket id?
4474 (i guess. that number was in the mail, didn't mention it being a ticket id)