Long wait time for Serverless deployments
Hi perhaps someone can help. We've got a various workloads running on Runpod. We deploy to Runpod using SST. Updating the a template with the new image to deploy works great in our CI (github actions). Once we've deployed our updated code to Runpod we want validate that our application is working so we run some tests by invoking the endpoint and asserting on various outputs produced. We do this with preview environments (per pull request) and in a staging environment on the way out to production. We realised a while ago that our tests would often be running against an older version of the code since Runpod handn't had a chance to pull the newer image from Dockerhub.
Our solution to this was to add a Job in the CI that would repeatedly call the
/runsync
endpoint until the SHA (baked into the image) matched the one the CI was currently running against and move on to the testing stage once we were certain we would be testing the latest version of the code. This mostly worked with an occasional timeout here and there. Our configuration was:
Recently, however, we made a change to set minWorkers: 0
as we were spending a bit much of endpoints left lying around so instead opted for a generous idleTimeout of 10 mins for all non-production endpoints.
The problem is that since setting minWorkers to 0
our "wait-for-runpod" job is timing out more often than it succeeds. Sometimes a second run will go through, but sometimes not. It also seems that deleting the stale workers for the endpoint in the Runpod Dashboard seems to get our CI moving again. This often coincides with an "image pull pending" message in the "stuck" worker. We have 3 endpoints that we need to wait for before running tests. All images are between 7.5 GB and 8.5 GB.
My questions:
1. How do I know reliably when an endpoint will start processing requests on workers that are running the latest image?
2. Is there a better way to do what I'm doing?1 Reply
For clarity, this is a simplified code sample to better illustrate what our "wait-for-runpod" job does. We run this code in a github action for every endpoint right after updating its template.