Vix
RRunPod
•Created by Vix on 4/28/2024 in #⚡|serverless
Expose S3 boto client retry config for endpoints (Dreambooth, etc)
Note: posting this here since I can't in "feature requests" section.
I'm currently experiencing (and also have in the past) rate-limiting issues when trying to upload the result of a job to my S3 bucket.
In my case, it is with Backblaze B2 and the infamous "ServiceUnavailable, no tomes available" error, but this has also happened with other providers.
Problem: When using the Dreambooth endpoint with an S3 bucket setup to upload the trained model, sometimes the S3 service fails due to rate limiting and the whole job ends in error, losing all training. AFAIK, this is expected behavior and the caller is supposed to retry the request with an exponential backoff.
Possible solution or mitigation: Add another field in the s3 object that lets you set the number of retries. Currently, the "runpod-python" lib uses a standard 3 retries setup, but having the option to increase this number (as it uses an exponential backoff algo) might allow it to complete an s3 upload operation when the service is overloaded. Hopefully, this helps prevent unnecessary failed jobs.
3 replies