R
RunPod3mo ago
Vix

Expose S3 boto client retry config for endpoints (Dreambooth, etc)

Note: posting this here since I can't in "feature requests" section. I'm currently experiencing (and also have in the past) rate-limiting issues when trying to upload the result of a job to my S3 bucket. In my case, it is with Backblaze B2 and the infamous "ServiceUnavailable, no tomes available" error, but this has also happened with other providers. Problem: When using the Dreambooth endpoint with an S3 bucket setup to upload the trained model, sometimes the S3 service fails due to rate limiting and the whole job ends in error, losing all training. AFAIK, this is expected behavior and the caller is supposed to retry the request with an exponential backoff. Possible solution or mitigation: Add another field in the s3 object that lets you set the number of retries. Currently, the "runpod-python" lib uses a standard 3 retries setup, but having the option to increase this number (as it uses an exponential backoff algo) might allow it to complete an s3 upload operation when the service is overloaded. Hopefully, this helps prevent unnecessary failed jobs.
2 Replies
Madiator2011
Madiator20113mo ago
For that you probably want to submit issue on runpod-python repo
Vix
Vix3mo ago
Was thinking of doing that, but would also need the Dreambooth endpoint to be updated and expose the option in the api