delay time
I have a serverless worker, which is configured to have 15 max workers. However, I notice that only about three of them are actually usable. My workload is configured to timeout if it takes longer than a minute to process.
The other workers randomly have issues such as timing out when attempting to return job data or completely failing to run and having to be retried on a different worker, leading to a delay/execution time of over 2-3 minutes
Executing 6 different jobs all have very different delay times. Some worker ids are consistenly low delay time but some randomly take forever. Is there anything I can do to lower this randomness? Additionally can I delete/blacklist these workers that perform poorly
7 Replies
You can terminate the ones that are behaving badly, but unfortunately no way to blacklist them. I have also experienced similar behavior 😦
My execution time is usually around 40s and my timeout is 5 mins and the 5 mins was reached pretty recently
I suggest logging a ticket for this on the website. I logged a ticket and still waiting for RunPod to get back to me.
lmk how it went
maybe* its just the cold start time, if those are the same workers
How does the cold start time differ so much from workers though
Already been like 3 days and just crickets from RunPod as usual 😢
😦
this error happens randomly,
{"requestId": "e617c5c9-b14c-42c6-886e-ec35f1b05bc9-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/rtqb8oacytm879/job-done/oh8mcc8cdhv1cx/e617c5c9-b14c-42c6-886e-ec35f1b05bc9-u1?gpu=NVIDIA+GeForce+RTX+4090&isStream=false", "level": "ERROR"}
causing the job to failThis is a different issue, log a support ticket on the website for this.,