R
RunPod4mo ago
yasyf

524 Timeouts when waiting for new serverless messages

After my async python serverless handler finishes one request, I then start getting these on that box:
2024-09-26T22:11:55.344188433Z {"requestId": null, "message": "Failed to get job, status code: 524", "level": "ERROR"}
2024-09-26T22:11:55.344188433Z {"requestId": null, "message": "Failed to get job, status code: 524", "level": "ERROR"}
This seemingly prevents the auto-shutdown after N seconds from happening, so our runners stay up forever. One example is zpatg26htp69og.
9 Replies
yhlong00000
yhlong000004mo ago
After reviewing the log, it looks like your worker remains active for a short period after completing the task. I assume you have an idle timeout configured? Each of your requests finishes quickly, and once the worker completes the task, it checks the queue for new tasks. The issue you mentioned might be due to a temporary network problem. Have you been seeing this error frequently? Most of the errors I observed occur when you’re checking the job result after 30 minutes. By that time, the results are no longer stored in our system, so you’ll need to retrieve them a bit sooner.
yasyf
yasyfOP4mo ago
yea I understand all of that, but the 524 happens very reproducibly and very frequently, so I dont think its a temp network problem and the result is the idle timeout is not expected and the worker stays alive longer than it should
flash-singh
flash-singh4mo ago
are you using llms? we have new sdk releases planned to reduce amount of traffic for workers and reduce 524s from cloudflare
yasyf
yasyfOP4mo ago
yes, using LLMs. ok cool, will keep an eye out for that. anything else to do in the interim?
flash-singh
flash-singh4mo ago
you reduce the number of concurrency, whats the value for that?
yasyf
yasyfOP4mo ago
its 4 right now. whats recommended value?
yhlong00000
yhlong000004mo ago
you mean this value is 4?
No description
yasyf
yasyfOP4mo ago
oh I'm not using VLLM, I meant the concurrency_modifier
yhlong00000
yhlong000004mo ago
Ok, in any case, try the new version of the sdk 1.7.1, it improves batch requests. If you’re still seeing the issue, feel free to record a quick video and share it here.

Did you find this page helpful?