R
RunPod•6d ago
andypotato

Async workers not running

When using the /run endpoint I will receive the usual response:
{
"id": "d0e6d88c-8274-4554-bb6a-0a469361ae20-e1",
"status": "IN_QUEUE"
}
{
"id": "d0e6d88c-8274-4554-bb6a-0a469361ae20-e1",
"status": "IN_QUEUE"
}
However the job is never getting processed, despite there being available workers. Some observations: - A worker will spin up and go into "running" status, but the rp_handler.py is never executed - When I check the status of the job via /status/<jobId> the job will immediately start running - I can reproduce the exact same behavior using the local test version, so this is not limited to cloud usage - Running the exact same worker with /runsync will work without problems Using the runpod sdk 1.7.7 How can I solve this issue?
7 Replies
Dj
Dj•6d ago
Can you share your endpoint id? I'd be happy to take a look here
andypotato
andypotatoOP•6d ago
Here is my endpoint ID: os1z7gv7hgacgd
yhlong00000
yhlong00000•6d ago
the worker 2bl9oiybd9v2yk has some machine issue, do you mind to terminate that and try again?
andypotato
andypotatoOP•6d ago
@yhlong00000 I don't think this is a single machine issue. If you read the observations that I shared in my issue report, this issue has occured on any worker, including on my own local machine when testing the container. The only reason why I deployed this container to Runpod is to check if the issue can be reproduced on the cloud, and it can. @yhlong00000 @Dj I have tried the same endpoint again and spawned a worker atjrfyd1c9zzgz - The result is exactly the same, the worker will start running and simply waste credits without ever executing rp_handler.py This is a serious issue because it completely breaks running workers async - I really hope you can look into this as soon as possible. If you need any support from my end with testing I am happy to provide.
yhlong00000
yhlong00000•5d ago
Sorry for the misunderstanding earlier. I checked the logs and initially thought it was a machine issue due to some containers failing to start. However, in your case, it turns out to be a CUDA version issue. Your program requires CUDA 12.6+ to run. I tested with your endpoint using CUDA 12.6+, and it seems to fix the issue. You can take a look and verify.
yhlong00000
yhlong00000•5d ago
No description
andypotato
andypotatoOP•5d ago
hey that's interesting. thanks for pointing that out. I wasn't even aware there is an option for "allowed cuda versions" I can confirm this is now working as expected, thank you for that 🫶 I will further investigate this issue on my local system. It could be related to CUDA versions too. If that's the case, maybe a note in the docs or an error message to the user would be helpful. It is otherwise pretty much impossible to debug this issue as no logs will be generated on the worker console

Did you find this page helpful?