RunPod•2mo ago

Async workers not running

When using the /run endpoint I will receive the usual response:

{
    "id": "d0e6d88c-8274-4554-bb6a-0a469361ae20-e1",
    "status": "IN_QUEUE"
}

{
    "id": "d0e6d88c-8274-4554-bb6a-0a469361ae20-e1",
    "status": "IN_QUEUE"
}

However the job is never getting processed, despite there being available workers. Some observations: - A worker will spin up and go into "running" status, but the rp_handler.py is never executed - When I check the status of the job via /status/<jobId> the job will immediately start running - I can reproduce the exact same behavior using the local test version, so this is not limited to cloud usage - Running the exact same worker with /runsync will work without problems Using the runpod sdk 1.7.7 How can I solve this issue?

7 Replies

Dj•2mo ago

Can you share your endpoint id? I'd be happy to take a look here

andypotatoOP•2mo ago

Here is my endpoint ID: os1z7gv7hgacgd

yhlong00000•2mo ago

the worker 2bl9oiybd9v2yk has some machine issue, do you mind to terminate that and try again?

andypotatoOP•2mo ago

@yhlong00000 I don't think this is a single machine issue. If you read the observations that I shared in my issue report, this issue has occured on any worker, including on my own local machine when testing the container. The only reason why I deployed this container to Runpod is to check if the issue can be reproduced on the cloud, and it can. @yhlong00000 @Dj I have tried the same endpoint again and spawned a worker atjrfyd1c9zzgz - The result is exactly the same, the worker will start running and simply waste credits without ever executing rp_handler.py This is a serious issue because it completely breaks running workers async - I really hope you can look into this as soon as possible. If you need any support from my end with testing I am happy to provide.

yhlong00000•2mo ago

Sorry for the misunderstanding earlier. I checked the logs and initially thought it was a machine issue due to some containers failing to start. However, in your case, it turns out to be a CUDA version issue. Your program requires CUDA 12.6+ to run. I tested with your endpoint using CUDA 12.6+, and it seems to fix the issue. You can take a look and verify.

yhlong00000•2mo ago

andypotatoOP•2mo ago

hey that's interesting. thanks for pointing that out. I wasn't even aware there is an option for "allowed cuda versions" I can confirm this is now working as expected, thank you for that 🫶 I will further investigate this issue on my local system. It could be related to CUDA versions too. If that's the case, maybe a note in the docs or an error message to the user would be helpful. It is otherwise pretty much impossible to debug this issue as no logs will be generated on the worker console

Gaming

Programming

Async workers not running

Did you find this page helpful?