R
RunPod•3w ago
chuunizzz

Intermittent Pod Issues: CUDA Errors and Pod Unresponsiveness

I'm experiencing intermittent but frequent issues with my pod running on the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image. The pod becomes unresponsive in a way that resembles a crash, but without actually showing as down in the dashboard. Problem Description: - When the issue occurs, Jupyter Lab opens but shows no folders/files - ComfyUI fails to start with CUDA errors (logs below) - Basic commands like nvidia-smi don't work - Restarting the pod temporarily resolves the issue - This happens frequently, despite no changes to ComfyUI or plugins Error logs when trying to run ComfyUI:
[2025-04-05 22:23:38.296] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
[2025-04-05 22:23:38.296] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS
The nginx logs also show:
2025/04/05 20:09:35 [error] 322#322: *13468 upstream timed out (110: Unknown error) while connecting to upstream, client: ****, server: _, request: "POST /prompt HTTP/1.1", upstream: "****", host: "****"
2025/04/05 20:09:35 [error] 322#322: *13468 upstream timed out (110: Unknown error) while connecting to upstream, client: ****, server: _, request: "POST /prompt HTTP/1.1", upstream: "****", host: "****"
This began occurring recently, even though my setup was previously stable. There have been no changes to ComfyUI or its plugins. Questions: 1. What might be causing this issue? Is it related to CUDA, GPU allocation, or something else? 2. Are there any logs I should check to better diagnose the problem? 3. Is there anything I can do to prevent these failures or make the pod more stable? 4. Is this a known issue with the PyTorch 2.4.0 image? Any help would be greatly appreciated as this is disrupting my workflow significantly. Thank you!
4 Replies
Jason
Jason•3w ago
Maybe its a problem with the server Try reporting here Nginx logs probably just show the comfyui is down/error Wait, maybe it's a driver mismatch Did you filter the driver to 12.4 and up in the create pod
chuunizzz
chuunizzzOP•3w ago
no, might be a misunderstanding im not using serverless im using a on demand pod with a saving plan maybe i should migrate my serevr to another pod, would that help?
Jason
Jason•3w ago
sorry yes, you can also select cuda version when creating pod yes try it
chuunizzz
chuunizzzOP•3w ago
weird... after I reported this issue, it never happened again..😂

Did you find this page helpful?