chuunizzz Posts - Answer Overflow

chuunizzz

•Created by chuunizzz on 4/7/2025 in #⛅｜pods-clusters

Intermittent Pod Issues: CUDA Errors and Pod Unresponsiveness

I'm experiencing intermittent but frequent issues with my pod running on the runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image. The pod becomes unresponsive in a way that resembles a crash, but without actually showing as down in the dashboard. Problem Description: - When the issue occurs, Jupyter Lab opens but shows no folders/files - ComfyUI fails to start with CUDA errors (logs below) - Basic commands like nvidia-smi don't work - Restarting the pod temporarily resolves the issue - This happens frequently, despite no changes to ComfyUI or plugins Error logs when trying to run ComfyUI:

[2025-04-05 22:23:38.296] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS

[2025-04-05 22:23:38.296] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS

The nginx logs also show:

2025/04/05 20:09:35 [error] 322#322: *13468 upstream timed out (110: Unknown error) while connecting to upstream, client: ****, server: _, request: "POST /prompt HTTP/1.1", upstream: "****", host: "****"

2025/04/05 20:09:35 [error] 322#322: *13468 upstream timed out (110: Unknown error) while connecting to upstream, client: ****, server: _, request: "POST /prompt HTTP/1.1", upstream: "****", host: "****"

This began occurring recently, even though my setup was previously stable. There have been no changes to ComfyUI or its plugins. Questions: 1. What might be causing this issue? Is it related to CUDA, GPU allocation, or something else? 2. Are there any logs I should check to better diagnose the problem? 3. Is there anything I can do to prevent these failures or make the pod more stable? 4. Is this a known issue with the PyTorch 2.4.0 image? Any help would be greatly appreciated as this is disrupting my workflow significantly. Thank you!

14 replies

RRunPod

•Created by chuunizzz on 9/29/2024 in #⛅｜pods-clusters

We have detected a critical error on this machine which may affect some pods.

3 replies

Gaming

Programming