R
RunPod•12mo ago
lewington

reproducible: pods crash 50% of the time

i am trying to build an API which allows people without big GPUs to run googles weather forcasting model graphcast I have the code
import json
import runpod

with open("credentials.json", "r") as f:
credentials = json.load(f)

runpod.api_key = credentials['RUNPOD_KEY']

pod = runpod.create_pod(
cloud_type="SECURE", # or else someone might snoop your session and steal your AWS/CDS credentials
name=f"easy-graphcast1",
image_name="runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04",
gpu_type_id="NVIDIA A100 80GB PCIe",
container_disk_in_gb=30
)
import json
import runpod

with open("credentials.json", "r") as f:
credentials = json.load(f)

runpod.api_key = credentials['RUNPOD_KEY']

pod = runpod.create_pod(
cloud_type="SECURE", # or else someone might snoop your session and steal your AWS/CDS credentials
name=f"easy-graphcast1",
image_name="runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04",
gpu_type_id="NVIDIA A100 80GB PCIe",
container_disk_in_gb=30
)
about 20% of the time this will work perfectly well
The other 80% we get in the pod logs, and it just keeps cycling like that
2023-12-31T07:03:04Z create pod network
2023-12-31T07:03:04Z create container lewingtonpitsos/easy-graphcast:latest
2023-12-31T07:03:04Z latest Pulling from lewingtonpitsos/easy-graphcast
2023-12-31T07:03:04Z Digest: sha256:0840e41d8649381afb4e8e15b364c772dad1127f626e00680297eeb5c1f71df5
2023-12-31T07:03:04Z Status: Image is up to date for lewingtonpitsos/easy-graphcast:latest
2023-12-31T07:03:04Z start container
2023-12-31T07:03:04Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2023-12-31T07:03:20Z start container...
2023-12-31T07:03:04Z create pod network
2023-12-31T07:03:04Z create container lewingtonpitsos/easy-graphcast:latest
2023-12-31T07:03:04Z latest Pulling from lewingtonpitsos/easy-graphcast
2023-12-31T07:03:04Z Digest: sha256:0840e41d8649381afb4e8e15b364c772dad1127f626e00680297eeb5c1f71df5
2023-12-31T07:03:04Z Status: Image is up to date for lewingtonpitsos/easy-graphcast:latest
2023-12-31T07:03:04Z start container
2023-12-31T07:03:04Z error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
2023-12-31T07:03:20Z start container...

Notably when I switch image_name to runpod/stack, it works 100% of the time this is very confusing to me
3 Replies
lewington
lewingtonOP•12mo ago
switching to "NVIDIA A100-SXM4-80GB" fixes the issue... still a very strange issue
JM
JM•12mo ago
Hey @lewington! Use the filter option to use only CUDA 12.1+ 🙂
lewington
lewingtonOP•12mo ago
Thanks I'll try that 😊
Want results from more Discord servers?
Add your server