R
RunPod2mo ago
sake

URGENT: Multiple H100 instances critical error - ICML deadline tomorrow

🚨 Critical Issue: - 3x H100 SXM pods simultaneously received critical error messages and experiments terminated - IDs: Iv8utoj2mozzp6 (1x H100), afinjwp2ryg3ub (2x H100) - Time: ~02:27 KST, Jan 30 - Image: runpod/pytorch:2.2.0-py3.10-cuda12.1-devel-ubuntu22.04 ⏰ Context: - ICML submission deadline: Jan 31st afternoon KST (tomorrow) - Multiple critical experiments terminated unexpectedly - Need urgent resolution to meet conference deadline 🙏 Requesting: 1. Immediate investigation 2. Priority restoration of instances 3. Prevention of recurrence for next 24hrs Can someone from the support team please help ASAP? This is severely impacting our conference submission timeline.
No description
No description
1 Reply
sake
sakeOP2mo ago
Container Creation Failures: • Image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 • Error: "layer does not exist" • Multiple failed attempts since 03:03 KST • Image shows as up to date but fails to create container

Did you find this page helpful?