landingpagelover24
landingpagelover24
RRunPod
Created by n8tzto on 1/19/2024 in #⚡|serverless
Intermittent Slow Performance Issue with GPU Workers
We have exactly the same problem here. 4090, sometimes fast ~10it/s, sometimes slow ~5it/s (doubling the costs!) – for the exact same request/workload. Has the problem been solved for you in the meantime @n8tzto?
8 replies
RRunPod
Created by landingpagelover24 on 10/2/2024 in #⚡|serverless
Job suddenly restarts and fails after one retry.
No description
10 replies
RRunPod
Created by landingpagelover24 on 10/2/2024 in #⚡|serverless
Job suddenly restarts and fails after one retry.
I think I can rule out OOM because it runs on my local machine with 12GB of VRAM (with exactly the same settings) while the worker GPU has 24GB. Also, RunPod UI shows that only a fraction of worker resources are being used (this is something I should definitely optimize after fixing the current issue lol).
10 replies
RRunPod
Created by landingpagelover24 on 10/2/2024 in #⚡|serverless
Job suddenly restarts and fails after one retry.
Hey @nerdylive! I cannot spot any errors in the logs attached – you can see how it suddenly restarts at epoch 4. Same request with less training images provided for testing just runs fine. Testing it locally with curl (via /runsync – because /run doesn't work due to a known bug) works fine.
10 replies