Bell Chen
RRunPod
•Created by Bell Chen on 7/5/2024 in #⚡|serverless
Is there any way to disable retrying after crashed
Is there any way to disable retrying after crashed
2 replies
RRunPod
•Created by Bell Chen on 6/8/2024 in #⚡|serverless
Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker
It looks like serverless worker will crash when spawning new processes from the handler. It crashes after the first process is spawned "Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2". Same code works fine in multi-GPU pod web terminal.
5 replies
RRunPod
•Created by Bell Chen on 5/28/2024 in #⚡|serverless
New release will re-pull the entire image.
It was working in the past, pulling only new layers on the top. But now it is pulling everything again. Slow to do testing.
10 replies
RRunPod
•Created by Bell Chen on 2/19/2024 in #⚡|serverless
"Failed to return job results. | 400, message='Bad Request', url=URL('https://api.runpod.ai/v2/gg3lo
{5 items
"dt":"2024-02-19 02:45:23.347011"
"endpointid":"gg3lo31p6vvlb0"
"level":"error"
"message":"Failed to return job results. | 400, message='Bad Request', url=URL('https://api.runpod.ai/v2/gg3lo31p6vvlb0/job-done/3plkb7uehbwit0/83aac4d7-36c5-45ce-8b43-8189a65a855f-u1?gpu=NVIDIA+L40&isStream=false')"
"workerId":"3plkb7uehbwit0"
}
6 replies
RRunPod
•Created by Bell Chen on 2/17/2024 in #⚡|serverless
Worker's log is not updating in real time. It only pulls the log every 5 mins..
Endpoint: 0bd8xndlkfo6oj
2 replies
RRunPod
•Created by Bell Chen on 2/15/2024 in #⚡|serverless
L40 and 6000 Ada serverless worker not spawning
It is not spawning
13 replies