Bell Chen Posts - Answer Overflow

Bell Chen

Posts Comments

RRunPod

•Created by Bell Chen on 7/5/2024 in #⚡｜serverless

Is there any way to disable retrying after crashed

2 replies

RRunPod

•Created by Bell Chen on 6/8/2024 in #⚡｜serverless

Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker

It looks like serverless worker will crash when spawning new processes from the handler. It crashes after the first process is spawned "Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2". Same code works fine in multi-GPU pod web terminal.

5 replies

RRunPod

•Created by Bell Chen on 5/28/2024 in #⚡｜serverless

New release will re-pull the entire image.

It was working in the past, pulling only new layers on the top. But now it is pulling everything again. Slow to do testing.

13 replies

RRunPod

•Created by Bell Chen on 3/18/2024 in #⛅｜pods-clusters

How to start a tensorboard from the pod?

3 replies

RRunPod

•Created by Bell Chen on 2/19/2024 in #⚡｜serverless

"Failed to return job results. | 400, message='Bad Request', url=URL('https://api.runpod.ai/v2/gg3lo

{5 items "dt":"2024-02-19 02:45:23.347011" "endpointid":"gg3lo31p6vvlb0" "level":"error" "message":"Failed to return job results. | 400, message='Bad Request', url=URL('https://api.runpod.ai/v2/gg3lo31p6vvlb0/job-done/3plkb7uehbwit0/83aac4d7-36c5-45ce-8b43-8189a65a855f-u1?gpu=NVIDIA+L40&isStream=false')" "workerId":"3plkb7uehbwit0" }

6 replies

RRunPod

•Created by Bell Chen on 2/17/2024 in #⚡｜serverless

Worker's log is not updating in real time. It only pulls the log every 5 mins..

Endpoint: 0bd8xndlkfo6oj

2 replies

RRunPod

•Created by Bell Chen on 2/15/2024 in #⚡｜serverless

L40 and 6000 Ada serverless worker not spawning

It is not spawning

13 replies

Gaming

Programming