RunPod•10mo ago

Stuck in the initialization

Seems that I'm stuck in the intiialization loop e.g.

2024-06-24T10:47:39Z worker is ready
2024-06-24T10:49:04Z loading container image from cache
2024-06-24T10:49:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T10:49:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T10:49:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T10:49:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z worker is ready
...
2024-06-24T11:15:04Z loading container image from cache
2024-06-24T11:15:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T11:15:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T11:15:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T11:15:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z worker is ready
2024-06-24T11:17:04Z loading container image from cache

2024-06-24T10:47:39Z worker is ready
2024-06-24T10:49:04Z loading container image from cache
2024-06-24T10:49:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T10:49:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T10:49:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T10:49:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z worker is ready
...
2024-06-24T11:15:04Z loading container image from cache
2024-06-24T11:15:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T11:15:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T11:15:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T11:15:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z worker is ready
2024-06-24T11:17:04Z loading container image from cache

Don't understand how to debug that. Any recommendations?

Solution:

I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.

Jump to solution

61 Replies

Jason•10mo ago

Try to create new endpoint @flash-singh mighr wanna check this also @Alpay Ariyak

Solution

yarcat•10mo ago

I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.

Jason•10mo ago

Yeah keep this discord thread on so staffs can check this issue

haris•10mo ago

@yarcat In the future would you be able to provide an endpoint ID so we could investigate further

yarcatOP•10mo ago

absolutely, sorry I didn't do it this time

haris•10mo ago

No worries!

flash-singh•10mo ago

@yarcat is this resolved? I know we had a bug regarding this few weeks ago and it was fixed, not sure if it fixed the issue for you

Jason•10mo ago

yeah he did this

yarcatOP•10mo ago

Sorry for the delay replying. I've cloned the pod, as pointed out above. No issues since then

yarcatOP•10mo ago

Don't want to create a new thread, so I'm kinda resurrecting this one. I've noticed that lately my i3a4tvypolo9bp serverless workers were restarting quite frequently (the flash boot is enabled). Today it was quite horrible -- we had a demo session, and in 90 mins of that session I had a feeling that it spent time restarting more than actually serving.

yarcatOP•10mo ago

I'm using this endpoint for almost a month, and it never behaved like that. Over the last few days I've noticed it was restarting kinda noticably frequently. But today it is just horrible.

Alpay Ariyak•10mo ago

Container disk size is super small, make it at least 50 Also all machines are now 12.1+ so you can use that worker

yarcatOP•10mo ago

Thanks!!!

Alpay Ariyak•10mo ago

Of course! Did that solve it?

Jason•10mo ago

Hey alpay is the new vllm version for vllm worker coming out soon

yarcatOP•9mo ago

I'd like to understand that pricing a bit. I have two serverless instances, each of them has 1 active worker: 1. vllm-4t9gk2df0h0pds runs 2xA40 2x0.0002*60*60*24=34.56 $/day 2. vllm-2513y3iyxauhz2 runs 4xA40 4x0.0002*60*60*24=69.12 $/day Which gives us around 104 $/day Both instances use 50GB local disk storage (as it was recommend above). However, my dashboard shows 123+$/day for exactly this setup, and at some point it was above 200 $/day, when we accidentally configured 200GB disk space per instance I'd like to understand, if there is a way for me to optimize it. 160+ $/day is kinda over my budget. And I'm trying to understand how did it happen that my deprecating on demand instances in favor of serverless, I actually started to pay more.

flash-singh•9mo ago

reach out to support with details

denr•9mo ago

Anyone using serverless? I've tried to implememt a custom handler directly from the runPod docs but when i deploy endpoint worker stucks in Initializing status forewer. EdpointID: 68qrx2ls5p4v23 Logs

2024-08-16T11:07:22Z loading container image from cache
2024-08-16T11:07:27Z Loaded image ID: sha256:ec4b30ce3184aa6f4a35049d9c55878d1013e0aa2c09b35fe2e61c2547ccfa5c
2024-08-16T11:07:28Z docker.io/denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682 Pulling from denrykhlov/test
2024-08-16T11:07:28Z Digest: sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z Status: Downloaded newer image for denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z worker is ready

2024-08-16T11:07:22Z loading container image from cache
2024-08-16T11:07:27Z Loaded image ID: sha256:ec4b30ce3184aa6f4a35049d9c55878d1013e0aa2c09b35fe2e61c2547ccfa5c
2024-08-16T11:07:28Z docker.io/denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682 Pulling from denrykhlov/test
2024-08-16T11:07:28Z Digest: sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z Status: Downloaded newer image for denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z worker is ready

Dockerfile:

FROM python:3.10-slim

WORKDIR /
RUN pip install --no-cache-dir runpod
COPY rp_handler.py /

# Start the container
CMD ["python3", "-u", "rp_handler.py"]

FROM python:3.10-slim

WORKDIR /
RUN pip install --no-cache-dir runpod
COPY rp_handler.py /

# Start the container
CMD ["python3", "-u", "rp_handler.py"]

- rp_handler.py

import runpod

def process_input(input):
    name = input['name']
    greeting = f'Hello {name}'

    return {
        "greeting": greeting
    }

def handler(event):
    return process_input(event['input'])


if __name__ == '__main__':
    runpod.serverless.start({'handler': handler})

import runpod

def process_input(input):
    name = input['name']
    greeting = f'Hello {name}'

    return {
        "greeting": greeting
    }

def handler(event):
    return process_input(event['input'])


if __name__ == '__main__':
    runpod.serverless.start({'handler': handler})

Gaming

Programming

Stuck in the initialization

Did you find this page helpful?