R
RunPod•8mo ago
yarcat

Stuck in the initialization

Seems that I'm stuck in the intiialization loop e.g.
2024-06-24T10:47:39Z worker is ready
2024-06-24T10:49:04Z loading container image from cache
2024-06-24T10:49:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T10:49:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T10:49:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T10:49:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z worker is ready
...
2024-06-24T11:15:04Z loading container image from cache
2024-06-24T11:15:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T11:15:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T11:15:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T11:15:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z worker is ready
2024-06-24T11:17:04Z loading container image from cache
2024-06-24T10:47:39Z worker is ready
2024-06-24T10:49:04Z loading container image from cache
2024-06-24T10:49:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T10:49:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T10:49:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T10:49:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T10:49:35Z worker is ready
...
2024-06-24T11:15:04Z loading container image from cache
2024-06-24T11:15:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string
2024-06-24T11:15:33Z Loaded image: runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z stable-cuda12.1.0 Pulling from runpod/worker-vllm
2024-06-24T11:15:35Z Digest: sha256:2d1b1ea50cfbf291800375956f71791bc69dd074a7531e5992d216355a817cc7
2024-06-24T11:15:35Z Status: Downloaded newer image for runpod/worker-vllm:stable-cuda12.1.0
2024-06-24T11:15:35Z worker is ready
2024-06-24T11:17:04Z loading container image from cache
Don't understand how to debug that. Any recommendations?
Solution:
I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.
Jump to solution
61 Replies
nerdylive
nerdylive•8mo ago
Try to create new endpoint @flash-singh mighr wanna check this also @Alpay Ariyak
Solution
yarcat
yarcat•8mo ago
I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.
nerdylive
nerdylive•8mo ago
Yeah keep this discord thread on so staffs can check this issue
haris
haris•8mo ago
@yarcat In the future would you be able to provide an endpoint ID so we could investigate further
yarcat
yarcatOP•8mo ago
absolutely, sorry I didn't do it this time
haris
haris•8mo ago
No worries!
flash-singh
flash-singh•7mo ago
@yarcat is this resolved? I know we had a bug regarding this few weeks ago and it was fixed, not sure if it fixed the issue for you
nerdylive
nerdylive•7mo ago
yeah he did this
yarcat
yarcatOP•7mo ago
Sorry for the delay replying. I've cloned the pod, as pointed out above. No issues since then
yarcat
yarcatOP•7mo ago
Don't want to create a new thread, so I'm kinda resurrecting this one. I've noticed that lately my i3a4tvypolo9bp serverless workers were restarting quite frequently (the flash boot is enabled). Today it was quite horrible -- we had a demo session, and in 90 mins of that session I had a feeling that it spent time restarting more than actually serving.
No description
No description
yarcat
yarcatOP•7mo ago
I'm using this endpoint for almost a month, and it never behaved like that. Over the last few days I've noticed it was restarting kinda noticably frequently. But today it is just horrible.
Alpay Ariyak
Alpay Ariyak•7mo ago
Container disk size is super small, make it at least 50 Also all machines are now 12.1+ so you can use that worker
yarcat
yarcatOP•7mo ago
Thanks!!!
Alpay Ariyak
Alpay Ariyak•7mo ago
Of course! Did that solve it?
nerdylive
nerdylive•7mo ago
Hey alpay is the new vllm version for vllm worker coming out soon
yarcat
yarcatOP•6mo ago
I'd like to understand that pricing a bit. I have two serverless instances, each of them has 1 active worker: 1. vllm-4t9gk2df0h0pds runs 2xA40 2x0.0002*60*60*24=34.56 $/day 2. vllm-2513y3iyxauhz2 runs 4xA40 4x0.0002*60*60*24=69.12 $/day Which gives us around 104 $/day Both instances use 50GB local disk storage (as it was recommend above). However, my dashboard shows 123+$/day for exactly this setup, and at some point it was above 200 $/day, when we accidentally configured 200GB disk space per instance I'd like to understand, if there is a way for me to optimize it. 160+ $/day is kinda over my budget. And I'm trying to understand how did it happen that my deprecating on demand instances in favor of serverless, I actually started to pay more.
flash-singh
flash-singh•6mo ago
reach out to support with details
denr
denr•6mo ago
Anyone using serverless? I've tried to implememt a custom handler directly from the runPod docs but when i deploy endpoint worker stucks in Initializing status forewer. EdpointID: 68qrx2ls5p4v23 Logs
2024-08-16T11:07:22Z loading container image from cache
2024-08-16T11:07:27Z Loaded image ID: sha256:ec4b30ce3184aa6f4a35049d9c55878d1013e0aa2c09b35fe2e61c2547ccfa5c
2024-08-16T11:07:28Z docker.io/denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682 Pulling from denrykhlov/test
2024-08-16T11:07:28Z Digest: sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z Status: Downloaded newer image for denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z worker is ready
2024-08-16T11:07:22Z loading container image from cache
2024-08-16T11:07:27Z Loaded image ID: sha256:ec4b30ce3184aa6f4a35049d9c55878d1013e0aa2c09b35fe2e61c2547ccfa5c
2024-08-16T11:07:28Z docker.io/denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682 Pulling from denrykhlov/test
2024-08-16T11:07:28Z Digest: sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z Status: Downloaded newer image for denrykhlov/test@sha256:02c2127a224cd546286efcb458d3da560abf4123242c948b96b70bd604cec682
2024-08-16T11:07:28Z worker is ready
Dockerfile:
FROM python:3.10-slim

WORKDIR /
RUN pip install --no-cache-dir runpod
COPY rp_handler.py /

# Start the container
CMD ["python3", "-u", "rp_handler.py"]
FROM python:3.10-slim

WORKDIR /
RUN pip install --no-cache-dir runpod
COPY rp_handler.py /

# Start the container
CMD ["python3", "-u", "rp_handler.py"]
- rp_handler.py
import runpod

def process_input(input):
name = input['name']
greeting = f'Hello {name}'

return {
"greeting": greeting
}

def handler(event):
return process_input(event['input'])


if __name__ == '__main__':
runpod.serverless.start({'handler': handler})
import runpod

def process_input(input):
name = input['name']
greeting = f'Hello {name}'

return {
"greeting": greeting
}

def handler(event):
return process_input(event['input'])


if __name__ == '__main__':
runpod.serverless.start({'handler': handler})
Encyrption
Encyrption•6mo ago
All that looks fine. What happens when you submit an API request? Are you using Run or RunSync? What is the JSON you are POSTING?
denr
denr•6mo ago
I'm trying to run request directly from the WebUI
No description
Encyrption
Encyrption•6mo ago
Does the status change if you query it again? Does it stay in the QUEUE or does it go to processing?
denr
denr•6mo ago
yes, it is in the queue my previous deployement with the same template was in the queue for few hours before i'm deleted it
Encyrption
Encyrption•6mo ago
Looks like your code is looking for input['name'] which you have not provided. Can you show me a screen shot of your serverless endpoint settings?
denr
denr•6mo ago
agree with you regarding "name". but even with name it is stuck, tried right now
denr
denr•6mo ago
the logs
No description
Encyrption
Encyrption•6mo ago
Click Container logs... that will give you details of the application not just the container
denr
denr•6mo ago
nothing in container logs, also tried to lookinto, 1 sec
denr
denr•6mo ago
No description
Encyrption
Encyrption•6mo ago
your serverless endpoint is likely wrong... wat do you have for these settings?
No description
denr
denr•6mo ago
No description
Encyrption
Encyrption•6mo ago
which GPU did you select?
denr
denr•6mo ago
No description
Encyrption
Encyrption•6mo ago
That all looks fine.. not sure what is causing your problem 😦
denr
denr•6mo ago
same thing but anyways, thx for trying i've just sent request to support, hopefully they have more detailed debug logs
Encyrption
Encyrption•6mo ago
wait i think I found problem
denr
denr•6mo ago
listening
nerdylive
nerdylive•6mo ago
maybe this if __name__ == '__main__': thats a default py entrypoint but yeah your filename isnt main i guess so it doesn't get executed, no jobs picked up, the workers wont start processing i guess possible explanation
Encyrption
Encyrption•6mo ago
Instead of calling
def handler(event):
def handler(event):
You should call your handler directly. Like this:
import runpod

''' RunPod minimal handler '''
def handler(job):
job_input = job['input']
name = job_input.get('name', 'Joe')
greeting = f'Hello {name}'
return { 'greeting': greeting }

if __name__ == "__main__":
runpod.serverless.start({'handler': handler})
import runpod

''' RunPod minimal handler '''
def handler(job):
job_input = job['input']
name = job_input.get('name', 'Joe')
greeting = f'Hello {name}'
return { 'greeting': greeting }

if __name__ == "__main__":
runpod.serverless.start({'handler': handler})
Can you try that code and see if you run into the same issue?
denr
denr•6mo ago
will try
denr
denr•6mo ago
i'm sorry, but it not a problem, it is expected for python the code works locally
No description
Encyrption
Encyrption•6mo ago
Looking at your code it should work... just trying to eliminate anything that is complicating it to see if that helps reveal the issue.
denr
denr•6mo ago
still the same 2024-08-16T11:37:21Z 3763ad45abf5 Extracting [==================================================>] 284B/284B 2024-08-16T11:37:21Z 3763ad45abf5 Extracting [==================================================>] 284B/284B 2024-08-16T11:37:21Z 3763ad45abf5 Pull complete 2024-08-16T11:37:21Z Digest: sha256:1210452dc40c2cd7615d0112b9a1c8e96196097d0a64ec1eb9d0d211e4d419dc 2024-08-16T11:37:21Z Status: Downloaded newer image for denrykhlov/test@sha256:1210452dc40c2cd7615d0112b9a1c8e96196097d0a64ec1eb9d0d211e4d419dc 2024-08-16T11:37:21Z worker is ready still initializing
Encyrption
Encyrption•6mo ago
Wha happens when you submit a request?
denr
denr•6mo ago
you meant rest request? it is awaiting in the queue
No description
Encyrption
Encyrption•6mo ago
Can you try to run it with RunSync? It is so small image with limited code it should have no problem with that.
denr
denr•6mo ago
it just waits, waits, waits.... nothing is happening, the Requests is being increased by 1 more request...
Encyrption
Encyrption•6mo ago
I've tried all I know.. it should be working. 😦
denr
denr•6mo ago
just wondering if my account is misconfigured and whatever....
Encyrption
Encyrption•6mo ago
Have you opened a ticket with RunPod? You can do so by going to help->Contact in RunPod site.
denr
denr•6mo ago
yep, tried already ~ 1 hour ago waiting
Encyrption
Encyrption•6mo ago
I find it generally take ~ 24 hours or so to get response
denr
denr•6mo ago
yep, that's what they are saying on the website will see if they can find something
Encyrption
Encyrption•6mo ago
good luck
denr
denr•6mo ago
thx man!
nerdylive
nerdylive•6mo ago
did it work or what?
denr
denr•6mo ago
no, still not working Support just replied, they are saying no available pod for my serverless container even when i've selected Global for GPU with "High availability"
flash-singh
flash-singh•6mo ago
how many gpus per worker?
denr
denr•6mo ago
1
yhlong00000
yhlong00000•6mo ago
I took a look at your pod, and it seems to be stuck on the initializing stage. When you build your image, are you on apple chip mac pc? try this when you build: docker build --platform=linux/amd64 -t your-image-name . I pushed my image to docker hub
No description
No description
denr
denr•6mo ago
thx for trying, yes, it stuck in the initialization stage. I'm building the image from ubuntu virtual machine and, yes, I'm providing the flag --platform=linux/amd64 as well. Interesting story, just now i've tried to recreate new serverless endpoint with your @yhlong00000 image and mine - and it went well for both! I'm able to send request and receive expected results. The worker now shows state "Idle" instead of "Initializing". Looks like RunPod team did something and now it worked. Hopefully in future serverless workers will be more stable. Anyways, thanks everyone in participation. It good to see RunPod community is quite active!
yhlong00000
yhlong00000•6mo ago
Glad everything worked! 🥳

Did you find this page helpful?