Stuck in the initialization
Seems that I'm stuck in the intiialization loop e.g.
Don't understand how to debug that.
Any recommendations?
Solution:Jump to solution
I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.
61 Replies
Try to create new endpoint
@flash-singh mighr wanna check this also @Alpay Ariyak
Solution
I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.
Yeah keep this discord thread on so staffs can check this issue
@yarcat In the future would you be able to provide an endpoint ID so we could investigate further
absolutely, sorry I didn't do it this time
No worries!
@yarcat is this resolved? I know we had a bug regarding this few weeks ago and it was fixed, not sure if it fixed the issue for you
yeah he did this
Sorry for the delay replying. I've cloned the pod, as pointed out above. No issues since then
Don't want to create a new thread, so I'm kinda resurrecting this one. I've noticed that lately my i3a4tvypolo9bp serverless workers were restarting quite frequently (the flash boot is enabled). Today it was quite horrible -- we had a demo session, and in 90 mins of that session I had a feeling that it spent time restarting more than actually serving.
I'm using this endpoint for almost a month, and it never behaved like that. Over the last few days I've noticed it was restarting kinda noticably frequently. But today it is just horrible.
Container disk size is super small, make it at least 50
Also all machines are now 12.1+ so you can use that worker
Thanks!!!
Of course! Did that solve it?
Hey alpay is the new vllm version for vllm worker coming out soon
I'd like to understand that pricing a bit. I have two serverless instances, each of them has 1 active worker:
1. vllm-4t9gk2df0h0pds runs 2xA40 2x0.0002*60*60*24=34.56 $/day
2. vllm-2513y3iyxauhz2 runs 4xA40 4x0.0002*60*60*24=69.12 $/day
Which gives us around 104 $/day
Both instances use 50GB local disk storage (as it was recommend above).
However, my dashboard shows 123+$/day for exactly this setup, and at some point it was above 200 $/day, when we accidentally configured 200GB disk space per instance
I'd like to understand, if there is a way for me to optimize it. 160+ $/day is kinda over my budget. And I'm trying to understand how did it happen that my deprecating on demand instances in favor of serverless, I actually started to pay more.
reach out to support with details
Anyone using serverless? I've tried to implememt a custom handler directly from the runPod docs but when i deploy endpoint worker stucks in Initializing status forewer.
EdpointID: 68qrx2ls5p4v23
Logs
Dockerfile:
- rp_handler.py
All that looks fine. What happens when you submit an API request? Are you using Run or RunSync? What is the JSON you are POSTING?
I'm trying to run request directly from the WebUI
Does the status change if you query it again? Does it stay in the QUEUE or does it go to processing?
yes, it is in the queue
my previous deployement with the same template was in the queue for few hours before i'm deleted it
Looks like your code is looking for input['name'] which you have not provided.
Can you show me a screen shot of your serverless endpoint settings?
agree with you regarding "name".
but even with name it is stuck, tried right now
the logs
Click Container logs... that will give you details of the application not just the container
nothing in container logs, also tried to lookinto, 1 sec
your serverless endpoint is likely wrong... wat do you have for these settings?
which GPU did you select?
That all looks fine.. not sure what is causing your problem 😦
same thing
but anyways, thx for trying
i've just sent request to support, hopefully they have more detailed debug logs
wait i think I found problem
listening
maybe this
if __name__ == '__main__':
thats a default py entrypoint but yeah your filename isnt main i guess
so it doesn't get executed, no jobs picked up, the workers wont start processing i guess
possible explanationInstead of calling
You should call your handler directly. Like this:
Can you try that code and see if you run into the same issue?
will try
i'm sorry, but it not a problem, it is expected for python
the code works locally
Looking at your code it should work... just trying to eliminate anything that is complicating it to see if that helps reveal the issue.
still the same
2024-08-16T11:37:21Z 3763ad45abf5 Extracting [==================================================>] 284B/284B
2024-08-16T11:37:21Z 3763ad45abf5 Extracting [==================================================>] 284B/284B
2024-08-16T11:37:21Z 3763ad45abf5 Pull complete
2024-08-16T11:37:21Z Digest: sha256:1210452dc40c2cd7615d0112b9a1c8e96196097d0a64ec1eb9d0d211e4d419dc
2024-08-16T11:37:21Z Status: Downloaded newer image for denrykhlov/test@sha256:1210452dc40c2cd7615d0112b9a1c8e96196097d0a64ec1eb9d0d211e4d419dc
2024-08-16T11:37:21Z worker is ready
still initializing
Wha happens when you submit a request?
you meant rest request?
it is awaiting in the queue
Can you try to run it with RunSync? It is so small image with limited code it should have no problem with that.
it just waits, waits, waits.... nothing is happening, the Requests is being increased by 1 more request...
I've tried all I know.. it should be working. 😦
just wondering if my account is misconfigured and whatever....
Have you opened a ticket with RunPod? You can do so by going to help->Contact in RunPod site.
yep, tried already ~ 1 hour ago
waiting
I find it generally take ~ 24 hours or so to get response
yep, that's what they are saying on the website
will see if they can find something
good luck
thx man!
did it work or what?
no, still not working
Support just replied, they are saying no available pod for my serverless container
even when i've selected Global for GPU with "High availability"
how many gpus per worker?
1
I took a look at your pod, and it seems to be stuck on the initializing stage. When you build your image, are you on apple chip mac pc? try this when you build:
docker build --platform=linux/amd64 -t your-image-name .
I pushed my image to docker hub
thx for trying, yes, it stuck in the initialization stage.
I'm building the image from ubuntu virtual machine and, yes, I'm providing the flag --platform=linux/amd64 as well.
Interesting story, just now i've tried to recreate new serverless endpoint with your @yhlong00000 image and mine - and it went well for both! I'm able to send request and receive expected results.
The worker now shows state "Idle" instead of "Initializing".
Looks like RunPod team did something and now it worked.
Hopefully in future serverless workers will be more stable.
Anyways, thanks everyone in participation.
It good to see RunPod community is quite active!
Glad everything worked! 🥳