RunPod•11mo ago

When servless is used, does the machine reboot if it is executed consecutively? Currently seeing iss

When servless is used, does the machine reboot if it is executed consecutively? Currently seeing issues with last execution affecting the next

70 Replies

jaxOP•11mo ago

There's been a problem with this workid alf4z19ubk8n71 v8lkcxxh6wjd6q k5xlystwyzjbm3

digigoblin•11mo ago

FlashBoot isn't guaranteed. Its influenced by how many workers you have and whether you're sending a constant flow of requests or not. I don't know what you mean by the machine rebooting though. Machines don't reboot. The container starts to handle a request, then shuts down again once the idle timeout period is reached when its idle and not processing any further requests. I assume you're referring to flash boot though and nothing to do with rebooting.

jaxOP•11mo ago

I don't have flashboot turned on, because of my comfyui, I need to start 1 service, if the old environment is not cleared, will the service startup conflict at this point?

digigoblin•11mo ago

You have to enable flash boot, otherwise you have a cold start on every single request. I use flash boot with comfyui and it works perfectly. Depends on which comfyui worker you're using though.

jaxOP•11mo ago

Thank you very much for the advice, I'll try it later but My problem at the moment isn't that it takes a long time, it's that it often reports errors.I think serverless should be a clean environment every time it executes. I'm getting tons of errors on my service for this reason

digigoblin•11mo ago

What errors? Without logs, nobody can advise.

jaxOP•11mo ago

This one is my reported error, and I didn't post it because I didn't feel it was generalizable It feels like it's because it reported an error in the execution error, and the comfyui service had stopped by the time the results were fetched

jaxOP•11mo ago

message.txt

digigoblin•11mo ago

Looks like you are trying to connect to the ComfyUI API before its ready to start taking requests

digigoblin•11mo ago

You need to do something like this to wait for the service to become ready before sending requests: https://github.com/ashleykleynhans/runpod-worker-comfyui/blob/main/rp_handler.py#L33-L49

GitHub

runpod-worker-comfyui/rp_handler.py at main · ashleykleynhans/runpo...

RunPod Serverless Worker for the ComfyUI Stable Diffusion API - ashleykleynhans/runpod-worker-comfyui

jaxOP•11mo ago

def check_server(url, retries=50, delay=500): """ Check if a server is reachable via HTTP GET request Args: - url (str): The URL to check - retries (int, optional): The number of times to attempt connecting to the server. Default is 50 - delay (int, optional): The time in milliseconds to wait between retries. Default is 500 Returns: bool: True if the server is reachable within the given number of retries, otherwise False """ for i in range(retries): try: response = requests.get(url) # If the response status code is 200, the server is up and running if response.status_code == 200: print(f"runpod-worker-comfy - API is reachable") return True except requests.RequestException as e: # If an exception occurs, the server may not be ready pass # Wait for the specified delay before retrying time.sleep(delay / 1000) print( f"runpod-worker-comfy - Failed to connect to server at {url} after {retries} attempts." ) return False I did wait for the service to start before executing the

jaxOP•11mo ago

jaxOP•11mo ago

Here's the log at the time of one of the errors, it looks like the comfyui service was started, but with an exception

jaxOP•11mo ago

This is the log of normal operation

jaxOP•11mo ago

It looks like Comfyui's service was shut down during the execution. Does runpod severless automatically close ports?

Jason•11mo ago

No So, there should be some errors that are stopping it

jaxOP•11mo ago

There is no error message here, but the port is closed, and an error in the comfyui execution does not cause the comfyui port to be closed

Jason•11mo ago

So the port is still listening? try gimme the output of lsof -i -P -n or ss -tulw

jaxOP•11mo ago

Here's another error log, looks like the port may have been closed at an arbitrary time

Jason•11mo ago

Huh copy that error type

jaxOP•11mo ago

What information do I need to provide

Jason•11mo ago

That error exception there oh sorry this is serverless yeah what ports did you mean is closed?

jaxOP•11mo ago

comfyui port 8188

Jason•11mo ago

Wait yeah i think digigoblin is right then how did you call def check_server(url, retries=50, delay=500):? Serverless doesn't open ports to public internet

jaxOP•11mo ago

This log looks like another kind of error, let's finish looking at the above one first

Jason•11mo ago

Yeah sure just copy that log and send it here

jaxOP•11mo ago

I don't need it to be public, I just need it to be internal.

Jason•11mo ago

I think this one is what happened when prompt execution well then it doesn't have to do with the port closing, might be the app closing

jaxOP•11mo ago

No, my logs are due to a port closure, which triggers a request to port 8188 to be rejected, and then an error

Jason•11mo ago

well then your comfyui isn't listening and up yet. that is why back to my question how did you call the checkserver func?

jaxOP•11mo ago

No, inside the logs comfyui's workflow are up and running. check_server( f"http://{COMFY_HOST}", COMFY_API_AVAILABLE_MAX_RETRIES, COMFY_API_AVAILABLE_INTERVAL_MS, ) resBucketFileName = new_predict_turn_video_style(input_file, uid, task_id, pre_workflow_api)

Jason•11mo ago

Thats that, its wrong use another endpoint to check, not the root one Try to just apply this, and how ashleyk's code call that function try to make this so hard for yourself, just use what is working if you're not sure of whats going on...

jaxOP•11mo ago

I don't think that's the problem at all

Jason•11mo ago

its because the root is ready before all the comfy backend is ready Hm then its your extension if thats not it try to deactivate all custom nodes and then activate 1 per 1, see which causes that im 80% sure the call thats not making sure all the backends are ready are making the workflow execution error

jaxOP•11mo ago

This problem comes up by accident, not always. Why is that, I run workflow after the checkSever function, why would it not be fully ready?

Jason•11mo ago

Hmm not sure of how your code looks like consult the comfyui's code for more info

jaxOP•11mo ago

There are also logs of workflow running in the logs, indicating that comfyui has finished starting up

Jason•11mo ago

So its like this if im not wrong

Jason•11mo ago

try active workers if you want to keep the comfyui running all the time

jaxOP•11mo ago

As I said that's a different 1, we're going back to my initial couple log discussions this log

Jason•11mo ago

Yep so, the process might die between jobs

jaxOP•11mo ago

this logs

Jason•11mo ago

Is it this error? yk i can't expand screenshots so

jaxOP•11mo ago

yes

Jason•11mo ago

and it might needs initializing again and you're waiting using the wrong endpoint as i said here

jaxOP•11mo ago

Can you be more specific about this situation?

Jason•11mo ago

when your worker isn't active, after its becoming idled after finished running ( the grey workers ) the process might exit so it needs to be initialized again and your not checking for the right endpoint if its ready or not ( in my guess which you didn't apply the fix yet )

jaxOP•11mo ago

Do you mean that currently after that check_sever of mine, the service may be shut down?

Jason•11mo ago

No, before check server ( after old job finished ) it might exit so it needs to be re-initialized

jaxOP•11mo ago

Is this log for that reason The log indicates that the service has been started

jaxOP•11mo ago

The second problem is also due to the first execution error, the port was closed, and subsequent requests came in because the service was not started, resulting in the check_sever failing.

jaxOP•11mo ago

I looked at all the error messages, 2 causes 1. Port was closed during execution 2. Since the port was closed last time, the next request came in and did not start the service, so it caused a continuous error The second problem is caused by the first problem. I would like to troubleshoot the 1st problem first The first problem is definitely not caused by a check location error, it is obvious that the port is closed during execution, but a general error will not cause the comfyui port to be closed, so I would like to troubleshoot if the system behavior is One possible reason I can think of is GPU OOM Can you guys see the logs related to GPU OOM?

Jason•11mo ago

yeah there should be some logs for that

digigoblin•11mo ago

Depends whether ComfyUI logs it or not, I don't think RunPod has access to logs that you don't log yourself, and all your logs should be available under the logs tab for your endpoint, you shouldn't need RunPod to check the logs for you.

jaxOP•11mo ago

GPU OOM is a lack of system resources. Don't you have a separate record for that? I've reproduced the problem, and it's due to GPU OOM. @digigoblin @nerdylive Let's move on to the next question, what does the serveless environment look like after the old serverless task has finished running and a new task comes in? (without flash boot and activiy worker enabled) Because I've found that it causes chaining problems here, where there was only 1 task that was causing OOM, but there were many

Jason•11mo ago

Its an error you should catch hahah if the job failed itll retry well its like idled but its faster

jaxOP•11mo ago

Will the GPU be emptied, won't there be any effects from the last job?

Jason•11mo ago

Depends on your code usually comfyui moves model into ram if not wrong

jaxOP•11mo ago

So if I don't have an action in my code that will clear the GPU, it could be affected yes?

Jason•11mo ago

yeah Hm then move to a higher ram gpu?

jaxOP•11mo ago

What can I do so that my serverless handles each request with a completely new environment? And not be affected by old requests. Yes, and I will optimize the code to be more GPU efficient for different user inputs

Jason•11mo ago

refresh worker to make sure it restarts or by code unload from vram

jaxOP•11mo ago

Actually, it's not just vram, but also comfyui sever, which I'd like to restart on every request, so I wish there was a way to handle it simply by just clearing the How does this work exactly?

Jason•11mo ago

Huh why tho read runpod's docs

Jason•11mo ago

https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls#refresh-worker

Additional controls | RunPod Documentation

Send progress updates during job execution using the runpod.serverless.progress_update function, and refresh workers for long-running or complex jobs by returning a dictionary with a 'refresh_worker' flag in your handler.

digigoblin•11mo ago

If you just used one of the tried and tested repos on Github instead of trying to roll your own, you would not have all these issues, for example: https://github.com/ashleykleynhans/runpod-worker-comfyui/blob/main/rp_handler.py#L250-L256

GitHub

runpod-worker-comfyui/rp_handler.py at main · ashleykleynhans/runpo...

RunPod Serverless Worker for the ComfyUI Stable Diffusion API - ashleykleynhans/runpod-worker-comfyui

jaxOP•11mo ago

Okay, I'll take a look. Because I have a lot of customization, not enough I should still learn a lot of the code here, thanks

Jason•11mo ago

Try from comfyui's cli Args too if they support

jaxOP•11mo ago

Okay, thank you both. Come back if you have any other questions later

Gaming

Programming

When servless is used, does the machine reboot if it is executed consecutively? Currently seeing iss

Did you find this page helpful?