RunPod•15mo ago

Restarting without error message

I'm deploying some code to serverless and it seems the code crashes and restarts the process, without an error message. In the logs it just shows that it has restarted, I can tell by my own startup logging. In the end I could make it work by using an specific version of CUDA and an specific version of a dependency, but I would like to know why it crashes, to fix it. Everything works fine locally with nvidia-docker...

7 Replies

lucasavila00OP•15mo ago

I have a custom template that can reproduce the issue. I deleted the broken workers and logs.

ashleyk•15mo ago

Its impossible to tell unless you add error logging to your handler. Then you can view the error logs in your logs tab.

lucasavila00OP•15mo ago

I have error logging, but it shows nothing. It prints the model path, and restarts.

llama2 = None
try:
    if not IS_STUB:
        with open("path.txt", "r") as f:
            model_path = f.read()
        print(model_path) # prints up to here
        llama2 = models.LlamaCpp(
            model_path, n_gpu_layers=-1, n_ctx=8192, echo=False
        )
except Exception as e:
    print(e)
    print("failed to load model")
    # sleep for 5s
    time.sleep(5)

    raise e
print("loaded model")

llama2 = None
try:
    if not IS_STUB:
        with open("path.txt", "r") as f:
            model_path = f.read()
        print(model_path) # prints up to here
        llama2 = models.LlamaCpp(
            model_path, n_gpu_layers=-1, n_ctx=8192, echo=False
        )
except Exception as e:
    print(e)
    print("failed to load model")
    # sleep for 5s
    time.sleep(5)

    raise e
print("loaded model")

ashleyk•15mo ago

Best to test it on GPU cloud to determine what the issue is then, maybe it can't find the path.txt file or something.

lucasavila00OP•15mo ago

I can fix it by downgrading https://github.com/abetlen/llama-cpp-python/releases to v0.2.23 The path etc work correctly, I'm testing it locally with nvidia docker too. To me it feels like a bug in serverless UI, it can't report logs if the python process crashes, it seems. I did not try to report this error with another process, the docker command is CMD python3.11 -u /handler.py

ashleyk•15mo ago

Its can only report on exceptions once you actually call runpod.serverless.start(), it is not aware of any exceptions before thats called so its not a bug.

lucasavila00OP•15mo ago

It is very weird because logs work, so I can print "about to call the LammaCpp constructor" and it shows this message in the logs in the UI. But it doesn't show the error, is just shows the NVIDIA CUDA Version 11.8.0 etc which it shows when the docker image starts.

Gaming

Programming

Restarting without error message

Did you find this page helpful?