Restarting without error message

I'm deploying some code to serverless and it seems the code crashes and restarts the process, without an error message. In the logs it just shows that it has restarted, I can tell by my own startup logging. In the end I could make it work by using an specific version of CUDA and an specific version of a dependency, but I would like to know why it crashes, to fix it. Everything works fine locally with nvidia-docker...
7 Replies
lucasavila00
lucasavila006mo ago
I have a custom template that can reproduce the issue. I deleted the broken workers and logs.
ashleyk
ashleyk6mo ago
Its impossible to tell unless you add error logging to your handler. Then you can view the error logs in your logs tab.
lucasavila00
lucasavila006mo ago
I have error logging, but it shows nothing. It prints the model path, and restarts.
llama2 = None
try:
if not IS_STUB:
with open("path.txt", "r") as f:
model_path = f.read()
print(model_path) # prints up to here
llama2 = models.LlamaCpp(
model_path, n_gpu_layers=-1, n_ctx=8192, echo=False
)
except Exception as e:
print(e)
print("failed to load model")
# sleep for 5s
time.sleep(5)

raise e
print("loaded model")
llama2 = None
try:
if not IS_STUB:
with open("path.txt", "r") as f:
model_path = f.read()
print(model_path) # prints up to here
llama2 = models.LlamaCpp(
model_path, n_gpu_layers=-1, n_ctx=8192, echo=False
)
except Exception as e:
print(e)
print("failed to load model")
# sleep for 5s
time.sleep(5)

raise e
print("loaded model")
ashleyk
ashleyk6mo ago
Best to test it on GPU cloud to determine what the issue is then, maybe it can't find the path.txt file or something.
lucasavila00
lucasavila006mo ago
I can fix it by downgrading https://github.com/abetlen/llama-cpp-python/releases to v0.2.23 The path etc work correctly, I'm testing it locally with nvidia docker too. To me it feels like a bug in serverless UI, it can't report logs if the python process crashes, it seems. I did not try to report this error with another process, the docker command is CMD python3.11 -u /handler.py
ashleyk
ashleyk6mo ago
Its can only report on exceptions once you actually call runpod.serverless.start(), it is not aware of any exceptions before thats called so its not a bug.
lucasavila00
lucasavila006mo ago
It is very weird because logs work, so I can print "about to call the LammaCpp constructor" and it shows this message in the logs in the UI. But it doesn't show the error, is just shows the NVIDIA CUDA Version 11.8.0 etc which it shows when the docker image starts.