RunPod•4mo ago

Migrated from RO to IS

Hello, I've got a message from you that RO would be deprecated, so I decided to move it all to IS, since it has the highest availability of 4090. All is working, but sometimes my serverless tasks die without any ouput or logs. It just dies all of a sudden. Is anyone else having the same issue?

25 Replies

solidOP•4mo ago

Doesn't look like an out of memory issue, since it worked perfectly before. I've also tried A100 just for the sake of... the worker disappears. and the is counting in the inProgress, but the worker is gone.

Madiator2011•4mo ago

I think you mean NO-1

solidOP•4mo ago

FUCK xD you are correct nervertheless, I think those EUR-IS datacenters are not normal. I will go back to RO.

Madiator2011•4mo ago

I would not use RO as its very heavy used

solidOP•4mo ago

But my workers are constantly dieing in IS, while in RO I had a stable deployment. I am not sure why, since no output is provided in IS

mitchken•4mo ago

What did not work on the EUR-IS? seems like there should be plenty of 4090 available

solidOP•4mo ago

I am not sure yet. I am making sure it is not on my code's end.

yhlong00000•4mo ago

if you have endpoint id or request id, we can take a look

solidOP•4mo ago

hello. I will share the request ids that are failing + the endpoint. endpoint: 9px4kz08hlv944 failing request example with silent kill: - 793b8adf-0954-4a70-8664-0a522bd4ba59-u1 I've reruned, and it eventually worked on request: - 9eff2b0e-4908-47c8-ad38-7f26f40eea59-u1 We have some odd thing latent, I am not able to figure what is it.

yhlong00000•4mo ago

The request with ID 793b8adf-0954-4a70-8664-0a522bd4ba59-u1 ran for 2 minutes before the worker failed to respond to our health check. the job was returned to the queue and picked up by a different worker. Job Start Time: 10/22 at 08:16:45 Eastern Time Health Check Failure: No response after 08:17:36 If you have access to the container logs, maybe review them to identify where the process might have stalled.

solidOP•4mo ago

hmm, okay, let me check. But it was looking stochastic. I am keeping the hypothesis I am the one at fault nevertheless. Being honest, I am running: tail -f /dev/null. I jump into machine using webssh. In serverless to replicate the same context. I run the original script manually. python something.py Sometimes after the backblaze downloads. Sometimes during the comfyui workflows. very odd. I have nvtop, I have htop spinned, I constantly refresh also the runpod stats page.

solidOP•4mo ago

solidOP•4mo ago

connection closed. any upload / download caps? I am doing multiple runs, same input, capturing the machine via ssh + tail dev/null to get some live logging from the web term. Can it be my image version being jurassic?: FROM runpod/pytorch:3.10-2.0.0-117 AS base Before this, I had this deployment running for months. Changed some disks, already rolled back to RO. old disks, old code. fourth time it was successful... after failing 3 times on those backblaze downloads. Running it all manually using tail dev null to allow some manual monitoring.

solidOP•4mo ago

Some of this failing machine stay in the inProgress counters for a while, eventhough no worker is working. streak: - 44638736-1794-4bb8-b90e-931dec6b4464-u1 - b66ea0a0-6aa6-42a1-b763-0978a4eaedcd-u1 - 8c0b5f64-b467-4c36-878b-f62951428456-u1 - b2ff4de8-e000-459c-9999-7610fb027099-u1 - 18b73afc-20c0-4b70-9f4b-6c018297b19d-u1 - 028058da-ce4a-41b6-ad63-41ad501b644e-u1 Some of those are still ongoing in the requests tab. I dont discard backblaze issues, but killing the proccess silently like this given try / catch. Looks like it comes from "above" Happened again: - 1st 6bafca8d-0517-423a-90ed-c9e51be2be72-u1 - 2nd e69f7942-72b2-490b-a951-d14a0ca11015-u1 - 3rd 2580c326-2de0-453e-9f13-6cc7510b9ead-u1 - 4th 475edfb2-2c8a-4546-8bc7-d94e595a6fcb-u1 ✅ Worked Just sharing the story line so far. - I misread the email, so NO is becoming deprecated by RO is fine. - I had disks in NO as backups, but I was not using it, so I migrated those to IS (more 4090 according to the dashboard stats) - I had the RO disks working with serverless with this code for some months now, SLA 99.9%. - I test my serverless with IS (instead of the main RO region disks) - Start having some tasks with stochastic failures, like I was throttled for down/up rates. - Went back to my old and untouched RO disks. Start having the same issues as in IS.

solidOP•4mo ago

Since 25th of July, not one failure. (sorry one heheh) 22 since the weekend. (after doing my IS tests) Thank you for the patience 🙏

solidOP•4mo ago

This endpoint (another one affected) had 12k without any hiccups.

solidOP•4mo ago

Since the weekend, 109 failed. Any ideas? A bit clueless on what to do next

yhlong00000•4mo ago

I reviewed a few of the requests you shared above, and it appears to be an issue with our 1.7.3 SDK. Could you try downgrading to 1.6.2? Alternatively, 1.7.2 should work as well, though it might have a slight delay on the first request.

solidOP•4mo ago

okay, you mean the runpod python package version? on it sir! thank you for the feedback. gotcha, it is indeed the runpod pip package version. changing this. This worked like a charm. You are the real MVP Doing some more tests, but the outlook is great. You can close this. Highest kudos to you guys

yhlong00000•4mo ago

👌🏻 Glad everything works, btw, which version you choose at the end? Again sorry for the inconvenience.

solidOP•4mo ago

1.6.2 I wanted to try 1.7.2 but I was short on time, and the client was impatient. You guys can send this thread to the oblivion.

nerdylive•4mo ago

Ah many were experiencing issues with the sdk The thread can be kept open for others to see

mitchken•4mo ago

So EUR-IS was fine and versions should just be locked if you want to use them? Definately over long periods so code does not change behaviour

Gaming

Programming

Migrated from RO to IS

Did you find this page helpful?