R
RunPod2mo ago
solid

Migrated from RO to IS

Hello, I've got a message from you that RO would be deprecated, so I decided to move it all to IS, since it has the highest availability of 4090. All is working, but sometimes my serverless tasks die without any ouput or logs. It just dies all of a sudden. Is anyone else having the same issue?
25 Replies
solid
solidOP2mo ago
Doesn't look like an out of memory issue, since it worked perfectly before. I've also tried A100 just for the sake of... the worker disappears. and the is counting in the inProgress, but the worker is gone.
Madiator2011
Madiator20112mo ago
I think you mean NO-1
solid
solidOP2mo ago
FUCK xD you are correct nervertheless, I think those EUR-IS datacenters are not normal. I will go back to RO.
Madiator2011
Madiator20112mo ago
I would not use RO as its very heavy used
solid
solidOP2mo ago
But my workers are constantly dieing in IS, while in RO I had a stable deployment. I am not sure why, since no output is provided in IS
mitchken
mitchken2mo ago
What did not work on the EUR-IS? seems like there should be plenty of 4090 available
solid
solidOP2mo ago
I am not sure yet. I am making sure it is not on my code's end.
yhlong00000
yhlong000002mo ago
if you have endpoint id or request id, we can take a look
solid
solidOP2mo ago
hello. I will share the request ids that are failing + the endpoint. endpoint: 9px4kz08hlv944 failing request example with silent kill: - 793b8adf-0954-4a70-8664-0a522bd4ba59-u1 I've reruned, and it eventually worked on request: - 9eff2b0e-4908-47c8-ad38-7f26f40eea59-u1 We have some odd thing latent, I am not able to figure what is it.
yhlong00000
yhlong000002mo ago
The request with ID 793b8adf-0954-4a70-8664-0a522bd4ba59-u1 ran for 2 minutes before the worker failed to respond to our health check. the job was returned to the queue and picked up by a different worker. Job Start Time: 10/22 at 08:16:45 Eastern Time Health Check Failure: No response after 08:17:36 If you have access to the container logs, maybe review them to identify where the process might have stalled.
solid
solidOP2mo ago
hmm, okay, let me check. But it was looking stochastic. I am keeping the hypothesis I am the one at fault nevertheless. Being honest, I am running: tail -f /dev/null. I jump into machine using webssh. In serverless to replicate the same context. I run the original script manually. python something.py Sometimes after the backblaze downloads. Sometimes during the comfyui workflows. very odd. I have nvtop, I have htop spinned, I constantly refresh also the runpod stats page.
solid
solidOP2mo ago
No description
solid
solidOP2mo ago
connection closed. any upload / download caps? I am doing multiple runs, same input, capturing the machine via ssh + tail dev/null to get some live logging from the web term. Can it be my image version being jurassic?: FROM runpod/pytorch:3.10-2.0.0-117 AS base Before this, I had this deployment running for months. Changed some disks, already rolled back to RO. old disks, old code. fourth time it was successful... after failing 3 times on those backblaze downloads. Running it all manually using tail dev null to allow some manual monitoring.
solid
solidOP2mo ago
No description
solid
solidOP2mo ago
Some of this failing machine stay in the inProgress counters for a while, eventhough no worker is working. streak: - 44638736-1794-4bb8-b90e-931dec6b4464-u1 - b66ea0a0-6aa6-42a1-b763-0978a4eaedcd-u1 - 8c0b5f64-b467-4c36-878b-f62951428456-u1 - b2ff4de8-e000-459c-9999-7610fb027099-u1 - 18b73afc-20c0-4b70-9f4b-6c018297b19d-u1 - 028058da-ce4a-41b6-ad63-41ad501b644e-u1 Some of those are still ongoing in the requests tab. I dont discard backblaze issues, but killing the proccess silently like this given try / catch. Looks like it comes from "above" Happened again: - 1st 6bafca8d-0517-423a-90ed-c9e51be2be72-u1 - 2nd e69f7942-72b2-490b-a951-d14a0ca11015-u1 - 3rd 2580c326-2de0-453e-9f13-6cc7510b9ead-u1 - 4th 475edfb2-2c8a-4546-8bc7-d94e595a6fcb-u1 ✅ Worked Just sharing the story line so far. - I misread the email, so NO is becoming deprecated by RO is fine. - I had disks in NO as backups, but I was not using it, so I migrated those to IS (more 4090 according to the dashboard stats) - I had the RO disks working with serverless with this code for some months now, SLA 99.9%. - I test my serverless with IS (instead of the main RO region disks) - Start having some tasks with stochastic failures, like I was throttled for down/up rates. - Went back to my old and untouched RO disks. Start having the same issues as in IS.
solid
solidOP2mo ago
No description
solid
solidOP2mo ago
Since 25th of July, not one failure. (sorry one heheh) 22 since the weekend. (after doing my IS tests) Thank you for the patience 🙏
solid
solidOP2mo ago
This endpoint (another one affected) had 12k without any hiccups.
No description
solid
solidOP2mo ago
Since the weekend, 109 failed. Any ideas? A bit clueless on what to do next
yhlong00000
yhlong000002mo ago
I reviewed a few of the requests you shared above, and it appears to be an issue with our 1.7.3 SDK. Could you try downgrading to 1.6.2? Alternatively, 1.7.2 should work as well, though it might have a slight delay on the first request.
solid
solidOP2mo ago
okay, you mean the runpod python package version? on it sir! thank you for the feedback. gotcha, it is indeed the runpod pip package version. changing this. This worked like a charm. You are the real MVP Doing some more tests, but the outlook is great. You can close this. Highest kudos to you guys
yhlong00000
yhlong000002mo ago
👌🏻 Glad everything works, btw, which version you choose at the end? Again sorry for the inconvenience.
solid
solidOP2mo ago
1.6.2 I wanted to try 1.7.2 but I was short on time, and the client was impatient. You guys can send this thread to the oblivion.
nerdylive
nerdylive2mo ago
Ah many were experiencing issues with the sdk The thread can be kept open for others to see
mitchken
mitchken2mo ago
So EUR-IS was fine and versions should just be locked if you want to use them? Definately over long periods so code does not change behaviour
Want results from more Discord servers?
Add your server