Migrated from RO to IS
Hello, I've got a message from you that RO would be deprecated, so I decided to move it all to IS, since it has the highest availability of 4090.
All is working, but sometimes my serverless tasks die without any ouput or logs. It just dies all of a sudden. Is anyone else having the same issue?
25 Replies
Doesn't look like an out of memory issue, since it worked perfectly before. I've also tried A100 just for the sake of...
the worker disappears. and the is counting in the inProgress, but the worker is gone.
I think you mean NO-1
FUCK xD
you are correct
nervertheless, I think those EUR-IS datacenters are not normal.
I will go back to RO.
I would not use RO as its very heavy used
But my workers are constantly dieing in IS, while in RO I had a stable deployment.
I am not sure why, since no output is provided in IS
What did not work on the EUR-IS? seems like there should be plenty of 4090 available
I am not sure yet. I am making sure it is not on my code's end.
if you have endpoint id or request id, we can take a look
hello.
I will share the request ids that are failing + the endpoint.
endpoint: 9px4kz08hlv944
failing request example with silent kill:
- 793b8adf-0954-4a70-8664-0a522bd4ba59-u1
I've reruned, and it eventually worked on request:
- 9eff2b0e-4908-47c8-ad38-7f26f40eea59-u1
We have some odd thing latent, I am not able to figure what is it.
The request with ID 793b8adf-0954-4a70-8664-0a522bd4ba59-u1 ran for 2 minutes before the worker failed to respond to our health check. the job was returned to the queue and picked up by a different worker.
Job Start Time: 10/22 at 08:16:45 Eastern Time
Health Check Failure: No response after 08:17:36
If you have access to the container logs, maybe review them to identify where the process might have stalled.
hmm, okay, let me check. But it was looking stochastic. I am keeping the hypothesis I am the one at fault nevertheless.
Being honest, I am running:
tail -f /dev/null
.
I jump into machine using webssh. In serverless to replicate the same context.
I run the original script manually. python something.py
Sometimes after the backblaze downloads.
Sometimes during the comfyui workflows.
very odd.
I have nvtop, I have htop spinned, I constantly refresh also the runpod stats page.connection closed.
any upload / download caps?
I am doing multiple runs, same input, capturing the machine via ssh + tail dev/null to get some live logging from the web term.
Can it be my image version being jurassic?:
FROM runpod/pytorch:3.10-2.0.0-117 AS base
Before this, I had this deployment running for months.
Changed some disks, already rolled back to RO.
old disks, old code.
fourth time it was successful... after failing 3 times on those backblaze downloads. Running it all manually using tail dev null to allow some manual monitoring.Some of this failing machine stay in the inProgress counters for a while, eventhough no worker is working.
streak:
- 44638736-1794-4bb8-b90e-931dec6b4464-u1
- b66ea0a0-6aa6-42a1-b763-0978a4eaedcd-u1
- 8c0b5f64-b467-4c36-878b-f62951428456-u1
- b2ff4de8-e000-459c-9999-7610fb027099-u1
- 18b73afc-20c0-4b70-9f4b-6c018297b19d-u1
- 028058da-ce4a-41b6-ad63-41ad501b644e-u1
Some of those are still ongoing in the requests tab.
I dont discard backblaze issues, but killing the proccess silently like this given try / catch. Looks like it comes from "above"
Happened again:
- 1st 6bafca8d-0517-423a-90ed-c9e51be2be72-u1
- 2nd e69f7942-72b2-490b-a951-d14a0ca11015-u1
- 3rd 2580c326-2de0-453e-9f13-6cc7510b9ead-u1
- 4th 475edfb2-2c8a-4546-8bc7-d94e595a6fcb-u1 ✅ Worked
Just sharing the story line so far.
- I misread the email, so NO is becoming deprecated by RO is fine.
- I had disks in NO as backups, but I was not using it, so I migrated those to IS (more 4090 according to the dashboard stats)
- I had the RO disks working with serverless with this code for some months now, SLA 99.9%.
- I test my serverless with IS (instead of the main RO region disks)
- Start having some tasks with stochastic failures, like I was throttled for down/up rates.
- Went back to my old and untouched RO disks. Start having the same issues as in IS.
Since 25th of July, not one failure. (sorry one heheh)
22 since the weekend. (after doing my IS tests)
Thank you for the patience 🙏
This endpoint (another one affected) had 12k without any hiccups.
Since the weekend, 109 failed.
Any ideas? A bit clueless on what to do next
I reviewed a few of the requests you shared above, and it appears to be an issue with our 1.7.3 SDK. Could you try downgrading to 1.6.2? Alternatively, 1.7.2 should work as well, though it might have a slight delay on the first request.
okay, you mean the runpod python package version?
on it sir! thank you for the feedback.
gotcha, it is indeed the runpod pip package version.
changing this.
This worked like a charm.
You are the real MVP
Doing some more tests, but the outlook is great.
You can close this. Highest kudos to you guys
👌🏻 Glad everything works, btw, which version you choose at the end? Again sorry for the inconvenience.
1.6.2
I wanted to try 1.7.2
but I was short on time, and the client was impatient.
You guys can send this thread to the oblivion.Ah many were experiencing issues with the sdk
The thread can be kept open for others to see
So EUR-IS was fine and versions should just be locked if you want to use them? Definately over long periods so code does not change behaviour