RunPod•3mo ago

Feb 20 - Serverless Issues Mega-Thread

Many people seem to be running into the following issue: Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening. I think there is an issue with how the requests are getting assigned to the workers: there a number of idling workers, and there are a number of queued requests, and they both stay in that state for many minutes without any requests getting picked up by workers! It's not my settings (eg. idle timeout, queue delay), because they've been constant and I haven't touched them at all. The issue has started to happen a few days ago.

42 Replies

jimOP•3mo ago

Dj•3mo ago

For those links, can you update them to be the link you get from pressing this icon?

jimOP•3mo ago

Sure Updated

Dj•3mo ago

It's being looked into :) Can I get your endpoint id?

jimOP•3mo ago

zvhg9gcnqmkugx and xku09ayt4686ma Thank you Getting up to 15min delays on requests, despite "running" workers not working on any requests

Dj•3mo ago

@jim - You would likely benefit from Flash Boot on zvhg9gcnqmkugx and updating your SDK version on both endpoints. @Felipe Fontana - You may also want to update your SDK version. This is all I know right now, we're looking into a correlation between old SDK versions and similiar queue delays. If we find out more I'll see what we can do about helping the overall community by sending notifications suggesting an upgrade.

jimOP•3mo ago

@Dj Both of these endpoints already have Flash Boot enabled. They worked fine with this SDK version before, did something change on the platform? In the meantime, I'll upgrade to latest SDK version to see if it helps

Dj•3mo ago

I can't personally view your account, I was just told a lot of your running workers were not flashbooted, so I made sure it was enabled

jimOP•3mo ago

Ah maybe that's a hint at the cause of the issue?

Dj•3mo ago

I've started at RunPod yesterday, so I'm not 100% sure if anythings changed but we're looking into the root still I just want to make sure you're doing everything you can in the meantime 😄

jimOP•3mo ago

These are my settings for the zv... endpoint

jimOP•3mo ago

Very interesting that it's not flash booting, despite it being enabled Sounds like that may be part of the cause: very slow boots that while being booted are in "running" state

Felipe Fontana•3mo ago

I'm with fastboot and the last version 1.6.2 of the package.

jimOP•3mo ago

@Felipe Fontana I believe 1.7.7 is the latest (https://github.com/runpod/runpod-python/releases)

GitHub

Releases · runpod/runpod-python

🐍 | Python library for RunPod API and serverless worker SDK. - runpod/runpod-python

Felipe Fontana•3mo ago

I have 20+ different services, and I can't do it on a reasonable time.

Dj•3mo ago

You do not have to upgrade, it's completely okay if you want to wait for a resolution that upgrading your SDK will help you definitely.

Felipe Fontana•3mo ago

@Dj I removed servers from CA, and now it looks better. I don't know if the problem is related.

Dj•3mo ago

@jim I've confirmed 1.7.2 to have a delay issue. It affects the first message sent to the queue, and with many workers you really feel that more. One request spins up a worker, it's "busy" starting, you get another request - another worker is spun up, it's "busy", etc. (This is my interpretation). It's less of an issue at high traffic, but an issue nonetheless. @Felipe Fontana I'm not sure either, for you I'm told if you're having issues in the longterm for workloads similiar to vLLM upgrading the 1.7 versions (1.7.2+) will help you. If it's fine right now it's fine though.

jimOP•3mo ago

@Dj Does this explain why workers are "running" for over 10 minutes, while not picking up any queue'd up requests? The fact that a lot of my workers were not flash booted, despite having flash boot enabled seems a little sus too

Felipe Fontana•3mo ago

@jim try remove CA ones, solve hre for now.

jimOP•3mo ago

Ah yeah, all of mine are CA All the other zones don't have a lot of H100 availability tho, but let's see

Josh-Runpod•3mo ago

@jim Flash boot isn't deterministic atm. We may evict the paused worker if we need the capacity for another request. Consider it "best effort" as currently implemented.

jimOP•3mo ago

I see. Has there been a significant increase in the % of cold boots in the recent few days (esp. for the endpoints that have flash boot enabled) due to increasingly scarce capacity? @Felipe Fontana US-GA-2 has the same issue for me unfortunately

Milad•3mo ago

Facing the same issue, example endpoint id 7obbfxqjmmama2

jimOP•3mo ago

@Dj Any updates?

TristenHarr•3mo ago

Milad•3mo ago

Happening on H100s, A100s, and A40s all day today

Dj•3mo ago

Reported, thank you

jimOP•3mo ago

Confirming that SDK 1.7.7 does not solve this @Dj. US-GA-2 has fewer issues than CA

Milad•3mo ago

Yes 1.7.7 does not solve, all of our endpoints are on 1.7.7

Dj•3mo ago

just got permission, first order of business was to fix that This is being looked into actively, incase anyone sees this now. I plan on mentioning everyone active when I receive a resolution @jim @Milad @TristenHarr @Felipe Fontana @Saqib Zia - On Call Engineering was paged to investigate the issue. The root cause has been identified and a patch for the issue should be ready to go within a day (Friday, Feburary 20th). In the meantime please bear with us, you should be okay but I'll announce more when I can, I'll be back tomorrow with more information.

TristenHarr•3mo ago

Great thanks so much! 🙂 No hard feelings on my side, I appreciate the info and support. There's always a few kinks to work out with new and exciting things like serverless GPU's so no problems here. I just hope this truly is fixed and patched because if there are known issues that is fine so long as they are documented, its when you expect everything to work fine and it isn't that things are a problem!

jimOP•3mo ago

@Dj You mentioned the patch should be ready within a day, but mentioned Friday. I assume you mean Thursday (today is the 20th)?

Dj•3mo ago

Yes, sorry! Misremembered the day of the week

jimOP•3mo ago

All good 🙂 thanks. Any update to share?

Dj•3mo ago

Nothing substantial, I can see the PR I was tracking is merged but I don't know the verson we're running in prod network wide at this time.

jimOP•3mo ago

I see. An ETA on the rollout would help in our planning if you can get one

Dj•3mo ago

@jim @Milad @TristenHarr @Felipe Fontana @Saqib Zia - Hello again! At this time the issue should be mostly resolved. An issue was identified by our on call engineering team that would effectively take down a single host. While technically it was possible to resolve the issue by selecting another datacenter for your deployments, you still had to get lucky in not landing on an affected host within that datacenter. We've reviewed logs the endpoint ids provided to us and everything seemed good to go on our end. Thank you for working with us during this time and being so patient. It took a little longer than I'd like to start the incident but we're learning together and having a dedicated Community Manager (me!) should help a lot. Please let us know if you experience any more issues, we'll be happy to take a look.

getsomedata•3mo ago

yhlong00000•3mo ago

The issue above was due to the spend limit, and it was fixed yesterday.

Aleksei Naumov•3mo ago

Hey everyone, does anyone have a clear understanding of how idle timeout works on RunPod? It seems like billing is based on max workers by default. For instance on this deployment I set 5 max workers, 0 active workers, and an idle timeout of 5 seconds, but even with no requests, I still see 3 idle workers. Is this expected behavior, or is something off?

yhlong00000•3mo ago

We only charge you when your worker is running. Idle worker means worker is sleep, no cost. Idle timeout refers to the duration a worker remains running after completing a request. If a new job arrives within this period, the worker can handle it immediately, preventing a cold start.

Gaming

Programming

Feb 20 - Serverless Issues Mega-Thread

Did you find this page helpful?