R
RunPod3d ago
jim

Feb 20 - Serverless Issues Mega-Thread

Many people seem to be running into the following issue: Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening. I think there is an issue with how the requests are getting assigned to the workers: there a number of idling workers, and there are a number of queued requests, and they both stay in that state for many minutes without any requests getting picked up by workers! It's not my settings (eg. idle timeout, queue delay), because they've been constant and I haven't touched them at all. The issue has started to happen a few days ago.
39 Replies
Dj
Dj3d ago
For those links, can you update them to be the link you get from pressing this icon?
No description
jim
jimOP3d ago
Sure Updated
Dj
Dj3d ago
It's being looked into :) Can I get your endpoint id?
jim
jimOP3d ago
zvhg9gcnqmkugx and xku09ayt4686ma Thank you Getting up to 15min delays on requests, despite "running" workers not working on any requests
Dj
Dj3d ago
@jim - You would likely benefit from Flash Boot on zvhg9gcnqmkugx and updating your SDK version on both endpoints. @Felipe Fontana - You may also want to update your SDK version. This is all I know right now, we're looking into a correlation between old SDK versions and similiar queue delays. If we find out more I'll see what we can do about helping the overall community by sending notifications suggesting an upgrade.
jim
jimOP3d ago
@Dj Both of these endpoints already have Flash Boot enabled. They worked fine with this SDK version before, did something change on the platform? In the meantime, I'll upgrade to latest SDK version to see if it helps
Dj
Dj3d ago
I can't personally view your account, I was just told a lot of your running workers were not flashbooted, so I made sure it was enabled
jim
jimOP3d ago
Ah maybe that's a hint at the cause of the issue?
Dj
Dj3d ago
I've started at RunPod yesterday, so I'm not 100% sure if anythings changed but we're looking into the root still I just want to make sure you're doing everything you can in the meantime 😄
jim
jimOP3d ago
These are my settings for the zv... endpoint
No description
jim
jimOP3d ago
Very interesting that it's not flash booting, despite it being enabled Sounds like that may be part of the cause: very slow boots that while being booted are in "running" state
Felipe Fontana
I'm with fastboot and the last version 1.6.2 of the package.
jim
jimOP3d ago
@Felipe Fontana I believe 1.7.7 is the latest (https://github.com/runpod/runpod-python/releases)
GitHub
Releases · runpod/runpod-python
🐍 | Python library for RunPod API and serverless worker SDK. - runpod/runpod-python
Felipe Fontana
I have 20+ different services, and I can't do it on a reasonable time.
Dj
Dj3d ago
You do not have to upgrade, it's completely okay if you want to wait for a resolution that upgrading your SDK will help you definitely.
Felipe Fontana
@Dj I removed servers from CA, and now it looks better. I don't know if the problem is related.
Dj
Dj3d ago
@jim I've confirmed 1.7.2 to have a delay issue. It affects the first message sent to the queue, and with many workers you really feel that more. One request spins up a worker, it's "busy" starting, you get another request - another worker is spun up, it's "busy", etc. (This is my interpretation). It's less of an issue at high traffic, but an issue nonetheless. @Felipe Fontana I'm not sure either, for you I'm told if you're having issues in the longterm for workloads similiar to vLLM upgrading the 1.7 versions (1.7.2+) will help you. If it's fine right now it's fine though.
jim
jimOP3d ago
@Dj Does this explain why workers are "running" for over 10 minutes, while not picking up any queue'd up requests? The fact that a lot of my workers were not flash booted, despite having flash boot enabled seems a little sus too
Felipe Fontana
@jim try remove CA ones, solve hre for now.
jim
jimOP3d ago
Ah yeah, all of mine are CA All the other zones don't have a lot of H100 availability tho, but let's see
Josh-Runpod
Josh-Runpod3d ago
@jim Flash boot isn't deterministic atm. We may evict the paused worker if we need the capacity for another request. Consider it "best effort" as currently implemented.
jim
jimOP2d ago
I see. Has there been a significant increase in the % of cold boots in the recent few days (esp. for the endpoints that have flash boot enabled) due to increasingly scarce capacity? @Felipe Fontana US-GA-2 has the same issue for me unfortunately
Milad
Milad2d ago
Facing the same issue, example endpoint id 7obbfxqjmmama2
jim
jimOP2d ago
@Dj Any updates?
Milad
Milad2d ago
Happening on H100s, A100s, and A40s all day today
Dj
Dj2d ago
Reported, thank you
jim
jimOP2d ago
Confirming that SDK 1.7.7 does not solve this @Dj. US-GA-2 has fewer issues than CA
Milad
Milad2d ago
Yes 1.7.7 does not solve, all of our endpoints are on 1.7.7
Dj
Dj2d ago
just got permission, first order of business was to fix that This is being looked into actively, incase anyone sees this now. I plan on mentioning everyone active when I receive a resolution @jim @Milad @TristenHarr @Felipe Fontana @Saqib Zia - On Call Engineering was paged to investigate the issue. The root cause has been identified and a patch for the issue should be ready to go within a day (Friday, Feburary 20th). In the meantime please bear with us, you should be okay but I'll announce more when I can, I'll be back tomorrow with more information.
TristenHarr
TristenHarr2d ago
Great thanks so much! 🙂 No hard feelings on my side, I appreciate the info and support. There's always a few kinks to work out with new and exciting things like serverless GPU's so no problems here. I just hope this truly is fixed and patched because if there are known issues that is fine so long as they are documented, its when you expect everything to work fine and it isn't that things are a problem!
jim
jimOP2d ago
@Dj You mentioned the patch should be ready within a day, but mentioned Friday. I assume you mean Thursday (today is the 20th)?
Dj
Dj2d ago
Yes, sorry! Misremembered the day of the week
jim
jimOP2d ago
All good 🙂 thanks. Any update to share?
Dj
Dj2d ago
Nothing substantial, I can see the PR I was tracking is merged but I don't know the verson we're running in prod network wide at this time.
jim
jimOP2d ago
I see. An ETA on the rollout would help in our planning if you can get one
Dj
Dj2d ago
@jim @Milad @TristenHarr @Felipe Fontana @Saqib Zia - Hello again! At this time the issue should be mostly resolved. An issue was identified by our on call engineering team that would effectively take down a single host. While technically it was possible to resolve the issue by selecting another datacenter for your deployments, you still had to get lucky in not landing on an affected host within that datacenter. We've reviewed logs the endpoint ids provided to us and everything seemed good to go on our end. Thank you for working with us during this time and being so patient. It took a little longer than I'd like to start the incident but we're learning together and having a dedicated Community Manager (me!) should help a lot. Please let us know if you experience any more issues, we'll be happy to take a look.
getsomedata
getsomedata23h ago

Did you find this page helpful?