Feb 20 - Serverless Issues Mega-Thread
Many people seem to be running into the following issue:
Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening.
I think there is an issue with how the requests are getting assigned to the workers: there a number of idling workers, and there are a number of queued requests, and they both stay in that state for many minutes without any requests getting picked up by workers!
It's not my settings (eg. idle timeout, queue delay), because they've been constant and I haven't touched them at all. The issue has started to happen a few days ago.
39 Replies
For those links, can you update them to be the link you get from pressing this icon?

Sure
Updated
It's being looked into :) Can I get your endpoint id?
zvhg9gcnqmkugx and xku09ayt4686ma
Thank you
Getting up to 15min delays on requests, despite "running" workers not working on any requests
@jim -
You would likely benefit from Flash Boot on
zvhg9gcnqmkugx
and updating your SDK version on both endpoints.
@Felipe Fontana -
You may also want to update your SDK version.
This is all I know right now, we're looking into a correlation between old SDK versions and similiar queue delays. If we find out more I'll see what we can do about helping the overall community by sending notifications suggesting an upgrade.@Dj Both of these endpoints already have Flash Boot enabled. They worked fine with this SDK version before, did something change on the platform?
In the meantime, I'll upgrade to latest SDK version to see if it helps
I can't personally view your account, I was just told a lot of your running workers were not flashbooted, so I made sure it was enabled
Ah maybe that's a hint at the cause of the issue?
I've started at RunPod yesterday, so I'm not 100% sure if anythings changed but we're looking into the root still
I just want to make sure you're doing everything you can in the meantime 😄
These are my settings for the zv... endpoint

Very interesting that it's not flash booting, despite it being enabled
Sounds like that may be part of the cause: very slow boots that while being booted are in "running" state
I'm with fastboot and the last version 1.6.2 of the package.
GitHub
Releases · runpod/runpod-python
🐍 | Python library for RunPod API and serverless worker SDK. - runpod/runpod-python
I have 20+ different services, and I can't do it on a reasonable time.
You do not have to upgrade, it's completely okay if you want to wait for a resolution that upgrading your SDK will help you definitely.
@Dj I removed servers from CA, and now it looks better. I don't know if the problem is related.
@jim
I've confirmed 1.7.2 to have a delay issue. It affects the first message sent to the queue, and with many workers you really feel that more. One request spins up a worker, it's "busy" starting, you get another request - another worker is spun up, it's "busy", etc. (This is my interpretation). It's less of an issue at high traffic, but an issue nonetheless.
@Felipe Fontana
I'm not sure either, for you I'm told if you're having issues in the longterm for workloads similiar to vLLM upgrading the 1.7 versions (1.7.2+) will help you. If it's fine right now it's fine though.
@Dj Does this explain why workers are "running" for over 10 minutes, while not picking up any queue'd up requests?
The fact that a lot of my workers were not flash booted, despite having flash boot enabled seems a little sus too
@jim try remove CA ones, solve hre for now.
Ah yeah, all of mine are CA
All the other zones don't have a lot of H100 availability tho, but let's see
@jim Flash boot isn't deterministic atm. We may evict the paused worker if we need the capacity for another request. Consider it "best effort" as currently implemented.
I see. Has there been a significant increase in the % of cold boots in the recent few days (esp. for the endpoints that have flash boot enabled) due to increasingly scarce capacity?
@Felipe Fontana US-GA-2 has the same issue for me unfortunately
Facing the same issue, example endpoint id 7obbfxqjmmama2
@Dj Any updates?
Happening on H100s, A100s, and A40s all day today
Reported, thank you
Confirming that SDK 1.7.7 does not solve this @Dj. US-GA-2 has fewer issues than CA
Yes 1.7.7 does not solve, all of our endpoints are on 1.7.7
just got permission, first order of business was to fix that
This is being looked into actively, incase anyone sees this now. I plan on mentioning everyone active when I receive a resolution
@jim @Milad @TristenHarr @Felipe Fontana @Saqib Zia -
On Call Engineering was paged to investigate the issue. The root cause has been identified and a patch for the issue should be ready to go within a day (Friday, Feburary 20th). In the meantime please bear with us, you should be okay but I'll announce more when I can, I'll be back tomorrow with more information.
Great thanks so much! 🙂 No hard feelings on my side, I appreciate the info and support. There's always a few kinks to work out with new and exciting things like serverless GPU's so no problems here. I just hope this truly is fixed and patched because if there are known issues that is fine so long as they are documented, its when you expect everything to work fine and it isn't that things are a problem!
@Dj You mentioned the patch should be ready within a day, but mentioned Friday. I assume you mean Thursday (today is the 20th)?
Yes, sorry!
Misremembered the day of the week
All good 🙂 thanks. Any update to share?
Nothing substantial, I can see the PR I was tracking is merged but I don't know the verson we're running in prod network wide at this time.
I see. An ETA on the rollout would help in our planning if you can get one
@jim @Milad @TristenHarr @Felipe Fontana @Saqib Zia -
Hello again! At this time the issue should be mostly resolved. An issue was identified by our on call engineering team that would effectively take down a single host. While technically it was possible to resolve the issue by selecting another datacenter for your deployments, you still had to get lucky in not landing on an affected host within that datacenter. We've reviewed logs the endpoint ids provided to us and everything seemed good to go on our end.
Thank you for working with us during this time and being so patient. It took a little longer than I'd like to start the incident but we're learning together and having a dedicated Community Manager (me!) should help a lot. Please let us know if you experience any more issues, we'll be happy to take a look.