Has anyone experienced issues with serverless /run callbacks since December?

We've noticed that response bodies are empty when using /run endpoints with callbacks in the RunPod serverless environment (occurring sometime after December 2nd). Additional context: - /runsync endpoints are working normally - Response JSON format appears correct in the "Requests" tab of RunPod console under Status - Our last deployment to this endpoint was two months ago Could anyone confirm if there have been any releases since December that might have introduced this issue? We haven't made any changes to our deployment since two months ago, but are now seeing empty response bodies with callbacks. Thanks in advance! 🙏
66 Replies
nerdylive
nerdylive2mo ago
@yhlong00000
wuxmes
wuxmes2mo ago
same here
No description
wuxmes
wuxmes2mo ago
getting a barrage of 520 and 415 since 4 am
Iggy
Iggy2mo ago
same here
Coderik
Coderik2mo ago
Same here with two clients of mine simultaneously. Status request immediately after the job's completion returns the correct result. But no webhook
Iggy
Iggy2mo ago
getting JSON strings sometimes empty responses badly formatted string
sdhiman15
sdhiman152mo ago
i am not getting any responses
Iggy
Iggy2mo ago
it started happening today
sdhiman15
sdhiman152mo ago
getting empty object in request.body and undefined in request.rawBody Yes
Iggy
Iggy2mo ago
everything is breaking 😢
sdhiman15
sdhiman152mo ago
yes No one is responding too. my entire flow is disrupted due to this.
Moritur
Moritur2mo ago
Same here, we're only getting the input data, but not the request id anymore so we can't fetch the output of the job
eliasb
eliasb2mo ago
from what I found, they're missing the "content-type" header the webhook POST is missing the content-type header, if you can fix it, just set it to application/json (I am using a middleware in rails)
Ahmadzafar176
Ahmadzafar1762mo ago
same here, getting initial "IN_QUEUE" status but not receiving any response , runsync works ok anyone has a fix or know what's wrong?
Moritur
Moritur2mo ago
I think we need to wait for them to wake up, it's currently ~2am in san francisco
wuxmes
wuxmes2mo ago
for me I did a manual fix by adding a middleware by directly setting all webhook requests as contenttype of json
def __call__(self, request):
if request.method == "POST" and "webhook_runpod" in request.path:
request.META["CONTENT_TYPE"] = "application/json"
return self.get_response(request)
def __call__(self, request):
if request.method == "POST" and "webhook_runpod" in request.path:
request.META["CONTENT_TYPE"] = "application/json"
return self.get_response(request)
smth like this fixed it for me in python
sdhiman15
sdhiman152mo ago
Yes, If using nodejs, set the request header to application/json manually and use express json parser to get the parsed body
app.post('/webook', (req, res) => {
req.headers['content-type'] = "application/json";
express.json()(req, res, ()=> {
console.log(req.body);
// Your code goes here

res.sendStatus(200);
})
})
app.post('/webook', (req, res) => {
req.headers['content-type'] = "application/json";
express.json()(req, res, ()=> {
console.log(req.body);
// Your code goes here

res.sendStatus(200);
})
})
Ahmadzafar176
Ahmadzafar1762mo ago
unfortunately my company uses bubble as a front end service, do i have to make a middleware on my end to solve this?
juergengunz
juergengunz2mo ago
i think someone pushed an update before they went to sleep
nerdylive
nerdylive2mo ago
Kek
Nikita
Nikita2mo ago
happens to everyone to be honest
nerdylive
nerdylive2mo ago
Anyone can take a screenshot I wanna see? Does it happen on all region? What or what region do you guys use
juergengunz
juergengunz2mo ago
i only use europe regions
No description
Nikita
Nikita2mo ago
No description
Nikita
Nikita2mo ago
fails on every datacenter
nerdylive
nerdylive2mo ago
The job fails or it only returns empty output?
Nikita
Nikita2mo ago
it returns a valid output (can be seen in requests tab of an endpoint), but the webhook post request body doesn't seem to be a serialized json event setting content type manually didnt work for me trying to serialize it manually from chunks sent
nerdylive
nerdylive2mo ago
So it errors in webhook only? But the /status and /run output works well?
Nikita
Nikita2mo ago
Yup For those who work with node.js and didn't resolve it with manually setting the content-type, here's the custom serializer from chunks into json (express)
public async handler(req: Request, res: Response, next: NextFunction) {
req.headers["content-type"] = "application/json";

let buffer = "";
req.setEncoding("utf8");

req.on("data", (chunk) => {
buffer += chunk;
});

req.on("end", () => {
try {
req.body = JSON.parse(buffer);
} catch (err) {
console.error("Error parsing JSON:", err);
}
next();
});
}
public async handler(req: Request, res: Response, next: NextFunction) {
req.headers["content-type"] = "application/json";

let buffer = "";
req.setEncoding("utf8");

req.on("data", (chunk) => {
buffer += chunk;
});

req.on("end", () => {
try {
req.body = JSON.parse(buffer);
} catch (err) {
console.error("Error parsing JSON:", err);
}
next();
});
}
furkan.huudle
furkan.huudle2mo ago
same here 🙌🏻 Sometimes I get empty body (with binary data). Is there any explanations from runpod?
Nikita
Nikita2mo ago
not yet, i see this happens since 03:00 at night UTC
Moritur
Moritur2mo ago
Is there any way to escalate issues like these to runpod staff, especially if it happens in the middle of the night for them?
Nikita
Nikita2mo ago
i guess discord is the only place
nerdylive
nerdylive2mo ago
Contact button on the website What tis it like? Any picture?
Moritur
Moritur2mo ago
Is that actually escalating it? I'm guessing they're all sleeping right now
nerdylive
nerdylive2mo ago
Well no, it creates a new support request
Moritur
Moritur2mo ago
Ok, I guess they will notice either way once they wake up. But I guess a company like runpod should also have monitoring set up that screams at them when suddenly a huge amount of webhooks across all customers fail.
Nikita
Nikita2mo ago
We've recently had an sdk version that didn't work properly for about 2 weeks straight so...
furkan.huudle
furkan.huudle2mo ago
Im using n8n for webhook, I can provide screenshot but there is no programming things, so I don't sure if it's fit or understand other developers But I think runpod should increase their focus on support or deployment management. Because 1 or 2 months ago, runpod sdk was broken and I couldn't see if I checked discord but anyway, Im sharing screnshots @nerdylive
furkan.huudle
furkan.huudle2mo ago
The right one is request from runpod
No description
furkan.huudle
furkan.huudle2mo ago
And Runpod respond as binary, I guess
No description
furkan.huudle
furkan.huudle2mo ago
And that's the binary data
No description
nerdylive
nerdylive2mo ago
Ohh because of the header data type is missing or wrong type I guess yeah.
furkan.huudle
furkan.huudle2mo ago
Yes :/ When they will fix u think?
nerdylive
nerdylive2mo ago
When they wake up and working Maybe like few more hours
yhlong00000
yhlong000002mo ago
Hey, sorry about this! We’re aware of the issue and will have a hotfix in next hour. The response is currently missing the application/json header. As a workaround, you can update your code to parse the body as JSON even if the header is missing.
Nikita
Nikita2mo ago
Thanks for quick respond 💪
furkan.huudle
furkan.huudle2mo ago
thanks meow meow 🚀
Iggy
Iggy2mo ago
@yhlong00000 any updates? 👀
nerdylive
nerdylive2mo ago
Don't worry they will announce it as soon as it's fixed either here , or in #📢|announcements
Iggy
Iggy2mo ago
was in the middle of a huge refactor to make this work but if you are working on it I'll just wait
yhlong00000
yhlong000002mo ago
We’re running the final tests now, it should be ready soon. We’re pushing the change now; it will take about 15 minutes. I’ll keep you posted. The release is almost complete, and my testing shows the response looks good. Could you verify it on your end and let me know if you still encounter any issues?
Ahmadzafar176
Ahmadzafar1762mo ago
seems to be working now thanks
Moritur
Moritur2mo ago
It works now. Are there any steps you are taking to prevent issues like this from happening in the future? The biggest issue for us was that there was no official reaction at all for more than 5 hours of our regular working day.
yhlong00000
yhlong000002mo ago
Yes, we will reflect on this incident internally and implement additional safeguards and necessary changes to prevent this from happening again in the future. We’re truly sorry for the inconvenience!
Iggy
Iggy2mo ago
All systems operational now, back to normal Even with priority support this was quite unnerving, would be great if you could have support team for the midnight san francisco hours
yhlong00000
yhlong000002mo ago
Thanks for the feedback! We’re currently short on support staff but will work towards providing 24/7 support in the future.
kazuph(かずふ)🍙
I'd like to confirm that our application has recovered from the above issue. Thank you.
CosMix
CosMix2mo ago
I am having issues right now... Cannot create pods through the python package despite having GPUs available. Keep getting the "There are no longer any instances available with the requested specifications. Please refresh and try again" but if I try to create it with exactly the same settings from the Web UI it all works out...
nerdylive
nerdylive2mo ago
Maybe the machine just freed up? the supply is real time, so when its really free it will be available to rent in around that time
CosMix
CosMix2mo ago
Its been happening since last week. Is I try to hit the button on the webui at the same time with triggering the script, it doesn't work with the runpod python package but it works with the Web UI.
nerdylive
nerdylive2mo ago
@yhlong00000 https://discord.com/channels/912829806415085598/1307856048173879327 i think these guys are experiencing the same problem
MaxFrax
MaxFrax2mo ago
Are any of you still experiencing issues with serverless vllm? I cannot manage to release a working endpoint. I keep getting 500s and even some 502 bad gateway from cloudflare. I don't even know how to further describe my issues, it's days that I'm banging my head on this problem and I'm losing sanity. I tried to rollback to runpod/worker-v1-vllm:v1.6.0stable-cuda12.1.0, without any luck. Lucklily it seems that my old endpoints created in the past few months are not experiencing visible issues 502 are coming in strong now and my in progress requests seems to be multiplying according to inprogress counter (without aparent reason)
yhlong00000
yhlong000002mo ago
The UI refreshes periodically to display the latest GPU availability. When you click the deploy button, the system checks the real-time availability of the GPU. If availability is low and many users are renting or releasing GPUs, it’s possible the UI shows a GPU as available, but by the time you deploy, it’s already taken due to the refresh delay. maybe try to record a video, screenshot, logs, endpointIds, current settings, those will be useful to figure out the issue.
MaxFrax
MaxFrax2mo ago
Didn't manage to collect all the material yet, however it seems related to constraining the generation with: extra_body={"guided_json": json_schema} https://docs.vllm.ai/en/latest/usage/structured_outputs.html
CosMix
CosMix2mo ago
That also happens with instances that show "Medium" or "High" availability. Is a general issue with Runpod.

Did you find this page helpful?