R
RunPod3d ago
abush

Issue with Websocket latency over serverless http proxy since runpod outage

We have a runpod serverless endpoint which we have been using to stream frames over direct one-to-one websocket. We have a lightweight version of this endpoint we've been using that streams simple diagnostic images, and a production version that streams AI generated frames. Frames are configured to stream at 18fps in both cases to create an animation. We now see that both versions of this endpoint fail to stream frames at a reasonable rate, hovering around 1 fps. The lightweight diagnostic frames take virtually no time to generate, and we have confirmed with logging that the AI generated frames in the production version are not generating any slower, and should still be able to meet the 18 fps demand. But we see that the time to send frames over websocket is on the order of 1s per frame, and is very unstable. See below a snippet from our logs showing fast image generation times, but slow times for sending images over websocket
Performance: 1.12 FPS | Avg times: generate=0.102s, encode=0.003s, send=0.679s, sleep=0.000s
Performance: 0.70 FPS | Avg times: generate=0.103s, encode=0.002s, send=1.311s, sleep=0.000s
Performance: 1.12 FPS | Avg times: generate=0.102s, encode=0.003s, send=0.679s, sleep=0.000s
Performance: 0.70 FPS | Avg times: generate=0.103s, encode=0.002s, send=1.311s, sleep=0.000s
Compare this to the attached screen shot showing a previously working version in which we can see from the logs that we are receiving many more than 1 frame within a one second window. We only started seeing this issue after runpod came back up from the outage earlier today. We have been testing with this setup in a variety of configurations over the last two weeks and have only started seeing this issue as of today after the runpod outage occurred. We would very much appreciate some attention on this issue @Dj. It is very impactful at the moment for our org. Could you let us know if there are other tests we could do on our end that would provide helpful data to assess root cause and identify a solution? Thanks very much for your help. Tagging @huemin for visibility.
No description
2 Replies
Dj
Dj3d ago
The timing of the outage and this specific bug (assuming you're seeing the issue I think it is) are unrelated. We deliver traffic to/from your pod through the RunPod Proxy which is about 6 or so servers deployed in the US and EU. We know the actual IP of your host, and tunnel that traffic through whichever server would be the fastest. It's interesting that I only started seeing this issue about the last time we had an outage affecting serverless and more users are affected after another serverless outage. Those events may be related, but since I'm not certain I won't confirm that yet. Do you also see the issue with the proxy when testing locally? If so, can you help me by grabbing an mtr to the URL you have in that screenshot as "WebSocket URL"? You don't have to share the output of the mtr here - you can DM me.
abush
abushOP3d ago
DM'ed! Thanks

Did you find this page helpful?