network connections are very slow, Failed to return job results.
Hello, I'm starting seeing this error for all my jobs:
2024-07-03T16:16:03.367164993Z {"requestId": "c1ffb9c9-970e-4602-906a-b64838a17309-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/m6fixbtjutzbvo/job-done/ap10mhw4d9tubz/c1ffb9c9-970e-4602-906a-b64838a17309-u1?gpu=NVIDIA+RTX+A6000&isStream=false", "level": "ERROR"}
It seemes like all network requests are taking much longer, like 8-10s to post a 1m image to an AWS server. Has anyone else seening this? Thanks!
29 Replies
I restart the workers and now the problems are fixed. I hope someone can take a look of those works.
For longer jobs you'll want to use
run
and not the runsync
method.Problem seems to be a connection timeout to api.runpod.ai, I've also seen this in my endpoint logs from time to time. The connection shouldn't time out.
It is the run method, it seems like RP has some internal network issues
yeah i've been seeing it too in my endpoints
Thanks for pointing this out, it is a little concerning. I was prepairing to deploy a 7B LLM model on serverless for my project - I hope they can fix this issue to ensure proper uptimes. I've used Runpod for over a year for regular pods and would love to continue using their service to host the LLM as an endpoint, but can't risk having downtimes.
Same here. Any solutions out there?
I have a production running actually we have just launched a big update and things like that just suck big time. Have you tried contacting someone from runpod?
@flash-singh seems multiple people are experiencing this issue.
I see it in my endpoint logs as well.
Maybe need to scale up api.runpod.ai or something
we are working on optimizing worker to use rust to handle the http connections and get better visibility into these errors and optimizing the connection timeouts
Hope it's stable soon
Running into this issue as well, tons of failed jobs today and yesterday, what's the resolution?
what do you see in dashboard? does it retry jobs or fail jobs?
Fail
thanks we are currently in middle of updating the sdk to be better, will share some updates
Hey I think if we can customize retries from sdk too it will be great
Also seeing some major changes to our serverless execution. Super slow response times for simple requests, and my feeling is that it might have to do with slow network volume mount times.
I believe this is increasing slow cold start times, which is greatly increasing costs.
Is the time that it takes to mount the network volumes being billed as part of the run time?
Yes, the time to mount the network volume will be billed
time to mount the network volume is fast but to load the data like a model into vram will vary based on model, the best approach is always put the model in container image if possible, how are you loading models?
you will see cold start and execution times as different metrics in your dashboard
you would set it higher or to 0, whats your goal?
Oh there is an option for that already?
yeah true
No goal yet.. but great for customizing it for different use cases
it is not an option yet, can plan to add it as a configuration
oh ok
We need 3 or so models readily available + SD libraries etc so would this mean to just have a 28GB or so container image. Would that load slowly?
We are using https://github.com/ashleykleynhans/runpod-worker-a1111
Which uses a network volume to store stable diffusion, python env + models etc and keeps the main container pretty lightweight. But it sounds like we will want to modify to keep everything in the container image itself.
one container image and endpoint for all those models?
I'm not sure what the best way to architect it would be. Multiple endpoints for each model? Problem is how to guage and distribute our worker quota to which endpoint.
yep
keep it under 15 if you want it to be fast enough
how does a job determine which model to use or if all? do you load / unload models per job?
Or use big enough gpu, and just select which to use
I'm not sure if that works
Use if condition, then only call which model selected from inputs, you know like the whisper endpoint or faster whisper
But if the sd models are sdxl yeah they will be loaded/ unloaded ig