network connections are very slow, Failed to return job results.

Hello, I'm starting seeing this error for all my jobs: 2024-07-03T16:16:03.367164993Z {"requestId": "c1ffb9c9-970e-4602-906a-b64838a17309-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/m6fixbtjutzbvo/job-done/ap10mhw4d9tubz/c1ffb9c9-970e-4602-906a-b64838a17309-u1?gpu=NVIDIA+RTX+A6000&isStream=false", "level": "ERROR"} It seemes like all network requests are taking much longer, like 8-10s to post a 1m image to an AWS server. Has anyone else seening this? Thanks!
29 Replies
ssssteven
sssstevenOP7mo ago
I restart the workers and now the problems are fixed. I hope someone can take a look of those works.
PatrickR
PatrickR7mo ago
For longer jobs you'll want to use run and not the runsync method.
digigoblin
digigoblin7mo ago
Problem seems to be a connection timeout to api.runpod.ai, I've also seen this in my endpoint logs from time to time. The connection shouldn't time out.
ssssteven
sssstevenOP7mo ago
It is the run method, it seems like RP has some internal network issues
nerdylive
nerdylive7mo ago
yeah i've been seeing it too in my endpoints
Phenomenology
Phenomenology7mo ago
Thanks for pointing this out, it is a little concerning. I was prepairing to deploy a 7B LLM model on serverless for my project - I hope they can fix this issue to ensure proper uptimes. I've used Runpod for over a year for regular pods and would love to continue using their service to host the LLM as an endpoint, but can't risk having downtimes.
luckedup.
luckedup.7mo ago
Same here. Any solutions out there? I have a production running actually we have just launched a big update and things like that just suck big time. Have you tried contacting someone from runpod?
digigoblin
digigoblin7mo ago
@flash-singh seems multiple people are experiencing this issue.
digigoblin
digigoblin7mo ago
I see it in my endpoint logs as well.
No description
digigoblin
digigoblin7mo ago
Maybe need to scale up api.runpod.ai or something
flash-singh
flash-singh7mo ago
we are working on optimizing worker to use rust to handle the http connections and get better visibility into these errors and optimizing the connection timeouts
nerdylive
nerdylive7mo ago
Hope it's stable soon
Tony!
Tony!7mo ago
Running into this issue as well, tons of failed jobs today and yesterday, what's the resolution?
No description
flash-singh
flash-singh7mo ago
what do you see in dashboard? does it retry jobs or fail jobs?
Tony!
Tony!7mo ago
Fail
flash-singh
flash-singh7mo ago
thanks we are currently in middle of updating the sdk to be better, will share some updates
nerdylive
nerdylive7mo ago
Hey I think if we can customize retries from sdk too it will be great
Arjun
Arjun7mo ago
Also seeing some major changes to our serverless execution. Super slow response times for simple requests, and my feeling is that it might have to do with slow network volume mount times. I believe this is increasing slow cold start times, which is greatly increasing costs. Is the time that it takes to mount the network volumes being billed as part of the run time?
nerdylive
nerdylive7mo ago
Yes, the time to mount the network volume will be billed
flash-singh
flash-singh7mo ago
time to mount the network volume is fast but to load the data like a model into vram will vary based on model, the best approach is always put the model in container image if possible, how are you loading models? you will see cold start and execution times as different metrics in your dashboard you would set it higher or to 0, whats your goal?
nerdylive
nerdylive7mo ago
Oh there is an option for that already? yeah true No goal yet.. but great for customizing it for different use cases
flash-singh
flash-singh7mo ago
it is not an option yet, can plan to add it as a configuration
nerdylive
nerdylive7mo ago
oh ok
Arjun
Arjun7mo ago
We need 3 or so models readily available + SD libraries etc so would this mean to just have a 28GB or so container image. Would that load slowly? We are using https://github.com/ashleykleynhans/runpod-worker-a1111 Which uses a network volume to store stable diffusion, python env + models etc and keeps the main container pretty lightweight. But it sounds like we will want to modify to keep everything in the container image itself.
flash-singh
flash-singh7mo ago
one container image and endpoint for all those models?
Arjun
Arjun7mo ago
I'm not sure what the best way to architect it would be. Multiple endpoints for each model? Problem is how to guage and distribute our worker quota to which endpoint.
nerdylive
nerdylive7mo ago
yep keep it under 15 if you want it to be fast enough
flash-singh
flash-singh7mo ago
how does a job determine which model to use or if all? do you load / unload models per job?
nerdylive
nerdylive7mo ago
Or use big enough gpu, and just select which to use I'm not sure if that works Use if condition, then only call which model selected from inputs, you know like the whisper endpoint or faster whisper But if the sd models are sdxl yeah they will be loaded/ unloaded ig

Did you find this page helpful?