RunPod•10mo ago

network connections are very slow, Failed to return job results.

Hello, I'm starting seeing this error for all my jobs: 2024-07-03T16:16:03.367164993Z {"requestId": "c1ffb9c9-970e-4602-906a-b64838a17309-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/m6fixbtjutzbvo/job-done/ap10mhw4d9tubz/c1ffb9c9-970e-4602-906a-b64838a17309-u1?gpu=NVIDIA+RTX+A6000&isStream=false", "level": "ERROR"} It seemes like all network requests are taking much longer, like 8-10s to post a 1m image to an AWS server. Has anyone else seening this? Thanks!

29 Replies

sssstevenOP•10mo ago

I restart the workers and now the problems are fixed. I hope someone can take a look of those works.

PatrickR•10mo ago

For longer jobs you'll want to use run and not the runsync method.

digigoblin•10mo ago

Problem seems to be a connection timeout to api.runpod.ai, I've also seen this in my endpoint logs from time to time. The connection shouldn't time out.

sssstevenOP•10mo ago

It is the run method, it seems like RP has some internal network issues

Jason•10mo ago

yeah i've been seeing it too in my endpoints

Phenomenology•10mo ago

Thanks for pointing this out, it is a little concerning. I was prepairing to deploy a 7B LLM model on serverless for my project - I hope they can fix this issue to ensure proper uptimes. I've used Runpod for over a year for regular pods and would love to continue using their service to host the LLM as an endpoint, but can't risk having downtimes.

luckedup.•10mo ago

Same here. Any solutions out there? I have a production running actually we have just launched a big update and things like that just suck big time. Have you tried contacting someone from runpod?

digigoblin•10mo ago

@flash-singh seems multiple people are experiencing this issue.

digigoblin•10mo ago

I see it in my endpoint logs as well.

digigoblin•10mo ago

Maybe need to scale up api.runpod.ai or something

flash-singh•10mo ago

we are working on optimizing worker to use rust to handle the http connections and get better visibility into these errors and optimizing the connection timeouts

Jason•10mo ago

Hope it's stable soon

Tony!•10mo ago

Running into this issue as well, tons of failed jobs today and yesterday, what's the resolution?

flash-singh•10mo ago

what do you see in dashboard? does it retry jobs or fail jobs?

Tony!•10mo ago

Fail

flash-singh•10mo ago

thanks we are currently in middle of updating the sdk to be better, will share some updates

Jason•10mo ago

Hey I think if we can customize retries from sdk too it will be great

Arjun•10mo ago

Also seeing some major changes to our serverless execution. Super slow response times for simple requests, and my feeling is that it might have to do with slow network volume mount times. I believe this is increasing slow cold start times, which is greatly increasing costs. Is the time that it takes to mount the network volumes being billed as part of the run time?

Jason•10mo ago

Yes, the time to mount the network volume will be billed

flash-singh•10mo ago

time to mount the network volume is fast but to load the data like a model into vram will vary based on model, the best approach is always put the model in container image if possible, how are you loading models? you will see cold start and execution times as different metrics in your dashboard you would set it higher or to 0, whats your goal?

Jason•10mo ago

Oh there is an option for that already? yeah true No goal yet.. but great for customizing it for different use cases

flash-singh•10mo ago

it is not an option yet, can plan to add it as a configuration

Jason•10mo ago

oh ok

Arjun•10mo ago

We need 3 or so models readily available + SD libraries etc so would this mean to just have a 28GB or so container image. Would that load slowly? We are using https://github.com/ashleykleynhans/runpod-worker-a1111 Which uses a network volume to store stable diffusion, python env + models etc and keeps the main container pretty lightweight. But it sounds like we will want to modify to keep everything in the container image itself.

flash-singh•10mo ago

one container image and endpoint for all those models?

Arjun•10mo ago

I'm not sure what the best way to architect it would be. Multiple endpoints for each model? Problem is how to guage and distribute our worker quota to which endpoint.

Jason•10mo ago

yep keep it under 15 if you want it to be fast enough

flash-singh•10mo ago

how does a job determine which model to use or if all? do you load / unload models per job?

Jason•10mo ago

Or use big enough gpu, and just select which to use I'm not sure if that works Use if condition, then only call which model selected from inputs, you know like the whisper endpoint or faster whisper But if the sd models are sdxl yeah they will be loaded/ unloaded ig

Gaming

Programming

network connections are very slow, Failed to return job results.

Did you find this page helpful?