RunPod•9mo ago

Urgent! all our workers not working! Any network issues?

Please take a look at our workers in endpoint h16kk1hi79s3t0 or kn0n8ry69jj1t7 All the workers are stuck at something!!

43 Replies

Maybe create a support ticket meanwhile i can help you check whats going on by seeing the logs also whats your template ?

giantsolOP•9mo ago

we're using our custom docker image how could I create a support ticket?

nerdylive•9mo ago

hmm did it ever worked yet? On the site by the contact button

giantsolOP•9mo ago

yes, we've running these for months without problem

nerdylive•9mo ago

nerdylive•9mo ago

Oh can you see the logs of the worker when its stuck?

giantsolOP•9mo ago

yes, sure, I'll paste it here

nerdylive•9mo ago

nice that would help identify the problem

giantsolOP•9mo ago

two different worker logs. as far as I can see, I think there's definitely some kind of network problems. These templates have been running for months without any changes.

giantsolOP•9mo ago

for the first screenshot, after our logic is done the worker is just not doing anything. for the second, we do some requests in our docker logic, and it seems these network requests are all failing

nerdylive•9mo ago

Its on running state?

giantsolOP•9mo ago

yes, all stuck in running state

nerdylive•9mo ago

network request failing? what is it like wow thats a huge amount of workers

giantsolOP•9mo ago

I don't know. I'm just guessing there's a network problem in runpod now. We've been using runpod heaviliy for months and this is quite urgent These templates have been running without any problem, but since just a few hours ago this problem started happening

nerdylive•9mo ago

can you copy the last line. the exception (In text form)

giantsolOP•9mo ago

here's our requests graph.

nerdylive•9mo ago

i c

giantsolOP•9mo ago

yeah I can paste the last line, but I don't think this will help you. it's just our docker logic. 2024-05-30T02:45:08.562489094Z exception in main_handler in validation check: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) but please look into it asap.. 🙂

nerdylive•9mo ago

Im not really the guy that can access your account's deeply but in technical i can help What region is this btw?

giantsolOP•9mo ago

we use all the regions. is this what you mean?

nerdylive•9mo ago

whats this doing?

giantsolOP•9mo ago

we send a request to amazon s3 to store our image

nerdylive•9mo ago

yeah might be a network outage in one of those regions, or an error in your side to another external service Oh but it says validation check

giantsolOP•9mo ago

yes, but we checked locally to send a request to amazon s3, but that works 😦 oh yeah, not only that, we have other things we do. validation check means.. as far as I remember, we use Amazon Rekognition service to check for nsfw photos

nerdylive•9mo ago

can you check that service? the connection to aws's rekognition can be failing

giantsolOP•9mo ago

we checked that, it works in my computer

giantsolOP•9mo ago

the serious thing is, here when it prints "push_output_image" that means our docker logic is done. normally after that, it should fetch the next runpod job to start, but it's just stuck here

nerdylive•9mo ago

okay i saw another user just posted this "" "exception in main_handler: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))" "" seems like there is a problem in runpod's network somewhere

giantsolOP•9mo ago

I think so too. Would really appreciate it if you could take a look

nerdylive•9mo ago

I couldn't access into runpod's infra atm im sorry 😦 but im sure there's another internal guys working on this

giantsolOP•9mo ago

oh no..

nerdylive•9mo ago

For now what you can do is just create a support ticket, and if you have maybe you can send me the ticket id

giantsolOP•9mo ago

that would take too long.. I'm just DMing RunPod members when we first started using RunPod a year ago. Thank you

nerdylive•9mo ago

Hahaha

giantsolOP•9mo ago

but they're not responding.. are they all off time?

nerdylive•9mo ago

btw, is your service is a public one?

giantsolOP•9mo ago

yes

nerdylive•9mo ago

No, they're mainly online in US hours

giantsolOP•9mo ago

oh, we're in Korea and I guess it's sleeping time in US..

nerdylive•9mo ago

Hmm korea huh the guy that reported the error seems to be also from korea https://discord.com/channels/912829806415085598/953341208871194654/1245591192456921220

giantsolOP•9mo ago

possibly, this is urgent..

nerdylive•9mo ago

what is the name ? well theres nothing i can do for now hahah, but if you want to you can try deploying to regions ( like 1 per endpoint ), and try seeing which fails

giantsolOP•9mo ago

thanks

Gaming

Programming

Urgent! all our workers not working! Any network issues?

Did you find this page helpful?