Urgent! all our workers not working! Any network issues?
Please take a look at our workers in endpoint h16kk1hi79s3t0 or kn0n8ry69jj1t7
All the workers are stuck at something!!
43 Replies
Maybe create a support ticket meanwhile i can help you check whats going on by seeing the logs
also whats your template ?
we're using our custom docker image
how could I create a support ticket?
hmm did it ever worked yet?
On the site by the contact button
yes, we've running these for months without problem
Oh can you see the logs of the worker when its stuck?
yes, sure, I'll paste it here
nice that would help identify the problem
two different worker logs. as far as I can see, I think there's definitely some kind of network problems.
These templates have been running for months without any changes.
for the first screenshot, after our logic is done the worker is just not doing anything.
for the second, we do some requests in our docker logic, and it seems these network requests are all failing
Its on running state?
yes, all stuck in running state
network request failing? what is it like
wow thats a huge amount of workers
I don't know. I'm just guessing there's a network problem in runpod now.
We've been using runpod heaviliy for months and this is quite urgent
These templates have been running without any problem, but since just a few hours ago this problem started happening
can you copy the last line. the exception (In text form)
here's our requests graph.
i c
yeah I can paste the last line, but I don't think this will help you. it's just our docker logic.
2024-05-30T02:45:08.562489094Z exception in main_handler in validation check: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
but please look into it asap.. 🙂
Im not really the guy that can access your account's deeply but in technical i can help
What region is this btw?
we use all the regions. is this what you mean?
whats this doing?
we send a request to amazon s3 to store our image
yeah might be a network outage in one of those regions, or an error in your side to another external service
Oh but it says validation check
yes, but we checked locally to send a request to amazon s3, but that works 😦
oh yeah, not only that, we have other things we do.
validation check means.. as far as I remember, we use Amazon Rekognition service to check for nsfw photos
can you check that service?
the connection to aws's rekognition can be failing
we checked that, it works in my computer
the serious thing is, here when it prints "push_output_image" that means our docker logic is done.
normally after that, it should fetch the next runpod job to start, but it's just stuck here
okay i saw another user just posted this
""
"exception in main_handler: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))"
""
seems like there is a problem in runpod's network somewhere
I think so too.
Would really appreciate it if you could take a look
I couldn't access into runpod's infra atm im sorry 😦
but im sure there's another internal guys working on this
oh no..
For now what you can do is just create a support ticket, and if you have maybe you can send me the ticket id
that would take too long.. I'm just DMing RunPod members when we first started using RunPod a year ago.
Thank you
Hahaha
but they're not responding.. are they all off time?
btw, is your service is a public one?
yes
No, they're mainly online in US hours
oh, we're in Korea and I guess it's sleeping time in US..
Hmm korea huh
the guy that reported the error seems to be also from korea https://discord.com/channels/912829806415085598/953341208871194654/1245591192456921220
possibly, this is urgent..
what is the name ?
well theres nothing i can do for now hahah, but if you want to you can try deploying to regions ( like 1 per endpoint ), and try seeing which fails
thanks