R
RunPod•6mo ago
giantsol

Urgent! all our workers not working! Any network issues?

Please take a look at our workers in endpoint h16kk1hi79s3t0 or kn0n8ry69jj1t7 All the workers are stuck at something!!
43 Replies
nerdylive
nerdylive•6mo ago
Maybe create a support ticket meanwhile i can help you check whats going on by seeing the logs also whats your template ?
giantsol
giantsolOP•6mo ago
we're using our custom docker image how could I create a support ticket?
nerdylive
nerdylive•6mo ago
hmm did it ever worked yet? On the site by the contact button
giantsol
giantsolOP•6mo ago
yes, we've running these for months without problem
nerdylive
nerdylive•6mo ago
No description
nerdylive
nerdylive•6mo ago
Oh can you see the logs of the worker when its stuck?
giantsol
giantsolOP•6mo ago
yes, sure, I'll paste it here
nerdylive
nerdylive•6mo ago
nice that would help identify the problem
giantsol
giantsolOP•6mo ago
two different worker logs. as far as I can see, I think there's definitely some kind of network problems. These templates have been running for months without any changes.
No description
No description
giantsol
giantsolOP•6mo ago
for the first screenshot, after our logic is done the worker is just not doing anything. for the second, we do some requests in our docker logic, and it seems these network requests are all failing
nerdylive
nerdylive•6mo ago
Its on running state?
giantsol
giantsolOP•6mo ago
yes, all stuck in running state
No description
nerdylive
nerdylive•6mo ago
network request failing? what is it like wow thats a huge amount of workers
giantsol
giantsolOP•6mo ago
I don't know. I'm just guessing there's a network problem in runpod now. We've been using runpod heaviliy for months and this is quite urgent These templates have been running without any problem, but since just a few hours ago this problem started happening
nerdylive
nerdylive•6mo ago
can you copy the last line. the exception (In text form)
giantsol
giantsolOP•6mo ago
here's our requests graph.
No description
nerdylive
nerdylive•6mo ago
i c
giantsol
giantsolOP•6mo ago
yeah I can paste the last line, but I don't think this will help you. it's just our docker logic. 2024-05-30T02:45:08.562489094Z exception in main_handler in validation check: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) but please look into it asap.. 🙂
nerdylive
nerdylive•6mo ago
Im not really the guy that can access your account's deeply but in technical i can help What region is this btw?
giantsol
giantsolOP•6mo ago
we use all the regions. is this what you mean?
No description
nerdylive
nerdylive•6mo ago
whats this doing?
giantsol
giantsolOP•6mo ago
we send a request to amazon s3 to store our image
nerdylive
nerdylive•6mo ago
yeah might be a network outage in one of those regions, or an error in your side to another external service Oh but it says validation check
giantsol
giantsolOP•6mo ago
yes, but we checked locally to send a request to amazon s3, but that works 😦 oh yeah, not only that, we have other things we do. validation check means.. as far as I remember, we use Amazon Rekognition service to check for nsfw photos
nerdylive
nerdylive•6mo ago
can you check that service? the connection to aws's rekognition can be failing
giantsol
giantsolOP•6mo ago
we checked that, it works in my computer
giantsol
giantsolOP•6mo ago
the serious thing is, here when it prints "push_output_image" that means our docker logic is done. normally after that, it should fetch the next runpod job to start, but it's just stuck here
No description
nerdylive
nerdylive•6mo ago
okay i saw another user just posted this "" "exception in main_handler: <class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))" "" seems like there is a problem in runpod's network somewhere
giantsol
giantsolOP•6mo ago
I think so too. Would really appreciate it if you could take a look
nerdylive
nerdylive•6mo ago
I couldn't access into runpod's infra atm im sorry 😦 but im sure there's another internal guys working on this
giantsol
giantsolOP•6mo ago
oh no..
nerdylive
nerdylive•6mo ago
For now what you can do is just create a support ticket, and if you have maybe you can send me the ticket id
giantsol
giantsolOP•6mo ago
that would take too long.. I'm just DMing RunPod members when we first started using RunPod a year ago. Thank you
nerdylive
nerdylive•6mo ago
Hahaha
giantsol
giantsolOP•6mo ago
but they're not responding.. are they all off time?
nerdylive
nerdylive•6mo ago
btw, is your service is a public one?
giantsol
giantsolOP•6mo ago
yes
nerdylive
nerdylive•6mo ago
No, they're mainly online in US hours
giantsol
giantsolOP•6mo ago
oh, we're in Korea and I guess it's sleeping time in US..
nerdylive
nerdylive•6mo ago
Hmm korea huh the guy that reported the error seems to be also from korea https://discord.com/channels/912829806415085598/953341208871194654/1245591192456921220
giantsol
giantsolOP•6mo ago
possibly, this is urgent..
nerdylive
nerdylive•6mo ago
what is the name ? well theres nothing i can do for now hahah, but if you want to you can try deploying to regions ( like 1 per endpoint ), and try seeing which fails
giantsol
giantsolOP•6mo ago
thanks
Want results from more Discord servers?
Add your server