Issues in SE region causing a massive amount of jobs to be retried

The issues in the screenshot are causing 10% of my jobs to be retried in SE region. Please fix this, its not happening in CA region.
No description
20 Replies
digigoblin
digigoblin2mo ago
Obviously I am referring to the "Connection timeout" errors which causes the job results to fail to be returned, and not the single exeption among them.
Madiator2011
Madiator20112mo ago
@digigoblin DO YOU MIND SUBMITING AS TICKET ON WEBSITE EASIER TO ESCALATE
digigoblin
digigoblin2mo ago
No need to shout but sure 😁
Madiator2011
Madiator20112mo ago
ups sorry for caps
digigoblin
digigoblin2mo ago
Ticket number is 4208
Madiator2011
Madiator20112mo ago
done
digigoblin
digigoblin2mo ago
Thank you
nerdylive
nerdylive2mo ago
hahaha wait SE? my jobs works well btw
digigoblin
digigoblin2mo ago
You probably didn't try and send 1000 jobs today
nerdylive
nerdylive2mo ago
Yes yes
digigoblin
digigoblin2mo ago
I said 10% are retried NOT ALL 🤦‍♂️
nerdylive
nerdylive2mo ago
im using dev on SE Ooh so 10% expected to fail?
digigoblin
digigoblin2mo ago
They are retried they don't fail
nerdylive
nerdylive2mo ago
well goodluck on your problem
digigoblin
digigoblin2mo ago
RunPod needs to check it out, I switched to CA in the meantime and it works fine without any issues.
nerdylive
nerdylive2mo ago
yeah great to hear
digigoblin
digigoblin2mo ago
I was using CA but then switched to SE because my jobs were failing, but it was actually because my own Redis server had OOM issues due to running out of memory and wasn't a RunPod issue. So I upgraded my ElastiCache instance on AWS from cache.t3.medium to cache.m4.large and now its fine.
nerdylive
nerdylive2mo ago
Wow you use elasticache? why not self hosted redis
digigoblin
digigoblin2mo ago
Because its a cluster not a single instance
nerdylive
nerdylive2mo ago
oh ic