R
RunPod•4mo ago
Ercan

URGENT! Network Connection issues

Hi, looks like there is a general issue in all pods and all of them are suffering network connection issues. Can someone look into this?
79 Replies
Ercan
ErcanOP•4mo ago
@Elder Papa Madiator sorry for tagging but no one is looking into this still
Madiator2011
Madiator2011•4mo ago
If you submitted ticket on page will check it on monday Not seen any issues
Foozle
Foozle•4mo ago
Been using RunPod for 6 weeks and kinda know what I'm doing. Big issues with network connections for me too, extremely slow speeds setting up and downloading Pods and Hugging Face and Civitai models are just stopping and timing out.
danielfriis
danielfriis•4mo ago
I've had the same issue. Just started using RunPod. First, it worked perfectly, but for the past few days I haven't been able to deploy. Works today. A bit worrisome if I have to rely on the service going forward
Magriux
Magriux•4mo ago
same here
nerdylive
nerdylive•4mo ago
Open up a ticket guys, from the site
NERDDISCO
NERDDISCO•4mo ago
@Ercan @Foozle @danielfriis @markolo would you mind telling me in which regions this was happening? And if you have, the pod ids + time / date, so that we can check out what is happening.
danielfriis
danielfriis•4mo ago
Happening again now. @Tim aka NERDDISCO ID: moz6awaptgyvrj, time: right now (14.52 WEST)
Ercan
ErcanOP•4mo ago
Happening time to time still, for when it started happening first "08/17/2024 4:50 PM EST" jv5qo2ua1o0ku7, r0qtnl9w5pd9ye, 0370lv0blxuph5, 8pbdspc5cfi8nw
NERDDISCO
NERDDISCO•4mo ago
thank you, passed all of this to the team!
nevermind
nevermind•4mo ago
We also experience temporary network issues such as read timeouts, dropped connections, and slow network.
NERDDISCO
NERDDISCO•4mo ago
@nevermind can you please tell me ids of the pods where this has happened? @danielfriis @Ercan for your systems it seems we have some issues with some specific machines in our datacenters, we are trying to get this sorted.
nevermind
nevermind•4mo ago
k5svp2kqw0rh7s - I've faced ReadTimeout: (ReadTimeoutError(\\"HTTPSConnectionPool(host=\'cdn-lfs.huggingface.co\', port=443) with this one Also encountered other network timeouts, but we've erased these logs It happened likely at [2024-08-16 02:58:40,236] +- 10 mins
danielfriis
danielfriis•4mo ago
Anything I can do about it if/when it happens again?
shashank
shashank•4mo ago
Hello Please @Tim aka NERDDISCO please can you fix my pod as well ? It has been down for 12 hours. Pod Id: s584tny3154kqg It won't start, been downloading a docker layer of 400 mb for 3 hours now I have raised a ticket, but no response. I took a savings plan ( shouldn't have ) so now locked in, paying for a GPU that I cannot use Please someone help! 🧭 Any acknowledgement from @Finley / @Justin Merrell / @nerdylive / RunPod team that a paid service is not operational ?
nerdylive
nerdylive•4mo ago
is there some kind of notice in your pod? that it is down"?
shashank
shashank•4mo ago
Yes I cannot connect to it
shashank
shashank•4mo ago
It has been in this state for last 6 hours
No description
shashank
shashank•4mo ago
@nevermind can someone from your org please resolve this ? I am stuck in the savings plan, once this fixed, I am gonna migrate asap. There is no communication or resolution. @Tim aka NERDDISCO please can you pass on my pod id to the team ? I don't really know what to do here. My pod which I have already paid for isn't running, I cannot start a new pod without paying again... I am in a catch 22 Pod Id: s584tny3154kqg
generatethings
generatethings•4mo ago
my pods are not connecting either
shashank
shashank•4mo ago
Please someone help My pod is stuck in
2024-08-21T11:33:18Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:18Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:18Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:18Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
for the last 2 hours Can someone from runpod team please respond ?
Encyrption
Encyrption•4mo ago
How big is your image?
shashank
shashank•4mo ago
1.5gb It was working fine since last 15 days I actually switched to a savings plan and now i am stuck with a non functional pod This started happening 16 hours back
Encyrption
Encyrption•4mo ago
How was it built? Were you on an X86 PC or a Mac?
shashank
shashank•4mo ago
I am sorry I am lost It is a template available on runpod
Encyrption
Encyrption•4mo ago
oh ok
shashank
shashank•4mo ago
I just started using an existing template. Worked fine after restart last 15 days Then after converting to a committed node and lockin in, it won't respond after restart.
Encyrption
Encyrption•4mo ago
Can you share the templates link here?
shashank
shashank•4mo ago
ghcr.io/ai-dock/comfyui:latest
Encyrption
Encyrption•4mo ago
That is the image name not the template link.
shashank
shashank•4mo ago
No description
shashank
shashank•4mo ago
Tell me what you need I will screenshot it
Encyrption
Encyrption•4mo ago
When you see Templates in Explore menu the template should have boxes on the top right, if you click that it will copy the template link into your buffer. I have highlighted area to click in screencap.
No description
shashank
shashank•4mo ago
Thats my template screen
shashank
shashank•4mo ago
No description
shashank
shashank•4mo ago
I just used a docker image And then I added the env vars I created the config when making the deployment
shashank
shashank•4mo ago
This is how I created the template
No description
shashank
shashank•4mo ago
There is a networking issue with my pod, is what I can guess. Here is the screenshots of your app^.
Encyrption
Encyrption•4mo ago
You are using someone else image? You didn't build it?
shashank
shashank•4mo ago
No ghcr.io/ai-dock/comfyui:latest Here is the image name.
shashank
shashank•4mo ago
GitHub
GitHub - ai-dock/comfyui: ComfyUI docker images for use in GPU clou...
ComfyUI docker images for use in GPU cloud and local environments. Includes AI-Dock base for authentication and improved user experience. - GitHub - ai-dock/comfyui: ComfyUI docker images for use ...
Encyrption
Encyrption•4mo ago
Ugh I wouldn't trust that! I would either use a template or build my own.
shashank
shashank•4mo ago
Okay Great. I am paying for the GPU Please fix the pod I don't think you understand open source tech. It was working fine last 15 days. It has stopped working last 16 hours Can your devops team just kill my pod ? And fix the issue.
Encyrption
Encyrption•4mo ago
Just FYI: I am not a RunPod employee. I am just a RunPod user trying to help other users. If you want offical support, I suggest you go to Help/Contact on RunPod's main site.
shashank
shashank•4mo ago
That I understood that you are a troll
Encyrption
Encyrption•4mo ago
and now I am done with you... Good luck!
NERDDISCO
NERDDISCO•4mo ago
@shashank sorry for the problems you are facing. But that doesn't mean that you can tell someone that they are a troll if they try to help you. Please stop with this behavior, as this is a community and we try to help each other. To get this straight: Your pod with s584tny3154kqg is in a state that you cannot connect to it and you already raised a ticket, is this correct? We talked about this via a DM and the issue was resolved.
nerdylive
nerdylive•4mo ago
whats up
NERDDISCO
NERDDISCO•4mo ago
For everyone else: There was maintenance happening in the EU-IS-1 datacenter, which caused the network performance issues.
nerdylive
nerdylive•4mo ago
ah unnanounced maintanance? maybe next time if there is some issues or downtime, runpod should post it, even some possible distruptions too
NERDDISCO
NERDDISCO•4mo ago
@danielfriis are you using the EUR-IS-1 datacenter by any chance? @nerdylive yes totally, we have to inform the community about this! Sorry that this was not happening in this case! @here the EUR-IS-1 datacenter replaced a bad switch, which caused network issues. Everything should be back to normal since 12:27 PM PT
Peter Peter
Peter Peter•4mo ago
The problem still persists, EU-IS-1 docker pull is terribly slow. Changed to EU-RO-1 and the pull was instant.
NERDDISCO
NERDDISCO•4mo ago
Sorry for this, we are looking into this already!
yekta
yekta•4mo ago
I'm having the same problem. These instances are having issues pulling images. Is there any way to select another instance? We try to get an instance from someone else but RunPod keeps selecting this particuar one in the end.
No description
yekta
yekta•4mo ago
Is there some Vast like way to select a specific one? Even if we change the filters it's still giving us those instances because I'm guessing it thinks they are better. Except they don't even launch currently. We would prefer being able to select
NERDDISCO
NERDDISCO•4mo ago
Also in the EUR-IS-1 region?
yekta
yekta•4mo ago
USA community instance Is there any way whatsoever to select a specific 4090, instead of Runpod selecting it for you
NERDDISCO
NERDDISCO•4mo ago
so you have selected the "community cloud" with "US" and now want to select the dedicated machine that runs a 4090?
yekta
yekta•4mo ago
those instances are unusable currently
No description
NERDDISCO
NERDDISCO•4mo ago
can you please give me the pod id?
yekta
yekta•4mo ago
I didn't select US, I applied some other filters. What I want it to be able to select which instance matching that instead of Runpod picking it for me Basically like Vast. This auto-select is okay when things are working. When they aren't working we are stuck with unusable instances omvzojeiev6fkw here is the ID I'm guessing all those instances are from same person/group and that's the problem. We can't see if there is any matching those filters because we can only select a single thing even though I'm guessing there are many instances.
yekta
yekta•4mo ago
No description
yekta
yekta•4mo ago
there doesn't seem to be a simple way to report the instance either or say "I don't want instances from this provider anymore"
yekta
yekta•4mo ago
it even auto-selects those instance when it displays some other instance in the page
No description
No description
yekta
yekta•4mo ago
Again same provider, same problem @Tim aka NERDDISCO are you looking into it, I'm going to delete that pod if not as it's spending money while not working
yekta
yekta•4mo ago
seeing this still
No description
NERDDISCO
NERDDISCO•4mo ago
yes you can delete it sorry for the delay it looks like this specific machine has some problems and there is no way for you in the UI to not select that specific server anymore. Only way for you as a user is to select another region or go into the secure cloud. For is it means to get this "broken" machine out of the selection and talk with the owner to get this sorted if you could define a feature here, how would it look like so that you as the user can resolve this situation without the help from anyone from RunPod? a filter where you can enter the ID of the machine and just make sure that this one is not selected anymore? Some kind of "negative list"?
yekta
yekta•4mo ago
A simple list where I pick a machine meeting all the filters I set instead of a single button to “select” it As I said it doesn’t need to be visible right away, it could be a dropdown, a menu hidden behind a button Vast lets you do this already. I get the upside of selection being so simple (you select 4090 and it’s done). However, I would rather have the option to be able to continue using the platform. Since I literally couldn’t pick a 4090 matching my criteria since RunPod kept selecting these broken set of instances on my behalf Maybe there were other instances matching my criteria at the time, maybe not I simply don’t know because RunPod doesn’t let me know/pick Negative list is also fine, however a dropdown is much simpler to implement and flexible since RunPod obviously already knows which instances are matching my criteria. It can simply list them and let me pick (if I want to, default can be autoselection still) Given the inconsistency of community instances anywhere (RunPod, Vast etc), I’d consider this a must have. Because otherwise a bunch of users will get served broken instances by RunPod itself (even though there are non-broken instances behind the scenes). The blacklisting of the ID wouldn’t work in this case btw, because this particular provider of 4090s had the same problem on all machines probably (I tried 3 different one). If it was a list, I could simply deduct the “owner” based on machine specs and location (since all of them were matching). Those instances still have network issues and RunPod still continues to auto-select them by the way. It's been hours
shashank
shashank•4mo ago
@Tim aka NERDDISCO My pod is again not working GPU is unusable, the utilization is constantly at 0 Same thing happened with me, I suggest just terminate pod. But just FYI new pod that I created is also not working now. Availability of:runpod: is really poor And no clear way to resolve it, I have open tickets on email and convos on discord, but haven't received any resolution. Can someone from RunPod team @Justin Merrell or anyone else really help please ?
shashank
shashank•4mo ago
I think rundpod machine has been hacked / compromised
No description
shashank
shashank•4mo ago
I am not able to run my application code but CPU usage is through the roof And the GPU for which I am paying for, is not working on the pod. Guys I think I am paying for someone else's compute Can someone from RunPod team give me an exit / refund ? I figured out one of the issues is that I am unable to create websocket connection with the pod I think this is the same networking problem. I am not able to copy my network volume to a different bucket. Please if someone from Runpod can copy my network volume, I will not use EU region
NERDDISCO
NERDDISCO•4mo ago
I'm so sorry for the issues you both are having here. I dm'ed both of you so that we can hopefully resolve these issues
Ercan
ErcanOP•4mo ago
Hi @Tim aka NERDDISCO , I have an assumption that runpod proxy has some serious issues. Someone suggested yesterday to me to use Pod Public Ip Address instead of runpod proxies, and there is no more connection issues. We are currently using pod ip addresses and all requests to our server in pods working fine and no connection issue.
nerdylive
nerdylive•4mo ago
What issues? disconnecting after some time? i believe runpod uses cloudflared and you should check for its limitations
Ercan
ErcanOP•4mo ago
That is not the issue - it cannot even connect to our server thorugh runpod proxy and because of that cloudflared timeout, it throws timeout error. So the issue is starting with not being able to connect runpod proxies, and because of that cloudflare throwing timeout error until it connects
nerdylive
nerdylive•4mo ago
Ohh i see if you can please report them to https://contact.runpod.io/hc/en-us/requests/new with your pod id
Ercan
ErcanOP•4mo ago
I think we already did multiple times, runpod email support/ticket or any communication is slow or either unresponsive that is why I am bringing it up here so at least there are devs here directly seeing these issues
Madiator2011
Madiator2011•4mo ago
usually is better to submit ticket on website and is faster cause dev have direct access
NERDDISCO
NERDDISCO•3mo ago
@Ercan I'm very sorry that you have these problems. Can you please send me the ticket id via DM, then I can find out what the current status is.
Want results from more Discord servers?
Add your server