RunPod•9mo ago

URGENT! Network Connection issues

Hi, looks like there is a general issue in all pods and all of them are suffering network connection issues. Can someone look into this?

79 Replies

ErcanOP•9mo ago

@Elder Papa Madiator sorry for tagging but no one is looking into this still

Madiator2011•9mo ago

If you submitted ticket on page will check it on monday Not seen any issues

Foozle•9mo ago

Been using RunPod for 6 weeks and kinda know what I'm doing. Big issues with network connections for me too, extremely slow speeds setting up and downloading Pods and Hugging Face and Civitai models are just stopping and timing out.

danielfriis•9mo ago

I've had the same issue. Just started using RunPod. First, it worked perfectly, but for the past few days I haven't been able to deploy. Works today. A bit worrisome if I have to rely on the service going forward

Magriux•9mo ago

same here

Jason•9mo ago

Open up a ticket guys, from the site

NERDDISCO•9mo ago

@Ercan @Foozle @danielfriis @markolo would you mind telling me in which regions this was happening? And if you have, the pod ids + time / date, so that we can check out what is happening.

danielfriis•9mo ago

Happening again now. @Tim aka NERDDISCO ID: moz6awaptgyvrj, time: right now (14.52 WEST)

ErcanOP•9mo ago

Happening time to time still, for when it started happening first "08/17/2024 4:50 PM EST" jv5qo2ua1o0ku7, r0qtnl9w5pd9ye, 0370lv0blxuph5, 8pbdspc5cfi8nw

NERDDISCO•9mo ago

thank you, passed all of this to the team!

nevermind•9mo ago

We also experience temporary network issues such as read timeouts, dropped connections, and slow network.

NERDDISCO•9mo ago

@nevermind can you please tell me ids of the pods where this has happened? @danielfriis @Ercan for your systems it seems we have some issues with some specific machines in our datacenters, we are trying to get this sorted.

nevermind•9mo ago

k5svp2kqw0rh7s - I've faced ReadTimeout: (ReadTimeoutError(\\"HTTPSConnectionPool(host=\'cdn-lfs.huggingface.co\', port=443) with this one Also encountered other network timeouts, but we've erased these logs It happened likely at [2024-08-16 02:58:40,236] +- 10 mins

danielfriis•9mo ago

Anything I can do about it if/when it happens again?

shashank•9mo ago

Hello Please @Tim aka NERDDISCO please can you fix my pod as well ? It has been down for 12 hours. Pod Id: s584tny3154kqg It won't start, been downloading a docker layer of 400 mb for 3 hours now I have raised a ticket, but no response. I took a savings plan ( shouldn't have ) so now locked in, paying for a GPU that I cannot use Please someone help! 🧭 Any acknowledgement from @Finley / @Justin Merrell / @nerdylive / RunPod team that a paid service is not operational ?

Jason•9mo ago

is there some kind of notice in your pod? that it is down"?

shashank•9mo ago

Yes I cannot connect to it

shashank•9mo ago

It has been in this state for last 6 hours

shashank•9mo ago

@nevermind can someone from your org please resolve this ? I am stuck in the savings plan, once this fixed, I am gonna migrate asap. There is no communication or resolution. @Tim aka NERDDISCO please can you pass on my pod id to the team ? I don't really know what to do here. My pod which I have already paid for isn't running, I cannot start a new pod without paying again... I am in a catch 22 Pod Id: s584tny3154kqg

generatethings•9mo ago

my pods are not connecting either

shashank•9mo ago

Please someone help My pod is stuck in

2024-08-21T11:33:18Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:18Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest

2024-08-21T11:33:18Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:18Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:42Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container ghcr.io/ai-dock/comfyui:latest
2024-08-21T11:33:59Z create container: still fetching image ghcr.io/ai-dock/comfyui:latest

for the last 2 hours Can someone from runpod team please respond ?

Encyrption•9mo ago

How big is your image?

shashank•9mo ago

1.5gb It was working fine since last 15 days I actually switched to a savings plan and now i am stuck with a non functional pod This started happening 16 hours back

Encyrption•9mo ago

How was it built? Were you on an X86 PC or a Mac?

shashank•9mo ago

I am sorry I am lost It is a template available on runpod

Encyrption•9mo ago

oh ok

shashank•9mo ago

I just started using an existing template. Worked fine after restart last 15 days Then after converting to a committed node and lockin in, it won't respond after restart.

Encyrption•9mo ago

Can you share the templates link here?

shashank•9mo ago

ghcr.io/ai-dock/comfyui:latest

Encyrption•9mo ago

That is the image name not the template link.

shashank•9mo ago

Tell me what you need I will screenshot it

Encyrption•9mo ago

When you see Templates in Explore menu the template should have boxes on the top right, if you click that it will copy the template link into your buffer. I have highlighted area to click in screencap.

shashank•9mo ago

Thats my template screen

shashank•9mo ago

I just used a docker image And then I added the env vars I created the config when making the deployment

shashank•9mo ago

This is how I created the template

shashank•9mo ago

There is a networking issue with my pod, is what I can guess. Here is the screenshots of your app^.

Encyrption•9mo ago

You are using someone else image? You didn't build it?

shashank•9mo ago

No ghcr.io/ai-dock/comfyui:latest Here is the image name.

shashank•9mo ago

https://github.com/ai-dock/comfyui

GitHub

GitHub - ai-dock/comfyui: ComfyUI docker images for use in GPU clou...

ComfyUI docker images for use in GPU cloud and local environments. Includes AI-Dock base for authentication and improved user experience. - GitHub - ai-dock/comfyui: ComfyUI docker images for use ...

Encyrption•9mo ago

Ugh I wouldn't trust that! I would either use a template or build my own.

shashank•9mo ago

Okay Great. I am paying for the GPU Please fix the pod I don't think you understand open source tech. It was working fine last 15 days. It has stopped working last 16 hours Can your devops team just kill my pod ? And fix the issue.

Encyrption•9mo ago

Just FYI: I am not a RunPod employee. I am just a RunPod user trying to help other users. If you want offical support, I suggest you go to Help/Contact on RunPod's main site.

shashank•9mo ago

That I understood that you are a troll

Encyrption•9mo ago

and now I am done with you... Good luck!

NERDDISCO•9mo ago

@shashank sorry for the problems you are facing. But that doesn't mean that you can tell someone that they are a troll if they try to help you. Please stop with this behavior, as this is a community and we try to help each other. To get this straight: Your pod with s584tny3154kqg is in a state that you cannot connect to it and you already raised a ticket, is this correct? We talked about this via a DM and the issue was resolved.

Jason•9mo ago

whats up

NERDDISCO•9mo ago

For everyone else: There was maintenance happening in the EU-IS-1 datacenter, which caused the network performance issues.

Jason•9mo ago

ah unnanounced maintanance? maybe next time if there is some issues or downtime, runpod should post it, even some possible distruptions too

NERDDISCO•9mo ago

@danielfriis are you using the EUR-IS-1 datacenter by any chance? @nerdylive yes totally, we have to inform the community about this! Sorry that this was not happening in this case! @here the EUR-IS-1 datacenter replaced a bad switch, which caused network issues. Everything should be back to normal since 12:27 PM PT

Peter Peter•9mo ago

The problem still persists, EU-IS-1 docker pull is terribly slow. Changed to EU-RO-1 and the pull was instant.

NERDDISCO•9mo ago

Sorry for this, we are looking into this already!

yekta•9mo ago

I'm having the same problem. These instances are having issues pulling images. Is there any way to select another instance? We try to get an instance from someone else but RunPod keeps selecting this particuar one in the end.

yekta•9mo ago

Is there some Vast like way to select a specific one? Even if we change the filters it's still giving us those instances because I'm guessing it thinks they are better. Except they don't even launch currently. We would prefer being able to select

NERDDISCO•9mo ago

Also in the EUR-IS-1 region?

yekta•9mo ago

USA community instance Is there any way whatsoever to select a specific 4090, instead of Runpod selecting it for you

NERDDISCO•9mo ago

so you have selected the "community cloud" with "US" and now want to select the dedicated machine that runs a 4090?

yekta•9mo ago

those instances are unusable currently

NERDDISCO•9mo ago

can you please give me the pod id?

yekta•9mo ago

I didn't select US, I applied some other filters. What I want it to be able to select which instance matching that instead of Runpod picking it for me Basically like Vast. This auto-select is okay when things are working. When they aren't working we are stuck with unusable instances omvzojeiev6fkw here is the ID I'm guessing all those instances are from same person/group and that's the problem. We can't see if there is any matching those filters because we can only select a single thing even though I'm guessing there are many instances.

yekta•9mo ago

there doesn't seem to be a simple way to report the instance either or say "I don't want instances from this provider anymore"

yekta•9mo ago

it even auto-selects those instance when it displays some other instance in the page

yekta•9mo ago

Again same provider, same problem @Tim aka NERDDISCO are you looking into it, I'm going to delete that pod if not as it's spending money while not working

yekta•9mo ago

seeing this still

NERDDISCO•9mo ago

yes you can delete it sorry for the delay it looks like this specific machine has some problems and there is no way for you in the UI to not select that specific server anymore. Only way for you as a user is to select another region or go into the secure cloud. For is it means to get this "broken" machine out of the selection and talk with the owner to get this sorted if you could define a feature here, how would it look like so that you as the user can resolve this situation without the help from anyone from RunPod? a filter where you can enter the ID of the machine and just make sure that this one is not selected anymore? Some kind of "negative list"?

yekta•9mo ago

A simple list where I pick a machine meeting all the filters I set instead of a single button to “select” it As I said it doesn’t need to be visible right away, it could be a dropdown, a menu hidden behind a button Vast lets you do this already. I get the upside of selection being so simple (you select 4090 and it’s done). However, I would rather have the option to be able to continue using the platform. Since I literally couldn’t pick a 4090 matching my criteria since RunPod kept selecting these broken set of instances on my behalf Maybe there were other instances matching my criteria at the time, maybe not I simply don’t know because RunPod doesn’t let me know/pick Negative list is also fine, however a dropdown is much simpler to implement and flexible since RunPod obviously already knows which instances are matching my criteria. It can simply list them and let me pick (if I want to, default can be autoselection still) Given the inconsistency of community instances anywhere (RunPod, Vast etc), I’d consider this a must have. Because otherwise a bunch of users will get served broken instances by RunPod itself (even though there are non-broken instances behind the scenes). The blacklisting of the ID wouldn’t work in this case btw, because this particular provider of 4090s had the same problem on all machines probably (I tried 3 different one). If it was a list, I could simply deduct the “owner” based on machine specs and location (since all of them were matching). Those instances still have network issues and RunPod still continues to auto-select them by the way. It's been hours

shashank•9mo ago

@Tim aka NERDDISCO My pod is again not working GPU is unusable, the utilization is constantly at 0 Same thing happened with me, I suggest just terminate pod. But just FYI new pod that I created is also not working now. Availability of:runpod: is really poor And no clear way to resolve it, I have open tickets on email and convos on discord, but haven't received any resolution. Can someone from RunPod team @Justin Merrell or anyone else really help please ?

shashank•9mo ago

I think rundpod machine has been hacked / compromised

shashank•9mo ago

I am not able to run my application code but CPU usage is through the roof And the GPU for which I am paying for, is not working on the pod. Guys I think I am paying for someone else's compute Can someone from RunPod team give me an exit / refund ? I figured out one of the issues is that I am unable to create websocket connection with the pod I think this is the same networking problem. I am not able to copy my network volume to a different bucket. Please if someone from Runpod can copy my network volume, I will not use EU region

NERDDISCO•9mo ago

I'm so sorry for the issues you both are having here. I dm'ed both of you so that we can hopefully resolve these issues

ErcanOP•9mo ago

Hi @Tim aka NERDDISCO , I have an assumption that runpod proxy has some serious issues. Someone suggested yesterday to me to use Pod Public Ip Address instead of runpod proxies, and there is no more connection issues. We are currently using pod ip addresses and all requests to our server in pods working fine and no connection issue.

Jason•9mo ago

What issues? disconnecting after some time? i believe runpod uses cloudflared and you should check for its limitations

ErcanOP•9mo ago

That is not the issue - it cannot even connect to our server thorugh runpod proxy and because of that cloudflared timeout, it throws timeout error. So the issue is starting with not being able to connect runpod proxies, and because of that cloudflare throwing timeout error until it connects

Jason•9mo ago

Ohh i see if you can please report them to https://contact.runpod.io/hc/en-us/requests/new with your pod id

ErcanOP•9mo ago

I think we already did multiple times, runpod email support/ticket or any communication is slow or either unresponsive that is why I am bringing it up here so at least there are devs here directly seeing these issues

Madiator2011•9mo ago

usually is better to submit ticket on website and is faster cause dev have direct access

NERDDISCO•9mo ago

@Ercan I'm very sorry that you have these problems. Can you please send me the ticket id via DM, then I can find out what the current status is.

Gaming

Programming

URGENT! Network Connection issues

Did you find this page helpful?