URGENT! Network Connection issues
Hi, looks like there is a general issue in all pods and all of them are suffering network connection issues. Can someone look into this?
79 Replies
@Elder Papa Madiator sorry for tagging but no one is looking into this still
If you submitted ticket on page will check it on monday
Not seen any issues
Been using RunPod for 6 weeks and kinda know what I'm doing. Big issues with network connections for me too, extremely slow speeds setting up and downloading Pods and Hugging Face and Civitai models are just stopping and timing out.
I've had the same issue. Just started using RunPod. First, it worked perfectly, but for the past few days I haven't been able to deploy. Works today. A bit worrisome if I have to rely on the service going forward
same here
Open up a ticket guys, from the site
@Ercan @Foozle @danielfriis @markolo would you mind telling me in which regions this was happening? And if you have, the pod ids + time / date, so that we can check out what is happening.
Happening again now. @Tim aka NERDDISCO ID: moz6awaptgyvrj, time: right now (14.52 WEST)
Happening time to time still, for when it started happening first "08/17/2024 4:50 PM EST"
jv5qo2ua1o0ku7, r0qtnl9w5pd9ye, 0370lv0blxuph5, 8pbdspc5cfi8nw
thank you, passed all of this to the team!
We also experience temporary network issues such as read timeouts, dropped connections, and slow network.
@nevermind can you please tell me ids of the pods where this has happened?
@danielfriis @Ercan for your systems it seems we have some issues with some specific machines in our datacenters, we are trying to get this sorted.
k5svp2kqw0rh7s
- I've faced ReadTimeout: (ReadTimeoutError(\\"HTTPSConnectionPool(host=\'cdn-lfs.huggingface.co\', port=443)
with this one
Also encountered other network timeouts, but we've erased these logs
It happened likely at [2024-08-16 02:58:40,236
] +- 10 minsAnything I can do about it if/when it happens again?
Hello
Please @Tim aka NERDDISCO please can you fix my pod as well ?
It has been down for 12 hours.
Pod Id: s584tny3154kqg
It won't start, been downloading a docker layer of 400 mb for 3 hours now
I have raised a ticket, but no response. I took a savings plan ( shouldn't have ) so now locked in, paying for a GPU that I cannot use
Please someone help! đź§
Any acknowledgement from @Finley / @Justin Merrell / @nerdylive / RunPod team that a paid service is not operational ?
is there some kind of notice in your pod?
that it is down"?
Yes
I cannot connect to it
It has been in this state for last 6 hours
@nevermind can someone from your org please resolve this ?
I am stuck in the savings plan, once this fixed, I am gonna migrate asap. There is no communication or resolution.
@Tim aka NERDDISCO please can you pass on my pod id to the team ?
I don't really know what to do here. My pod which I have already paid for isn't running, I cannot start a new pod without paying again... I am in a catch 22
Pod Id: s584tny3154kqg
my pods are not connecting either
Please someone help
My pod is stuck in
for the last 2 hours
Can someone from runpod team please respond ?
How big is your image?
1.5gb
It was working fine since last 15 days
I actually switched to a savings plan and now i am stuck with a non functional pod
This started happening 16 hours back
How was it built? Were you on an X86 PC or a Mac?
I am sorry I am lost
It is a template available on runpod
oh ok
I just started using an existing template. Worked fine after restart last 15 days
Then after converting to a committed node and lockin in, it won't respond after restart.
Can you share the templates link here?
ghcr.io/ai-dock/comfyui:latest
That is the image name not the template link.
Tell me what you need I will screenshot it
When you see Templates in Explore menu the template should have boxes on the top right, if you click that it will copy the template link into your buffer. I have highlighted area to click in screencap.
Thats my template screen
I just used a docker image
And then I added the env vars
I created the config when making the deployment
This is how I created the template
There is a networking issue with my pod, is what I can guess.
Here is the screenshots of your app^.
You are using someone else image? You didn't build it?
No
ghcr.io/ai-dock/comfyui:latest
Here is the image name.
GitHub
GitHub - ai-dock/comfyui: ComfyUI docker images for use in GPU clou...
ComfyUI docker images for use in GPU cloud and local environments. Includes AI-Dock base for authentication and improved user experience. - GitHub - ai-dock/comfyui: ComfyUI docker images for use ...
Ugh I wouldn't trust that! I would either use a template or build my own.
Okay
Great.
I am paying for the GPU
Please fix the pod
I don't think you understand open source tech.
It was working fine last 15 days. It has stopped working last 16 hours
Can your devops team just kill my pod ? And fix the issue.
Just FYI: I am not a RunPod employee. I am just a RunPod user trying to help other users. If you want offical support, I suggest you go to Help/Contact on RunPod's main site.
That I understood that you are a troll
and now I am done with you... Good luck!
@shashank sorry for the problems you are facing. But that doesn't mean that you can tell someone that they are a troll if they try to help you. Please stop with this behavior, as this is a community and we try to help each other.
To get this straight: Your pod with
s584tny3154kqg
is in a state that you cannot connect to it and you already raised a ticket, is this correct?
We talked about this via a DM and the issue was resolved.whats up
For everyone else: There was maintenance happening in the
EU-IS-1
datacenter, which caused the network performance issues.ah unnanounced maintanance? maybe next time if there is some issues or downtime, runpod should post it, even some possible distruptions too
@danielfriis are you using the
EUR-IS-1
datacenter by any chance?
@nerdylive yes totally, we have to inform the community about this!
Sorry that this was not happening in this case!
@here the EUR-IS-1
datacenter replaced a bad switch, which caused network issues. Everything should be back to normal since 12:27 PM PTThe problem still persists, EU-IS-1 docker pull is terribly slow. Changed to EU-RO-1 and the pull was instant.
Sorry for this, we are looking into this already!
I'm having the same problem. These instances are having issues pulling images. Is there any way to select another instance? We try to get an instance from someone else but RunPod keeps selecting this particuar one in the end.
Is there some Vast like way to select a specific one? Even if we change the filters it's still giving us those instances because I'm guessing it thinks they are better. Except they don't even launch currently. We would prefer being able to select
Also in the
EUR-IS-1
region?USA
community instance
Is there any way whatsoever to select a specific 4090, instead of Runpod selecting it for you
so you have selected the "community cloud" with "US" and now want to select the dedicated machine that runs a 4090?
those instances are unusable currently
can you please give me the pod id?
I didn't select US, I applied some other filters. What I want it to be able to select which instance matching that instead of Runpod picking it for me
Basically like Vast. This auto-select is okay when things are working. When they aren't working we are stuck with unusable instances
omvzojeiev6fkw
here is the ID
I'm guessing all those instances are from same person/group and that's the problem.
We can't see if there is any matching those filters because we can only select a single thing even though I'm guessing there are many instances.
there doesn't seem to be a simple way to report the instance either
or say "I don't want instances from this provider anymore"
it even auto-selects those instance when it displays some other instance in the page
Again same provider, same problem
@Tim aka NERDDISCO are you looking into it, I'm going to delete that pod if not as it's spending money while not working
seeing this still
yes you can delete it
sorry for the delay
it looks like this specific machine has some problems and there is no way for you in the UI to not select that specific server anymore. Only way for you as a user is to select another region or go into the secure cloud.
For is it means to get this "broken" machine out of the selection and talk with the owner to get this sorted
if you could define a feature here, how would it look like so that you as the user can resolve this situation without the help from anyone from RunPod?
a filter where you can enter the ID of the machine and just make sure that this one is not selected anymore? Some kind of "negative list"?
A simple list where I pick a machine meeting all the filters I set instead of a single button to “select” it
As I said it doesn’t need to be visible right away, it could be a dropdown, a menu hidden behind a button
Vast lets you do this already. I get the upside of selection being so simple (you select 4090 and it’s done). However, I would rather have the option to be able to continue using the platform. Since I literally couldn’t pick a 4090 matching my criteria since RunPod kept selecting these broken set of instances on my behalf
Maybe there were other instances matching my criteria at the time, maybe not I simply don’t know because RunPod doesn’t let me know/pick
Negative list is also fine, however a dropdown is much simpler to implement and flexible since RunPod obviously already knows which instances are matching my criteria. It can simply list them and let me pick (if I want to, default can be autoselection still)
Given the inconsistency of community instances anywhere (RunPod, Vast etc), I’d consider this a must have. Because otherwise a bunch of users will get served broken instances by RunPod itself (even though there are non-broken instances behind the scenes).
The blacklisting of the ID wouldn’t work in this case btw, because this particular provider of 4090s had the same problem on all machines probably (I tried 3 different one). If it was a list, I could simply deduct the “owner” based on machine specs and location (since all of them were matching).
Those instances still have network issues and RunPod still continues to auto-select them by the way. It's been hours
@Tim aka NERDDISCO My pod is again not working
GPU is unusable, the utilization is constantly at 0
Same thing happened with me, I suggest just terminate pod. But just FYI new pod that I created is also not working now. Availability of:runpod: is really poor
And no clear way to resolve it, I have open tickets on email and convos on discord, but haven't received any resolution.
Can someone from RunPod team @Justin Merrell or anyone else really help please ?
I think rundpod machine has been hacked / compromised
I am not able to run my application code but CPU usage is through the roof
And the GPU for which I am paying for, is not working on the pod.
Guys I think I am paying for someone else's compute
Can someone from RunPod team give me an exit / refund ?
I figured out one of the issues is that I am unable to create websocket connection with the pod
I think this is the same networking problem.
I am not able to copy my network volume to a different bucket.
Please if someone from Runpod can copy my network volume, I will not use EU region
I'm so sorry for the issues you both are having here.
I dm'ed both of you so that we can hopefully resolve these issues
Hi @Tim aka NERDDISCO , I have an assumption that runpod proxy has some serious issues. Someone suggested yesterday to me to use Pod Public Ip Address instead of runpod proxies, and there is no more connection issues.
We are currently using pod ip addresses and all requests to our server in pods working fine and no connection issue.
What issues? disconnecting after some time? i believe runpod uses cloudflared and you should check for its limitations
That is not the issue - it cannot even connect to our server thorugh runpod proxy and because of that cloudflared timeout, it throws timeout error.
So the issue is starting with not being able to connect runpod proxies, and because of that cloudflare throwing timeout error until it connects
Ohh i see
if you can please report them to https://contact.runpod.io/hc/en-us/requests/new with your pod id
I think we already did multiple times, runpod email support/ticket or any communication is slow or either unresponsive
that is why I am bringing it up here so at least there are devs here directly seeing these issues
usually is better to submit ticket on website and is faster cause dev have direct access
@Ercan I'm very sorry that you have these problems. Can you please send me the ticket id via DM, then I can find out what the current status is.