RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

ngc tritonserver container image not usable?

I tried to create a pod on a server with cuda >= 12.2 using this image: nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 it loads up correctly, but the resulting server is not usable, cannot connect ssh (the window immediately closes after typing passphrase). the same image works fine on servers from vast.ai, what's the issue?...

"Too many open files in system"

I am using many cpu3c-2-4 in RO region, all working off of the same volume and keep running into "Too many open files" error. Error only happens in CPU pods, and only when many different pods are working with many different files, such as large apt-get installs and large tar gzips. I have tried setting ulimit -n [LARGE_NUMBER], but this does not fix the error. Any ideas?...

What the fuck is going on again with US - 1 x H100 80GB SXM5

"We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime." I have been using runpod and every fucking day is something wrong!? ID: x1vidmyoiu3a06...

GPU runpod critical error detected

"We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime." ID: pris741sxxrz2d...

stable diffusion - how do I view the active log?

When you launch stable diffusion in local, you have a DOS window with a log that gives infos on all actions taken. But on runpod, the terminal shows the sequence of initialization, and then it stops recording after the 1st model is loaded. Why is that, how can I see the current state? I currently need it to try understanding how to open files on my network storage, but it's an useful tool in general that I know I'll need a lot later....
Solution:
Read the README for instructions on how to see the logs. You can't really use launch.py because the pod already starts it and looks like you also didn't activate the venv first, if you want to do that you basically have to set the DISABLE_AUTOLAUNCH environment variable, again see the README.
No description

Pod using CPU instead of GPU

Title. Trying to run deforum on Ashley's ultimate template. However, normal txt2img works fine....
No description

After tying the service for the first time, out of funds because of a stale pod after disconnecting

Hello. As per the title. I'm a professional comic book artist working with Krita and trying to stay competitive in a difficult market. After finding out about Krita's AI plugin and it capabilities to assist with coloring and finishing sketches and drawings I decided to try it in preparation for a big project. Lacking a powerful PC at the moment I followed the recommendations from the Krita team and tried your services. After a bit of hassle with the set up I signed up with you, added 10 euros in...

pod does not show public ip & ports

we have a template with tcp port configured when we deploy that template using community cloud with public ip filter set to true or the secure cloud In both cases we do not get a pod with a public ip & port its just showing:...
Solution:
okay turns out the pod was not running since there was no entrypoint / command configured was not transparent that the pod is actually not running, since it did not show as exited however, solved by providing such...

Pod is unable to find/use GPU in python

Hi, I'm trying to connect to this pod: RunPod Pytorch 2.2.10 ID: zgel6p985mjmmn...
Solution:
@Dhruv Mullick I don't think it has to do with the image... If you select it from the runpod website, there is a filter button at the top and then a drop down menu where you can select 12.2 as "Allowed CUDA Versions" as @ashleyk pointed out earlier 'the machine is running CUDA 12.3 which is not production ready'. if I select 12.2 it works....

Pod is stuck in a loop and does not finish creating

Hi, I'm trying to start a 1 x V100 SXM2 32GB with additional disk space (40 GB). It worked fine until yesterday. now when I'm trying to create it gets stuck in this loop: ```...

Runpodctl in container receiving 401

Over the past few days, I have sometimes been getting a 401 response when attempting to stop pods with runpodctl stop pod $RUNPOD_POD_ID at the end of my jobs. This is causing the container to restart on exit rather than stop. Do the credentials passed to the container expire?
Solution:
ok. so any pods created before the migration will fail when stopping via runpodctl

Cannot establish connection for web terminal using Standard Diffusion pod

I'm able to connect to the Webui HTTP client. And I can connect via SSH from my local machine AND I can connect to the Jupiter notebook no problem, but when i start the web terminal and attempt to connect, it brings up the black screen but then immediately says "connection closed".

Runpod errors, all pods having same issue this morning. Important operation

I got this error on all my pods today We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime.

Hi, I have a problem with two of my very important services, and I received the following message

Hi, I have a problem with two of my very important services, and I received the following message: "This server has recently suffered a network outage and may have spotty network connectivity. We aim to restore connectivity soon, but you may have connection issues until it is resolved. You will not be charged during any network downtime." ID: 24v2dmaqcpzk05...

Error while using vLLm in RTX A6000

2024-02-22T11:19:46.009303238Z /usr/bin/python3: Error while finding module specification for 'vllm.entrypoints.openai.api_server' (ModuleNotFoundError: No module named 'vllm') Using RTX A600, i'm using this gpu from last 4-5days, not getting any error, but today i'm facing this issue, could anyone help me out with this, why it is happening?...

502 error when trying to connect to SD Pod HTTP Service on Runpod

I've been following along with this tutorial - everything was going smoothly until it cam time to connect to A1111 (steps 10-11 in the link). Rather than asking for my username and password, it produced a 502 error for which the error page is set to the ReadMe. In the ReadMe, it says to wait for the GPU utilization to come down to 0% before connecting otherwise you'll get a 502 Error. I double-checked and it definitely was at 0%. I restarted the pod - same issue. I stopped/exited the pod and restarted it - same issue. Anyone run into this or have any idea as to what it may be?...

correct way to call jupyter in template

I'm trying to learn how to create a template. I'm using FROM runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 as a base, which I beleive already comes with jupyter. As such I am tryign to run jupyter in my start.sh file using the command below:...

Too many failed requests

Hello. I've tried to run casperhansen/mixtral-instruct-awq (https://huggingface.co/casperhansen/mixtral-instruct-awq) on A100 80 GB and A100 SXM 80GB GPUs, sending 10 requests per second using this script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. However most of the requests failed with Aborted request log from vLLM. This issue didn't occur on another platform with the same GPU, and same code, so I'm not sure if the problem is with vLLM or with RunPod's internal processing. Could anyone provide guidance on what the cause might be?...
Solution:
Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud. https://github.com/runpod-workers/worker-vllm...

Community pod: very bad download speed from github.

I started experiencing yesterday very slow download speeds from github (cloning repos), but downloading from other sources works ok. It still happens today. Do you have any idea? I am using the web terminal.
No description