RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Runpodctl in container receiving 401

Over the past few days, I have sometimes been getting a 401 response when attempting to stop pods with runpodctl stop pod $RUNPOD_POD_ID at the end of my jobs. This is causing the container to restart on exit rather than stop. Do the credentials passed to the container expire?
Solution:
ok. so any pods created before the migration will fail when stopping via runpodctl

Cannot establish connection for web terminal using Standard Diffusion pod

I'm able to connect to the Webui HTTP client. And I can connect via SSH from my local machine AND I can connect to the Jupiter notebook no problem, but when i start the web terminal and attempt to connect, it brings up the black screen but then immediately says "connection closed".

Runpod errors, all pods having same issue this morning. Important operation

I got this error on all my pods today We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime.

Hi, I have a problem with two of my very important services, and I received the following message

Hi, I have a problem with two of my very important services, and I received the following message: "This server has recently suffered a network outage and may have spotty network connectivity. We aim to restore connectivity soon, but you may have connection issues until it is resolved. You will not be charged during any network downtime." ID: 24v2dmaqcpzk05...

Error while using vLLm in RTX A6000

2024-02-22T11:19:46.009303238Z /usr/bin/python3: Error while finding module specification for 'vllm.entrypoints.openai.api_server' (ModuleNotFoundError: No module named 'vllm') Using RTX A600, i'm using this gpu from last 4-5days, not getting any error, but today i'm facing this issue, could anyone help me out with this, why it is happening?...

502 error when trying to connect to SD Pod HTTP Service on Runpod

I've been following along with this tutorial - everything was going smoothly until it cam time to connect to A1111 (steps 10-11 in the link). Rather than asking for my username and password, it produced a 502 error for which the error page is set to the ReadMe. In the ReadMe, it says to wait for the GPU utilization to come down to 0% before connecting otherwise you'll get a 502 Error. I double-checked and it definitely was at 0%. I restarted the pod - same issue. I stopped/exited the pod and restarted it - same issue. Anyone run into this or have any idea as to what it may be?...

correct way to call jupyter in template

I'm trying to learn how to create a template. I'm using FROM runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 as a base, which I beleive already comes with jupyter. As such I am tryign to run jupyter in my start.sh file using the command below:...

Too many failed requests

Hello. I've tried to run casperhansen/mixtral-instruct-awq (https://huggingface.co/casperhansen/mixtral-instruct-awq) on A100 80 GB and A100 SXM 80GB GPUs, sending 10 requests per second using this script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. However most of the requests failed with Aborted request log from vLLM. This issue didn't occur on another platform with the same GPU, and same code, so I'm not sure if the problem is with vLLM or with RunPod's internal processing. Could anyone provide guidance on what the cause might be?...
Solution:
Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud. https://github.com/runpod-workers/worker-vllm...

Community pod: very bad download speed from github.

I started experiencing yesterday very slow download speeds from github (cloning repos), but downloading from other sources works ok. It still happens today. Do you have any idea? I am using the web terminal.
No description

Skypilot + Runpod: No resource satisfying the request

Hi team. I'm trying to use Skypilot + vllm+ Runpod to serve a custom trained LLM. I cannot make the skypilot to launch a resource. I get the following error: I 02-22 00:16:32 optimizer.py:1206] No resource satisfying <Cloud>({'NVIDIA RTX A6000': 1}, ports=['8888']) on RunPod. sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request: ...

`runpodctl stop pod $RUNPOD_POD_ID` failing with 401

I used to end my long running jobs with this command. has failed last several times with 401. runpodctl stop pod $RUNPOD_POD_ID Error: statuscode 401...

Stuck pod instance

I have a problem with community pod (id: xbyhioflerw8pz), it is not accessible for a really long time and stuck at launch. It infinitely tries to deploy without updating the status, only shows "Waiting for logs". Live chat support is silent. I appreciate any help.

Start container pod error

PodID: iovxdnrsop9fz1 Region: NO Error log message: 2024-02-21T09:47:48Z error starting container: Error response from daemon: driver failed programming external connectivity on endpoint iovxdnrsop9fz1-0 (7047bce4c334cf194763afae5e0b7e6f1ce041721666df277facb429a82f9d9b): Error starting userland proxy: listen tcp4 0.0.0.0:40448: bind: address already in use...

Pod doesn't recognize my SSH key

Hi, my pod crashed and after restart it doesn't let me connect to it with my SSH key which is set in runpod settings. I can connect to ther pods without problems with my private key. Web terminal works. Restart of pod doesn't help. I need to download a log file from this pod before I destroy it. Can please someone help?
Solution:
Yes, you can use runpodctl, croc, etc.

Run Lorax on Runpod (Serverless)

I created a docker image similar to (https://github.com/runpod-workers/worker-tgi/blob/main/src/entrypoint.sh) for Lorax, but inside of the docker image I am getting connection refused: could you please check it?...

What is the difference between secure cloud and Community Cloud?

What is the difference between secure cloud and Community Cloud?

Urgent Prod Issue

Pod is stuck, and not restarting

cuda version filter

Hi team, wanted to check is the cuda filter on the page the minimum version that the hardware supports? Or does the cuda filter mean the only cuda version that the hardware supports? I was trying to run 2xL40 but had a weird cuda index assertion error whereas the exact same code ran fine on 4090's, hence that got me wondering if L40's only permitted cuda 12.0 (based on the page)
Solution:
So if a machine is 12.2, it supports images with 12.1, 11.8 and so on

Maximum length for value of environment variables

As I set some environment variables via the GraphQL API while starting a pod, I was wondering what the maximum length restriction is. The GraphQL API spec is only mentioning that it should be an UTF-8 String.

Enquiry about pod ID oi3rnyumuzvp2s

Hello, is it possible to search the history for a pod ID? We can not see anything in the audit log and the feedback is that the pod has somehow vanished. Can we please check oi3rnyumuzvp2s. Thank you....