RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

MI300X in RO cannot be created

Creating the pod is failing because ``` 2024-11-17T12:20:23Z Status: Image is up to date for runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04 2024-11-17T12:20:23Z error creating container: container: create: container create: Error response from daemon: layer does not exist...

Pods getting erased/terminated

Finally decided to give runpod a try, deposit some credit and deploy on spot node with network volume. Several seconds after it runs, it getting erased automatically, thought it was because on spot. Tried to deploy on-demand, at it gones too. now when I tried to access my account i, the runpod website just keeps loading. tbh not a really good first experience any help?...
No description

Hosting RunPod as an API endpoint

I have hosted the workflow from the runpod pods service. Is there any way to host it as an api endpoint or work as a script based on the user input?...

accessing nginx server on my local machine

hello, can anyone help me access a server hosted on localhost:8000 on my pod from my pc's web browser. or can you provide a basic setup of nginx and how to access it on my local machine. i have gone through the documentation of exposing ports and it didn't work well...
Solution:
Try to break it down into smaller problems: 1. HTTP server. 2. Connectivity. 3. Nginx config. ...

Does the Kohya_ss template support FLUX?

Want to use Kohya to train Flux dreambooth models, just curious if it works with your settings or I would have to upload my own install of Kohya to do so

Network volume permissions

Is there a way to change permissions for files/directories on a network volume? I’d like to save Postgres data to a network drive but the directory needs permissions 700 or 750 rather than 777. I haven’t been able to find a way to change any permissions for any file/directory on a network volume. Permissions for container volumes can be modified no problem with chmod....

How to migrate serverless endpoint to a pod?

I have a strange use case in which I have a functional serverless endpoint that must run on AMD hardware (for none technical reasons) Everything is setup and working currently running on NVIDIA hardware. AMD hardware is not yet available for serverless, can I recreate the serverless behaviour using a pod?...

Ollama on Runpod

After following all instructions in the following article: https://docs.runpod.io/tutorials/pods/run-ollama#:~:text=Set%20up%20Ollama%20on%20your%20GPU%20Pod%201,4%3A%20Interact%20with%20Ollama%20via%20HTTP%20API%20 I am able to setup a Ollama on a pod, however after a few inferences, I get a 504 (sometimes 524) error in response. I have been making inferences to Ollama on a Runpod pod for the past few months now, and never faced this issue, so it's definitely more recent. Any thought on what might be going on?...

My pod is down, and won't restart

After this log, the pod is down and won't restart. I tried restart pod, stop pod, reset pod, but nothing doesn't work
No description

A100 PCIe is not working with EU-RO-1 storage.

I have created storage(network volume based on EU-RO-1) A100 PCIe is available. But I am getting an error while deploying runpod instance. There are no longer any instances available with the requested specifications. Please refresh and try again. whats wrong with me?...
Solution:
hmm maybe tthe gpu is taken, low on stock, and there are no currently

Error when synching with Backblaze

I'm getting "Something went wrong!" most of the time when syncing with Backblaze. It sometimes works so doesn't seem to be an issue with credentials. No other info in the error popup.

Import backup from volume disk to Network volume

Hello! Right now I have a POD with stable diffusion installed, all my files are in jupyter and I am using a normal disk, I would like to be able to transfer all the information from this volume to a network volume (for price reasons). What would be the best way to do it?...

Pod is stuck on network outage message, no changes for quite a while.

Our pod has been having network issues for a while now (saw it first yesterday afternoon). Also I have recently purchased a savings plan for this pod (id nx9twh8ikfjru8) so I am not sure what will happen if I will try to recreate this pod. Also there is probably some data outside of the /workspace directory (I know not a good idea..). Any way to check what is going wrong here?...

authorized_keys not working on runpod

I've deployed a runpod server, and added ssh key into user settings. that key is working for ssh. but when I add a new public key into ~/.ssh/authroized_keys directly in in terminal, that key pairs not working for ssh....

Runpod is not utilizing GPU and Showing zero GPUs

I am current running a Runpod with A40 GPUs with pytorch template, when I am trying to check GPUs in the jupyter notebbok using list(range(torch.cuda.device_count())) or print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) , it is showing zero GPUs. Also I want to know if there is any template available to tensorflow version 2, I couldn't find it, so currently using pytorch first template. It would be very helpful if someone could help me with this issue. I am stuck in the middle. I need to finish my project in asap....

Stuck on "Waiting for logs"

I've tried everything: switching GPU, switching regions, creating new storage, changing browers, switching accounts, clearing history, restarting WiFi, nothing has worked. What's weird is my colleagues can connect on the shared account and are able to initialize a pod in this same manner, which I can then access. Obviously asking my colleague to start up a pod every time is not sustainable. I'm pretty sure the log is being initialized but I can't access it as I don't have the access code....
No description

Multiple Pods SSH Resolving to Same Machine

I'm trying to connect to multiple community cloud pods simultaneously through VSCode. They happen to have the same public IP with different ports, but they seem to be sharing resources (e.g. storage and GPU). It seems like it could be an issue with VSCode, since connecting from a generic SSH terminal doesn't have this problem, but I'm wondering if there is a known workaround.

Network errors in Secure Cloud

Hello, I am using secure cloud to serve inference for an LLM, can someone explain what these messages mean? Is this the infra’s fault or mine? Is there any roadmap for improving reliability of network?...
No description

Pod with Comfy (flux + stable diffusion)

Hello, Right now I have a pod with stable-diffusion:web-ui-10.2.1 and I want to have only 1 pod where I can choose whether to use flux dev version or stable-diffusion:web-ui-10.2.1 , I heard about comfy that allows both but I am not clear, can you recommend me the best template according to my requirements? I don't know if in my current pod with stable diffusion I can add comfy, if I create another pod I will have to move all my files to the new stable diffusion and it will be long 😦...

Changed Log output on the Runpod website

we are using FastAPI in one of our applications on your run pods. Since a couple of days the FastAPI log output is not displayed on the website's log window. In order to see the log output I have to start FastAPI via terminal now. Have there been recent changes to the way logfiles are displayed on the runport website?...