RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Issues with connecting/initializing custom docker image

I've created a custom docker image for quick ocr training; https://hub.docker.com/repository/docker/jeffchen23/paddleocr-image/general The problem is, everything downloads properly, but then I am unable to connect. When trying to connect, I get Permission denied (publickey); but the permissions are not an issue for any of my other pods. I think it is because the pod fails to initialize correctly, as it constantly spams messages of Start container. Can anyone help me pin down this issue? It works on my local machine when I pull it from the web. My local docker command is as follows: docker run -it --runtime nvidia --shm-size 2g --gpus all -v paddleocr-volume:/PaddleOCR paddleocr-image bash It doesn' t look like I have any direct control over the Docker command from RunPod (from what I can tell), so I'm a little lost....

Error occurred when executing STMFNet VFI: No module named 'cupy'

Running Comfy UI on runpod and hits this error. Can someone help provide the steps to install or update Cupy? Much appreciated!

my pod start very slow

it takes 10 minutes for my port 5000 to go to ready , any helep pls ?
No description

Template sharing in a team doesn't work

We have a RunPod Team with several people in it and other users can't access our custom Template from the Graph API using their own API key, but they can see it on the UI (so the UI and API are not consistent). We get the following error:
Error: {'errors': [{'message': 'Template not found', 'path': ['podFindAndDeployOnDemand'], 'extensions': {'code': 'RUNPOD'}}], 'data': {'podFindAndDeployOnDemand': None}}
Error: {'errors': [{'message': 'Template not found', 'path': ['podFindAndDeployOnDemand'], 'extensions': {'code': 'RUNPOD'}}], 'data': {'podFindAndDeployOnDemand': None}}
...
Solution:
Yes there is, they need to use an API key from the team account, not from their own account and also API keys are not scoped so they will have access to do anything they want with the API key.

ComfyUI not launching

I've tried running ComfyUI using the runpod community template (ai-dock/comfyui:latest) and now both buttons in the "Connect" modal point to the "Service" endpoint even though the 8188 port should open the web interface. Clicking that link (Connect to HTTP Service on port 8188) opens the service logs which are stuck with "Waiting for workspace mamba sync..." repeating. I would expect ComfyUI to open on this port.

I can't shutdown my pod ?

There is just no button on the interface to shut down my pod? I can only terminate it... ID: oeyqtrae2ex5tv...
Solution:
U can instead just terminate it completely 🙂 and just always spin up new ones. stopping pushes a pod to idle state but is mainly for persistent storage.

LocalAI Deployment

Hello RunPod Team, I'm considering your platform for deploying an AI model and have some questions. My project involves using LocalAI (https://localai.io/ https://github.com/mudler/LocalAI), and it's crucial for the deployed model to support JSON formatted responses, this is the main reason I chose localai. Could you guide me on how to set up this functionality on your platform? Is there a feature on RunPod that allows the server or the LLM model to automatically shut down or enter a low-resource state if it doesn't receive requests for a certain period, say 15 minutes? This is to optimize costs when the model is not in use....
Solution:
What u are looking for is the runpod serverless. Can read their documentation, but the tldr is can use a runpod official template as a base, then build on it to have ur own handler.py. U must be able to build a docker image. Build whatever model you want into the docker image so it isnt constantly downloaded at runtime...

Jupiter notebook (In chrome tab) consistently crashing after 20 hours

My Jupiter lab notebook chrome tab has crashed in the middle of 22 hours of training a model, how do i know if it's still training it, if it has stopped, or if it is just running without doing anything? This has happened to me 3 times in a row and this time i would like to know what is happening. The GPU usage is going up and down with is suggesting it is training and simply not showing on the notebook, but i would like to make sure.
No description

Extremely slow sync speed

Syncing a pod to dropbox and the speed is extremely slow. Maxing out at 80kb/s and dropping as low as a few b/s at times.

How can I remove a network volume?

Hi, I'd like to know how I can remove a network volume I created? Tried looking through your docs but couldn't find info on it, could you please help?
Solution:
You can delete it under the network volume section in GPU cloud

Can I remove a GPU & resize my storage after I've created a pod?

I'd like to create a pod with two GPUs. However, I won't be needing 2 forever so I would like to if I can remove one after I'm done with it. I would also like to know if I can resize my pod's persistent storage after I've created it (either by shrinking or adding more).
Solution:
Im not sure u can resize but prob ur best bet just have a network storage to always store to then u can always terminate and spin back up as needed 🙂

Need to update Auto1111 to 1.7.0

I want to enable SDXL inpainting, and git pull doesn't seem to work. I've understood that there are some other files that need to be altered as well, and sometimes things don't work as expected on Runpod (like updating an extension). Could I have some help in getting this to work?
Solution:
My template is already updated to 1.7.0 😎

How can I clean up storage in my network volume?

Hello, I'm using stable diffusion template with a network volume. I noticed that even though I clean up files in Jupyter, space is not freed up in my volume. I suspect files go to trash but not removed completely. I searched a lot but could not find the trash folder. Does anybody know where I can find or any other way of cleaning up my storage space properly?
Solution:
Alright I found using ncdu that path is /workspace/.Trash-0 and then I removed it with rm -rf /workspace/.Trash-0 All good now. Storage space is freed up....

Is there a way to get the SSH Terminal address for a pod using GraphQL api?

After creating a pod using GraphQL, I want to access this value. Is it possible? I can get a portion of it, but there is a random value that I can't see in the response.
No description

Help deploying LLaVA Flask API

I'm trying to create a LLaVa endpoint I can use in my project so I can assess 5 million photos with a Node script, similar to how I'm doing right locally currently with Ollama. I'm looking to deploy the 7b model on an RTX 4000, GPU Cloud not serverless to keep costs down. My preference is speed as well as cost so I'd ideally like to process multiple images at once, any advice welcome. After speaking to the author of the LLaVA RunPod template, he's recommended I use the below Flask method, but I'm not sure how I'd go around getting this deployed as I'm new to backend. Anybody able to help with some initial steps? https://github.com/ashleykleynhans/LLaVA/tree/main?tab=readme-ov-file#flask-api-inference...

Does RunPod support H100 confidential computing?

Months ago, another user mentioned the H100 confidential computing in this discord: https://discord.com/channels/912829806415085598/1131816583065505853/1131816583065505853. Does RunPod support it now? More information about Nvidia confidential computing: https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/

Restricting the kinds of pods dev accounts can launch

Hello, I'm an admin of a research team. I would like to give researchers the ability to launch a pod, but I would like to restrict the kinds of pods which they can launch (cost <= community server pods RTX4000). Is there a way to do this?...

ssh2 with node doesn't work correctly ?

Hello I am trying to connect to the gpu cloud using ssh2 via the [email protected] using a ssh key. It work using ssh.shell but not ssh.exec (it asks for PTY and when it is set, it doesn't no send any command). I don't know what to do because I faced this problem with runpod and I can yet connect using my linux terminal instead of going through my script) ...

Error starting the container

error starting container: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknow...
No description

Are the EU-CZ-1 servers down?

can spin up the instances logs are fine but can't connect to any of the services (sdwebui, jupyter, ssh) thank you...