Hello Posts - Answer Overflow

Hello

Posts Comments

RRunPod

•Created by Hello on 1/2/2025 in #⚡｜serverless

Job response not loading

5 replies

RRunPod

•Created by Hello on 11/21/2024 in #⛅｜pods-clusters

error pulling image: Error response from daemon

A100 GPU on IE region is giving these errors when pulling the pytorch image

024-11-21T14:08:37Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

2024-11-21T14:08:39Z create container runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

2024-11-21T14:08:47Z create container: still fetching image runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

024-11-21T14:08:37Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

2024-11-21T14:08:39Z create container runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

2024-11-21T14:08:47Z create container: still fetching image runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

2 replies

RRunPod

•Created by Hello on 11/21/2024 in #⛅｜pods-clusters

A100 GPU vram being used

8 replies

RRunPod

•Created by Hello on 9/16/2024 in #⚡｜serverless

worker exited with exit code 137

My serverless worker seems to get the error, worker exited with exit code 137 after multiple consecutive requests (around 10 or so). Seems like the container is running out of memory. Does anyone know what could be the issue as the script runs gc.collect() to free up resources already but the issue still persists.

4 replies

RRunPod

•Created by Hello on 9/15/2024 in #⚡｜serverless

Speeding up loading of model weights

Hi guys, I have setup my severless docker image to contain all my required model weights. My handler script also loads the weights using the diffusers library's .from_pretrained with local_files_only=true so we are loading everything locally. I notice that during cold starts, loading those weights still take around 25 seconds till the logs display --- Starting Serverless Worker | Version 1.6.2 ---. Anyone has experience optimising the time needed tp load weights? Could we pre-load it on ram or something (I may be totally off)?

7 replies

RRunPod

•Created by Hello on 9/5/2024 in #⚡｜serverless

Offloading multiple models

Hi guys, anyone has experience with a inference pipeline that uses multiple models? Wondering how best to manage loading of models that exceed a worker's vram if everything is on vram. Any best practices / examples on how to keep model load time as minimal as possible. Thanks!

3 replies

RRunPod

•Created by Hello on 9/4/2024 in #⚡｜serverless

Stuck on "loading container image from cache"

Hi, I have updated my serverless endpint release version but some of my workers are stuck on "loading container image from cache" even though its a new version that shouldn't exists in the cache to begin with. Any advice on how to solve this issue?

24 replies

RRunPod

•Created by Hello on 9/1/2024 in #⚡｜serverless

How to deal with multiple models?

Anyone has a good deployment flow for deploying severless endpoints with multiple large models? Asking because building and pushing a docker image with the model weights takes forever.

2 replies

RRunPod

•Created by Hello on 7/12/2024 in #⚡｜serverless

Failed to return job results

I keep getting "Failed to return job results" errors on 16GB serverless endpoints. After terminating one of the workers, it worked but now my other workers keep getting the same errors as well.

4 replies

RRunPod

•Created by Hello on 3/19/2024 in #⚡｜serverless

No module "runpod" found

Hi, I am trying to run a serverless runpod instance with a docker image. This is my dockerfile:

# Base image -> https://github.com/runpod/containers/blob/main/official-templates/base/Dockerfile
# DockerHub -> https://hub.docker.com/r/runpod/base/tags
FROM runpod/base:0.6.2-cuda12.2.0

# The base image comes with many system dependencies pre-installed to help you get started quickly.
# Please refer to the base image's Dockerfile for more information before adding additional dependencies.
# IMPORTANT: The base image overrides the default huggingface cache location.


# --- Optional: System dependencies ---
# COPY builder/setup.sh /setup.sh
# RUN /bin/bash /setup.sh && \
#     rm /setup.sh


# Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \
    rm /requirements.txt

# NOTE: The base image comes with multiple Python versions pre-installed.
#       It is reccommended to specify the version of Python when running your code.

# Add src files (Worker Template)
ADD src .

RUN python3.11 -m pip install runpod

CMD python3.11 -u /handler.py

# Base image -> https://github.com/runpod/containers/blob/main/official-templates/base/Dockerfile
# DockerHub -> https://hub.docker.com/r/runpod/base/tags
FROM runpod/base:0.6.2-cuda12.2.0

# The base image comes with many system dependencies pre-installed to help you get started quickly.
# Please refer to the base image's Dockerfile for more information before adding additional dependencies.
# IMPORTANT: The base image overrides the default huggingface cache location.


# --- Optional: System dependencies ---
# COPY builder/setup.sh /setup.sh
# RUN /bin/bash /setup.sh && \
#     rm /setup.sh


# Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \
    rm /requirements.txt

# NOTE: The base image comes with multiple Python versions pre-installed.
#       It is reccommended to specify the version of Python when running your code.

# Add src files (Worker Template)
ADD src .

RUN python3.11 -m pip install runpod

CMD python3.11 -u /handler.py

When the handler runs, import runpod errors out as ModuleNotFoundError: No module named 'runpod' Anyone experienced this before?

4 replies

RRunPod

•Created by Hello on 2/16/2024 in #⚡｜serverless

Safetensor safeopen OS Error device not found

Running inference on severless endpoint and this line of code:

with safetensors.safe_open(path, framework="pt", device="cpu") as f:

with safetensors.safe_open(path, framework="pt", device="cpu") as f:

Throws OSError: No such device (os error 19) Running on RTX5000. Attached a network volume. The path used in the safe_open leads to a safetensor file in /runpod-volume/example.safetensor Anyone got this error before?

10 replies

Gaming

Programming