Problem with RunPod cuda base image. Jobs stuck in queue forever
Hello, I'm trying to do a request to a serverless endpoint that uses this base image on its Dockerfile
FROM runpod/base:0.4.0-cuda11.8.0
I want the serverside to run the input_fn
function when I do the request. This is part of the server side code:
If I use the cuda base image it does not run input_fn
, I only see the debug prints from model_fn
and then the job stays in queue forever (photo).
The thing is that if I use this base image:
FROM python:3.11.1-buster
It does run both input_fn
and model_fn
So my questions are:
- Why is the problem happening in the cuda base image?
- What are the implications of using the 2nd base image? Are there cuda or pytorch dependencies missing here?
- What base image should I use? What do I do?13 Replies
I have no problem with the Cuda base image, how's your dockerfile?
How did you run the python script
Is input fn called when it's running?
FROM runpod/base:0.4.0-cuda11.8.0
FROM python:3.11.1-buster
Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN python3.11 -m pip install --upgrade pip && \
python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \
rm /requirements.txt
Add src files (Worker Template)
COPY src /app/src
Ensure the checkpoints directory exists and copy the checkpoint file
RUN mkdir -p /app/src/tapnet/checkpoints
COPY src/tapnet/checkpoints/bootstapir_checkpoint.pt /app/src/tapnet/checkpoints/bootstapir_checkpoint.pt
Set working directory
WORKDIR /app
Set AWS credentials. DEBUG, luego poner en env o
ENV AWS_ACCESS_KEY_ID=...
ENV AWS_SECRET_ACCESS_KEY=...
ENV AW...
ENV PYTHONPATH=/app
CMD ["python3.11", "-u", "src/inference.py"]
if i use the cuda image it is not running, if i use the other image, it runs, it gets the video and everything
sorry for the bad format on the dockerfile, but its just the typical thing i guess
Hmm try the Cuda image from ngc
would need to see error message
there are no errors really, its just that input_fn isnt running
where can i find a link or something to that?
I'm not sure what's wrong there but id suggest use other image if it's problematic
Search Google, Nvidia ngc
It's nvidia's domain
Also i think python 3.11isnt installed on Cuda 11 img
okay I'll try doing that, I guess that using
python:3.11.1-buster
won't work right?
Wait I thought it works no?
What template does it work with
it works with that one, meaning that it gets inside input_fn, but there are going to be dependencies missing or something to run the GPU
Solution
Hmm yeah I guess python 3.11 is missing from that runpod base image..
You just have to install them or use templates from NGC
yeah, okay, I'll try both things, thank you so much
Np lmk how it goes