Can't use GPU with Jax in serverless endpoint

Hi, I'm trying to run a serverless worker to perform point tracking on a video. It works ok, but I think that it is running on CPU. I read that the telemetry on the UI isn't reliable, but the Container Logs indicate that too. There is an image of what they logs say. It finds the Nvidia GPU, but there are problems with Jax I think. I use the function on the first image to check the device: And the outputs I get are on the second image: In my Dockerfile, I'm setting this as base image: FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 I'm running this command to install the jax version that is supposed to work with CUDA 11.8. RUN pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html Then I install requirements.txt (I don't install Jax again here) and do other stuff And finally I do this to set the library path for CUDA: ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH I still can't get to make it work on GPU, if someone could tell me where the problem could be, it would be extremely helpful, thank you.
No description
No description
38 Replies
nerdylive
nerdylive2mo ago
Hey before running the code try setting this env variable export CUDA_VISIBLE_DEVICES=0,1 Run that command in a cli Let me know if that works or not
Madiator2011 (Work)
try add this to your dockerfile
ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_VISIBLE_DEVICES=all
ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_VISIBLE_DEVICES=all
nerdylive
nerdylive2mo ago
yeah i gues that, but it has been included in the newest image tag
Madiator2011 (Work)
I'm also why to use CUDA 11.8 rather than 12.1
pip install -U "jax[cuda12]"
pip install -U "jax[cuda12]"
galakurpismo3
galakurpismo32mo ago
Yeah I tried both cuda 12 and 11.8
Madiator2011 (Work)
@galakurpismo3 any use case I might try make Better JAX template though would need to understand how you test it
galakurpismo3
galakurpismo32mo ago
do i have to run this command in a cmd inside the Worker Container? Or how is it?
Madiator2011 (Work)
you would probably need to add it in docker container
galakurpismo3
galakurpismo32mo ago
but the container is running on the serverless endpoint right?
Madiator2011 (Work)
workers are basically pods
galakurpismo3
galakurpismo32mo ago
ok I'll run that command from the python code in the beginning and add your suggestion too
Madiator2011 (Work)
tried to run: pip install --upgrade "jax[cuda12_local]"
galakurpismo3
galakurpismo32mo ago
okay, in the dockerfile, right?
Madiator2011 (Work)
GitHub
GitHub - NVIDIA/JAX-Toolbox: JAX-Toolbox
JAX-Toolbox. Contribute to NVIDIA/JAX-Toolbox development by creating an account on GitHub.
galakurpismo3
galakurpismo32mo ago
okay, I'll try yes, thank you Hi, I think that it worked but there is a new error now, related to cudnn I think, these are the logs: Starting Serverless Worker |  Version 1.6.0 --- {"requestId": "cbeb73b4-8679-43d1-aaa0-8c68101e76ac-e1", "message": "Started.", "level": "INFO"} Get inside input_fn xla_bridge.py       :889  Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' xla_bridge.py       :889  Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory inference.py        :172  Found device: cuda:0 inference.py        :176  JAX is not using the GPU. Check your JAX installation and environment configuration. inference.py        :177  JAX backend: gpu inference.py        :182  CUDA_VISIBLE_DEVICES: 0,1 inference.py        :183  LD_LIBRARY_PATH: /opt/venv/lib/python3.9/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 inference.py        :187  libcudart.so loaded successfully. inference.py        :189  libcudnn.so loaded successfully. inference.py        :143  Read and resized video, number of frames: 107 E0716  cuda_dnn.cc:535 Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR E0716  cuda_dnn.cc:539 Memory usage: 84536328192 bytes free, 84986691584 bytes total. E0716  cuda_dnn.cc:535 Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR E0716  cuda_dnn.cc:539 Memory usage: 84536328192 bytes free, 84986691584 bytes total. inference.py        :162  Error during processing: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details. {"requestId": "cbeb73b4-8679-43d1-aaa0-8c68101e76ac-e1", "message": "Finished.", "level": "INFO"} I've tried with 24GB GPU and 80GB GPU. I'm using this base image: FROM nvidia/cuda:12.0.0-cudnn8-devel-ubuntu20.04
Want results from more Discord servers?
Add your server