RunPod•13mo ago

NGC containers

Has anyone gotten NGC containers running on runpod? I see it as an option but I think it doesn't work because you need to install the ssh libraries on top. I need this to use FP8 on H100s since the PyTorch NGC container includes Transformer Engine for FP8. Building Transformer Engine manually takes a long time (requires downloading a cudnn tarball from NVIDIA website).

40 Replies

Madiator2011•13mo ago

Yes

SantoshOP•12mo ago

Is there any docs or quick example on how to use it? Any update on this? CC @mmoy

Jason•12mo ago

Not on runpod docs, but I'm sure the way to use them on runpod is by creating a template (dockerize your needed apps) then run them on pods or serverless And creating runpod templates are documented ( for example but not specifically for ngc containers) Use them as base images and just do what you need to fill in the image and use them as a template

SantoshOP•12mo ago

yeah I got that, I guess I was just too lazy to add the required ssh libs and create that template I also didn't understand why RunPod PyTorch NGC containers are available in the dropdown selection if the limitations are known. Maybe I'm just not using it correctly?

Madiator2011•12mo ago

I can always take comissions

Jason•12mo ago

What limitations?

SantoshOP•12mo ago

how do you use a container and "Connect" if there's no SSH access? There's also no option to SSH into the host and use the container interactively So I'm not sure what you can do after deploying a "RunPod PyTorch NGC" template

Madiator2011•12mo ago

if you run bare image you might need to set container command to

bash -c 'sleep infinity'

bash -c 'sleep infinity'

SantoshOP•12mo ago

but how would I SSH into the container or is the SSH command for host machine with Docker access? anyways I think I can create a template to fix it with my remaining few dollars of credits 😅

Jason•12mo ago

there is ssh command in the connect button after you press it you press ssh oh wait this is not pods why would you want to ssh into serverless containers

SantoshOP•12mo ago

this is for pods, a pod still runs a container a pod doesn't give you access to the host machine

Madiator2011•12mo ago

Give me like 1h will build container for you

Jason•12mo ago

wow rare moment

Madiator2011•12mo ago

any specific docker image as base? @sbhavani

SantoshOP•12mo ago

latest container from a few days ago: nvcr.io/nvidia/pytorch:24.04-py3

Madiator2011•12mo ago

I think it should work note volume storage wont be /workspace @sbhavani btw you wanted template for pods? Note image requires host with CUDA 12.4 @sbhavani so you have any code for test?

SantoshOP•12mo ago

yes template for pods, I guess it depends on the driver version for the host too

Madiator2011•12mo ago

I got template done just need to run some test and if you have any small code to test 8bit quant let me know

SantoshOP•12mo ago

https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#pytorch - sample code here!

GitHub

GitHub - NVIDIA/TransformerEngine: A library for accelerating Trans...

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio...

Madiator2011•12mo ago

what kinda output shall I get from it?

SantoshOP•12mo ago

hmm actually that code is more functional testing, I don't have anything readily available to test perf/speed up I can clean up this repo and add a HF LLama-2/3 example comparing BF16 and FP8 throughput: https://github.com/sbhavani/h100-performance-tests

Madiator2011•12mo ago

I kinda run it and not getting anything not output or error

SantoshOP•12mo ago

then sounds like it works! if you publish to the community I'll test it out as well

Madiator2011•12mo ago

It should be cached on H100 PCIe CA region on secure cloud at leas @sbhavani https://runpod.io/console/deploy?template=lc5dch2fuv&ref=vfker49t template name pytorch-ngc-runpod password for jupiter is RunPod volume storage is being mounted at /vol btw @sbhavani let me know if it worked for you not much rare im happy to help build templates but not if you ask me to add 50 models from civati ai

SantoshOP•12mo ago

thanks! I'll test it out on friday!

Jason•12mo ago

How about 20

Madiator2011•12mo ago

I can build you container that would block access to civati ai

Jason•12mo ago

Sure please share the dockerfile for me

Madiator2011•12mo ago

FROM runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04
RUN echo "127.0.0.1 civitai.com" >> /etc/hosts

FROM runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04
RUN echo "127.0.0.1 civitai.com" >> /etc/hosts

Jason•12mo ago

lol

Geri•12mo ago

im looking for a pytorch docker container without runpod can i just do a docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.10-py3?

Geri•12mo ago

i want to use pytorch with sentence transformers from huggingface (https://github.com/huggingface/setfit) and do a torch.compile and run predictions

GitHub

GitHub - huggingface/setfit: Efficient few-shot learning with Sente...

Efficient few-shot learning with Sentence Transformers - huggingface/setfit

Jason•12mo ago

Sure use the right docker command Well add your code to use the models

Geri•12mo ago

has someone tried torch.compile?

Jason•12mo ago

Not me, I haven't tried it

Geri•12mo ago

where can i find which torch-tensorrt version is compatibel with cuda, torch etc? is it expected that pip install torch-tensorrt==2.2.0 installs both: nvidia-cuda-runtime-cu11 and nvidia-cuda-runtime-cu12 .. same for nvidia-cudnn-cu11 and nvidia-cudnn-cu12 ... and some other nvidia packages? and does torch-tensorrt work with an older gpu like a g4dn.xlarge?

Jason•12mo ago

I dont know about that sure try it, just select a right driver But, aws has its own support its best to ask there for best support in their products

SantoshOP•12mo ago

@Geri Take a look at the versions used in https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html. That should give you an idea of compatibility across torch and nvidia packages

Jason•12mo ago

Oh there's that matrix thanks for sharing it

Geri•12mo ago

hi does someone know to configure a config.pbtxt for onnx or pytorch?

Gaming

Programming

NGC containers

Did you find this page helpful?