R
RunPod•8mo ago
sbhavani

NGC containers

Has anyone gotten NGC containers running on runpod? I see it as an option but I think it doesn't work because you need to install the ssh libraries on top. I need this to use FP8 on H100s since the PyTorch NGC container includes Transformer Engine for FP8. Building Transformer Engine manually takes a long time (requires downloading a cudnn tarball from NVIDIA website).
40 Replies
Madiator2011
Madiator2011•8mo ago
Yes
sbhavani
sbhavaniOP•7mo ago
Is there any docs or quick example on how to use it? Any update on this? CC @mmoy
nerdylive
nerdylive•7mo ago
Not on runpod docs, but I'm sure the way to use them on runpod is by creating a template (dockerize your needed apps) then run them on pods or serverless And creating runpod templates are documented ( for example but not specifically for ngc containers) Use them as base images and just do what you need to fill in the image and use them as a template
sbhavani
sbhavaniOP•7mo ago
yeah I got that, I guess I was just too lazy to add the required ssh libs and create that template I also didn't understand why RunPod PyTorch NGC containers are available in the dropdown selection if the limitations are known. Maybe I'm just not using it correctly?
Madiator2011
Madiator2011•7mo ago
I can always take comissions
nerdylive
nerdylive•7mo ago
What limitations?
sbhavani
sbhavaniOP•7mo ago
how do you use a container and "Connect" if there's no SSH access? There's also no option to SSH into the host and use the container interactively So I'm not sure what you can do after deploying a "RunPod PyTorch NGC" template
Madiator2011
Madiator2011•7mo ago
if you run bare image you might need to set container command to
bash -c 'sleep infinity'
bash -c 'sleep infinity'
sbhavani
sbhavaniOP•7mo ago
but how would I SSH into the container or is the SSH command for host machine with Docker access? anyways I think I can create a template to fix it with my remaining few dollars of credits 😅
nerdylive
nerdylive•7mo ago
there is ssh command in the connect button after you press it you press ssh oh wait this is not pods why would you want to ssh into serverless containers
sbhavani
sbhavaniOP•7mo ago
this is for pods, a pod still runs a container a pod doesn't give you access to the host machine
Madiator2011
Madiator2011•7mo ago
Give me like 1h will build container for you
nerdylive
nerdylive•7mo ago
wow rare moment
Madiator2011
Madiator2011•7mo ago
any specific docker image as base? @sbhavani
sbhavani
sbhavaniOP•7mo ago
latest container from a few days ago: nvcr.io/nvidia/pytorch:24.04-py3
Madiator2011
Madiator2011•7mo ago
I think it should work note volume storage wont be /workspace @sbhavani btw you wanted template for pods? Note image requires host with CUDA 12.4 @sbhavani so you have any code for test?
sbhavani
sbhavaniOP•7mo ago
yes template for pods, I guess it depends on the driver version for the host too
Madiator2011
Madiator2011•7mo ago
I got template done just need to run some test and if you have any small code to test 8bit quant let me know
sbhavani
sbhavaniOP•7mo ago
GitHub
GitHub - NVIDIA/TransformerEngine: A library for accelerating Trans...
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio...
Madiator2011
Madiator2011•7mo ago
what kinda output shall I get from it?
sbhavani
sbhavaniOP•7mo ago
hmm actually that code is more functional testing, I don't have anything readily available to test perf/speed up I can clean up this repo and add a HF LLama-2/3 example comparing BF16 and FP8 throughput: https://github.com/sbhavani/h100-performance-tests
Madiator2011
Madiator2011•7mo ago
I kinda run it and not getting anything not output or error
sbhavani
sbhavaniOP•7mo ago
then sounds like it works! if you publish to the community I'll test it out as well
Madiator2011
Madiator2011•7mo ago
It should be cached on H100 PCIe CA region on secure cloud at leas @sbhavani https://runpod.io/console/deploy?template=lc5dch2fuv&ref=vfker49t template name pytorch-ngc-runpod password for jupiter is RunPod volume storage is being mounted at /vol btw @sbhavani let me know if it worked for you not much rare im happy to help build templates but not if you ask me to add 50 models from civati ai
sbhavani
sbhavaniOP•7mo ago
thanks! I'll test it out on friday!
nerdylive
nerdylive•7mo ago
How about 20
Madiator2011
Madiator2011•7mo ago
I can build you container that would block access to civati ai
nerdylive
nerdylive•7mo ago
Sure please share the dockerfile for me
Madiator2011
Madiator2011•7mo ago
FROM runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04
RUN echo "127.0.0.1 civitai.com" >> /etc/hosts
FROM runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04
RUN echo "127.0.0.1 civitai.com" >> /etc/hosts
nerdylive
nerdylive•7mo ago
lol
Geri
Geri•6mo ago
im looking for a pytorch docker container without runpod can i just do a docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.10-py3?
Geri
Geri•6mo ago
i want to use pytorch with sentence transformers from huggingface (https://github.com/huggingface/setfit) and do a torch.compile and run predictions
GitHub
GitHub - huggingface/setfit: Efficient few-shot learning with Sente...
Efficient few-shot learning with Sentence Transformers - huggingface/setfit
nerdylive
nerdylive•6mo ago
Sure use the right docker command Well add your code to use the models
Geri
Geri•6mo ago
has someone tried torch.compile?
nerdylive
nerdylive•6mo ago
Not me, I haven't tried it
Geri
Geri•6mo ago
where can i find which torch-tensorrt version is compatibel with cuda, torch etc? is it expected that pip install torch-tensorrt==2.2.0 installs both: nvidia-cuda-runtime-cu11 and nvidia-cuda-runtime-cu12 .. same for nvidia-cudnn-cu11 and nvidia-cudnn-cu12 ... and some other nvidia packages? and does torch-tensorrt work with an older gpu like a g4dn.xlarge?
nerdylive
nerdylive•6mo ago
I dont know about that sure try it, just select a right driver But, aws has its own support its best to ask there for best support in their products
sbhavani
sbhavaniOP•6mo ago
@Geri Take a look at the versions used in https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html. That should give you an idea of compatibility across torch and nvidia packages
nerdylive
nerdylive•6mo ago
Oh there's that matrix thanks for sharing it
Geri
Geri•6mo ago
hi does someone know to configure a config.pbtxt for onnx or pytorch?
Want results from more Discord servers?
Add your server