How to deploy ModelsLab/Uncensored-llama3.1-nemotron?

I have tried to deploy this model https://huggingface.co/ModelsLab/Uncensored-llama3.1-nemotron Btw I am facing cude memory issue(I have tried 24gb, 48gb), it does not work, how to fix?
9 Replies
openmind
openmindOP4d ago
@haris , can you please check and give advice?
openmind
openmindOP4d ago
"message":"engine.py :115 2024-12-16 19:24:32,150 Error initializing vLLM engine: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 47.50 GiB of which 130.31 MiB is free. Process 4026461 has 47.37 GiB memory in use. Of the allocated memory 46.87 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n"
CUDA semantics — PyTorch 2.5 documentation
A guide to torch.cuda, a PyTorch module to run CUDA operations
kaj
kaj4d ago
70B llama models typically need a little over 48GB, try 80GB vram gpus
openmind
openmindOP4d ago
tried 80gb, btw same issue regarding memory
kaj
kaj4d ago
are you quantizing or halving? or running full f32? You will need to probably do both
nerdylive
nerdylive3d ago
needs more vram i think
nerdylive
nerdylive3d ago
Use at least:
No description
yhlong00000
yhlong000003d ago
you can't use 3 or 6 GPUs, has to be 2,4,8 GPUs
nerdylive
nerdylive3d ago
Oh ya true I forgot about that And don't forget to set the tensor pararelism to amount of your gpu setting I think it's necessary
Want results from more Discord servers?
Add your server