How to deploy ModelsLab/Uncensored-llama3.1-nemotron?
I have tried to deploy this model
https://huggingface.co/ModelsLab/Uncensored-llama3.1-nemotron
Btw I am facing cude memory issue(I have tried 24gb, 48gb), it does not work, how to fix?
9 Replies
@haris , can you please check and give advice?
"message":"engine.py :115 2024-12-16 19:24:32,150 Error initializing vLLM engine: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 47.50 GiB of which 130.31 MiB is free. Process 4026461 has 47.37 GiB memory in use. Of the allocated memory 46.87 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n"
CUDA semantics — PyTorch 2.5 documentation
A guide to torch.cuda, a PyTorch module to run CUDA operations
70B llama models typically need a little over 48GB, try 80GB vram gpus
tried 80gb, btw same issue regarding memory
are you quantizing or halving? or running full f32? You will need to probably do both
needs more vram i think
Use at least:
you can't use 3 or 6 GPUs, has to be 2,4,8 GPUs
Oh ya true I forgot about that
And don't forget to set the tensor pararelism to amount of your gpu setting I think it's necessary