Running llama 3.3 70b using vLLM and 160gb network volume
Hi, I want to check if 160 gb is enough for llama 70b and whether I can use use a smaller network volume
27 Replies
Or if I need a larger network volume
Maybe try 200gb first,
ok thanks a lot
If im not wrong it's around 170+, need to check how much on the hf files
And which gpu are going to use?
I'm not sure atm, are the 24vram options fine?
I think I'm going to use the suggested option (A6000, A40) and use aqm quant
Oh using quantization
I set up with vllm template without quant for now using a6000,A40, using 210gb of volume in Canada. I posted an inital request. How long will this take to initialize roughly?
Check the logs, im not sure how long, depending on the gpu setup you are using too
Because I've never deployed that model with that specific quantization
It's past 20 mins now
with no quantization
How is the log
Any signs of out of memory?
5 items
"endpointId":"ikmbyelhctz06j"
"workerId":"2zeadzwvontveg"
"level":"error"
"message":"Uncaught exception | <class 'torch.OutOfMemoryError'>; CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 44.45 GiB of which 444.62 MiB is free. Process 1865701 has 44.01 GiB memory in use. Of the allocated memory 43.71 GiB is allocated by PyTorch, and 1.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables); <traceback object at 0x7f0a94eff580>;"
"dt":"2024-12-11 05:47:39.26656704"
CUDA semantics — PyTorch 2.5 documentation
A guide to torch.cuda, a PyTorch module to run CUDA operations
yes
something must be wrong with my setup
Oom
So you add gpus
add workers?
ahh ok! Is 3 enough
Use more gpu per worker
In edit endpoint, not max workers
ok!
thank you!
Yup
I'll try to Calc the vram requirement after this if I can
What quantization type do you use
gonna try awq but I am a noob, gonna do some research
I'm using 5 gpus per worker, keeps exiting with error code 1
Use 4
And don't forger to change the tensor pararelism
Maybe just create a new endpoint from the quick deploy and because once you deploy it, the config settings will be converted into env variables and if you don't change the env won't be there
When creating a new endpoj change the tensor pararelism to 4 or the amount that you use
Must be multiple of 2 I guess I but cant be 6
ok I'll try creating a new endpoint
cant seem to get any responses
no errors in the 😦
logs
is it still loading
hows the logs now?
i also don't think thats how you use quantization, you need to have the quantized model first then use that. not from unquantized model then load it using quantization in vllm
https://huggingface.co/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
maybe check this out, i found it just now, im not sure if its safe but seems to be a quantized version for it
4096 max len