R
RunPod•2w ago
Nickbkl

Running llama 3.3 70b using vLLM and 160gb network volume

Hi, I want to check if 160 gb is enough for llama 70b and whether I can use use a smaller network volume
27 Replies
Nickbkl
NickbklOP•2w ago
Or if I need a larger network volume
nerdylive
nerdylive•2w ago
Maybe try 200gb first,
Nickbkl
NickbklOP•2w ago
ok thanks a lot
nerdylive
nerdylive•2w ago
If im not wrong it's around 170+, need to check how much on the hf files And which gpu are going to use?
Nickbkl
NickbklOP•2w ago
I'm not sure atm, are the 24vram options fine? I think I'm going to use the suggested option (A6000, A40) and use aqm quant
nerdylive
nerdylive•2w ago
Oh using quantization
Nickbkl
NickbklOP•2w ago
I set up with vllm template without quant for now using a6000,A40, using 210gb of volume in Canada. I posted an inital request. How long will this take to initialize roughly?
Nickbkl
NickbklOP•2w ago
No description
nerdylive
nerdylive•2w ago
Check the logs, im not sure how long, depending on the gpu setup you are using too Because I've never deployed that model with that specific quantization
Nickbkl
NickbklOP•2w ago
It's past 20 mins now with no quantization
nerdylive
nerdylive•2w ago
How is the log Any signs of out of memory?
Nickbkl
NickbklOP•2w ago
5 items "endpointId":"ikmbyelhctz06j" "workerId":"2zeadzwvontveg" "level":"error" "message":"Uncaught exception | <class 'torch.OutOfMemoryError'>; CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 44.45 GiB of which 444.62 MiB is free. Process 1865701 has 44.01 GiB memory in use. Of the allocated memory 43.71 GiB is allocated by PyTorch, and 1.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables); <traceback object at 0x7f0a94eff580>;" "dt":"2024-12-11 05:47:39.26656704"
CUDA semantics — PyTorch 2.5 documentation
A guide to torch.cuda, a PyTorch module to run CUDA operations
Nickbkl
NickbklOP•2w ago
yes something must be wrong with my setup
nerdylive
nerdylive•2w ago
Oom So you add gpus
Nickbkl
NickbklOP•2w ago
add workers? ahh ok! Is 3 enough
nerdylive
nerdylive•2w ago
Use more gpu per worker In edit endpoint, not max workers
Nickbkl
NickbklOP•2w ago
ok! thank you!
nerdylive
nerdylive•2w ago
Yup I'll try to Calc the vram requirement after this if I can What quantization type do you use
Nickbkl
NickbklOP•2w ago
gonna try awq but I am a noob, gonna do some research I'm using 5 gpus per worker, keeps exiting with error code 1
nerdylive
nerdylive•2w ago
Use 4 And don't forger to change the tensor pararelism Maybe just create a new endpoint from the quick deploy and because once you deploy it, the config settings will be converted into env variables and if you don't change the env won't be there When creating a new endpoj change the tensor pararelism to 4 or the amount that you use Must be multiple of 2 I guess I but cant be 6
Nickbkl
NickbklOP•2w ago
ok I'll try creating a new endpoint
Nickbkl
NickbklOP•2w ago
No description
Nickbkl
NickbklOP•2w ago
cant seem to get any responses no errors in the 😦 logs
nerdylive
nerdylive•2w ago
is it still loading hows the logs now? i also don't think thats how you use quantization, you need to have the quantized model first then use that. not from unquantized model then load it using quantization in vllm
nerdylive
nerdylive•2w ago
https://huggingface.co/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4 maybe check this out, i found it just now, im not sure if its safe but seems to be a quantized version for it
nerdylive
nerdylive•2w ago
4096 max len
No description
Want results from more Discord servers?
Add your server