RunPod•4mo ago

Running llama 3.3 70b using vLLM and 160gb network volume

Hi, I want to check if 160 gb is enough for llama 70b and whether I can use use a smaller network volume

27 Replies

NickbklOP•4mo ago

Or if I need a larger network volume

nerdylive•4mo ago

Maybe try 200gb first,

NickbklOP•4mo ago

ok thanks a lot

nerdylive•4mo ago

If im not wrong it's around 170+, need to check how much on the hf files And which gpu are going to use?

NickbklOP•4mo ago

I'm not sure atm, are the 24vram options fine? I think I'm going to use the suggested option (A6000, A40) and use aqm quant

nerdylive•4mo ago

Oh using quantization

NickbklOP•4mo ago

I set up with vllm template without quant for now using a6000,A40, using 210gb of volume in Canada. I posted an inital request. How long will this take to initialize roughly?

NickbklOP•4mo ago

nerdylive•4mo ago

Check the logs, im not sure how long, depending on the gpu setup you are using too Because I've never deployed that model with that specific quantization

NickbklOP•4mo ago

It's past 20 mins now with no quantization

NickbklOP•4mo ago

logs-defiant_olive_m...

nerdylive•4mo ago

How is the log Any signs of out of memory?

NickbklOP•4mo ago

5 items "endpointId":"ikmbyelhctz06j" "workerId":"2zeadzwvontveg" "level":"error" "message":"Uncaught exception | <class 'torch.OutOfMemoryError'>; CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 44.45 GiB of which 444.62 MiB is free. Process 1865701 has 44.01 GiB memory in use. Of the allocated memory 43.71 GiB is allocated by PyTorch, and 1.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables); <traceback object at 0x7f0a94eff580>;" "dt":"2024-12-11 05:47:39.26656704"

CUDA semantics — PyTorch 2.5 documentation

A guide to torch.cuda, a PyTorch module to run CUDA operations

NickbklOP•4mo ago

yes something must be wrong with my setup

nerdylive•4mo ago

Oom So you add gpus

NickbklOP•4mo ago

add workers? ahh ok! Is 3 enough

nerdylive•4mo ago

Use more gpu per worker In edit endpoint, not max workers

NickbklOP•4mo ago

ok! thank you!

nerdylive•4mo ago

Yup I'll try to Calc the vram requirement after this if I can What quantization type do you use

NickbklOP•4mo ago

gonna try awq but I am a noob, gonna do some research I'm using 5 gpus per worker, keeps exiting with error code 1

nerdylive•4mo ago

Use 4 And don't forger to change the tensor pararelism Maybe just create a new endpoint from the quick deploy and because once you deploy it, the config settings will be converted into env variables and if you don't change the env won't be there When creating a new endpoj change the tensor pararelism to 4 or the amount that you use Must be multiple of 2 I guess I but cant be 6

NickbklOP•4mo ago

ok I'll try creating a new endpoint

NickbklOP•4mo ago

cant seem to get any responses no errors in the 😦 logs

nerdylive•4mo ago

is it still loading hows the logs now? i also don't think thats how you use quantization, you need to have the quantized model first then use that. not from unquantized model then load it using quantization in vllm

nerdylive•4mo ago

https://huggingface.co/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4 maybe check this out, i found it just now, im not sure if its safe but seems to be a quantized version for it

ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4 · Hugging Face

nerdylive•4mo ago

4096 max len

Gaming

Programming

Running llama 3.3 70b using vLLM and 160gb network volume

Did you find this page helpful?