Markrr Comments - Answer Overflow

Markrr

Posts Comments

RRunPod

•Created by Thibaud on 8/8/2024 in #⚡｜serverless

can't run 70b

Thanks, will do

75 replies

RRunPod

•Created by Thibaud on 8/8/2024 in #⚡｜serverless

can't run 70b

The GPU instance is an H100 14 vCPU 80GB VRAM. Runpod Serverless settings: 

MODEL = neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 
MAX_MODEL_LEN = 131072
 GPU_MEMORY_UTILIZATION = 0.99

024-08-14T21:49:06.860520596Z tokenizer_name_or_path: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8, tokenizer_revision: None, trust_remote_code: False

engine.py :113 2024-08-14 20:33:15,350 Error initializing vLLM engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (20736). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

I also tried: KV_CACHE_DTYPE = fp8

75 replies

RRunPod

•Created by Thibaud on 8/8/2024 in #⚡｜serverless

can't run 70b

I am using Neuralmagic's FP8 quantized model: https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

75 replies

RRunPod

•Created by Thibaud on 8/8/2024 in #⚡｜serverless

can't run 70b

Hi @Thibaud - I am having the same issue with max seq len (131072) being larger than max number of tokens in KV cache ... Using H100 v16 80GB GPU RAM on Runpod using vLLM worker... I'm curious what you did to solve it? Thanks !!! 🙂

75 replies

Gaming

Programming