Markrr
RRunPod
•Created by Thibaud on 8/8/2024 in #⚡|serverless
can't run 70b
The GPU instance is an H100 14 vCPU 80GB VRAM.
Runpod Serverless settings:
MODEL = neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
MAX_MODEL_LEN = 131072
GPU_MEMORY_UTILIZATION = 0.99
024-08-14T21:49:06.860520596Z tokenizer_name_or_path: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8, tokenizer_revision: None, trust_remote_code: False
engine.py :113 2024-08-14 20:33:15,350 Error initializing vLLM engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (20736). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
I also tried:
KV_CACHE_DTYPE = fp875 replies
RRunPod
•Created by Thibaud on 8/8/2024 in #⚡|serverless
can't run 70b
I am using Neuralmagic's FP8 quantized model: https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
75 replies
RRunPod
•Created by Thibaud on 8/8/2024 in #⚡|serverless
can't run 70b
Hi @Thibaud - I am having the same issue with max seq len (131072) being larger than max number of tokens in KV cache ... Using H100 v16 80GB GPU RAM on Runpod using vLLM worker... I'm curious what you did to solve it? Thanks !!! 🙂
75 replies