RunPod•11mo ago

Can't set up the serverless vLLM for the model.

Please help solve the problem. When trying to make a request, these errors are logged: ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. {5 items"dt":"2024-04-24 15:25:10.089313""endpointid":"7pih5vdoqp1xsu""level":"info""message":"INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:10.089 [hrkxm58yz2r504] [info] �INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. {5 items"dt":"2024-04-24 15:25:10.089260""endpointid":"7pih5vdoqp1xsu""level":"info""message":"�INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.""workerId":"hrkxm58yz2r504"} ▲ 2024-04-24 18:25:08.884 [hrkxm58yz2r504] [info] MINFO 04-24 15:25:08 llm_engine.py:337] # GPU blocks: 1199, # CPU blocks: 327 Configuration:

17 Replies

Alpay Ariyak•11mo ago

What size is your GPU?

haris•11mo ago

cc: @Kostya ^^^

Kostya | Matrix OneOP•11mo ago

@Alpay Ariyak @haris 24Gb GPU

nerdylive•11mo ago

I don't see the error there BTW, seems like just info's for the feature that is used there

Kostya | Matrix OneOP•11mo ago

@nerdylive @haris Could you please tell me if this model (https://huggingface.co/TheBloke/MythoMax-L2-13B-GPTQ) is compatible?

TheBloke/MythoMax-L2-13B-GPTQ · Hugging Face

nerdylive•11mo ago

maybe it is

Alpay Ariyak•10mo ago

It is compatible Like @nerdylive said, there’s no error message, just warnings

Kostya | Matrix OneOP•10mo ago

This is very strange, because this model (https://huggingface.co/solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ) works. What is the difference between them and how can I get MythoMax-L2-13B-GPTQ to work?

solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ · Hugging Face

nerdylive•10mo ago

Whats making it not working? Any errors?

digigoblin•10mo ago

AWQ and GPTQ are 2 different kinds of quantization methods. You can't really compare AWQ with GPTQ, you should compare another GPTQ model with GPTQ rather than AWQ vs GPTQ.

Kostya | Matrix OneOP•10mo ago

@nerdylive There are no errors in the logs, only informational logs are being displayed.

nerdylive•10mo ago

Press the running worker Then there will be a log button

Kostya | Matrix OneOP•10mo ago

digigoblin•10mo ago

Ran out of VRAM

Kostya | Matrix OneOP•10mo ago

Could you please tell me how to increase VRAM?

digigoblin•10mo ago

Use 48GB instead of 24GB tier

Kostya | Matrix OneOP•10mo ago

Thank you very much, it worked. I have another question. We use two fields in the request to the route /openai/v1/chat/completions: messages and prompt. According to the documentation https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#chat-completions and the API response, we cannot use these two fields simultaneously. Can we really not use these fields simultaneously in the request, or am I doing something wrong?

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Gaming

Programming

Can't set up the serverless vLLM for the model.

Did you find this page helpful?