Can't set up the serverless vLLM for the model.
Please help solve the problem. When trying to make a request, these errors are logged:
▲
2024-04-24 18:25:10.089
[hrkxm58yz2r504]
[info]
INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing
gpu_memory_utilization
or enforcing eager mode. You can also reduce the max_num_seqs
as needed to decrease memory usage.
{5 items"dt":"2024-04-24 15:25:10.089313""endpointid":"7pih5vdoqp1xsu""level":"info""message":"INFO 04-24 15:25:10 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization
or enforcing eager mode. You can also reduce the max_num_seqs
as needed to decrease memory usage.""workerId":"hrkxm58yz2r504"}
▲
2024-04-24 18:25:10.089
[hrkxm58yz2r504]
[info]
�INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
{5 items"dt":"2024-04-24 15:25:10.089260""endpointid":"7pih5vdoqp1xsu""level":"info""message":"�INFO 04-24 15:25:10 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.""workerId":"hrkxm58yz2r504"}
▲
2024-04-24 18:25:08.884
[hrkxm58yz2r504]
[info]
MINFO 04-24 15:25:08 llm_engine.py:337] # GPU blocks: 1199, # CPU blocks: 327
Configuration:17 Replies
What size is your GPU?
cc: @Kostya ^^^
@Alpay Ariyak @haris 24Gb GPU
I don't see the error there BTW, seems like just info's for the feature that is used there
@nerdylive @haris Could you please tell me if this model (https://huggingface.co/TheBloke/MythoMax-L2-13B-GPTQ) is compatible?
maybe it is
It is compatible
Like @nerdylive said, there’s no error message, just warnings
This is very strange, because this model (https://huggingface.co/solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ) works. What is the difference between them and how can I get MythoMax-L2-13B-GPTQ to work?
Whats making it not working?
Any errors?
AWQ and GPTQ are 2 different kinds of quantization methods. You can't really compare AWQ with GPTQ, you should compare another GPTQ model with GPTQ rather than AWQ vs GPTQ.
@nerdylive There are no errors in the logs, only informational logs are being displayed.
Press the running worker
Then there will be a log button
Ran out of VRAM
Could you please tell me how to increase VRAM?
Use 48GB instead of 24GB tier
Thank you very much, it worked. I have another question. We use two fields in the request to the route /openai/v1/chat/completions: messages and prompt. According to the documentation https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#chat-completions and the API response, we cannot use these two fields simultaneously. Can we really not use these fields simultaneously in the request, or am I doing something wrong?
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm