Everything is crashing and burning today [SOLVED] + DEV image with beta 1.0.0preview feedback
Today the testing on Serverless vLLM has been a very bad experience. It is extremely unstable.
Out of the blue we started getting the error message:
We haven't changed the endpoint for days, and it was working. Whyis this now happening? The only workaround is to change the gpu-memory-utilization from 0.95 to 0.99.
After that first hurdle it was working for 10 min until it started failing again. Jobs are waiting in the queue and doing nothing. (see screenshot)
I don't know if this is because of a specific data centre failing, but how would I know?
I see errors in the log tab like :
This doesn't seem stable enough for production. I don't understand.
14 Replies
Hi,
In terms of the size error, it is a result of your configuration, which must be specific to amount of VRAM you have on your GPU. The documentation details that if you're running into OOM errors, you will have to either reduce
max_model_length
(usually the biggest cause for OOM on 24GB GPUs, when the model has a 32k context by default) if your use case allows it, then increasing the gpu memory utilization if that doesn't help or isn't an option. This is specific to vLLM, and not our worker or serverless.
For the second issue with server disconnect, that's a valid concern - have you experienced it since? Passing it on
@flash-singh check the disconnected issue outThanks. I'm just testing it again with the new image.
I'm using a 48 GB GPU with one worker only on CUDA 2.1+.
The LLM I'm using is this: casperhansen/llama-3-70b-instruct-awq
The context of this model is only 8192. Based on your suggestion I have now set it in the environment as well.
But I get this error:
09T08:45:01.250406184Z [rank0]: ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6864). Try increasing
gpu_memory_utilization
or decreasing max_model_len
when initializing the engine.
----
Additionally when I started your new image initially with KV_CACHE_DTYPE set to fp8_e5m2, I also got the error:
2024-05-09T08:37:27.009451601Z ValueError: Unknown kv cache dtype: fp8_e5m2
----
I think the main issue is why is a model on a worker that was working before suddenly stops working and does OOM?
Could it be that certain LLM requests require more memory than what is allocated? What is the best setting for the worker to avoid OOM? 0.95 utilisation or 1.00 utilisation?----
I removed model_length. Now the config looks like this:
And it is still failing:
09T08:54:12.301107051Z [rank0]: ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6896). Try increasing
gpu_memory_utilization
or decreasing max_model_len
when initializing the engine.
Why?
Now I'm trying at 0.99 utilisation.... with KV_CACHE_DTYPE removed and max model length removed too.Now I'm getting this error attached.
There is always an issue...
Now after another restart it's working.
I find vLLM worker very unpredictable, as you see the issues above. You can see the logs for yourself on vllm-cekmcjiplskwpj I have a feeling there could be differences betwen A40 and RTX A6000 and how well they handle vLLM, even though both come as 48 GB.
I find vLLM worker very unpredictable, as you see the issues above. You can see the logs for yourself on vllm-cekmcjiplskwpj I have a feeling there could be differences betwen A40 and RTX A6000 and how well they handle vLLM, even though both come as 48 GB.
rank0]: RuntimeError: An error occurred while downloading using
hf_transfer
. Consider disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling.
Download error?One error among so many....
Keep scrolling up 😄
Ooh wait sorry I'm on phone rn 🙂
Maybe this is hf download error, try browsing and checking the docs if they have that
Thanks! Will fix those
[for context for new readers of this thread, these issues are regarding the DEV image, not the stable image]
Once again though, the memory error is a vLLM issue, not a worker issue, particularly with your configuration not being feasible on your GPU size - you need to decrease max model length
Hf Transfer error fixed @nerdylive @houmie
My suggestion was to lower the
max_model_len
, vLLM already sets it to the model's full context length(8192
in this case) by default. The vLLM error states that with the gpu memory utilization you had, the max context length you could set is 6864
For the kv cache erorr in the beta image, seems they renamed fp8_e5m2
to fp8
- fixed thatAh it's from a template error? Great it's fixed
Yes, requirements.txt had hf_transfer instead of hf-transfer
Hahaha alright
1) Thanks for fixing fp8_e5m2 to fp8. I will test it later today to confirm.
2) Also good spot about hf-transfer. I will test again to see.
3) Can you please elaborate on max_model_len? Why is vLLM suggesting to set it to 6864, maybe there is a configuration issue somewhere, because the context supported by Llama3 is 8192.