R
RunPod9mo ago
houmie

Everything is crashing and burning today [SOLVED] + DEV image with beta 1.0.0preview feedback

Today the testing on Serverless vLLM has been a very bad experience. It is extremely unstable. Out of the blue we started getting the error message:
The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
We haven't changed the endpoint for days, and it was working. Whyis this now happening? The only workaround is to change the gpu-memory-utilization from 0.95 to 0.99. After that first hurdle it was working for 10 min until it started failing again. Jobs are waiting in the queue and doing nothing. (see screenshot) I don't know if this is because of a specific data centre failing, but how would I know? I see errors in the log tab like :

2024-05-06 15:12:19.720
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502

2024-05-06 15:12:13.482
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502

[error]
Traceback: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 55, in get_job async with session.get(_job_get_url()) as response: File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 1194, in __aenter__ self._resp = await self._coro File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 605, in _request await resp.start(conn) File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 966, in start message, payload = await protocol.read() # type: ignore[union-attr] File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 622, in read await self._waiter aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

2024-05-06 15:01:25.855
[d59ufubzugwuiz]
[error]
Failed to get job. | Error Type: ServerDisconnectedError | Error Message: Server disconnected

2024-05-06 15:12:19.720
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502

2024-05-06 15:12:13.482
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502

[error]
Traceback: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 55, in get_job async with session.get(_job_get_url()) as response: File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 1194, in __aenter__ self._resp = await self._coro File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 605, in _request await resp.start(conn) File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 966, in start message, payload = await protocol.read() # type: ignore[union-attr] File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 622, in read await self._waiter aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

2024-05-06 15:01:25.855
[d59ufubzugwuiz]
[error]
Failed to get job. | Error Type: ServerDisconnectedError | Error Message: Server disconnected
This doesn't seem stable enough for production. I don't understand.
No description
No description
14 Replies
Alpay Ariyak
Alpay Ariyak9mo ago
Hi, In terms of the size error, it is a result of your configuration, which must be specific to amount of VRAM you have on your GPU. The documentation details that if you're running into OOM errors, you will have to either reduce max_model_length(usually the biggest cause for OOM on 24GB GPUs, when the model has a 32k context by default) if your use case allows it, then increasing the gpu memory utilization if that doesn't help or isn't an option. This is specific to vLLM, and not our worker or serverless. For the second issue with server disconnect, that's a valid concern - have you experienced it since? Passing it on @flash-singh check the disconnected issue out
houmie
houmieOP9mo ago
Thanks. I'm just testing it again with the new image. I'm using a 48 GB GPU with one worker only on CUDA 2.1+. The LLM I'm using is this: casperhansen/llama-3-70b-instruct-awq The context of this model is only 8192. Based on your suggestion I have now set it in the environment as well. But I get this error: 09T08:45:01.250406184Z [rank0]: ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6864). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. ---- Additionally when I started your new image initially with KV_CACHE_DTYPE set to fp8_e5m2, I also got the error: 2024-05-09T08:37:27.009451601Z ValueError: Unknown kv cache dtype: fp8_e5m2 ---- I think the main issue is why is a model on a worker that was working before suddenly stops working and does OOM? Could it be that certain LLM requests require more memory than what is allocated? What is the best setting for the worker to avoid OOM? 0.95 utilisation or 1.00 utilisation?
houmie
houmieOP9mo ago
---- I removed model_length. Now the config looks like this:
No description
houmie
houmieOP9mo ago
And it is still failing: 09T08:54:12.301107051Z [rank0]: ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6896). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. Why? Now I'm trying at 0.99 utilisation.... with KV_CACHE_DTYPE removed and max model length removed too.
houmie
houmieOP9mo ago
Now I'm getting this error attached.
houmie
houmieOP9mo ago
There is always an issue... Now after another restart it's working.
I find vLLM worker very unpredictable, as you see the issues above. You can see the logs for yourself on vllm-cekmcjiplskwpj I have a feeling there could be differences betwen A40 and RTX A6000 and how well they handle vLLM, even though both come as 48 GB.
nerdylive
nerdylive9mo ago
rank0]: RuntimeError: An error occurred while downloading using hf_transfer. Consider disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling. Download error?
houmie
houmieOP9mo ago
One error among so many.... Keep scrolling up 😄
nerdylive
nerdylive9mo ago
Ooh wait sorry I'm on phone rn 🙂 Maybe this is hf download error, try browsing and checking the docs if they have that
Alpay Ariyak
Alpay Ariyak9mo ago
Thanks! Will fix those [for context for new readers of this thread, these issues are regarding the DEV image, not the stable image] Once again though, the memory error is a vLLM issue, not a worker issue, particularly with your configuration not being feasible on your GPU size - you need to decrease max model length Hf Transfer error fixed @nerdylive @houmie My suggestion was to lower the max_model_len, vLLM already sets it to the model's full context length(8192 in this case) by default. The vLLM error states that with the gpu memory utilization you had, the max context length you could set is 6864 For the kv cache erorr in the beta image, seems they renamed fp8_e5m2 to fp8 - fixed that
nerdylive
nerdylive9mo ago
Ah it's from a template error? Great it's fixed
Alpay Ariyak
Alpay Ariyak9mo ago
Yes, requirements.txt had hf_transfer instead of hf-transfer
nerdylive
nerdylive9mo ago
Hahaha alright
houmie
houmieOP9mo ago
1) Thanks for fixing fp8_e5m2 to fp8. I will test it later today to confirm. 2) Also good spot about hf-transfer. I will test again to see. 3) Can you please elaborate on max_model_len? Why is vLLM suggesting to set it to 6864, maybe there is a configuration issue somewhere, because the context supported by Llama3 is 8192.

Did you find this page helpful?