RunPod•12mo ago

Everything is crashing and burning today [SOLVED] + DEV image with beta 1.0.0preview feedback

Today the testing on Serverless vLLM has been a very bad experience. It is extremely unstable. Out of the blue we started getting the error message:

The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (7456). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

We haven't changed the endpoint for days, and it was working. Whyis this now happening? The only workaround is to change the gpu-memory-utilization from 0.95 to 0.99. After that first hurdle it was working for 10 min until it started failing again. Jobs are waiting in the queue and doing nothing. (see screenshot) I don't know if this is because of a specific data centre failing, but how would I know? I see errors in the log tab like :

▼
2024-05-06 15:12:19.720
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502
▼
2024-05-06 15:12:13.482
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502
▼
[error]
Traceback: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 55, in get_job async with session.get(_job_get_url()) as response: File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 1194, in __aenter__ self._resp = await self._coro File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 605, in _request await resp.start(conn) File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 966, in start message, payload = await protocol.read() # type: ignore[union-attr] File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 622, in read await self._waiter aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
▼
2024-05-06 15:01:25.855
[d59ufubzugwuiz]
[error]
Failed to get job. | Error Type: ServerDisconnectedError | Error Message: Server disconnected

▼
2024-05-06 15:12:19.720
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502
▼
2024-05-06 15:12:13.482
[d59ufubzugwuiz]
[error]
Failed to get job, status code: 502
▼
[error]
Traceback: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 55, in get_job async with session.get(_job_get_url()) as response: File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 1194, in __aenter__ self._resp = await self._coro File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 605, in _request await resp.start(conn) File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 966, in start message, payload = await protocol.read() # type: ignore[union-attr] File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 622, in read await self._waiter aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
▼
2024-05-06 15:01:25.855
[d59ufubzugwuiz]
[error]
Failed to get job. | Error Type: ServerDisconnectedError | Error Message: Server disconnected

This doesn't seem stable enough for production. I don't understand.

14 Replies

Alpay Ariyak•12mo ago

Hi, In terms of the size error, it is a result of your configuration, which must be specific to amount of VRAM you have on your GPU. The documentation details that if you're running into OOM errors, you will have to either reduce max_model_length(usually the biggest cause for OOM on 24GB GPUs, when the model has a 32k context by default) if your use case allows it, then increasing the gpu memory utilization if that doesn't help or isn't an option. This is specific to vLLM, and not our worker or serverless. For the second issue with server disconnect, that's a valid concern - have you experienced it since? Passing it on @flash-singh check the disconnected issue out

houmieOP•12mo ago

Thanks. I'm just testing it again with the new image. I'm using a 48 GB GPU with one worker only on CUDA 2.1+. The LLM I'm using is this: casperhansen/llama-3-70b-instruct-awq The context of this model is only 8192. Based on your suggestion I have now set it in the environment as well. But I get this error: 09T08:45:01.250406184Z [rank0]: ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6864). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. ---- Additionally when I started your new image initially with KV_CACHE_DTYPE set to fp8_e5m2, I also got the error: 2024-05-09T08:37:27.009451601Z ValueError: Unknown kv cache dtype: fp8_e5m2 ---- I think the main issue is why is a model on a worker that was working before suddenly stops working and does OOM? Could it be that certain LLM requests require more memory than what is allocated? What is the best setting for the worker to avoid OOM? 0.95 utilisation or 1.00 utilisation?

houmieOP•12mo ago

---- I removed model_length. Now the config looks like this:

houmieOP•12mo ago

And it is still failing: 09T08:54:12.301107051Z [rank0]: ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (6896). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. Why? Now I'm trying at 0.99 utilisation.... with KV_CACHE_DTYPE removed and max model length removed too.

houmieOP•12mo ago

Now I'm getting this error attached.

message.txt

houmieOP•12mo ago

There is always an issue... Now after another restart it's working.
I find vLLM worker very unpredictable, as you see the issues above. You can see the logs for yourself on vllm-cekmcjiplskwpj I have a feeling there could be differences betwen A40 and RTX A6000 and how well they handle vLLM, even though both come as 48 GB.

Jason•12mo ago

rank0]: RuntimeError: An error occurred while downloading using hf_transfer. Consider disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling. Download error?

houmieOP•12mo ago

One error among so many.... Keep scrolling up 😄

Jason•12mo ago

Ooh wait sorry I'm on phone rn 🙂 Maybe this is hf download error, try browsing and checking the docs if they have that

Alpay Ariyak•12mo ago

Thanks! Will fix those [for context for new readers of this thread, these issues are regarding the DEV image, not the stable image] Once again though, the memory error is a vLLM issue, not a worker issue, particularly with your configuration not being feasible on your GPU size - you need to decrease max model length Hf Transfer error fixed @nerdylive @houmie My suggestion was to lower the max_model_len, vLLM already sets it to the model's full context length(8192 in this case) by default. The vLLM error states that with the gpu memory utilization you had, the max context length you could set is 6864 For the kv cache erorr in the beta image, seems they renamed fp8_e5m2 to fp8 - fixed that

Jason•12mo ago

Ah it's from a template error? Great it's fixed

Alpay Ariyak•12mo ago

Yes, requirements.txt had hf_transfer instead of hf-transfer

Jason•12mo ago

Hahaha alright

houmieOP•12mo ago

1) Thanks for fixing fp8_e5m2 to fp8. I will test it later today to confirm. 2) Also good spot about hf-transfer. I will test again to see. 3) Can you please elaborate on max_model_len? Why is vLLM suggesting to set it to 6864, maybe there is a configuration issue somewhere, because the context supported by Llama3 is 8192.

Gaming

Programming

Everything is crashing and burning today [SOLVED] + DEV image with beta 1.0.0preview feedback

Did you find this page helpful?