Possible memory leak on Serverless
We're testing different mistral models (cognitivecomputations/dolphin-2.6-mistral-7b and TheBloke/dolphin-2.6-mistral-7B-GGUF) and running into the same problem regardless of what size GPU we use. After 20 or so messages the model starts returning empty responses, we've been trying to debug this every way we know how but it just doesn't make sense as the context size is around the same for each message so its due to an increasing number of prompt tokens. What i've noticed is that even when the worker isn't processing any requests, the GPU memory stays (nearly) maxed out. The only thing I can think of is the the new requests don't have enough ram to processes as its already full
5 Replies
The GPU memory being near max memory usage is expected with vLLM
In terms of empty messages, seems more like a problem with the model or vLLM. Have you tried 20+ messages with regular vLLM on a pod or any other inference engine?
It's a very popular model and I can't find any other people complaining about this problem so I can only assume its related to the docker container or hardware. I'm using the runpod/worker-vllm:0.3.0-cuda11.8.0 image which is also popular and I haven't found anyone complaining about this issue of empty messages after X number of messages. It seems like it must be hardware related as the messages being sent are all very similar to each other, yet after a while it just starts returning "\r\n\r\n..." in a long string
Following up here, I think I might be seeing this same issue. I see a slow creep of memory and eventual empty outputs with CUDA OOM. Was there any resolution or progress on understanding this issue?
I'm pretty sure there was a leak in my inference code. Switching wholesale over to vLLM did resolve this, even if I didnt end up getting a root cause on it
Hardware-related issues can’t affect the output tokens, so it’s vLLM-related
Hi,
I’m not sure what you mean by wholesale, could you please elaborate
Switching all my inference code over to vLLM is what I meant by "wholesale". I did find out that the empty responses were too large of inputs. Memory leak seems to have been something with my inference code that was resolved by switching to vLLM