R
RunPod•4d ago
jackson hole

How to monitor the LLM inference speed (generation token/s) with vLLM serverless endpoint?

I have got started with vLLM deployment and the configuration with my application is straightforward and that woerked as well. My main concern is how to monitor the speed of inference on the dashboard or on the "metrics" tab? Because, currently, I have to look manually in the logs and find the average token generation speed spit by vLLM. Any neat solution to this??
5 Replies
nerdylive
nerdylive•4d ago
https://docs.vllm.ai/en/latest/serving/metrics.html you can expose this and monitor the metrics thats the only way i know, and to access you need the pod id, andthe exposed port then build the runpod proxy link ( like the one inside the connect button in pods)
jackson hole
jackson holeOP•4d ago
Oh yeah, I thought runpod has built-in support for this. Thanks
nerdylive
nerdylive•4d ago
your welcome, if you want you can check with staffs too in tickets, contact button in website
jackson hole
jackson holeOP•4d ago
Absolutely, but I found Discord (and nerdylive support) faster and quicker 😉
nerdylive
nerdylive•4d ago
haha

Did you find this page helpful?