jackson hole
jackson hole
RRunPod
Created by jackson hole on 1/8/2025 in #⚡|serverless
Some basic confusion about the `handlers`
Hi everyone! 👋 I'm currently using RunPod's serverless option to deploy an LLM. Here's my setup: - I've deployed the vLLM with a serverless endpoint (runpod.io/v2/<endpoint>/run). - I built a FastAPI backend that forwards frontend requests to the RunPod endpoint. - This works fine since FastAPI is async and handles requests efficiently. However, I came across the Handler feature in the RunPod docs and am unsure if I should switch to using it. My questions are: 1. Is using the Handler feature necessary, or is it okay to stick with FastAPI as the middleware? 2. Are there any advantages to adopting Handlers, such as reduced latency or better scaling, compared to my current setup? 3. Would switching simplify my architecture, or am I overcomplicating things by considering it? Basically my architecture is: 1. Frontend 2. FastAPI (different endpoints and pre/post processing -- async requests) 3. Runpod vLLM 4. FastAPI (final processing) 5. Return to frontend I am not able to grasp the handler feature, is it a replacement of such FastAPI like frameworks or is it handelled automatically on the runpod side? Any advice or insights would be much appreciated! Thanks in advance. 😊
5 replies
RRunPod
Created by jackson hole on 1/7/2025 in #⚡|serverless
How to monitor the LLM inference speed (generation token/s) with vLLM serverless endpoint?
I have got started with vLLM deployment and the configuration with my application is straightforward and that woerked as well. My main concern is how to monitor the speed of inference on the dashboard or on the "metrics" tab? Because, currently, I have to look manually in the logs and find the average token generation speed spit by vLLM. Any neat solution to this??
7 replies
RRunPod
Created by jackson hole on 1/3/2025 in #⚡|serverless
How is the architecture set up in the serverless (please give me a minute to explain myself)
We have been looking for the LLM hosting services and autoscaling functionality to make sure we meet the demand -- but our main concern is the authentication architecture design. The basic setup Based on my understanding there are the following layers: 1. Application in the user's device (sends request) 2. A dedicated authentication server checks the user's authenticity (by API key, bearer etc and rate limits) 3. Our HTTP server takes that request, processes the data and sends the request to the LLM server (to runpod - serverless) 4. Runpod returns some generated data, and finally the HTTP server post-processes it and sends back to the user. --- We want to: - Make sure no unauthorized device is accessing our API to LLM - To track each user's leftover quota and only let them to send a couple of requests etc. 👉🏻 As you can see, we have certain authentication related thoughts -- but I need more granular understanding of what standard practices are when deploying LLMs for the commercial use which the real customers are going to use. Please guide. Thank you.
20 replies