jackson hole Posts - Answer Overflow

jackson hole

•Created by jackson hole on 1/20/2025 in #⚡｜serverless

Is there any "reserve for long" and "get it cheaper" payment options?

Hey, Till now, we have been testing the serverless endpoint with vLLM configuration internally for development. Now, we are looking to move it into production. We believe it would be beneficial to have a "reserve for long" option, such as a monthly reservation. Currently, the service charges on a per-second basis with a 30% discount on active workers, but we need to constantly monitor our balance to ensure it doesn't run out. Is it possible to reserve a fixed single active worker for a month at a reduced rate? Perhaps with a one-time payment or a similar option, as typically offered by services like AWS EC2 instances. Thanks.

2 replies

RRunPod

•Created by jackson hole on 1/13/2025 in #⚡｜serverless

I want to increase/decrease workers by code or script, can you help? (GraphQL)

I have a serverless setup already. Generally we keep 1 active worker in the actual time when we expect the traffic throughout the day, and at night when no one is using the application we make active workers 0 to avoid any charges. And then the next day, we make active workers 1 manually from runpod dashboard. We are willing to do that automatically. I know there is a GraphQL but I am not able to find relevant code to do that. Can anyone please help? Thanks!!

13 replies

RRunPod

•Created by jackson hole on 1/8/2025 in #⚡｜serverless

Some basic confusion about the `handlers`

Hi everyone! 👋 I'm currently using RunPod's serverless option to deploy an LLM. Here's my setup: - I've deployed the vLLM with a serverless endpoint (runpod.io/v2/<endpoint>/run). - I built a FastAPI backend that forwards frontend requests to the RunPod endpoint. - This works fine since FastAPI is async and handles requests efficiently. However, I came across the Handler feature in the RunPod docs and am unsure if I should switch to using it. My questions are: 1. Is using the Handler feature necessary, or is it okay to stick with FastAPI as the middleware? 2. Are there any advantages to adopting Handlers, such as reduced latency or better scaling, compared to my current setup? 3. Would switching simplify my architecture, or am I overcomplicating things by considering it? Basically my architecture is: 1. Frontend 2. FastAPI (different endpoints and pre/post processing -- async requests) 3. Runpod vLLM 4. FastAPI (final processing) 5. Return to frontend I am not able to grasp the handler feature, is it a replacement of such FastAPI like frameworks or is it handelled automatically on the runpod side? Any advice or insights would be much appreciated! Thanks in advance. 😊

10 replies

RRunPod

•Created by jackson hole on 1/7/2025 in #⚡｜serverless

How to monitor the LLM inference speed (generation token/s) with vLLM serverless endpoint?

I have got started with vLLM deployment and the configuration with my application is straightforward and that woerked as well. My main concern is how to monitor the speed of inference on the dashboard or on the "metrics" tab? Because, currently, I have to look manually in the logs and find the average token generation speed spit by vLLM. Any neat solution to this??

7 replies

RRunPod

•Created by jackson hole on 1/3/2025 in #⚡｜serverless

How is the architecture set up in the serverless (please give me a minute to explain myself)

We have been looking for the LLM hosting services and autoscaling functionality to make sure we meet the demand -- but our main concern is the authentication architecture design. The basic setup Based on my understanding there are the following layers: 1. Application in the user's device (sends request) 2. A dedicated authentication server checks the user's authenticity (by API key, bearer etc and rate limits) 3. Our HTTP server takes that request, processes the data and sends the request to the LLM server (to runpod - serverless) 4. Runpod returns some generated data, and finally the HTTP server post-processes it and sends back to the user. --- We want to: - Make sure no unauthorized device is accessing our API to LLM - To track each user's leftover quota and only let them to send a couple of requests etc. 👉🏻 As you can see, we have certain authentication related thoughts -- but I need more granular understanding of what standard practices are when deploying LLMs for the commercial use which the real customers are going to use. Please guide. Thank you.

20 replies

Gaming

Programming