R
RunPodβ€’3d ago
jackson hole

Some basic confusion about the `handlers`

Hi everyone! πŸ‘‹ I'm currently using RunPod's serverless option to deploy an LLM. Here's my setup: - I've deployed the vLLM with a serverless endpoint (runpod.io/v2/<endpoint>/run). - I built a FastAPI backend that forwards frontend requests to the RunPod endpoint. - This works fine since FastAPI is async and handles requests efficiently. However, I came across the Handler feature in the RunPod docs and am unsure if I should switch to using it. My questions are: 1. Is using the Handler feature necessary, or is it okay to stick with FastAPI as the middleware? 2. Are there any advantages to adopting Handlers, such as reduced latency or better scaling, compared to my current setup? 3. Would switching simplify my architecture, or am I overcomplicating things by considering it? Basically my architecture is: 1. Frontend 2. FastAPI (different endpoints and pre/post processing -- async requests) 3. Runpod vLLM 4. FastAPI (final processing) 5. Return to frontend I am not able to grasp the handler feature, is it a replacement of such FastAPI like frameworks or is it handelled automatically on the runpod side? Any advice or insights would be much appreciated! Thanks in advance. 😊
2 Replies
3WaD
3WaDβ€’3d ago
Do you mean this Handler? That's what the serverless containers (including your vLLM) are running on. When you want to develop a container image for RunPod serverless, you use their SDK and put the execution code inside the Handler functions. So you are already using it just didn't have to write it since you're using someone else's image (the VLLM template).
jackson hole
jackson holeOPβ€’3d ago
Yes, that handler. So I am basically utilizing that already!? Wow. Actually my code structure looks like this:
app = FastAPI()

@app.post("/get_response")
async def get_response(request: Request) -> Response:
request_dict = await request.json()
print("JSON:", request_dict)

payload = {"topic": request_dict["story_topic"]}
request_id = random_uuid()
story = await functions.generate_story(payload)

return {"response":story}
app = FastAPI()

@app.post("/get_response")
async def get_response(request: Request) -> Response:
request_dict = await request.json()
print("JSON:", request_dict)

payload = {"topic": request_dict["story_topic"]}
request_id = random_uuid()
story = await functions.generate_story(payload)

return {"response":story}
And that calls the appropriate async function:
async def generate_story(payload):

sampling_params = SamplingParams(temperature=0.42,
max_tokens=2048,
top_p=0.734,
repetition_penalty=1.0,
stop=["Note", "note"])

prompt = "<>"

### the below will call the runpod openai endpoint ###
completion = client.completions.create(
model="iqbalamo93",
prompt=prompt,
**sampling_params)
return completion.choices[0].text
async def generate_story(payload):

sampling_params = SamplingParams(temperature=0.42,
max_tokens=2048,
top_p=0.734,
repetition_penalty=1.0,
stop=["Note", "note"])

prompt = "<>"

### the below will call the runpod openai endpoint ###
completion = client.completions.create(
model="iqbalamo93",
prompt=prompt,
**sampling_params)
return completion.choices[0].text
So, basically this FastAPI is implemented in this async way and I guess it should do that job!

Did you find this page helpful?