ngagefreak05
RRunPod
•Created by ngagefreak05 on 6/24/2024 in #⚡|serverless
cannot stream openai compatible response out
but it is specifically for vllm, i am trying to use llama-cpp
7 replies
RRunPod
•Created by Concept on 12/28/2023 in #⚡|serverless
Serverless Endpoint Streaming
sorry, may be i am reading the wrong readme.md at https://github.com/runpod-workers/worker-vllm/blob/main/README.md, can you please guide me with section or link (if the link is wrong)
30 replies
RRunPod
•Created by ngagefreak05 on 6/24/2024 in #⚡|serverless
cannot stream openai compatible response out
any directions how to use it?
7 replies
RRunPod
•Created by Concept on 12/28/2023 in #⚡|serverless
Serverless Endpoint Streaming
Hi, could you please provide the url of your fork, need this too
30 replies
RRunPod
•Created by houmie on 4/30/2024 in #⚡|serverless
Is serverless cost per worker or per GPU?
they will utilize different gpu's
8 replies
RRunPod
•Created by houmie on 4/30/2024 in #⚡|serverless
Is serverless cost per worker or per GPU?
Yes it is per GPU second, so cost will be number of gpu's used per second
8 replies
RRunPod
•Created by ngagefreak05 on 4/30/2024 in #⚡|serverless
openai compatible endpoint for custom serverless docker image
Although I could find an workaround in another thread :
What happens is when you hit
https://api.runpod.ai/v2/<ENDPOINT ID>/openai/abc
The handler receives two new key-value pairs in the job["handler"] input:
- "openai_route": this will be everything in the link you hit after /openai, so for the example case its value would be
/abc
, you would use this to tell the handler to do logic for /v1/chat/completions
, /v1/models
, etc
- "openai_input" the openai request as a dictionary, with message, etc
if you dont have stream: true
in your openai request, then you just return the openai completions/chatcompletions/etc object as a dict in the output
(Returned here: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L160)
If you have stream: true
, then this will be an SSE stream, for which you would yield your output, but instead of yielding the dict directly, you would put it in an SSE stream chunk string format, which is something like f"data: {your json output as string}"\n\n"
(Stream code: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L161)
Most of the code is in this class in general: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L109
Will work on documentation soon5 replies
RRunPod
•Created by ngagefreak05 on 4/30/2024 in #⚡|serverless
openai compatible endpoint for custom serverless docker image
But if I do this then the agents will become complicated and will not use generally accepted api
5 replies