ngagefreak05 Comments - Answer Overflow

ngagefreak05

•Created by ngagefreak05 on 6/24/2024 in #⚡｜serverless

cannot stream openai compatible response out

but it is specifically for vllm, i am trying to use llama-cpp

7 replies

•Created by Concept on 12/28/2023 in #⚡｜serverless

Serverless Endpoint Streaming

sorry, may be i am reading the wrong readme.md at https://github.com/runpod-workers/worker-vllm/blob/main/README.md, can you please guide me with section or link (if the link is wrong)

30 replies

•Created by ngagefreak05 on 6/24/2024 in #⚡｜serverless

cannot stream openai compatible response out

any directions how to use it?

7 replies

•Created by Concept on 12/28/2023 in #⚡｜serverless

Serverless Endpoint Streaming

Hi, could you please provide the url of your fork, need this too

30 replies

•Created by houmie on 4/30/2024 in #⚡｜serverless

Is serverless cost per worker or per GPU?

they will utilize different gpu's

8 replies

•Created by houmie on 4/30/2024 in #⚡｜serverless

Is serverless cost per worker or per GPU?

Yes it is per GPU second, so cost will be number of gpu's used per second

8 replies

•Created by ngagefreak05 on 4/30/2024 in #⚡｜serverless

openai compatible endpoint for custom serverless docker image

Although I could find an workaround in another thread : What happens is when you hit https://api.runpod.ai/v2/<ENDPOINT ID>/openai/abc The handler receives two new key-value pairs in the job["handler"] input: - "openai_route": this will be everything in the link you hit after /openai, so for the example case its value would be /abc , you would use this to tell the handler to do logic for /v1/chat/completions, /v1/models, etc - "openai_input" the openai request as a dictionary, with message, etc if you dont have stream: true in your openai request, then you just return the openai completions/chatcompletions/etc object as a dict in the output (Returned here: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L160) If you have stream: true , then this will be an SSE stream, for which you would yield your output, but instead of yielding the dict directly, you would put it in an SSE stream chunk string format, which is something like f"data: {your json output as string}"\n\n" (Stream code: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L161) Most of the code is in this class in general: https://github.com/runpod-workers/worker-vllm/blob/0a5b5bc095153363e8d45af1a2fa6f2d26425530/src/engine.py#L109 Will work on documentation soon

5 replies

•Created by ngagefreak05 on 4/30/2024 in #⚡｜serverless

openai compatible endpoint for custom serverless docker image

But if I do this then the agents will become complicated and will not use generally accepted api

5 replies