VidimusWolf
RRunPod
•Created by VidimusWolf on 10/14/2024 in #⚡|serverless
Streaming LLM output via a Google Cloud Function
Has anyone been able to figure this out? User inputs are going through a GCloud Function that can then call the runpod model's inference. This pipeline works, but I now want the output to be streamed through instead of waiting ages for the complete answer. I have unsuccessfully so far tried to implement it, and Google's docs have examples for streaming LLM outputs using their Vertex AI service, not this specific case I am dealing with.
2 replies
RRunPod
•Created by VidimusWolf on 10/9/2024 in #⚡|serverless
Keeping Flashboot active?
It is my understanding that Flashboot is only active for "a while" after each request, and then it is disabled as the instance goes to a deeper sleep. Sadly for me it takes a whopping 70-90 seconds of just delay to cold start after a long delay (running llama-2-13b-chat-hf off the 48GB GPUs e.g. A40), I don't know if I am doing something wrong there as I see others on this forum are getting much much faster start times. However, on consecutive jobs, the delay drops down to 1-3 seconds. What is the minimum time between requests to keep Flashboot functional? I assume this is some "secret", but would e.g. 1 job every 10 minutes do the trick?
9 replies
RRunPod
•Created by VidimusWolf on 10/9/2024 in #⚡|serverless
Hugging face token not working
Hello! Has anyone had issues getting their hugging face token to work on a serverless vLLM instance? I have used hugging face before and their tokens work for me locally, but I keep getting access denied log entries on the console logs when trying to send a request even though I give it the token key...
7 replies