vllm worker OpenAI stream timeout
OpenAI client code from tutorial (https://docs.runpod.io/serverless/workers/vllm/openai-compatibility#streaming-responses-1) is not reproducible.
I'm hosting 70B model, which usualy has ~2 mins delay for request.
Using openai client with stream=True timeouts after ~1 min and returns nothing. Any solutions?
11 Replies
Did you set model name ?
Or it was as it is MODEL_NAME?
MODEL_NAME is huggingface link as usual
basically what I experience there is that server closes the connection after ~ 1 min in case stream == True, non-streaming works fine
Eh isn't it the model repo only like
meta-llama/llama3.3-70b
something like that
yes this is what I meant, sorry
I'm not sure how does MODEL_NAME affect this problem at all
Maybe just the environment variable key name
Maybe was only checking for that
But if not using stream does it works?
Yes, this waits for the whole request to finish.
Adding
stream=True
, sends the request which I can see in the dashboard, but it terminates the connection after ~1 min.Oh hmm
And empty response? Nothing streamed back?
If you replicate your vllm config in a pod, try it if it works with streaming and try active workers too
I'm guessing it might be the cloudflare proxy limiting a request to a 100s only
Nope
If you want you can create a ticket too to explore more on this
@Misterion
Escalated To Zendesk
The thread has been escalated to Zendesk!
Same issue here but even without streaming