Not using cached worker
What are ttft times we should be able to reach?
80GB GPUs totally unavailable
Not able to connect to the local test API server
What methods can I use to reduce cold start times and decrease latency for serverless functions
Network volume vs baking in model into docker
Jobs Stays in In-Progress for forever
How to Get the Progress of the Processing job in serverless ?
Why is Runsync returning status response instead of just waiting for image response?
Worker Keeps running after idle timeout
May I deploy template ComfyUI with Flux.1 dev one-click to serverless ?emplate
What is the real Serverless price?
Can't find juggernaut on list of models to download in Comfy UI manager
comfy
Incredibly long startup time when running 70b models via vllm
cognitivecomputations/dolphin-2.9.1-llama-3-70b
. I find it even weirder that the request ultimately succeeds. Logs and screenshot of the endpoint and template config are attached - if anyone can spot an issue or knows how to deploy 70b models such that they reliably work I would greatly appreciate it.
Some other observations:
- in support, someone told me that I need to manually set the env BASE_PATH=/workspace
, which I am now always doing
- I sometimes but not always see this in the logs: AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'...
, even though I am deploying a completely different model...Mounting network storage at runtime - serverless
Serverless fails when workers arent manually set to active
Chat completion (template) not working with VLLM 0.6.3 + Serverless
qwen2.5 vllm openwebui