Ardgon
RRunPod
•Created by Ardgon on 11/18/2024 in #⚡|serverless
vLLM override open ai served model name
Overriding the served model name on the vllm serverless pod doesn't seem to take effect. Configuring a new endpoint through the explore page on runpod's interface creates a worker with the env variable
OPENAI_SERVED_MODEL_NAME_OVERRIDE
but the name of the model on the openai endpoint is still hf_repo/model name.
The logs show : engine.py: AsyncEngineArgs(model='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', served_model_name=None...
and the endpoint returns Error with model object='error' message='The model 'model_name' does not exist.' type='NotFoundError' param=None code=404
Setting the env variable SERVED_MODEL_NAME
shows logs: engine.py: Engine args: AsyncEngineArgs(model='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', served_model_name='model_name'...
yet the endpoint still returns the same error message as above.1 replies
RRunPod
•Created by Ardgon on 6/18/2024 in #⚡|serverless
Cancelling job resets flashboot
For some reason whenever we cancel a job the next time the serverless worker cold boots it doesn't use flash boot and instead it reloads the llm model weights into the gpu from scratch. Any idea why cancelling jobs might be causing this problem? Is there maybe a more graceful solution for stopping jobs early than the /cancel/{job_id} endpoint?
4 replies