Urgent: Issue with Runpod vllm Serverless Endpoint
We are encountering a critical issue with the runpod vllm serverless endpoint. Specifically, when attaching a network volume, the following code is failing:
response = client.completions.create(
model="llama3-dumm/llm",
prompt=["hello? How are you "],
temperature=0.8,
max_tokens=600,
)
But the below is working :
response = client.chat.completions.create(
model="llama3-dumm/llm",
messages=[{'role': 'user', 'content': "hell0"}],
max_tokens=100,
temperature=0.9,
)
And This is the client object :
client = OpenAI(
api_key=api_key, base_url=f"https://api.runpod.ai/v2/endpoint_id/openai/v1", )This behavior is unusual and suggests there might be a bug. Given our tight deadline, could you please investigate this issue as soon as possible? Your prompt assistance would be greatly appreciated. Thank you very much for your help.
15 Replies
@naaviii thanks for reporting this. So you are saying that once you connect a network volume to your vllm endpoint, then
client.completions.create
stops working?yes , Infact now I am getting error without using network volume , from openai import OpenAI
api_key = "xxxxxxxxx"
endpoint_id = "vllm-xxxxx"
client = OpenAI(
base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1",
api_key=api_key,
)
# Create a completion
response = client.completions.create(
model="microsoft/Phi-3.5-mini-instruct",
prompt="Runpod is the best platform because",
temperature=0,
max_tokens=100,
)
print(response)
# Print the response
print(response.choices[0].text)
################Output###############################
{
"delayTime": 104,
"error": "handler: 'NoneType' object has no attribute 'headers' \ntraceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py\", line 192, in run_job_generator\n async for output_partial in job_output:\n File \"/src/handler.py\", line 13, in handler\n async for batch in results_generator:\n File \"/src/engine.py\", line 151, in generate\n async for response in self._handle_chat_or_completion_request(openai_request):\n File \"/src/engine.py\", line 179, in _handle_chat_or_completion_request\n response_generator = await generator_function(request, raw_request=None)\n File \"/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py\", line 129, in create_completion\n raw_request.headers):\nAttributeError: 'NoneType' object has no attribute 'headers'\n",
"executionTime": 1191,
"id": "sync-9c9ccd0f-7e42-4f6a-8c5d-d430004b399f-e1",
"status": "FAILED"
}
This is the basic code that I have used
This was working fine few days back , are there major changes done in library versions ?
Please help , need you supportcan you also please provide the docker image version of the worker-vllm that you are using? And the endpoint id (which is ok to share, as an API key is still needed to access it), but you can also DM me the endpoint ID if you want!
I got the ID via DM, it looks like a problem in the worker-vllm. I asked our team to take a look at this. Will report back once I hear something.
Thanks a lot Tim, for the quick response 🙂
you are super welcome!
I hope we can get this sorted out quickly
Hi Tim , just checking for any update ?
Hi @naaviii, I have no update for you yet, but I will ping you the second I have something.
I just saw that there is already an issue for the problem: https://github.com/runpod-workers/worker-vllm/issues/104
GitHub
'NoneType' object has no attribute 'headers' (completions endpoint)...
When trying to use the completions endpoint (rather than chat_completions) on a vLLM runpod serverless instance I get a server error. This happens with all models that I've tried. The chat_comp...
I talked with the team and try to help them resolve the issue
Hello Tim , is there any update from the team , regarding the issue ?
Hello @Tim aka NERDDISCO is there any update from the team , regarding the issue ?
@naaviii nope, I'm very sorry for this situation 😦
No worries @NERDDISCO , could you give me an ETA if possible ? so that our team can plan accordingly
@naaviii we have created a fix, can you please check the latest version:
runpod/worker-v1-vllm:v1.3.1dev-cuda12.1.0
this one is not available in the UI yet when using quick deploy, so you have to change the docker image yourself in the endpointI encountered the same problem, and swapped to the dev image mentioned after finding this, and things are not directly crashing.
Do you guys know how to run llama 70 3.1? Can i use a quant version with this? GGUf? Can't seem to find anything about it.