RunPod•8mo ago

Urgent: Issue with Runpod vllm Serverless Endpoint

We are encountering a critical issue with the runpod vllm serverless endpoint. Specifically, when attaching a network volume, the following code is failing:

response = client.completions.create(
    model="llama3-dumm/llm",
    prompt=["hello? How are you "],
    temperature=0.8,
    max_tokens=600,
)

But the below is working :

response = client.chat.completions.create(
    model="llama3-dumm/llm",
    messages=[{'role': 'user', 'content': "hell0"}],
    max_tokens=100,
    temperature=0.9,
)

And This is the client object : client = OpenAI(

api_key=api_key, base_url=f"https://api.runpod.ai/v2/endpoint_id/openai/v1", )

This behavior is unusual and suggests there might be a bug. Given our tight deadline, could you please investigate this issue as soon as possible? Your prompt assistance would be greatly appreciated. Thank you very much for your help.

15 Replies

NERDDISCO•8mo ago

@naaviii thanks for reporting this. So you are saying that once you connect a network volume to your vllm endpoint, then client.completions.create stops working?

naaviiiOP•8mo ago

yes , Infact now I am getting error without using network volume , from openai import OpenAI

api_key = "xxxxxxxxx"
endpoint_id = "vllm-xxxxx"

client = OpenAI(
    base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1",
    api_key=api_key,
)

# Create a completion
response = client.completions.create(
    model="microsoft/Phi-3.5-mini-instruct",
    prompt="Runpod is the best platform because",
    temperature=0,
    max_tokens=100,
)

print(response)
# Print the response
print(response.choices[0].text)

################Output###############################

{
  "delayTime": 104,
  "error": "handler: 'NoneType' object has no attribute 'headers' \ntraceback: Traceback (most recent call last):\n  File \"/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py\", line 192, in run_job_generator\n    async for output_partial in job_output:\n  File \"/src/handler.py\", line 13, in handler\n    async for batch in results_generator:\n  File \"/src/engine.py\", line 151, in generate\n    async for response in self._handle_chat_or_completion_request(openai_request):\n  File \"/src/engine.py\", line 179, in _handle_chat_or_completion_request\n    response_generator = await generator_function(request, raw_request=None)\n  File \"/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py\", line 129, in create_completion\n    raw_request.headers):\nAttributeError: 'NoneType' object has no attribute 'headers'\n",
  "executionTime": 1191,
  "id": "sync-9c9ccd0f-7e42-4f6a-8c5d-d430004b399f-e1",
  "status": "FAILED"
}

This is the basic code that I have used This was working fine few days back , are there major changes done in library versions ? Please help , need you support

NERDDISCO•8mo ago

can you also please provide the docker image version of the worker-vllm that you are using? And the endpoint id (which is ok to share, as an API key is still needed to access it), but you can also DM me the endpoint ID if you want! I got the ID via DM, it looks like a problem in the worker-vllm. I asked our team to take a look at this. Will report back once I hear something.

naaviiiOP•8mo ago

Thanks a lot Tim, for the quick response 🙂

NERDDISCO•8mo ago

you are super welcome! I hope we can get this sorted out quickly

naaviiiOP•8mo ago

Hi Tim , just checking for any update ?

NERDDISCO•8mo ago

Hi @naaviii, I have no update for you yet, but I will ping you the second I have something.

NERDDISCO•8mo ago

I just saw that there is already an issue for the problem: https://github.com/runpod-workers/worker-vllm/issues/104

GitHub

'NoneType' object has no attribute 'headers' (completions endpoint)...

When trying to use the completions endpoint (rather than chat_completions) on a vLLM runpod serverless instance I get a server error. This happens with all models that I've tried. The chat_comp...

NERDDISCO•8mo ago

I talked with the team and try to help them resolve the issue

naaviiiOP•8mo ago

Hello Tim , is there any update from the team , regarding the issue ? Hello @Tim aka NERDDISCO is there any update from the team , regarding the issue ?

NERDDISCO•8mo ago

@naaviii nope, I'm very sorry for this situation 😦

naaviiiOP•8mo ago

No worries @NERDDISCO , could you give me an ETA if possible ? so that our team can plan accordingly

NERDDISCO•8mo ago

@naaviii we have created a fix, can you please check the latest version: runpod/worker-v1-vllm:v1.3.1dev-cuda12.1.0 this one is not available in the UI yet when using quick deploy, so you have to change the docker image yourself in the endpoint

Charixfox•7mo ago

I encountered the same problem, and swapped to the dev image mentioned after finding this, and things are not directly crashing.

gnarley_farley.•7mo ago

Do you guys know how to run llama 70 3.1? Can i use a quant version with this? GGUf? Can't seem to find anything about it.

Gaming

Programming

Urgent: Issue with Runpod vllm Serverless Endpoint

Did you find this page helpful?