R
RunPod•4mo ago
naaviii

Urgent: Issue with Runpod vllm Serverless Endpoint

We are encountering a critical issue with the runpod vllm serverless endpoint. Specifically, when attaching a network volume, the following code is failing: response = client.completions.create( model="llama3-dumm/llm", prompt=["hello? How are you "], temperature=0.8, max_tokens=600, ) But the below is working : response = client.chat.completions.create( model="llama3-dumm/llm", messages=[{'role': 'user', 'content': "hell0"}], max_tokens=100, temperature=0.9, ) And This is the client object : client = OpenAI(
api_key=api_key, base_url=f"https://api.runpod.ai/v2/endpoint_id/openai/v1", )
This behavior is unusual and suggests there might be a bug. Given our tight deadline, could you please investigate this issue as soon as possible? Your prompt assistance would be greatly appreciated. Thank you very much for your help.
15 Replies
NERDDISCO
NERDDISCO•4mo ago
@naaviii thanks for reporting this. So you are saying that once you connect a network volume to your vllm endpoint, then client.completions.create stops working?
naaviii
naaviiiOP•4mo ago
yes , Infact now I am getting error without using network volume , from openai import OpenAI api_key = "xxxxxxxxx" endpoint_id = "vllm-xxxxx" client = OpenAI( base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1", api_key=api_key, ) # Create a completion response = client.completions.create( model="microsoft/Phi-3.5-mini-instruct", prompt="Runpod is the best platform because", temperature=0, max_tokens=100, ) print(response) # Print the response print(response.choices[0].text) ################Output############################### { "delayTime": 104, "error": "handler: 'NoneType' object has no attribute 'headers' \ntraceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py\", line 192, in run_job_generator\n async for output_partial in job_output:\n File \"/src/handler.py\", line 13, in handler\n async for batch in results_generator:\n File \"/src/engine.py\", line 151, in generate\n async for response in self._handle_chat_or_completion_request(openai_request):\n File \"/src/engine.py\", line 179, in _handle_chat_or_completion_request\n response_generator = await generator_function(request, raw_request=None)\n File \"/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py\", line 129, in create_completion\n raw_request.headers):\nAttributeError: 'NoneType' object has no attribute 'headers'\n", "executionTime": 1191, "id": "sync-9c9ccd0f-7e42-4f6a-8c5d-d430004b399f-e1", "status": "FAILED" } This is the basic code that I have used This was working fine few days back , are there major changes done in library versions ? Please help , need you support
NERDDISCO
NERDDISCO•4mo ago
can you also please provide the docker image version of the worker-vllm that you are using? And the endpoint id (which is ok to share, as an API key is still needed to access it), but you can also DM me the endpoint ID if you want! I got the ID via DM, it looks like a problem in the worker-vllm. I asked our team to take a look at this. Will report back once I hear something.
naaviii
naaviiiOP•4mo ago
Thanks a lot Tim, for the quick response 🙂
NERDDISCO
NERDDISCO•4mo ago
you are super welcome! I hope we can get this sorted out quickly
naaviii
naaviiiOP•4mo ago
Hi Tim , just checking for any update ?
NERDDISCO
NERDDISCO•4mo ago
Hi @naaviii, I have no update for you yet, but I will ping you the second I have something.
NERDDISCO
NERDDISCO•4mo ago
I just saw that there is already an issue for the problem: https://github.com/runpod-workers/worker-vllm/issues/104
GitHub
'NoneType' object has no attribute 'headers' (completions endpoint)...
When trying to use the completions endpoint (rather than chat_completions) on a vLLM runpod serverless instance I get a server error. This happens with all models that I've tried. The chat_comp...
NERDDISCO
NERDDISCO•4mo ago
I talked with the team and try to help them resolve the issue
naaviii
naaviiiOP•4mo ago
Hello Tim , is there any update from the team , regarding the issue ? Hello @Tim aka NERDDISCO is there any update from the team , regarding the issue ?
NERDDISCO
NERDDISCO•4mo ago
@naaviii nope, I'm very sorry for this situation 😦
naaviii
naaviiiOP•4mo ago
No worries @NERDDISCO , could you give me an ETA if possible ? so that our team can plan accordingly
NERDDISCO
NERDDISCO•4mo ago
@naaviii we have created a fix, can you please check the latest version: runpod/worker-v1-vllm:v1.3.1dev-cuda12.1.0 this one is not available in the UI yet when using quick deploy, so you have to change the docker image yourself in the endpoint
Charixfox
Charixfox•4mo ago
I encountered the same problem, and swapped to the dev image mentioned after finding this, and things are not directly crashing.
gnarley_farley.
gnarley_farley.•4mo ago
Do you guys know how to run llama 70 3.1? Can i use a quant version with this? GGUf? Can't seem to find anything about it.
Want results from more Discord servers?
Add your server