Concept
Concept
RRunPod
Created by Concept on 2/28/2024 in #⚡|serverless
VLLM Error
2024-02-28T21:49:45.485567449Z The above exception was the direct cause of the following exception: 2024-02-28T21:49:45.485572406Z 2024-02-28T21:49:45.485576486Z Traceback (most recent call last): 2024-02-28T21:49:45.485580679Z File "/handler.py", line 8, in <module> 2024-02-28T21:49:45.485636156Z vllm_engine = vLLMEngine() 2024-02-28T21:49:45.485673099Z ^^^^^^^^^^^^ 2024-02-28T21:49:45.485677772Z File "/engine.py", line 37, in init 2024-02-28T21:49:45.485766458Z self.tokenizer = Tokenizer(self.config["model"]) 2024-02-28T21:49:45.485851807Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-28T21:49:45.485869204Z File "/engine.py", line 13, in init 2024-02-28T21:49:45.485930650Z self.tokenizer = AutoTokenizer.from_pretrained(model_name) 2024-02-28T21:49:45.486024250Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-28T21:49:45.486057129Z File "/usr/local/lib/python3.11/dist-packages/transformers/models/auto/tokenization_auto.py", line 752, in from_pretrained 2024-02-28T21:49:45.486279358Z config = AutoConfig.from_pretrained( 2024-02-28T21:49:45.486317144Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-28T21:49:45.486355404Z File "/usr/local/lib/python3.11/dist-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained 2024-02-28T21:49:45.486897074Z config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, kwargs) 2024-02-28T21:49:45.486931567Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-28T21:49:45.486950020Z File "/usr/local/lib/python3.11/dist-packages/transformers/configuration_utils.py", line 644, in get_config_dict 2024-02-28T21:49:45.487400710Z config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) 2024-02-28T21:49:45.487416173Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-28T21:49:45.487420923Z File "/usr/local/lib/python3.11/dist-packages/transformers/configuration_utils.py", line 699, in _get_config_dict 2024-02-28T21:49:45.487425980Z resolved_config_file = cached_file( 2024-02-28T21:49:45.487455286Z ^^^^^^^^^^^^ 2024-02-28T21:49:45.487497700Z File "/usr/local/lib/python3.11/dist-packages/transformers/utils/hub.py", line 429, in cached_file 2024-02-28T21:49:45.487650365Z raise EnvironmentError( 2024-02-28T21:49:45.487658992Z OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like mistralai/Mistral-7B-Instruct-v0.1 is not the path to a directory containing a file named config.json. 2024-02-28T21:49:45.487664865Z Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode' anyone else having this with VLLM worker?
4 replies
RRunPod
Created by Concept on 2/1/2024 in #⚡|serverless
VLLM Worker Error that doesn't time out.
2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n async with session.get(_job_get_url()) as response:\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n self._resp = await self._coro\n ^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n message, payload = await protocol.read() # type: ignore[union-attr]\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}
2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n async with session.get(_job_get_url()) as response:\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n self._resp = await self._coro\n ^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n message, payload = await protocol.read() # type: ignore[union-attr]\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}
Worker ran for 20 hours stuck on this error. Had to kill the worker and job. What causes this?
10 replies
RRunPod
Created by Concept on 1/20/2024 in #⚡|serverless
Empty Tokens Using Mixtral AWQ
No description
6 replies
RRunPod
Created by Concept on 1/15/2024 in #⚡|serverless
Request Format Runpod VLLM Worker
{
"conversation": {
"id": "some_conversation_id",
"messages": [
{
"source": "USER",
"content": "Previous messages in the conversation..."
}
]
},
"message": {
"content": "Tell me why RunPod is the best GPU provider",
"source": "USER"
}
}
{
"conversation": {
"id": "some_conversation_id",
"messages": [
{
"source": "USER",
"content": "Previous messages in the conversation..."
}
]
},
"message": {
"content": "Tell me why RunPod is the best GPU provider",
"source": "USER"
}
}
I have been using the above format with Runpod VLLM worker to utilize the chat history functionality. I've been getting the error that input is missing in the JSON request so this works. { "input": { "prompt": "Tell me why RunPod is the best GPU provider", "sampling_params": { "max_tokens": 100 }, "apply_chat_template": true, "stream": true } } Did the input change recently?
11 replies
RRunPod
Created by Concept on 1/15/2024 in #⚡|serverless
Rundpod VLLM Cuda out of Memory
Hi I've been using the default runpod VLLM template with the mixtrial model loaded in the network volume. I'm encountering CUDA out of memory on cold starts. Here is the error log. 2024-01-15T20:32:13.726720287Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 16.75 MiB is free. Process 422202 has 47.51 GiB memory in use. Of the allocated memory 47.05 GiB is allocated by PyTorch, and 12.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
76 replies
RRunPod
Created by Concept on 1/9/2024 in #⚡|serverless
Runpod VLLM Context Window
Hi I've been using this template in my serverless endpoint https://github.com/runpod-workers/worker-vllm I'm wondering what my context window is/how its handling chat history? { "conversation": { "id": "some_conversation_id", // This should be the ID of the conversation "messages": [ { "source": "USER", "content": "Previous messages in the conversation..." } // ... other previous messages ] }, "message": { "content": "Tell me why RunPod is the best GPU provider", "source": "USER" } } I follow the above as my input to the endpoint.
1 replies
RRunPod
Created by Concept on 12/28/2023 in #⚡|serverless
Serverless Endpoint Streaming
I'm currently working with Llama.cpp for my inference and have setup my handler.py file to be similar to this guide. https://docs.runpod.io/docs/handler-generator My input and handler file looks like this:
{
"input": {
"query": "Temp",
"stream":true
}
}
{
"input": {
"query": "Temp",
"stream":true
}
}
import runpod
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain.llms import LlamaCpp
from langchain.chains import LLMChain

def load_llm():

n_gpu_layers = 200
n_batch = 500

# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="starling-lm-7b-alpha.Q4_K_M.gguf",
n_gpu_layers=n_gpu_layers,
use_mlock=True,
use_mmap=True,
max_tokens=1024,
stop=["Q:","Disclaimer:","</s>","Source:","Legal Inquiry:","\n\n ","Summary:"],
n_batch=n_batch,
temp= 0.5,
n_ctx = 8192,
repeat_penalty=1.18,
)
print("LLM Loaded! ;)")
return llm


def process_input(input):
"""
Execute the application code
"""
llm = load_llm()
query = input['query']

prompt = PromptTemplate(
input_variables=["context"],
template="""
GPT4 User: {context}

<|end_of_turn|>GPT4 Legal Assistant:
""",
)

llmchain = LLMChain(llm = llm, prompt = prompt)
answer = llmchain.run(query)

return {
"answer": answer
}


# ---------------------------------------------------------------------------- #
# RunPod Handler #
# ---------------------------------------------------------------------------- #
def handler(event):
"""
This is the handler function that will be called by RunPod serverless.
"""
return process_input(event['input'])


if __name__ == '__main__':
runpod.serverless.start({
'handler': handler,
"return_aggregate_stream": True
})
import runpod
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain.llms import LlamaCpp
from langchain.chains import LLMChain

def load_llm():

n_gpu_layers = 200
n_batch = 500

# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="starling-lm-7b-alpha.Q4_K_M.gguf",
n_gpu_layers=n_gpu_layers,
use_mlock=True,
use_mmap=True,
max_tokens=1024,
stop=["Q:","Disclaimer:","</s>","Source:","Legal Inquiry:","\n\n ","Summary:"],
n_batch=n_batch,
temp= 0.5,
n_ctx = 8192,
repeat_penalty=1.18,
)
print("LLM Loaded! ;)")
return llm


def process_input(input):
"""
Execute the application code
"""
llm = load_llm()
query = input['query']

prompt = PromptTemplate(
input_variables=["context"],
template="""
GPT4 User: {context}

<|end_of_turn|>GPT4 Legal Assistant:
""",
)

llmchain = LLMChain(llm = llm, prompt = prompt)
answer = llmchain.run(query)

return {
"answer": answer
}


# ---------------------------------------------------------------------------- #
# RunPod Handler #
# ---------------------------------------------------------------------------- #
def handler(event):
"""
This is the handler function that will be called by RunPod serverless.
"""
return process_input(event['input'])


if __name__ == '__main__':
runpod.serverless.start({
'handler': handler,
"return_aggregate_stream": True
})
My problem is that whenever I am testing this out in the requests tab on the dashboard, it keeps saying stream is empty. https://github.com/runpod-workers/worker-vllm
30 replies