houmie Posts - Answer Overflow

Topics

houmie

•Created by houmie on 6/28/2024 in #⚡｜serverless

vLLM serverless throws 502 errors

I'm getting these errors out of the blue, anyone knows why? 2024-06-28 00:44:12.053 [71ncv12913w751] [error] Failed to get job, status code: 502 ▼ 2024-06-28 00:41:33.874 [71ncv12913w751] [info] Finished. ▼ 2024-06-28 00:41:33.844 [71ncv12913w751] [info] Finished running generator. ▼ 2024-06-28 00:41:08.658 [71ncv12913w751] [error] Failed to get job, status code: 502 ▼ 2024-06-28 00:40:40.032 [71ncv12913w751] [error] Failed to get job, status code: 502 .... 2024-06-28 00:16:05.919 [71ncv12913w751] [error] Traceback: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 55, in get_job async with session.get(_job_get_url()) as response: File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 1194, in aenter self._resp = await self._coro File "/usr/local/lib/python3.10/dist-packages/aiohttp/client.py", line 605, in _request await resp.start(conn) File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 966, in start message, payload = await protocol.read() # type: ignore[union-attr] File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 622, in read await self._waiter aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected ▼ 2024-06-28 00:16:05.919 [71ncv12913w751] [error] Failed to get job. | Error Type: ServerDisconnectedError | Error Message: Server disconnected

11 replies

•Created by houmie on 6/20/2024 in #⚡｜serverless

How to download models for Stable Diffusion XL on serverless?

1) I created a new network storage of 26 GB for various models I'm interested in trying.
2) I created a Stable Diffusion XL endpoint on serverless, but couldn't attach the network storage.
3) After the deployment succeeded, I clicked on edit endpoint and attached that network storage to it. So far so good I believe. But how do I exactly download various SDXL models into my network storage, so that I could use them via Postman? Many Thanks

32 replies

•Created by houmie on 6/18/2024 in #⚡｜serverless

RUNPOD_API_KEY and MAX_CONTEXT_LEN_TO_CAPTURE

We are also starting a vLLM project and I have two questions: 1) In the environment variables, do I have to define the RUNPOD_API_KEY with my own secret key to access the final vLLM OpenAI endpoint? 2) Isn't MAX_CONTEXT_LEN_TO_CAPTURE now deprecated? Do we still need to provide it, if MAX_MODEL_LEN is already set? Thank you

27 replies

•Created by houmie on 6/18/2024 in #⚡｜serverless

Do I need to allocate extra container space for Flashboot?

I'm planning to use Llama3 model that takes about 40 GB space. I believe Flashboot takes a snapshot of the worker and keeps it on the disk to load it within seconds when the worker becomes active. Do I need to allocate enough space on the container for this? In this case, since I'm planning to select a 48 GB vRAM GPU, do I need to allocate 40 GB Model + 48 GB for snapshot + 5 GB extra space = 93 GB container space?
Thanks

5 replies

•Created by houmie on 6/18/2024 in #⛅｜pods-clusters

How do saving plans work?

I don't quite understand how to enable it. Isn't it better to select a saving plan for a given period and you show me how much I need to pay and I would pay it? It seems you expect me to know how much I need to pay upfront, credit the account with that amount and THEN activate the saving plan.
Right now I would like to pay for 3 months saving plan. How do I proceed? Thanks

12 replies

•Created by houmie on 6/14/2024 in #⛅｜pods-clusters

How to exclude servers in planned maintenance?

I'm preparing the production environment for our release this weekend. When I pick 4 x RTX 4000 ada I end up with a server that is flagged for maintenance in the coming days. Is there a way to exclude servers that are planned for maintenance? Thanks

9 replies

•Created by houmie on 6/13/2024 in #⛅｜pods-clusters

What is the recommended GPU_MEMORY_UTILIZATION?

All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM. 1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%? 2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?

30 replies

•Created by houmie on 6/12/2024 in #⛅｜pods-clusters

How can I install a Docker image on RunPod?

I had a chat with the maintainer of aphrodite-engine and he said I shouldn't use the existing RunPod image as it's very old.
He said there is a docker that I should utilise: https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#docker And here is the docker compose file: https://github.com/PygmalionAI/aphrodite-engine/blob/main/docker/docker-compose.yml Sorry if my question is very basic. How do I build a RunPod image with this docker so that I could run it later on RunPod? I'm still learning Docker, I would appreciate clear instructions. Many Thanks

22 replies

•Created by houmie on 5/31/2024 in #⛅｜pods-clusters

How do I raise a support ticket?

I cannot interact with the Email Support button on the website, and I have received no response on Discord either. I submitted feedback a week ago here: https://discord.com/channels/912829806415085598/1243604870074732595 We are scheduled to go live in about a week, and the general lack of support is very concerning.

5 replies

•Created by houmie on 5/6/2024 in #⛅｜pods-clusters

How to deploy Llama3 on Aphrodite Engine (RunPod)

No description

28 replies

•Created by houmie on 5/6/2024 in #⚡｜serverless

Everything is crashing and burning today [SOLVED] + DEV image with beta 1.0.0preview feedback

No description

27 replies

•Created by houmie on 5/4/2024 in #⚡｜serverless

How to stream via OPENAI BASE URL?

Does the OPENAI BASE URL support Server-sent events (SSE) type of streaming? I was working previously with Ooba streaming was working fine. Since we switched to vLLM/Serverless it is no longer working. If this is not done via SSE, Is there perhaps any tutorial you could recommend how to achieve streaming, please?

27 replies

•Created by houmie on 5/1/2024 in #⚡｜serverless

Which version of vLLM is installed on Serverless?

There is currently a bug on vLLM that causes Llama3 to not utilising the stop tokens correctly. This has been fixed in v0.4.1. https://github.com/vllm-project/vllm/issues/4180#issuecomment-2074017550 I was wondering what is the version of vLLM on the serverless. Thanks

63 replies

•Created by houmie on 5/1/2024 in #⚡｜serverless

When using vLLM on OpenAI endpoint, what is the point of runsync/run?

I just managed to create a flexible worker on serverless. It works great and I can do text completions via the openai/v1/completions endpoint. What I don't understand is the purpose of runsync and run. It's not like I'm queuing jobs somewhere to pick up the results later, right? openai endpoint returns the results straight away. And if I had too many users trying to use the openai/v1/completions, aditional workers will come to aid and get them access. So what's the point of the other endpoints? May someone is so kind and explain that to me? Maybe I'm missing something. Thank you

12 replies

•Created by houmie on 5/1/2024 in #⚡｜serverless

Can we run aphrodite-engine on Serverless?

aphrodite-engine is a fork from vLLM and also supports exl2 format, which gives it a huge advantage. Are there any plans to support aphrodite-engine in future on RunPod's serverless offering? I believe currently aphrodite-engine is only supported as a single server on RunPod. Thanks

10 replies

•Created by houmie on 4/30/2024 in #⚡｜serverless

Is serverless cost per worker or per GPU?

I'm looking at serverless GPU options and when looking at 48 GB GPU it costs $0.00048/s. But is that per worker or per GPU?
For example if I set the max workers to 3, will I be charged 3 x $0.00048/s if all three are in use? That would get very quickly very expensive... Thanks

8 replies

•Created by houmie on 4/29/2024 in #⚡｜serverless

Memory usage on serverless too high

I finally managed to get the serverless setup working.
I just sent a very simple post with a minimum prompt but it runs out of memory. I'm using this highly qualitised model which should fit into a 24GB GPU: Dracones/Midnight-Miqu-70B-v1.0_exl2_2.24bpw I have chosen a 48 GB GPU so there should be plenty of room, why is it running out of memory? Error message: 2024-04-29T18:12:32.121035837Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 71.38 MiB is free. Process 2843331 has 44.27 GiB memory in use. Of the allocated memory 43.81 GiB is allocated by PyTorch, and 13.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

6 replies

•Created by houmie on 4/29/2024 in #⚡｜serverless

What is the meaning behind template on serverless?

I would like to understand how to create a serverless RunPod environment to run vLLM to host a LLM model. What is the purpose of template? Although it's optional, it seems a template can be pre-created and utilised. What is the meaning behind template? Thanks

4 replies