RunPod•13mo ago

Text-generation-inference on serverless endpoints

Hi, I don't have much experience neither with llms nor with python, so I always just use this image 'ghcr.io/huggingface/text-generation-inference:latest' and run my models on Pods. Now, I wanna try serverless endpoints, but I don't know how to launch text-generation-inference on serverless endpoints, can someone give some tips or maybe there are some docs which could help me.

10 Replies

ashleyk•13mo ago

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

HashsetOP•13mo ago

Thanks @ashleyk, I think this can help. I'll take a look at it 🙂

HashsetOP•13mo ago

For now, everything works well! I managed to deploy llama-2-7b, but I have few more questions:

HashsetOP•13mo ago

1. How can I set a temperature or other fields when sending a request:

HashsetOP•13mo ago

2. Why am I seeing this deprecation notification? Am I doing something wrong?

HashsetOP•13mo ago

2024-03-06T11:04:57.237074977Z 2024-03-06T11:04:57.237196083Z ========== 2024-03-06T11:04:57.237225766Z == CUDA == 2024-03-06T11:04:57.237320751Z ========== 2024-03-06T11:04:57.239764246Z 2024-03-06T11:04:57.239767808Z CUDA Version 12.1.0 2024-03-06T11:04:57.240376901Z 2024-03-06T11:04:57.240384025Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 2024-03-06T11:04:57.240869637Z 2024-03-06T11:04:57.240873199Z This container image and its contents are governed by the NVIDIA Deep Learning Container License. 2024-03-06T11:04:57.240876761Z By pulling and using the container, you accept the terms and conditions of this license: 2024-03-06T11:04:57.240879135Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license 2024-03-06T11:04:57.240883884Z 2024-03-06T11:04:57.240886259Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. 2024-03-06T11:04:57.249862363Z 2024-03-06T11:04:57.249875424Z ** 2024-03-06T11:04:57.249882548Z DEPRECATION NOTICE! 2024-03-06T11:04:57.249979908Z ** 2024-03-06T11:04:57.250003654Z THIS IMAGE IS DEPRECATED and is scheduled for DELETION. 2024-03-06T11:04:57.250009591Z https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md 2024-03-06T11:04:57.250039273Z

ashleyk•13mo ago

Why are you using CUDA 12.1.0 base image and not 12.1.1. Use 12.1.1 instead.

Alpay Ariyak•13mo ago

Please refer to the Worker vLLM documentation, it goes into a lot of detail on usage That’s the one vLLM uses in their docker image

HashsetOP•13mo ago

Thank you Alpay, yes I've found my answers in the documentation) sorry, I should've read it till the end)

Alpay Ariyak•13mo ago

No worries at all, let me know if anything else comes up

Gaming

Programming

Text-generation-inference on serverless endpoints

Did you find this page helpful?