Text-generation-inference on serverless endpoints
Hi, I don't have much experience neither with llms nor with python, so I always just use this image 'ghcr.io/huggingface/text-generation-inference:latest' and run my models on Pods. Now, I wanna try serverless endpoints, but I don't know how to launch text-generation-inference on serverless endpoints, can someone give some tips or maybe there are some docs which could help me.
10 Replies
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Thanks @ashleyk, I think this can help. I'll take a look at it 🙂
For now, everything works well! I managed to deploy llama-2-7b, but I have few more questions:
1. How can I set a temperature or other fields when sending a request:
2. Why am I seeing this deprecation notification? Am I doing something wrong?
2024-03-06T11:04:57.237074977Z
2024-03-06T11:04:57.237196083Z ==========
2024-03-06T11:04:57.237225766Z == CUDA ==
2024-03-06T11:04:57.237320751Z ==========
2024-03-06T11:04:57.239764246Z
2024-03-06T11:04:57.239767808Z CUDA Version 12.1.0
2024-03-06T11:04:57.240376901Z
2024-03-06T11:04:57.240384025Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-03-06T11:04:57.240869637Z
2024-03-06T11:04:57.240873199Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-03-06T11:04:57.240876761Z By pulling and using the container, you accept the terms and conditions of this license:
2024-03-06T11:04:57.240879135Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-03-06T11:04:57.240883884Z
2024-03-06T11:04:57.240886259Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-03-06T11:04:57.249862363Z
2024-03-06T11:04:57.249875424Z **
2024-03-06T11:04:57.249882548Z DEPRECATION NOTICE!
2024-03-06T11:04:57.249979908Z **
2024-03-06T11:04:57.250003654Z THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
2024-03-06T11:04:57.250009591Z https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
2024-03-06T11:04:57.250039273Z
Why are you using CUDA 12.1.0 base image and not 12.1.1. Use 12.1.1 instead.
Please refer to the Worker vLLM documentation, it goes into a lot of detail on usage
That’s the one vLLM uses in their docker image
Thank you Alpay, yes I've found my answers in the documentation) sorry, I should've read it till the end)
No worries at all, let me know if anything else comes up