rien19ma
Maintenance - only a Community Cloud issue?
They also mentioned that even with this solution, they still have some delay time (not 120s of, but still). Some mention ms of delay time when the model is included, but you might be right; size matters...
408 replies
Maintenance - only a Community Cloud issue?
I haven't try yet, but your solution just above was also recommended here: https://www.reddit.com/r/SillyTavernAI/comments/1app7gv/new_guides_and_worker_container_for/
408 replies
Maintenance - only a Community Cloud issue?
The idea is to validate that a specific 70B llama3 quantized is good enough for our use case (which implies non-technical teams playing with the model) before using it with Vllm on batching mode (therefore no serverless), with millions of prompts to pass every X.
408 replies
Maintenance - only a Community Cloud issue?
Thanks, Justin,
I checked your repository before asking my question here.
For context, I need to create a working Docker image containing a llama3 70B quantized model. Currently (as other users mentioned in this channel), it's not financially viable due to the delay time (around 120s) using the Vllm template. This implies at 0.0025 $/s a 0.3$ cost to pop the serverless endpoint. On the runpod doc and on repositories affiliated with runpod, it's mentioned that including the model inside the image reduces the delay time (some users on r/LocalLLama claim impressive figures using this technique). To do so and to test if it works, I was renting an instance on GCP (A100) to build and test my docker image, and I came here to ask if it was the right workflow 🙂
All of that to say, I'm not sure how your repository (and your workflow) will fit my need for my usecase; wdyt ?
Thanks 🙂
408 replies
Maintenance - only a Community Cloud issue?
Thanks for the advice. I will give it a try.
Just to summarize my understanding of the workflow:
1. Build the image locally (in my case: vllm, model (50go) and runpod required handler)
2. push it to dockerhub
3. deploy a pod/serverless on runpod and give it a try
4. if the output is not as expected, restart from Step 1
M'I correct?
408 replies
Maintenance - only a Community Cloud issue?
I will rephrase it to be more accurate. I'm using the HF tool to download the model within my image (
AutoTokenizer.from_pretrained
, AutoModelForCausalLM.from_pretrained
), and they raised an error if they can't find the GPU. To be more specific, it's actually quantization_config.py
, which complains about not finding a GPU for AWQ. I don't think the GPU is involved at any point during downloading; I guess it's just a check to avoid uncesseray model downloading if you don't have a GPU. But I admit I have been lazy, and I will find a way to overcome that point (by downloading the model with another tool) ^^408 replies
Maintenance - only a Community Cloud issue?
Hi,
Thanks for your prompt answer.
Yes, "GPU-poor" because my laptop has a really bad consumer-class GPU.
You might need GPU access even during build time if, for example, you download a model from HF with vllm. But even if you don't have the need at build time, I will need it at run time for the test, and as mentioned earlier, unfortunately, I can't do that locally.
So, what is the workflow you advised? I'm not sure I fully grasp your second point:
also another alternative you can also test on runpod too
?
Thanks a lot 🙂408 replies
Maintenance - only a Community Cloud issue?
Hi @Papa Madiator ,
This is a weird question, but what if I'm too GPU-poor on my laptop to handle the build and a test of a docker image (model included) for a llama3 70B quantized model, for example?
(not exactly picked randomly; it's currently my need :))
I'm ending up popping an instance on GCP to build and test my image before pushing it to Docker Hub, making it accessible for Rundod. Is this the correct workflow?
Is there any chance we can exit the running container from inside the POD and reach the underlying Linux where we can
docker build . [...] && docker push
?408 replies