I am trying to deploy a "meta-llama/Llama-3.1-8B-Instruct" model on Serverless vLLM
I do this with maximum possible memory.
After setup, I try to run the "hello world" sample, but the request is stuck in queue and I get "[error]worker exited with exit code 1" with no other error or message in log.
Is it even possible to run this model?
What is the problem? can this be resolved?
(for the record, I did manage to run a much smaller model using the same procedure as above)
18 Replies
what gpu did you use?
try setting the allow remote code
I tried all of them I think. The storngest possible for sure.
I did
can i see your request example
is this chioce ok?

maybe
you use the vllm template from runpod?
your inputs
ok
it's the default (I deleted the instace by now)
yes
How can I choose a GPU? (there is no choice available in the setup process)
should I be using a different template?
ok let me try
im trying to run that model with runpod's template now
that one you screenshot works
my run just succeed with h100
did you have your huggingface token in the endpoint ?
no
well thats a problem
you need it
How do I choose a GPU? Where do I even see which GPU I got?
This here
if you select one, thats all the gpu will be
if you select two, you can see the workers in the other tab
readonly token is ok?
yeah for downloading

theres a screenshot about how to see your workers
I get an error that I need to ask for access to the model in huggingface
I applied and waiting for approval...
Thanks for your time.
Oh i see
okok
your welcome
I works now. Thanks.
yay