RunPod•10mo ago

Loading LLama 70b on vllm template serverless cant answer a simple question like "what is your name"

I am loading with 1 worker and 2 GPU's 80g But the model just cant performance at all, it gives gibrish answers for simple prompts like "what is your name"

24 Replies

nerdylive•10mo ago

are you using llama 3 70b? I tried it and it works, just long load times What config do you use? Just the default?

PatriotSoulOP•10mo ago

I am just setting bfloat16 the rest i leave blank/default. When i load with web-ui, getting completely different responses.

nerdylive•10mo ago

Oh you tried with pods too? Last time I left all blank only like the fields in the first page of vllm setup And used a network volume How do you make request to this then?

justin•10mo ago

Is it llama instruct? i think i was told there was a difference between llama 70b and instruct Instruct is more like an actual chat, respond and answer while the llama 70b is like some weird completion thing. i had also gotten gibberish answers in the past making me move to just using openllm

nerdylive•10mo ago

Wait isn't it chat that's like that Instruct is the completion thing?

justin•10mo ago

Oh Lol 👁️

nerdylive•10mo ago

I thought it was only instruct

justin•10mo ago

Haha maybe im wrong and to use chat model

nerdylive•10mo ago

I didn't see the llama chat version hmm Can you send the link here I wanna see haha

justin•10mo ago

Oof I dont remember. let me see if i can find my old post on this where i also asked about gibberish coming out of vllm

nerdylive•10mo ago

What's the open llm ?

justin•10mo ago

It’s just another framework to run llm models easily - i prefer to runpod’s vllm solution which i just dont prefer. some reason couldn’t ever get the vllm to work nicely / easily as openllm i felt https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless https://github.com/bentoml/OpenLLM

GitHub

GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...

A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.

GitHub

GitHub - bentoml/OpenLLM: Run any open-source LLMs, such as Llama 2...

Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud. - bentoml/OpenLLM

justin•10mo ago

and also i could get openllm to work vs ollama which requires a whole background server etc and couldn’t ever get ollama to preload models properly

nerdylive•10mo ago

Hmm alright I should try that out haha

justin•10mo ago

https://discord.com/channels/912829806415085598/1208252068238860329/1209740324004429844 oh oops. my previous question was around mistral being dumb 😅 Yeah! It pretty good. I have the docker images up for mistral7b, and obvs the repo. I didnt realize how big 70b models are xD and left it running on depot and came up with stupidly above 100gb images lmao Which basically is unusable

nerdylive•10mo ago

Unusable? Try it out 😂

justin•10mo ago

Thxfully depot gave me free caches 🙏

nerdylive•10mo ago

But the subs is paid right

justin•10mo ago

xD i dont wanna wait an hour for a single serverless to load 😂 what are subs? Oh yea depot usually cost money

nerdylive•10mo ago

The plan you pay for the depot

justin•10mo ago

But they gave me a sponsored account So i use it for free lol

nerdylive•10mo ago

I c that's cool 👍 So, what's the "gibrish" response like? @PatriotSoul

PatriotSoulOP•10mo ago

Im using the instruct version. Just feels like its x10 quantized like the model is very stupid.

nerdylive•10mo ago

Yeah its not normal What's does the response look like

Gaming

Programming

Loading LLama 70b on vllm template serverless cant answer a simple question like "what is your name"

Did you find this page helpful?