RunPod•15mo ago

Mixtral Possible?

Wondering if it's possible to run AWQ mixtral on serverless with good speed

16 Replies

I'm currently running this with decent speeds, but you'll need to set your min and max workers accordingly depending on the load you expect

BuildermanOP•15mo ago

what GPU do you use?

wizardjoe•15mo ago

I have min workers set to at least 1 so that it doesnt spend time booting, which is where the majority of the latency will be I use 48GB don't select anything under that

BuildermanOP•15mo ago

kk thanks

interesting_friend_5•15mo ago

I have been trying to run Mixtral AWQ but am not getting any results returned in the completed message. I had not trouble with Llama 2, but am struggling to get Mixtral working. Anyone else have this issue?

J.•15mo ago

What repository are u running it with? Just wondering? A custom repo or?

interesting_friend_5•15mo ago

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

J.•15mo ago

Ah, a great person to ask would be @Alpay Ariyak then! 🙂 I'll ping him into this thread, as maybe you can ask him more question, He is Runpod staff who is familiar / the main one it seems working on the vllm.

interesting_friend_5•15mo ago

Awesome! Thank you!

J.•15mo ago

Just as a question, do you want to share your build command + also the input that you are sending? Might be helpful to debug for when he does get a chance to take a look. Or whatever steps + what you are getting.

interesting_friend_5•15mo ago

I've set the environment variables MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ and QUANTIZATION=awq. I've got no other custom commands.

J.•15mo ago

What input are u sending in? And I think that's great. Hopefully alpay will be able to ping in then 🙂 since he is just knowledgeable on that repo.

interesting_friend_5•15mo ago

prompt = "Tell me about AI" prompt_template=f'''[INST] {prompt} [/INST] ''' prompt = prompt_template.format(prompt=prompt) payload = { "input": { "prompt": prompt, "sampling_params": { "max_tokens": 1000, "n": 1, "presence_penalty": 0.2, "frequency_penalty": 0.7, "temperature": 1.0, } } }

Alpay Ariyak•15mo ago

Hi, what do the logs show? One suggestion I've seen with quants is turning trust remote code on, which can be done by setting TRUST_REMOTE_CODE to 1 Could you share the actual job outputs as well

ashleyk•15mo ago

I don't know about the quantized models but even the non-quantized Mixtral model requires trust_remote_code to be enabled.

Alpay Ariyak•15mo ago

That’s good to know, thanks for pointing that out

Gaming

Programming

Mixtral Possible?

Did you find this page helpful?