Mixtral Possible?
Wondering if it's possible to run AWQ mixtral on serverless with good speed
16 Replies
I'm currently running this with decent speeds, but you'll need to set your min and max workers accordingly depending on the load you expect
what GPU do you use?
I have min workers set to at least 1 so that it doesnt spend time booting, which is where the majority of the latency will be
I use 48GB
don't select anything under that
kk thanks
I have been trying to run Mixtral AWQ but am not getting any results returned in the completed message. I had not trouble with Llama 2, but am struggling to get Mixtral working. Anyone else have this issue?
What repository are u running it with? Just wondering? A custom repo or?
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Ah, a great person to ask would be @Alpay Ariyak then! 🙂 I'll ping him into this thread, as maybe you can ask him more question, He is Runpod staff who is familiar / the main one it seems working on the vllm.
Awesome! Thank you!
Just as a question, do you want to share your build command + also the input that you are sending? Might be helpful to debug for when he does get a chance to take a look. Or whatever steps + what you are getting.
I've set the environment variables MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ and QUANTIZATION=awq. I've got no other custom commands.
What input are u sending in? And I think that's great. Hopefully alpay will be able to ping in then 🙂 since he is just knowledgeable on that repo.
prompt = "Tell me about AI"
prompt_template=f'''[INST] {prompt} [/INST]
'''
prompt = prompt_template.format(prompt=prompt)
payload = {
"input": {
"prompt": prompt,
"sampling_params": {
"max_tokens": 1000,
"n": 1,
"presence_penalty": 0.2,
"frequency_penalty": 0.7,
"temperature": 1.0,
}
}
}
Hi, what do the logs show?
One suggestion I've seen with quants is turning trust remote code on, which can be done by setting TRUST_REMOTE_CODE to 1
Could you share the actual job outputs as well
I don't know about the quantized models but even the non-quantized Mixtral model requires trust_remote_code to be enabled.
That’s good to know, thanks for pointing that out