Llama-70B 3.1 execution and queue delay time much larger than 3.0. Why?
I deployed these two model who seem to be using same techniques. I'm using same machine 2x80GB but the execution time and queue delay time has massive differences:
Queue delay:
Llama70B 3.0: 0.02 secs
Llama70B 3.1: 0.15 secs
Execution time:
Llama70B 3.0: 0.65 secs
Llama70B 3.1: 3 secs
Models:
Llama 70B 3.0: https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5
Llama 70B 3.1: https://huggingface.co/mlabonne/Llama-3.1-70B-Instruct-lorablated
1 Reply
You can find sources online that indicate 3.0 averages around three times as fast as 3.1. So while 3.1 is more accurate, 3.0 is speedier.