llama.cpp serverless endpoint
https://github.com/ggerganov/llama.cpp
llama.cpp is afak the only setup that supports llava-1.6 quantized, that's why i use it. On some workers the docker image works, on others "illegal instruction" error and crash. https://github.com/ggerganov/llama.cpp/issues/537 I wonder if someone already tried it out and if there's a better fix to this issue other than building and stuffing multiple binaries with the correct instruction sets into one image that will work anywhere. (i already tried building with LLAMA_NATIVE=0) appreciate any insights, thanks!
llama.cpp is afak the only setup that supports llava-1.6 quantized, that's why i use it. On some workers the docker image works, on others "illegal instruction" error and crash. https://github.com/ggerganov/llama.cpp/issues/537 I wonder if someone already tried it out and if there's a better fix to this issue other than building and stuffing multiple binaries with the correct instruction sets into one image that will work anywhere. (i already tried building with LLAMA_NATIVE=0) appreciate any insights, thanks!
GitHub
GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++
Port of Facebook's LLaMA model in C/C++. Contribute to ggerganov/llama.cpp development by creating an account on GitHub.
GitHub
Docker Issus ''Illegal instruction'' · Issue #537 · ggerganov/llama...
I try to make it run the docker version on Unraid, I run this as post Arguments: --run -m /models/7B/ggml-model-q4_0.bin -p "This is a test" -n 512 I got this error: /app/.devops/tools.sh...
Solution:Jump to solution
I don't know why you would want to use llama.cpp, its more for offloading onto CPU than for GPU. You can look at using this instead:
https://github.com/ashleykleynhans/runpod-worker-llava...
GitHub
GitHub - ashleykleynhans/runpod-worker-llava: LLaVA: Large Language...
LLaVA: Large Language and Vision Assistant | RunPod Serverless Worker - ashleykleynhans/runpod-worker-llava
3 Replies
Solution
I don't know why you would want to use llama.cpp, its more for offloading onto CPU than for GPU. You can look at using this instead:
https://github.com/ashleykleynhans/runpod-worker-llava
GitHub
GitHub - ashleykleynhans/runpod-worker-llava: LLaVA: Large Language...
LLaVA: Large Language and Vision Assistant | RunPod Serverless Worker - ashleykleynhans/runpod-worker-llava
If your really want to use llama.cpp, keep using the Github, its clearly an issue with llama.cpp so don't know why you're posting it here, it is 100% not a RunPod issue.
You even posted a link to an issue on their Github, so it really makes no sense whatsoever as to why you would create a post for support on RunPod.
RunPod is an infrastructure provider, not here to help you with bugs in the applications you're using, Github should be used for that. Imagine AWS etc trying to help every user with every bug in every application they want to run, its simply not feasible. Thats what Github issues are there for.
ye was wondering if anyone had exp with this already, thx
llama.cpp supports quantized model, haotian-liu/LLaVA does not yet afak. 34B unquantized is just too big