R
RunPod9mo ago
bigslam

How to run OLLAMA on Runpod Serverless?

As the title suggests, I’m trying to find out a way to deploy the OLLAMA on Runpod as a Serverless Application. Thank you
Solution:
Ollama has a way to override where u the models get downloaded. so u essentially create a network volume on serverless under /runpod-volume is where they get mounted for serverless And when ur ollama server starts through a background script on start, u do whatever u want. overall its a bit of a pain...
Jump to solution
24 Replies
Solution
justin
justin9mo ago
Ollama has a way to override where u the models get downloaded. so u essentially create a network volume on serverless under /runpod-volume is where they get mounted for serverless And when ur ollama server starts through a background script on start, u do whatever u want. overall its a bit of a pain
justin
justin9mo ago
I recommend use runpod vllm, if ur looking for a runpod supported method / alpay can help as he is a staff working specifically on it
justin
justin9mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
justin
justin9mo ago
Option 1, if u got any specific models in mind and @Alpay Ariyak can give help Or use a community thing like what i built, which has everything built into the docker container, avoiding network volumes cause network volumes has some downsides like being locked into a region + i already have docker images ready to go
justin
justin9mo ago
GitHub
GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
justin
justin9mo ago
I even have client side code examples for mine these are not “ollama” but it would achieve i assume ur purpose of running ur own llm maybe like mistral 7b
bigslam
bigslamOP8mo ago
I want to run quantized LLMs @justin eg GGUF
JJonahJ
JJonahJ8mo ago
Vllm supports AWQ quantized, but yeah it would be nice to have other options for text inference. Like I keep seeing this ‘grammars’ thing mentioned about the place, but afaik Vllm doesn’t support that either…
justin
justin8mo ago
I mean as I said, if you want to run it, just attach a network volume, and override where the Ollama stores the models into the network drive. The problem with ollama is that it needs to start a background server and check if the models are there > if not it downloads a new one. So the main thing is just overriding the default check path logic, so when ur worker starts up, it checks the network volume if it exists already. for some reason, I could never get it to work by manually copying in the models locally into my docker image, idk how their hash checking works, and I want it built into my docker image, so I just moved to using OpenLLM
bigslam
bigslamOP8mo ago
Easier to run ollama on a gpu pod, but I’m trying to save time and want a serverless implementation
Armyk
Armyk6mo ago
Any news on this? Did you manage to run Ollama in serverless? I need to run a GGUF model.
giannisan.
giannisan.6mo ago
I am wondering the same, having trouble with the serverles config for ollama
nerdylive
nerdylive6mo ago
Why not try it on vllm? You can make the template yourself check some implementation on worker handler code on github
digigoblin
digigoblin6mo ago
Obviously because vllm does NOT support GGUF.
nerdylive
nerdylive6mo ago
oh right
PatrickR
PatrickR6mo ago
We have a tutorial on this. It’s for CPU but you can run it on GPU too https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference
Run an Ollama Server on a RunPod CPU | RunPod Documentation
Learn to set up and run an Ollama server on RunPod CPU for inference with this step-by-step tutorial.
nerdylive
nerdylive6mo ago
Wow
digigoblin
digigoblin6mo ago
How come some stuff is blog posts and some docs?
nerdylive
nerdylive6mo ago
Hahah its a tutorial right in my opinion, stuffs like that for specific use cases should be in tutorials
digigoblin
digigoblin6mo ago
Well my point is some tutorials are blog posts, others are docs. Would be nice to have some level of consistency to know where to find things.
nerdylive
nerdylive6mo ago
what do you mean by level of consistency
digigoblin
digigoblin6mo ago
Put everything that is a tutorial in the same place, not all over the place. I don't want to search docs, blog posts etc to find something. I want to go to 1 place.
nerdylive
nerdylive6mo ago
ohh ic
PatrickR
PatrickR6mo ago
@digigoblin It’s a good point! Stuff on tutorials are supported, updates will occur, customer support can answer questions. Blog posts are kind of like a snapshot in time, don’t always get updated, and have less quality control. We have a ticket to go back and turn old blog posts into tutorials.
Want results from more Discord servers?
Add your server