How to run OLLAMA on Runpod Serverless?
As the title suggests, I’m trying to find out a way to deploy the OLLAMA on Runpod as a Serverless Application. Thank you
Solution:Jump to solution
Ollama has a way to override where u the models get downloaded. so u essentially create a network volume on serverless under /runpod-volume is where they get mounted for serverless
And when ur ollama server starts through a background script on start, u do whatever u want. overall its a bit of a pain...
24 Replies
Solution
Ollama has a way to override where u the models get downloaded. so u essentially create a network volume on serverless under /runpod-volume is where they get mounted for serverless
And when ur ollama server starts through a background script on start, u do whatever u want. overall its a bit of a pain
I recommend use runpod vllm, if ur looking for a runpod supported method / alpay can help as he is a staff working specifically on it
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Option 1, if u got any specific models in mind and @Alpay Ariyak can give help
Or use a community thing like what i built, which has everything built into the docker container, avoiding network volumes cause network volumes has some downsides like being locked into a region + i already have docker images ready to go
GitHub
GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
I even have client side code examples for mine
these are not “ollama” but it would achieve i assume ur purpose of running ur own llm maybe like mistral 7b
I want to run quantized LLMs @justin eg GGUF
Vllm supports AWQ quantized, but yeah it would be nice to have other options for text inference. Like I keep seeing this ‘grammars’ thing mentioned about the place, but afaik Vllm doesn’t support that either…
I mean as I said, if you want to run it, just attach a network volume, and override where the Ollama stores the models into the network drive. The problem with ollama is that it needs to start a background server and check if the models are there > if not it downloads a new one. So the main thing is just overriding the default check path logic, so when ur worker starts up, it checks the network volume if it exists already.
for some reason, I could never get it to work by manually copying in the models locally into my docker image, idk how their hash checking works, and I want it built into my docker image, so I just moved to using OpenLLM
Easier to run ollama on a gpu pod, but I’m trying to save time and want a serverless implementation
Any news on this? Did you manage to run Ollama in serverless? I need to run a GGUF model.
I am wondering the same, having trouble with the serverles config for ollama
Why not try it on vllm?
You can make the template yourself check some implementation on worker handler code on github
Obviously because vllm does NOT support GGUF.
oh right
We have a tutorial on this.
It’s for CPU but you can run it on GPU too
https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference
Run an Ollama Server on a RunPod CPU | RunPod Documentation
Learn to set up and run an Ollama server on RunPod CPU for inference with this step-by-step tutorial.
Wow
How come some stuff is blog posts and some docs?
Hahah its a tutorial right
in my opinion, stuffs like that for specific use cases should be in tutorials
Well my point is some tutorials are blog posts, others are docs. Would be nice to have some level of consistency to know where to find things.
what do you mean by level of consistency
Put everything that is a tutorial in the same place, not all over the place.
I don't want to search docs, blog posts etc to find something. I want to go to 1 place.
ohh ic
@digigoblin It’s a good point! Stuff on tutorials are supported, updates will occur, customer support can answer questions.
Blog posts are kind of like a snapshot in time, don’t always get updated, and have less quality control.
We have a ticket to go back and turn old blog posts into tutorials.