RunPod•14mo ago

How to run OLLAMA on Runpod Serverless?

As the title suggests, I’m trying to find out a way to deploy the OLLAMA on Runpod as a Serverless Application. Thank you

Solution:

Ollama has a way to override where u the models get downloaded. so u essentially create a network volume on serverless under /runpod-volume is where they get mounted for serverless And when ur ollama server starts through a background script on start, u do whatever u want. overall its a bit of a pain...

Jump to solution

24 Replies

Solution

J.•14mo ago

I recommend use runpod vllm, if ur looking for a runpod supported method / alpay can help as he is a staff working specifically on it

J.•14mo ago

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

J.•14mo ago

Option 1, if u got any specific models in mind and @Alpay Ariyak can give help Or use a community thing like what i built, which has everything built into the docker container, avoiding network volumes cause network volumes has some downsides like being locked into a region + i already have docker images ready to go

J.•14mo ago

https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless

GitHub

GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...

A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.

J.•14mo ago

I even have client side code examples for mine these are not “ollama” but it would achieve i assume ur purpose of running ur own llm maybe like mistral 7b

bigslamOP•14mo ago

I want to run quantized LLMs @justin eg GGUF

Toxibunny•13mo ago

Vllm supports AWQ quantized, but yeah it would be nice to have other options for text inference. Like I keep seeing this ‘grammars’ thing mentioned about the place, but afaik Vllm doesn’t support that either…

J.•13mo ago

I mean as I said, if you want to run it, just attach a network volume, and override where the Ollama stores the models into the network drive. The problem with ollama is that it needs to start a background server and check if the models are there > if not it downloads a new one. So the main thing is just overriding the default check path logic, so when ur worker starts up, it checks the network volume if it exists already. for some reason, I could never get it to work by manually copying in the models locally into my docker image, idk how their hash checking works, and I want it built into my docker image, so I just moved to using OpenLLM

bigslamOP•13mo ago

Easier to run ollama on a gpu pod, but I’m trying to save time and want a serverless implementation

Armyk•11mo ago

Any news on this? Did you manage to run Ollama in serverless? I need to run a GGUF model.

giannisan.•11mo ago

I am wondering the same, having trouble with the serverles config for ollama

Jason•11mo ago

Why not try it on vllm? You can make the template yourself check some implementation on worker handler code on github

digigoblin•11mo ago

Obviously because vllm does NOT support GGUF.

Jason•11mo ago

oh right

PatrickR•11mo ago

We have a tutorial on this. It’s for CPU but you can run it on GPU too https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference

Run an Ollama Server on a RunPod CPU | RunPod Documentation

Learn to set up and run an Ollama server on RunPod CPU for inference with this step-by-step tutorial.

Jason•11mo ago

Wow

digigoblin•11mo ago

How come some stuff is blog posts and some docs?

Jason•11mo ago

Hahah its a tutorial right in my opinion, stuffs like that for specific use cases should be in tutorials

digigoblin•11mo ago

Well my point is some tutorials are blog posts, others are docs. Would be nice to have some level of consistency to know where to find things.

Jason•11mo ago

what do you mean by level of consistency

digigoblin•11mo ago

Put everything that is a tutorial in the same place, not all over the place. I don't want to search docs, blog posts etc to find something. I want to go to 1 place.

Jason•11mo ago

ohh ic

PatrickR•11mo ago

@digigoblin It’s a good point! Stuff on tutorials are supported, updates will occur, customer support can answer questions. Blog posts are kind of like a snapshot in time, don’t always get updated, and have less quality control. We have a ticket to go back and turn old blog posts into tutorials.

Gaming

Programming

How to run OLLAMA on Runpod Serverless?

Did you find this page helpful?