RunPod•15mo ago

Api to Text Generation Web UI

Hello! I want to upload a model using a serverless pod. It will be a Text Generation Web UI. I know I will end up getting an endpoint. I would like to create a few "characters" on there but how can I interact with the model using my code? Is there some type of POST endpoint I need to use to use a certain character and run my model? I am currently using OpenAI GPT4 for this but would like to switch over. Thanks!!

24 Replies

J.•15mo ago

By text generation web ui? are you looking to host a web ui interface? or are you looking to use some sort of LLM model, like an API this just will highly depend on the model you are using / usually most LLM models will have some sort of example python code on how to interface with the model Like if u can do it in python, you essentially can write. your handler to do it:

def handler(job):
   model = load model
   model.predict(job.input.something)

def handler(job):
   model = load model
   model.predict(job.input.something)

danlasOP•15mo ago

I have a model I want to use, on my phone right now but it’s TheBloke 2.5, I think the latest one. Right now I am using the web ui to load it up, and I created characters to use them with. My end goal is to just host the LLM and invoke it with chat history + a certain context (character) Is there a certain type of template I should use for runpod? @justin I never hosted a LLM before so this all new to me

J.•15mo ago

I see. There is no certain "template". For runpod as long as you can run it in a docker container it will run. The hard thing what your asking I think just requires more development knowledge, and I guess that depends on you It is certainly possible to host an LLM But to have context history etc. will require more development time I also think would require a lot of experimentation, personally i am not as familiar with LLMs, but just reading through their github repository

J.•15mo ago

https://github.com/TheBlokeAI/dockerLLM/tree/main It is possible...

GitHub

GitHub - TheBlokeAI/dockerLLM: TheBloke's Dockerfiles

TheBloke's Dockerfiles. Contribute to TheBlokeAI/dockerLLM development by creating an account on GitHub.

J.•15mo ago

but i think ur prob looking more for like

J.•15mo ago

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...

J.•15mo ago

But ik ppl in the server have had trouble with the worker vllm If ur okay with slower response times

J.•15mo ago

https://www.youtube.com/@generativelabs

YouTube

Generative Labs

J.•15mo ago

Generative labs have great videos to do exactly what u want to host ur own llm

danlasOP•15mo ago

Ok thanks! I’ll take a peep at it tomorrow. Maybe I gotta create my own api endpoint inside the pod then use my backend code to hit it (my guess) Awesome thanks

J.•15mo ago

Yeah, the only thing is when using network volumes like generative lab can be slower response Its not really inside the pod So they have: GPU Pod, which is a running linux computer till u turn it off, u can think of it that way then serverless, which essentially you define a function in python and when the endpoint is pinged, will turn on ur docker in the same environment as the gpu pod to serve the request, but when the function is done, will turn it off 🙂 So serverless is a bit different than just exposing a port on a persistent pod till u turn it off

danlasOP•15mo ago

Yeah I was going toward the serverless to save on costs My main thing was like how can my current backend app hit this new LLM I’m hosting and pass in context/history. Currently using GPT4 and Langchain for it But I’ll take a look at the docs you sent tomorrow

J.•15mo ago

well its really just appending it haha that is essentially what history is all LLMs are stateless so any history is cached on ur backend and when u make a request ur essentially appending it or if the appending is too long u use the LLM to summarize it and apppend the summarization to the incoming new query Ill also argue that if chatgpt is too expensive

J.•15mo ago

https://mistral.ai/

Mistral AI | Open-weight models

Frontier AI in your hands

J.•15mo ago

Mistral is a great alternative You can certainly host ur own LLMs but ive never found it worth it compared to worrying about cold starts / worrying about the technical difficulties etc Its why i tend to use runpod more for other ML models, but i think services for LLMs u can probably go down more routes that are cheaper that are already managed for u before turning to manage it urself unless u have a specific use case like privacy concerns, so on

danlasOP•15mo ago

That looks like a good site maybe I can just use the api

J.•15mo ago

Yup!

danlasOP•15mo ago

Saves me from having to run my own

J.•15mo ago

Runpod has their own API for Llama models hosted: https://doc.runpod.io/reference/llama2-13b-chat An interesting idea ive heard is that

RunPod

Llama2 13B Chat

Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-13b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-13b-chat/stream...

J.•15mo ago

(personally i think llama2 isn't that good without fine tuning) but u can use if runpod is cheaper for summarizations and use mistral / chatgpt for more complex stuff But yeah

danlasOP•15mo ago

I’m trying to make a site like https://candy.ai and their model is really good, GPT4 is very censored and doesn’t work ,

Candy.ai - Enjoy The Ultimate AI Girlfriend Experience

Engage with Candy.ai's virtual companions for immersive and personalized chats. Dive deep into intricate dialogues, and liberate your imagination. Experience adaptive AI-driven role-plays today.

J.•15mo ago

I see I think mistral is ur best bet 🙂 I dont think they censor.. i could be wriong But I definitely think that is more worth a shot first

danlasOP•15mo ago

As long as the api can specify context and chat history then I’ll be good! Thanks man!

Toxibunny•15mo ago

https://hub.docker.com/repository/docker/toxibunny/mixtral-8x7b-moe-rp-story-awq/general https://hub.docker.com/repository/docker/toxibunny/rpmacaronimaidapi/general Here are a couple of ‘uncensored’ RP focused models, ready for use with the runpod llama2 endpoint. Tested a little and seem to be working so far. It is just kinda like openai API though where you have to handle context/history yourself… Edit: though I forgot about the assistants API that handles all that for you 😅 It’s been a while…

Gaming

Programming

Api to Text Generation Web UI

Did you find this page helpful?