RunPod•4mo ago

Serverless deepseek-ai/DeepSeek-R1 setup?

How can I configure a serverless end point for deepseek-ai/DeepSeek-R1?

75 Replies

Jason•4mo ago

does vllm supports that model? if not, you can make a model that can run inference for that model

LattusOP•4mo ago

https://huggingface.co/deepseek-ai/DeepSeek-R1 It seems so.

deepseek-ai/DeepSeek-R1 · Hugging Face

LattusOP•4mo ago

Basic config, 2 GPU count

LattusOP•4mo ago

Once it is running, I try the default hello world request and it just gets stuck IN_QUEUE for 8 minutes..

Jason•4mo ago

Can you check logs maybe its still downloading or OOM wait.. how big is the model? seems like r1 is a really huge model isnt it?

LattusOP•4mo ago

yes, but I tried even just following along with the youtube tutorial here and got the same IN_QUEUE problem...: https://youtu.be/0XXKK82LwWk?si=ZDCu_YV39Eb5Fn8A

RunPod

YouTube

Set Up A Serverless LLM Endpoint Using vLLM In Six Minutes on RunPod

Guide to setting up a serverless endpoint on RunPod in six minutes on RunPod.

Jason•4mo ago

Any logs? in your workers or endpoint?

LattusOP•4mo ago

Oh, wait!! I just ran the 1.5B model and got this response:

LattusOP•4mo ago

When I tried running the larger model, I got errors about not enough memory ""Uncaught exception | <class 'torch.OutOfMemoryError'>; CUDA out of memory. Tried to allocate 3.50 GiB. GPU 0 has a total capacity of 44.45 GiB of which 1.42 GiB is free"

Jason•4mo ago

seems like you got oom ya..

LattusOP•4mo ago

So how do I configure ?

Jason•4mo ago

r1 is such a huge model seems like you need 1tb+ vram don't know how to calculate, but est maybe something in range of 700gb+ vram

LattusOP•4mo ago

wow so it's not really an option to deploy?..

Jason•4mo ago

not sure, depends for your use hahah

LattusOP•4mo ago

I mean, Deepseek offers their own API keys I thought it could be more cost effective to just run a serverless endpoint here but..

Jason•4mo ago

only if you got enough volume, especially for bigger models imo

LattusOP•4mo ago

hmm.. I see Thanks for your help

Jason•4mo ago

your welcome bro

lsdvaibhavvvv•3mo ago

Hey @nerdylive i still can deploy the 7B deepseek R1 model right instead of huge model. ?

lsdvaibhavvvv•3mo ago

I am facing this issue I am not that good in resolving issues.

<MarDev/>•3mo ago

Did you find a solution ?

lsdvaibhavvvv•3mo ago

Not yet...

Jason•3mo ago

use trust remote code = true

Jason•3mo ago

lsdvaibhavvvv•3mo ago

where should i put this in envrinment

Jason•3mo ago

env variable

Jason•3mo ago

like this

riverfog7•3mo ago

Is the model you are trying to run a GGUF quant? You'll need a custom script for GGUF quants or if there is multiple models in a single repo

Jehex•2mo ago

I dont understand , this morning I to do a brief test with https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B + a 24gb Vram gpu, but now I got a error cuda memory , do you know guy's how I can fix this issue ?

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B · Hugging Face

yhlong00000•2mo ago

try 48GB gpu, see if that helps.

Jehex•2mo ago

Hello there, I increased the max token settings but still getting only the beginning of the thinking, how can I fix that

Jehex•2mo ago

yep fixed thanks

Jason•2mo ago

set max tokens to mroe than 16 in your request, or use a openai client sdk

Jehex•2mo ago

Thanks ! Will let you know if it’s work Yep increase to 3000 but still getting a short " thinking " answer 😦

Jason•2mo ago

How did you configure it

Jehex•2mo ago

basically used this model casperhansen/deepseek-r1-distill-qwen-32b-awq with vllm and runpod serverless, except lower the model max lenght to 11000 I didnt modify any others settings my input look like this now : { "input": { "messages": [ { "role": "system", "content": "Your are an ai assistant." }, { "role": "user", "content": "Explain llm models" } ], "max_tokens": 3000, "temperature": 0.7, "top_p": 0.95, "n": 1, "stream": false, "stop": [], "presence_penalty": 0, "frequency_penalty": 0, "logit_bias": {}, "user": "utilisateur_123", "best_of": 1, "echo": false } }

Jason•2mo ago

Not the correct way

Jehex•2mo ago

Ah ok, do you have an example of correct input for this model ?

Jason•2mo ago

was going to give an example after this wait

{
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "Your are an ai assistant."
      },
      {
        "role": "user",
        "content": "Explain llm models"
      }
    ],
"sampling_params": {
    "max_tokens": 3000,
    "temperature": 0.7,
    "top_p": 0.95,
    "n": 1,
    "stream": false,
    "stop": [],
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "logit_bias": {},
    "user": "utilisateur_123",
    "best_of": 1,
    "echo": false
}
  }
}

{
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "Your are an ai assistant."
      },
      {
        "role": "user",
        "content": "Explain llm models"
      }
    ],
"sampling_params": {
    "max_tokens": 3000,
    "temperature": 0.7,
    "top_p": 0.95,
    "n": 1,
    "stream": false,
    "stop": [],
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "logit_bias": {},
    "user": "utilisateur_123",
    "best_of": 1,
    "echo": false
}
  }
}

like this, inside sampling_params if not just use openai sdk, its easier (the docs are easily accessible in openai's site) hahah

Jehex•2mo ago

hm im not very familiar with the openai sdk, is it something to configure during the creation of the serverless endpoint ( with vllm ) ?

Jason•2mo ago

https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/chat/completions no, for the client only you can use packages from openai ( for the client) to connect using that url your endpoint id should be replaced with your endpoint id and use your runpod api key as the auth in the openai client try reading the docs in runpod website, the vllm worker part

Jehex•2mo ago

Nice thanks you for theses infos

Jason•2mo ago

Ya that's if you use http request directly

Jehex•2mo ago

Yep, I basically create a template from https://github.com/runpod-workers/worker-vllm then modify models etc. from env variables right and also modify the few lines of code for be able to call the openai api

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

riverfog7•2mo ago

Can you check the cloudflare proxy (not in serverless) for vllm openai compatible servers? Batched requests keep getting aborted only on proxied connections (not on direct using tcp forwarding(?)). Related Github Issue: https://github.com/vllm-project/vllm/issues/2484 When the problem happens, the logs look something like this:

INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 0e89f1d2d94c4a039f868222c100cc8a.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request be67046b843244b5bf1ed3d2ff2f5a02.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request b532ed57647945869a4cae499fe54f23.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 6c56897bbc9d4a808b8e056c39baf91b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 75b645c69d7449509f68ca23b34f1048.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request eb87d6473a9d4b3699ca0cc236900248.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request ca15a251849c45329825ca95a2fce96b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request c42bbea2f781469e89e576f98e618243.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 14b94d4ffd6646d69d4c2ad36d7dfd50.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 83c7dd9cbe9d4f6481b26403f46f1731.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 3e98245d88534c53be230aa25c56d99a.
INFO 01-17 03:11:26 async_llm_engine.py:111] Finished request 49b84ef96af44f069056b2fc43526cdd.

INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 0e89f1d2d94c4a039f868222c100cc8a.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request be67046b843244b5bf1ed3d2ff2f5a02.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request b532ed57647945869a4cae499fe54f23.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 6c56897bbc9d4a808b8e056c39baf91b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 75b645c69d7449509f68ca23b34f1048.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request eb87d6473a9d4b3699ca0cc236900248.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request ca15a251849c45329825ca95a2fce96b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request c42bbea2f781469e89e576f98e618243.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 14b94d4ffd6646d69d4c2ad36d7dfd50.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 83c7dd9cbe9d4f6481b26403f46f1731.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 3e98245d88534c53be230aa25c56d99a.
INFO 01-17 03:11:26 async_llm_engine.py:111] Finished request 49b84ef96af44f069056b2fc43526cdd.

GitHub

Aborted request without reason · Issue #2484 · vllm-project/vllm

Hi, i am trying to load test vllm on a single gpu with 20 concurrent request. Each request would pass through the llm engine twice. Once to change the prompt, the other to generate the output. Howe...

Jason•2mo ago

Is it streaming request? How long is your request? What's batched request? Can you open a ticket

riverfog7•2mo ago

1. doesnt abort on streaming requests 2. about 16K tokens? 3. Its in langchain's vllm openai-compatible api sdk (just sends <batch size> requests to the api endpoint at the same time Also that sdk in langchain doesn't support streaming requests in batch mode

Jason•2mo ago

Open a ticket from the contact button, tell these details + your endpoint id I see It doesn't abort on streaming means there might be some timeout here that's limiting it

riverfog7•2mo ago

Can i do it tomorrow..?

Jason•2mo ago

So no response and it's aborted by the proxy or smth

riverfog7•2mo ago

yeah i think so too

Jason•2mo ago

Or Lang chain client

riverfog7•2mo ago

on that github issue,

Jason•2mo ago

Sure best to do it now tho, they might take a longer time to respond

riverfog7•2mo ago

ppl have problems with nginix or some kind of proxy in front of the server unfortunately i removed the endpoint & pod with the issue

Jason•2mo ago

Yeah. You can check your audit logs maybe and tell them it's deleted In website

riverfog7•2mo ago

thx for the info!

Jason•2mo ago

Your welcome!

riverfog7•2mo ago

It was a cloudflare problem that's on the blog here. https://blog.runpod.io/when-to-use-or-not-use-the-proxy-on-runpod/ btw does serverless use cloudflare proxies too?

RunPod Blog

When to Use (Or Not Use) RunPod's Proxy

RunPod uses a proxy system to ensure that you have easy accessibility to your pods without needing to make any configuration changes. This proxy utilizes Cloudflare for ease of both implementation and access, which comes with several benefits and drawbacks. Let's go into a little explainer about specifically how the

riverfog7•2mo ago

If so, how do i run long-running requests on serverless without streaming?

Jason•2mo ago

I'm not sure ask In the ticket ya

flash-singh•2mo ago

you can stream serverless without worrying about request times, look into streaming section, also serverless max timeout is 5mins, proxy is about 90s

Jehex•2mo ago

Is they’re any difference between using the fast deployment > vllm or using pre built the docker image

Jason•2mo ago

Quick deploy right? You can configure it before deploying using a setup

Jehex•2mo ago

Yep exact but you can also pre configure the pre built docker image from the env variables right ?

Jason•2mo ago

Yep Or from your end point's env variable

Jehex•2mo ago

Ok 🙂 about my issue with DeepSeek distilled r1, seems the prompt system is weird and tricky to use, if anyone know a good uncensored model to use vllm let me know ( I’m using llama 3.3 but it’s too censored )

Jason•2mo ago

Find out some fine tuned model like the dolphin one, i forgot the name

Jehex•2mo ago

is it a finetuned from llama model ?

Jason•2mo ago

Jehex•2mo ago

ok:)

Jason•2mo ago

Need a link?

Jason•2mo ago

https://huggingface.co/cognitivecomputations

cognitivecomputations (Cognitive Computations)

Jehex•2mo ago

Thanks, will try the cognitivecomputations/Dolphin3.0-Llama3.2-3B

Gaming

Programming

Serverless deepseek-ai/DeepSeek-R1 setup?

Did you find this page helpful?