3WaD
3WaD
RRunPod
Created by xnerhu on 3/25/2025 in #⚡|serverless
Serverless handler on Nodejs
It is. The thing running on the serverless can be whatever you want. You just have to expose it to the RunPod via the handler function in the Python script. The minimal code required is:
import runpod

def handler(job):
job_input = job["input"] # request parameters sent to RunPod are exposed via this
# Your code here.
return "your-job-results"

runpod.serverless.start({"handler": handler})
import runpod

def handler(job):
job_input = job["input"] # request parameters sent to RunPod are exposed via this
# Your code here.
return "your-job-results"

runpod.serverless.start({"handler": handler})
You can then connect your Nodejs code for example via HTTP API - by running a local server and sending internal requests to it from the Python script:
import requests
import json

def handler(job):
res = requests.post("http://localhost:3000", json=job["input"])
return res.json()
import requests
import json

def handler(job):
res = requests.post("http://localhost:3000", json=job["input"])
return res.json()
Or by spawning a subprocess
import subprocess
import json

def handler(job):
result = subprocess.run( # use .Popen() for streaming
["node", "script.js"],
input=json.dumps(job["input"]),
text=True,
capture_output=True
)
return json.loads(result.stdout)
import subprocess
import json

def handler(job):
result = subprocess.run( # use .Popen() for streaming
["node", "script.js"],
input=json.dumps(job["input"]),
text=True,
capture_output=True
)
return json.loads(result.stdout)
Other methods like Pipes, WebSockets, PubSub or direct TCP exist, but the communication method depends on your needs.
6 replies
RRunPod
Created by xnerhu on 3/25/2025 in #⚡|serverless
Serverless handler on Nodejs
That's correct. You write the serverless handler in Python.
6 replies
RRunPod
Created by Xqua on 3/11/2025 in #⚡|serverless
Do you cache docker layers to avoid repulling ?
I would check EU-RO then. I've never experienced layer caching there. I didn't even know it's supported on RunPod until I was able to use EU-CZ and was pleasantly surprised it's there. This was happening even on a tiny few kb rewrites of the handler file. While cached just fine by the docker hub on upload, the data centre always pulled the whole huge image again. It's wasting your bandwidth there for sure.
9 replies
RRunPod
Created by Anders on 3/15/2025 in #⚡|serverless
Anyone get vLLM working with reasonable response times?
I've spent too much time to optimize vLLM for that. But even though I am pushing tokens/s above the official benchmarks of the model and hardware combination, there is some overhead I can't do anything about: frequent worker shifts causing cold starts, various speeds in different data centres and different requests, and especially the delay time even when warm, which can be as long as the execution time itself. I think the serverless is suitable for starting up or smaller LLM projects. Go dedicated or self-host for big ones. But even then, the delay times are a few seconds for me. They should not be minutes as you say. Which region do you use?
6 replies
RRunPod
Created by Xqua on 3/11/2025 in #⚡|serverless
Do you cache docker layers to avoid repulling ?
As far as I know, some data centres do, and some don't. For example, on EU-RO I had to redownload the whole image every time. On EU-CZ it's pulling only the difference.
9 replies
RRunPod
Created by NexaS on 3/10/2025 in #⚡|serverless
Using serverless to train on a face
It is possible. But for longer running tasks it would be cheaper to dynamically start and stop a pod.
2 replies
RRunPod
Created by hakankaan on 3/5/2025 in #⚡|serverless
Can't get Warm/Cold status
Yes, queue delay. Does the request count behave differently in deciding which worker to choose? I didn't think about that.
11 replies
RRunPod
Created by hakankaan on 3/5/2025 in #⚡|serverless
Can't get Warm/Cold status
They're prioritized but not guaranteed, right? Since I regularly send requests to a warm worker in testing, and after a few requests, suddenly, a cold start happens on a different one even though the warm worker is still marked as idle and ready in the endpoint.
11 replies
RRunPod
Created by S1TH on 3/6/2025 in #⚡|serverless
How to deploy a custom model in runpod?
2 replies
RRunPod
Created by hakankaan on 3/5/2025 in #⚡|serverless
Can't get Warm/Cold status
You can't target specific workers. Jobs are dynamically assigned to them, and they're shifting around constantly. You're not guaranteed a warm worker even when you've recently run a job and should have one. The only solution currently is keeping as many of them warm as possible. If you want to check if the whole endpoint has a warm worker available at that exact moment before you send the job itself, you could theoretically edit the code a bit to return the worker info regardless of the state. However, I am not sure how reliable it would be due to the active nature of the job balancer.
# ------------------- #
#   RunPod Handler    #
# ------------------- #
engine = None
async def handler(job):
global engine
job_input = process_input(job["input"])

# Check the warm status
if job_input == "prewarm":
yield {"warm": True if engine else False}
else: # normal request
engine = initialize_engine()
# ...
# ------------------ #
#   Entrypoint       #
# ------------------ #
if __name__ == "__main__":
runpod.serverless.start({"handler": handler, "concurrency_modifier": concurrency_modifier, "return_aggregate_stream": True})
# ------------------- #
#   RunPod Handler    #
# ------------------- #
engine = None
async def handler(job):
global engine
job_input = process_input(job["input"])

# Check the warm status
if job_input == "prewarm":
yield {"warm": True if engine else False}
else: # normal request
engine = initialize_engine()
# ...
# ------------------ #
#   Entrypoint       #
# ------------------ #
if __name__ == "__main__":
runpod.serverless.start({"handler": handler, "concurrency_modifier": concurrency_modifier, "return_aggregate_stream": True})
11 replies
RRunPod
Created by hakankaan on 3/5/2025 in #⚡|serverless
Can't get Warm/Cold status
A new feature that will automatically pre-warm your workers is in development/testing. Before it's released, I'm adding a prewarm method to my workers. If you have initialization of your app outside the handler (as recommended), it's very easy to add.
async def handler(job):
job_input = process_input(job["input"])

# Prewarm Flashboot request
if job_input == "prewarm":
yield {"warm": True}

if __name__ == "__main__":
initialize_engine() # init outside
runpod.serverless.start({"handler": handler, "concurrency_modifier": concurrency_modifier, "return_aggregate_stream": True})
async def handler(job):
job_input = process_input(job["input"])

# Prewarm Flashboot request
if job_input == "prewarm":
yield {"warm": True}

if __name__ == "__main__":
initialize_engine() # init outside
runpod.serverless.start({"handler": handler, "concurrency_modifier": concurrency_modifier, "return_aggregate_stream": True})
You can then send prewarm requests as needed or periodically to keep the workers warm.
"input": { "prewarm": true }
"input": { "prewarm": true }
11 replies
RRunPod
Created by codyman4488 on 3/4/2025 in #⚡|serverless
how to run a quantized model on server less? I'd like to run the 4/8 bit version of this model:
The author of the models states it's not required, but I believe you have to set the correct quantization for any type. The RunPod UI selection is very limited only to: AWQ, SqueezeLLM and GPTQ. While vLLM currently supports:
aqlm, awq, deepspeedfp, tpu_int8, fp8, ptpc_fp8, fbgemm_fp8, modelopt, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, experts_int8, neuron_quant, ipex, quark
aqlm, awq, deepspeedfp, tpu_int8, fp8, ptpc_fp8, fbgemm_fp8, modelopt, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, experts_int8, neuron_quant, ipex, quark
So, for most quantization types you have to set it yourself with the QUANTIZATION env variable. You can find all the info in the vLLM documentation and template source repository
5 replies
RRunPod
Created by codyman4488 on 3/4/2025 in #⚡|serverless
how to run a quantized model on server less? I'd like to run the 4/8 bit version of this model:
vLLM does support GGUF, but it looks like the official RunPod template doesn't. You'll have to build the image yourself with quantization set to GGUF. Using a tokenizer from the unquantized model is also recommended. Keep in mind that GGUF is optimized for CPU inference. For GPU you probably want to use dynamic bitsandbytes.
5 replies
RRunPod
Created by rmnvc on 3/4/2025 in #⚡|serverless
Troubles with answers
Have you tried providing a list of messages instead of a prompt, which I believe is used for standard text completion even in non-openai requests? Many chat-finetuned models might not be the best in that. Chat completion uses conversations like:
"messages": [
{"role": "system", "content": "You're helpful assistant"},
{"role": "user", "content": "How are you?"}
]
"messages": [
{"role": "system", "content": "You're helpful assistant"},
{"role": "user", "content": "How are you?"}
]
Output:
"choices": [{
"message": {
"role": "assistant",
"content": "I am feeling good, how may I assist you today?",
}
}]
"choices": [{
"message": {
"role": "assistant",
"content": "I am feeling good, how may I assist you today?",
}
}]
Standard text completion looks like:
"prompt": "How are you? I am"
"prompt": "How are you? I am"
Output:
"choices": [
{
"text": " feeling good today."
}
]
"choices": [
{
"text": " feeling good today."
}
]
You can also use openAI compatible endpoint and requests at api.runpod.ai/v2/{RUNPOD_ENDPOINT_ID}/openai/v1/chat/completions.
2 replies
RRunPod
Created by Aleksei Naumov on 2/23/2025 in #⚡|serverless
Keeping idle workers alive even without any requests.
It's just an unfortunate feature naming conflict. Idle worker in the context of endpoint means "initialized worker, without a currently active job, ready to accept one". Idle workers are not billed Idle timeout in the context of job execution means "how long after a job finishes is the worker kept running, with the ability to instantly execute another request". So yes, having idle workers is an expected and desired state because it means your endpoint has free workers for incoming requests. On the other hand, setting idle timeout to 600 seconds will make your workers run very long for full price. Either set it to something reasonable like 1-5 seconds or consider using active workers which are discounted, if you can't afford to have cold starts or big delay times.
5 replies
RRunPod
Created by zaid on 2/22/2025 in #⚡|serverless
do we get billed partially or rounded up to the second?
Delay time should not be billed as long as it's RunPod's delay (e.g., a job waiting in the queue) and not your code (cold start). It's good practice to put app initialization, such as loading AI models into VRAM, outside the RunPod serverless handler function. This is then marked as a delay in stats yet still billed as execution time.
4 replies
RRunPod
Created by 3WaD on 2/15/2025 in #⚡|serverless
[Solved] EU-CZ Datacenter not visible in UI
Thank you very much. It's working now. This was a small problem, probably just with my account, but you still took the time to fix it, and I appreciate it!
10 replies
RRunPod
Created by 3WaD on 2/15/2025 in #⚡|serverless
[Solved] EU-CZ Datacenter not visible in UI
Should I make a ticket? I thought it won't be that hard to get an answer to this.
10 replies
RRunPod
Created by skeledrew on 2/18/2025 in #⚡|serverless
Workers stuck at initializing
So there's your error. It means it couldn't pull the image because either such repo/image/tag does not exist, or you're using private repo and didn't set registry credentials in the RunPod settings.
6 replies