Is there a programatic way to activate servers on high demand / peak hours load?
We are testing the serverless for production deployment for next month.
I want to assure we will have server times during peak hours.
We'll have some active servers but we need to guarantee load for certain peak hours, is there a way to programatically activate the servers?
16 Replies
There is let me find it..
https://github.com/ashleykleynhans/runpod-api/blob/main/serverless/update_min_workers.py
Ive never gotten around to it, but I have manually tested that setting minimum workers do seem to give you some sort of stronger priority in their system
GitHub
runpod-api/serverless/update_min_workers.py at main Β· ashleykleynha...
A collection of Python scripts for calling the RunPod GraphQL API - ashleykleynhans/runpod-api
So my thought always was potentially to use this wrapper on their graphql endpoint to programatically toggle minimum active workers. It isn't "instantly" on, but it is still prob about 1-2 min or so~ sort of range, from my anecdotal testing in the past for workers to go from throttled state to an Active state and stay there. Maybe faster from idle > active
Yes, I need to avoid any throttle because the demand will be huge, but for short tasks ~15s
@AC_pill If you are under utilizing the GPU on these short tasks
u can add concurrency? If ur not already π btw
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py
Here is an example for my OpenLLM, i have it set by default to 1, but u can play with 2-3 to see if u get a memory bottlenecked.
https://docs.runpod.io/serverless/workers/handlers/handler-concurrency
Here is there documentation on it
Honestly, it wasn't until just 2-3 weeks ago, I really delved into concurrency in runpod, cause the docs didn't exist but after pestering the staff xD they were able to help me out on it~ and they got the doc pushed
https://discord.com/channels/912829806415085598/1200525738449846342
Here the original thread on that if curious haha
But if u dont have concurrency already, it would allow a single worker to handle multiple jobs at a time + that means if ur not fully utilizing the GPU u can increase the concurrency, or maybe even on a baseline gpu, if u know u can safely use xyz amt of parallel jobs might be good π
Thanks for the advices, I'll need to check it's TurboXL, I need to check the memory usage
dang this is great to learn this exists, haha, sounds good π
yeah, but this is a heavy GPU consumer for the new models, I'm pretty sure there will be memory leaks, but can be a second research
@justin [Not Staff] do you know if we can pull tasks from Serverless Queue line?
No, unless you want to write your own circumvention logic or something
Not possible as far as I can tell
U could potentially highjack a worker at the end of a job, before it returns to check some circumvention queue / cache, and complete the job, and write back out
so probably I'll need to wrap tasks together on the same run (so say 4 tasks) for 1 queue
Ah yeah, or do the concurrency stuff
and set it to 4
unless those jobs are specifically grouped tgt for other logical reasons
Yeah batching jobs together, concurrency, or a circumvention infrastructure
I saw that handler script, issue is my workflow network is complex and changes a lot, so that would be hard to let the handler do the work
if the opposite and we can pull the JSON tasks, the async task handler would perform the best
thanks for the reply that might help in the future
Yeah, the complexity is higher, but what I tested before was to send empty requests to Runpod to spin up workers
but have the worker find my own distribued queue i had on upstash
to actually do the pulling job logic
and then i write the answer to planetscale lol
but it a bit of a crazy workaround only if u need such fine grain control + u wanna host ur own stuff
To be honest, that can give you really fine grain controls, cause then, u could arbitrary return a value, ending the process and controlling when u want the worker to "terminate"; but all the surrounding infras is hosted by u
yeah, I need to be pragmatical with the complexity here, team is small and mostly devoted to frontend so backend will lag maintanance if it goes up the roof
Makes sense~ gl! π
ποΈ, u sound like u got a cool project / business ongoing
yes, it is, very AI driven like 99% of apps now π
but it's cool
I'll post news on how it's moving
if it works could be a good case for Runpod
@justin [Not Staff] yeap, not yet, memory leaks using ONNX models in concurrency: [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running FusedConv node. Name:'Conv_455' Status Message: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=c67b8afabaf8 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);
And memory is only 70% full with 3 instances
In case you are using too