R
RunPodβ€’12mo ago
AC_pill

Is there a programatic way to activate servers on high demand / peak hours load?

We are testing the serverless for production deployment for next month. I want to assure we will have server times during peak hours. We'll have some active servers but we need to guarantee load for certain peak hours, is there a way to programatically activate the servers?
16 Replies
justin
justinβ€’12mo ago
There is let me find it..
justin
justinβ€’12mo ago
https://github.com/ashleykleynhans/runpod-api/blob/main/serverless/update_min_workers.py Ive never gotten around to it, but I have manually tested that setting minimum workers do seem to give you some sort of stronger priority in their system
GitHub
runpod-api/serverless/update_min_workers.py at main Β· ashleykleynha...
A collection of Python scripts for calling the RunPod GraphQL API - ashleykleynhans/runpod-api
justin
justinβ€’12mo ago
So my thought always was potentially to use this wrapper on their graphql endpoint to programatically toggle minimum active workers. It isn't "instantly" on, but it is still prob about 1-2 min or so~ sort of range, from my anecdotal testing in the past for workers to go from throttled state to an Active state and stay there. Maybe faster from idle > active
AC_pill
AC_pillOPβ€’12mo ago
Yes, I need to avoid any throttle because the demand will be huge, but for short tasks ~15s
justin
justinβ€’12mo ago
@AC_pill If you are under utilizing the GPU on these short tasks u can add concurrency? If ur not already πŸ™‚ btw https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py Here is an example for my OpenLLM, i have it set by default to 1, but u can play with 2-3 to see if u get a memory bottlenecked. https://docs.runpod.io/serverless/workers/handlers/handler-concurrency Here is there documentation on it Honestly, it wasn't until just 2-3 weeks ago, I really delved into concurrency in runpod, cause the docs didn't exist but after pestering the staff xD they were able to help me out on it~ and they got the doc pushed https://discord.com/channels/912829806415085598/1200525738449846342 Here the original thread on that if curious haha But if u dont have concurrency already, it would allow a single worker to handle multiple jobs at a time + that means if ur not fully utilizing the GPU u can increase the concurrency, or maybe even on a baseline gpu, if u know u can safely use xyz amt of parallel jobs might be good πŸ™‚
AC_pill
AC_pillOPβ€’12mo ago
Thanks for the advices, I'll need to check it's TurboXL, I need to check the memory usage
justin
justinβ€’12mo ago
dang this is great to learn this exists, haha, sounds good πŸ™‚
AC_pill
AC_pillOPβ€’12mo ago
yeah, but this is a heavy GPU consumer for the new models, I'm pretty sure there will be memory leaks, but can be a second research @justin [Not Staff] do you know if we can pull tasks from Serverless Queue line?
justin
justinβ€’12mo ago
No, unless you want to write your own circumvention logic or something Not possible as far as I can tell U could potentially highjack a worker at the end of a job, before it returns to check some circumvention queue / cache, and complete the job, and write back out
AC_pill
AC_pillOPβ€’12mo ago
so probably I'll need to wrap tasks together on the same run (so say 4 tasks) for 1 queue
justin
justinβ€’12mo ago
Ah yeah, or do the concurrency stuff and set it to 4 unless those jobs are specifically grouped tgt for other logical reasons Yeah batching jobs together, concurrency, or a circumvention infrastructure
AC_pill
AC_pillOPβ€’12mo ago
I saw that handler script, issue is my workflow network is complex and changes a lot, so that would be hard to let the handler do the work if the opposite and we can pull the JSON tasks, the async task handler would perform the best thanks for the reply that might help in the future
justin
justinβ€’12mo ago
Yeah, the complexity is higher, but what I tested before was to send empty requests to Runpod to spin up workers but have the worker find my own distribued queue i had on upstash to actually do the pulling job logic and then i write the answer to planetscale lol but it a bit of a crazy workaround only if u need such fine grain control + u wanna host ur own stuff To be honest, that can give you really fine grain controls, cause then, u could arbitrary return a value, ending the process and controlling when u want the worker to "terminate"; but all the surrounding infras is hosted by u
AC_pill
AC_pillOPβ€’12mo ago
yeah, I need to be pragmatical with the complexity here, team is small and mostly devoted to frontend so backend will lag maintanance if it goes up the roof
justin
justinβ€’12mo ago
Makes sense~ gl! πŸ™‚ πŸ‘οΈ, u sound like u got a cool project / business ongoing
AC_pill
AC_pillOPβ€’12mo ago
yes, it is, very AI driven like 99% of apps now πŸ™‚ but it's cool I'll post news on how it's moving if it works could be a good case for Runpod @justin [Not Staff] yeap, not yet, memory leaks using ONNX models in concurrency: [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running FusedConv node. Name:'Conv_455' Status Message: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=c67b8afabaf8 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); And memory is only 70% full with 3 instances In case you are using too

Did you find this page helpful?