RunPod•7mo ago

One request = one worker

How can I configure my endpoint so that one request is equal to one worker, and one worker does not complete more than one request within a certain timeframe? My workload is bursty and requires all of the workers to be available at once. However, my endpoint does not give that and takes a long time to start all the workers I need. In addition, workers are sometimes reused instead of creating a new instance which I do not want.

43 Replies

Encyrption•7mo ago

I think If you disable Flashboot then after each request it will be a cold boot. But, are you sure you want to do that? It will dramatically increase the time a job stays IN_QUEUE. Only way I know to guarantee that you have X number of workers up ready to respond is to make them active.

1AndOnlyPikaOP•7mo ago

I dont want it to be cold boot Is it possible to have x workers active for a two minute period and then turn it off? I tried with active but it takes a long time to turn all the workers on though I need 50 workers to be available within 10 seconds

Encyrption•7mo ago

You can adjust your active and max workers... just click the hamburger icon on the endpoint and then click edit and you can adjust. Have you tried using runpodctl or graphQL to provision your workers? Might have more control / speed with scripting rather than manual. Beyond scripting, I think your boot up time will depend mainly on the size of your docker image.

1AndOnlyPikaOP•7mo ago

Let me look into that, i didnt know serverless worked on graphql api too I couldnt find any info on doing it programatically Only serverless endpoints

Encyrption•7mo ago

serverless endpoint = worker

1AndOnlyPikaOP•7mo ago

but my problem is serverless endpoint gets put into a queue and doesnt always start a new worker

Encyrption•7mo ago

I think they will all have to be active Do you have the ability to use 50 workers? I'm currently limited to 35.

1AndOnlyPikaOP•7mo ago

i have requested further worker increases

Encyrption•7mo ago

sweet!

1AndOnlyPikaOP•7mo ago

i mean i already have limit of more than 50 workers cant figure out how to utilize all of them at the same time is the problem

flash-singh•7mo ago

if you want all of them starting without putting in jobs, easiest way is to change active worker to 50 then after few mins to reduce them, you can do this using our graphql api why don't you want the worker to take the next job if the worker is free?

1AndOnlyPikaOP•7mo ago

so my job has two components, a preparation time (my program needs to build kernel modules for the specific machine, takes about 20-30s) and a deadline which the workers will just shut down after because its no longer needed i noticed that the current way it one worker finishes the preparation, it ends up taking all the requests in queue which instantly error out because the deadline had passed already and those jobs are done and this kills the other workers which is currently in progress of building those kernel modules because theres no more in queue waiting for them

1AndOnlyPikaOP•7mo ago

i also noticed this behavior:

1AndOnlyPikaOP•7mo ago

where workers are started sequentially rather than in parallel scale type is already set to request count of 1, because i want one request to create a new worker how can I make my workers start up in parallel? im limited to about 50 workers starting within 20 seconds. because my workload is bursty, i want to have 75+ workers start up asynchronously within 10-15 seconds

yhlong00000•6mo ago

https://docs.runpod.io/serverless/workers/handlers/handler-concurrency Would this work for you? It means a single worker would handle multiple jobs. I’m not sure how heavy each job is, though. Also, which GPU are you using right now?

Concurrent Handlers | RunPod Documentation

RunPod's concurrency functionality enables efficient task handling through asynchronous requests, allowing a single worker to manage multiple tasks concurrently. The concurrency_modifier configures the worker's concurrency level to optimize resource consumption and performance.

1AndOnlyPikaOP•6mo ago

I am using 4090s. This would not work, by the time a worker is done with one job the window to do the next would have already passed Is there some sort of super high priority flag for a job request that I can send for it to bypass the queue and instantly spin up the worker?

yhlong00000•6mo ago

https://docs.runpod.io/serverless/endpoints/send-requests#execution-policies You can set job to low priority, not sure if this helps in your case

Send a request | RunPod Documentation

Learn how to construct a JSON request body to send to your custom endpoint, including optional inputs for webhooks, execution policies, and S3-compatible storage, to optimize job execution and resource management.

1AndOnlyPikaOP•6mo ago

thats the opposite of what I want to do

flash-singh•6mo ago

so your using request count to scale? and when you need to scale, you submit 50-100 very quick and you need the scale to catch up quick? ill review this logic for requests scale

Encyrption•6mo ago

They want to spin up 50 workers as quickly as possible to process 50 requests and then shut them all down. They want to insure that each worker processes only 1 request each such that all 50 are run in parallel.

nerdylive•6mo ago

I think if you used the right scaling type on the endpoint and no concurrent settings on the handler it should do what you want and maybe the Runpod's scaling is abit slow ( not sure )

1AndOnlyPikaOP•6mo ago

i did not make any adjustment to the concurrency settings. i have my workers scaling at 1 by request count yes this is right, what Im thinking is that the request count scaling would mean that if I have 50 jobs in queue then it would spin up 50 workers at once but from what I observed the current scaling, even with a request count scale of 1, takes time to start up for 50 workers, it takes about 30 seconds

nerdylive•6mo ago

So the scale up is slow?

1AndOnlyPikaOP•6mo ago

yea the scale up is slow i want all 50 workers to be started up at the same time, so that 5 seconds after my batch of requests I have 50 workers ready instead of 1 new worker every few seconds is there a way to do it?

nerdylive•6mo ago

Yeah let's wait for staff here

flash-singh•6mo ago

i plan to review the scale algo and will get back to you

Encyrption•6mo ago

You could call that algo batch_mode LOL 😉

1AndOnlyPikaOP•6mo ago

thank you :TOCheer:

flash-singh•6mo ago

looks like the scaling algo for request scale is serial, workers are started one after the other, ill plan optimizations for this in the coming week our queue scale method is async, you can try that out and set queue delay scale time to lowest possible, 1 second, that will spin up much faster, please provide some feedback if you do decide to test that out @1AndOnlyPika

1AndOnlyPikaOP•6mo ago

this odd situation happened:

1AndOnlyPikaOP•6mo ago

there were requests stitting in queue, but no workers being spun up to handle them the first few workers do start up much faster, but it never reaches 1 request = 1 worker

1AndOnlyPikaOP•6mo ago

well it does but not evenly, the times end up looking like this