hanging after 500 concurrent requests
Hi, I loaded llama 8b in serverless with 1 active worker A100, and 1 idle worker, I wanted to benchmark how many requests I can do at the same time so I can go production. But when I send 500 requests at the same the server just hangs and I don't get an error. What could be the issue? how to know how much load 1 gpu can handle and how to optmize it for max concurrency.
3 Replies
@Alpay Ariyak any idea?
Hangs? What's it's like
No output just running?
Yeah, could you please expand on how it hangs?
Also, Max Job Concurrency is 300 by default, you can change it with
MAX_CONCURRENCY
env var