Serverless rate limits for OpenAI chat completions
I have set up an OpenAI chat completions endpoint on Runpod serverless with access to 8 GPUs. I can see all 8 GPUs are running and show healthy logs, but when I run tests I notice that the rate at which requests are processed becomes very slow after approximately 500 requests, even slower than if I only ran on a single dedicated GPU pod.
The first 500 requests get processed at a rate in line with expectations for 8 GPUs, but then it immediately falls off a cliff, dropping from ~150 req/s to ~15 req/s
I saw Runpod has rate limits for
/run
and /runsync
endpoints, but does this also apply for all endpoints? My endpoint is https://api.runpod.ai/v2/<endpoint-id>/openai/v1/completions
0 Replies