RunPod•5mo ago

Serverless rate limits for OpenAI chat completions

I have set up an OpenAI chat completions endpoint on Runpod serverless with access to 8 GPUs. I can see all 8 GPUs are running and show healthy logs, but when I run tests I notice that the rate at which requests are processed becomes very slow after approximately 500 requests, even slower than if I only ran on a single dedicated GPU pod. The first 500 requests get processed at a rate in line with expectations for 8 GPUs, but then it immediately falls off a cliff, dropping from ~150 req/s to ~15 req/s I saw Runpod has rate limits for /run and /runsync endpoints, but does this also apply for all endpoints? My endpoint is https://api.runpod.ai/v2/<endpoint-id>/openai/v1/completions

0 Replies

No replies yetBe the first to reply to this messageJoin

Gaming

Programming

Serverless rate limits for OpenAI chat completions

Did you find this page helpful?