How is the architecture set up in the serverless (please give me a minute to explain myself)
We have been looking for the LLM hosting services and autoscaling functionality to make sure we meet the demand -- but our main concern is the authentication architecture design.
As you can see, we have certain authentication related thoughts -- but I need more granular understanding of what standard practices are when deploying LLMs for the commercial use which the real customers are going to use.
Please guide. Thank you.
The basic setup
Based on my understanding there are the following layers:- Application in the user's device (sends request)
- A dedicated authentication server checks the user's authenticity (by API key, bearer etc and rate limits)
- Our HTTP server takes that request, processes the data and sends the request to the LLM server (to runpod - serverless)
- Runpod returns some generated data, and finally the HTTP server post-processes it and sends back to the user.
We want to:
- Make sure no unauthorized device is accessing our API to LLM
- To track each user's leftover quota and only let them to send a couple of requests etc.
Please guide. Thank you.