Charixfox
RRunPod
•Created by Charixfox on 5/21/2024 in #⚡|serverless
Speed up cold start on large models
I'm trying to do some intermittent testing on a 70B LLM model, but any time a vLLM worker does a cold start, it downloads the model from HF. This takes about 19 minutes, so costs add up and the requests made to the API time out and fail. Once the model is loaded, things are fine with inference running in 12-15 seconds. Is there any good solution to work with this larger model without keeping a constant worker, which would defeat the whole purpose of running it on serverless for the very intermittent testing?
17 replies