Deploy BART on serverless
Hi!
Does anyone know how to deploy bart-large-mnli on serverless? Been trying with the hugging face template (ghcr.io/huggingface/text-generation-inference) but I always get "Error: ShardCannotStart"
I already tried setting NUM_SHARD = 1 in env but still failing.
Repo for reference: https://huggingface.co/facebook/bart-large-mnli
Let me know if you need further details!
Thank you 🙂
6 Replies
Why doo you want to use that template? Just create your own handler and copy and paste the code from Hugging Face into the handler function.
That docker image is probably for a pod, it won't work in serverless without a RunPod serverless handler.
ok thank you! and do you think it can still handle concurrency? because that's my main concern
for example writing my own handler and sending 10 requests simultaneously, do I have to wait for the previous one to complete? (of course independently on the gpu that I am using)
Don't know much about how transformers library works, but serverless can handle multiple concurrent requests depending on your max worker count
You don't have to wait for the previous request to complete if you have multiple workers and a decent scaling policy configured on your endpoint
okok I'll try it, because for example using the VLLM template even with 1 worker I can handle more than 1 request per time without scaling it
but I will try deploying it with the custom handler
thank you again!
Yeah vllm engine can handle that which is why the vllm worker can do it too.
I see someone actually logged an issue for vllm to add support for it:
https://github.com/vllm-project/vllm/issues/5985
GitHub
[New Model]: support for BartForSequenceClassification · Issue #598...
The model to consider. Hi! Is there any plan on supporting https://huggingface.co/facebook/bart-large-mnli on vllm? When I try to run it it says "BartForSequenceClassification" is not sup...