Reusing containers from Github integration registry
embeddings endpoints
Delay time even when there are many workers available
Runpod serverless for Comfyui with custom nodes
How to deploy ModelsLab/Uncensored-llama3.1-nemotron?
Almost no 48GB Workers available in the EU
GitHub integration: "exporting to oci image format" takes forever.
vllm worker OpenAI stream
ChatCompletion(id=None, choices=None, created=None, model=None, object='error', service_tier=None, system_fingerprint=None, usage=None, code=400, message='As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', param=None, type='BadRequestError')
ChatCompletion(id=None, choices=None, created=None, model=None, object='error', service_tier=None, system_fingerprint=None, usage=None, code=400, message='As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', param=None, type='BadRequestError')
Trying to work with: llama3-70b-8192 and I get out of memory
llama3-70b-8192
but I cant deploy my serverless endpoint because out of memory.
I have attached image config screenshot. please reccoment on other settings to make it work
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU
Thanks...Incrase serverless worker count after 30.
Consistently timing out after 90 seconds
Upload files to network storage
Serverless problems since 10.12
Git LFS on Github integration
Using runpod serverless for HF 72b Qwen model --> seeking help
Docker Image EXTREMELY Slow to load on endpoint but blazing locally
Constantly getting "Failed to return job results."
Why is my serverless endpoint requests waiting in queue when theres free workers?
Github integration
Is VLLM Automatic Prefix Caching enabled by default?