Simon Posts - Answer Overflow

Simon

•Created by Simon on 1/31/2025 in #⚡｜serverless

Serverless worker keeps failing

We run several serverless workers in parallel to run the inference. Sometimes a serverless worker starts failing with OOM and all the following runs on the same worker will fail until the worker is terminated. We have noticed that the retries initiated by our backend always end up on the same worker. Let's say we have 10 prompts, and we run one prompt per worker, the retries with the same prompt always end up on the same worker. If a random worker were used every time, this issue wouldn't be a problem because retrying will eventually succeed, but since it's always the same worker, all the retries fail. How is the target worker selected? Is there a hash on the input? Or on the webhook? Can we add some random data to the input to always have a different worker selected?

8 replies

RRunPod

•Created by Simon on 12/17/2024 in #⛅｜pods-clusters

Creating a Pod with dockerArgs and a docker image from a registry that requires auth

I'm trying to create a pod from a template or from a docker image from a docker registry with authentication. I'm using the method podFindAndDeployOnDemand. If I specify a templateId, the pod starts but it seems that the dockerArgs I specify in the API call is ignored and the CMD in the Dockerfile is run instead. If I try without the templateId but with a docker image instead, I don't know how to provide the authentication for the registry. With the containerRegistryAuthId I get a graphql error. I already created registryAuth using the saveRegistryAuth API.

2 replies

Gaming

Programming