jojje
RRunPod
•Created by blistick on 1/5/2024 in #⚡|serverless
What does "throttled" mean?
I think it's the word choice of "throttled" that causes the confusion, since runpod is hijacking an established term having the defacto meaning (semantic) "mitigation of user induced policy violation" to instead mean "Pending" or "Queued", which means "waiting for the requisite (resource) condition to execute a task". If they'd used any of the latter terms I expect there wouldn't be nearly as many questions about serverless throttling.
16 replies
RRunPod
•Created by blistick on 1/5/2024 in #⚡|serverless
What does "throttled" mean?
@justin [Not Staff] are the "workers" bound to a specific data center (region) ?
If not, then I don't see why adding more workers would help since the situation of the requested GPUs wouldn't change one iota. They'd all just be throttled as well, for the same reason the initial ones were. But if a worker is pegged to a specific colo, then it would make sense as the resource horizon would be limited to that single colo.
Do you know which of these hold true for workers? (DC pegged at creation, or whether worker pegging happens once a resource match has been found)
16 replies
Deploy custom private docker image
3. Finally at Runpod, going to Account > Settings > Container Registry Auth and adding the credential such as:
4. When launching pods, select/attach that named credential (dockerhub-read in the example above), else the pod won't actually use it when pulling the image.
17 replies
My pod is taking forever to download the image
Thanks. Did they say why?
I'm asking because it's faster than dockerhub for me in the US.
Edit: I followed your suggestion @rahul and pushed a copy to dockerhub. The download from there was almost instant. Thanks for this tidbit!
So it seem there is some serious issue between GH and Runpod, and it would be good to learn why.
But my problem is at least resolved for now. Thx again.
9 replies
pods just keep stopping without any reason why when downloading?
Seems most plausible. Given the 20Gbps speed, that file seems already cached in runpod's datacenter reverse network proxy.
Had it not been cached there; fetching directly from huggingface, it could have been an unlucky draw with a network connection to the HF CDN, at which time one just has to redo the download to get a "healthy" link. But in this case, the "luck of the draw" wrt connections seems far fetched since the in-DC cache should not have this problem. At least I've never encountered one.
6 replies
My pod is taking forever to download the image
Same issue here. My images are hosted on gcr.io (github). This has been a problem in the RO region for the past few days I tried. In US regions with H100 etc, the transfer is blazingly (~300-400 MB/s) quick
But in RO it's slow as molasses. Was getting around 2-3 MB/s for 2000-Ada. Switched to 4000-Ada hoping that machine had better connectivity, but zero difference.
At this rate It'll take forever to download the 14GB image (with all the ML libs needed for the workload). The NVIDIA CUDA layers alone are > 10 GB.
Please someone from Runpod, can you shed some light on the docker layer-pull issue in general?
In particular for my issue, Is the network link / path between RO and Github limited in some way? If so, what EU regions would you recommend instead?
PS. I'm using "secure cloud" if that wasn't self-evident.
9 replies
Runpod API documentation
Great. Your AI helpbot revealed the GraphQL API is documented here: https://graphql-spec.runpod.io/ (for anyone else looking)
So that means the question can be narrowed to just the REST API.
4 replies
any way to control the restart policy of pods?
Haven't found one. So in order to avoid crash-loops, I just wrap all my containers in an init script that just execs into a "wait" process and launches all its actual work in sub-processes. That way I can see any errors in the logs, debug and fix stuff that is broken without the frigging container vanishing in a puff a smoke a second after an error happens.
The always-restart policy is only really useful for stable production workloads. Not for R&D or experimental setups, which is all I'm using runpod for.
7 replies