RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

How does the soft check on workers limit work?

I've noticed that the first soft cap is about 100$ so i guess that having a balance that's more than 100$ will increase my workers limit. What happens if my balance goes to 90$ afterwards? Will my limit be lowered? What will happen to active workers?
Solution:
The soft limit just checks your balance at the time of the upgrade, if you do fall below that balance at another time you will not lose access to the upgraded workers count.

Stuck in the initialization

Seems that I'm stuck in the intiialization loop e.g. ``` 2024-06-24T10:47:39Z worker is ready 2024-06-24T10:49:04Z loading container image from cache 2024-06-24T10:49:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string...
Solution:
I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.

cannot stream openai compatible response out

I have the below code for streaming the response, the generator is working but cannnot stream the response: llm = Llama(model_path="Phi-3-mini-4k-instruct-q4.gguf", n_gpu_layers=-1, n_ctx=4096,...

[URGENT] Failed to return results

Hi, I am having issues for a few hours with one of my serverless pods. When the process ends, it fails to reach to api.runpod. 2024-06-23T09:09:05.462788318Z {"requestId": "sync-53542990-e57d-4f02-acb4-988800d2cd1a-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/2ylrt71iu9oxpi/job-done/wy06bwgvghwp50/sync-53542990-e57d-4f02-acb4-988800d2cd1a-u1?gpu=NVIDIA+RTX+A4000&isStream=false", "level": "ERROR"}...

Is there an equivalent of flash boot for CPU-only serverless?

I was trying to figure out if there was a way to have a CPU job only fire up when it was needed so it would not accrue charges when idle (like flash boot for GPU serverless) Thanks!

Why the available GPUs are only 1?

I want to run my pod with at least 2 gpus. My pod is A5000. Now available gpus are ony 2. what happened?...
Solution:
@Robbie if you created pod you cant edit number of gpus's you would need to make new one with correct amount

Faster-Whisper worker template is not fully up-to-date

Hi, We're using the Faster-Whisper worker (https://github.com/runpod-workers/worker-faster_whisper) on Serverless. I saw that Faster-Whisper itself is currently on version 1.0.2, whereas the Runpod template is still on 0.10.0. There are a few changes that have been introduced in Faster-Whisper (now using CUDA 12) since, that we would like to benefit from, especially the language_detection_threshold setting, since it seems like most of our transcriptions done by people with British accent are being transcribed into Welsh (with a language detection confidence of around 0.51 to 0.55) - which could be circumvented by increasing the threshold....

Slow IO speeds on serverless

An A6000 always active worker takes twice as run to run my code than a normal A6000, I think it is IO speed. How can I see IO speeds?
Solution:
It looks like the method I was using for seeking had a really high IO. Changing to another method sped up serverless a lot, but not necessarily a ton on pod. This leaved me to believe that serverless IO is just slow

How to download models for Stable Diffusion XL on serverless?

1) I created a new network storage of 26 GB for various models I'm interested in trying.
2) I created a Stable Diffusion XL endpoint on serverless, but couldn't attach the network storage.
3) After the deployment succeeded, I clicked on edit endpoint and attached that network storage to it. So far so good I believe. But how do I exactly download various SDXL models into my network storage, so that I could use them via Postman?...

0% GPU utilization and 100% CPU utilization on Faster Whisper quick deploy endpoint

I used the "Quick Deploy" option to deploy a Faster Whisper custom endpoint (https://github.com/runpod-workers/worker-faster_whisper). Then, I called the endpoint to transcribe a 1 hour long podcast by using the following parameters: ``` { 'input': { 'audio': 'https://www.podtrac.com/pts/redirect.mp3/pdst.fm/e/traffic.megaphone.fm/ISOSO6446456065.mp3?updated=1715037715',...
No description

Loading models from network volume cache is taking too long.

Hello all, I'm loading my model like following so that I can use the cache from my network volume. model = AutoModel.from_pretrained(...

Are webhooks fired from Digital Ocean?

I setup a WAF in AWS to block bots and I am getting a bunch of requests to my RunPod Serverless Webhook blocked by AWS#AWSManagedRulesBotControlRuleSet#SignalKnownBotDataCenter . The IP address in these requests seems to be a Digital Ocean Data Center. I have disabled the WAF for my ALB for my RunPod webhooks temporarily, but hoping that someone can confirm whether these are legitimate requests or not, because I was under the impression that RunPod uses AWS and not Digital Ocean.

best architecture opinion

Hello, I would like to build an app that out of 1 prompt specified by a user, create 10 prompts. Then call a model once for each of these 10 prompts, giving me 10 responses. Then, do a final call to aggregate the 10 responses into one final response that will be returned to the user. My question is the following, do you have any advice on how to build this ? option a) send the user prompt to the serverless endpoint, and within the endpoint, create the 10 prompts, and call the model sequentially, and then one last time to aggregate the result. All of that in 1 call from the user to the serverless endpoint...

Cancelling job resets flashboot

For some reason whenever we cancel a job the next time the serverless worker cold boots it doesn't use flash boot and instead it reloads the llm model weights into the gpu from scratch. Any idea why cancelling jobs might be causing this problem? Is there maybe a more graceful solution for stopping jobs early than the /cancel/{job_id} endpoint?

RUNPOD_API_KEY and MAX_CONTEXT_LEN_TO_CAPTURE

We are also starting a vLLM project and I have two questions: 1) In the environment variables, do I have to define the RUNPOD_API_KEY with my own secret key to access the final vLLM OpenAI endpoint? 2) Isn't MAX_CONTEXT_LEN_TO_CAPTURE now deprecated? Do we still need to provide it, if MAX_MODEL_LEN is already set? ...

Do I need to allocate extra container space for Flashboot?

I'm planning to use Llama3 model that takes about 40 GB space. I believe Flashboot takes a snapshot of the worker and keeps it on the disk to load it within seconds when the worker becomes active. Do I need to allocate enough space on the container for this? In this case, since I'm planning to select a 48 GB vRAM GPU, do I need to allocate 40 GB Model + 48 GB for snapshot + 5 GB extra space = 93 GB container space?
Thanks...

When servless is used, does the machine reboot if it is executed consecutively? Currently seeing iss

When servless is used, does the machine reboot if it is executed consecutively? Currently seeing issues with last execution affecting the next

unusual usage

Hello ! we got billed weirdly this past weekend...
No description

Slow I/O

Hey, I am trying to download a 7GB file and run a ffmpeg process to extract an audio from that file (its a video). Locally it takes on average around 5 minutes, but when I try it on the cloud (I chose the CPU, general purpose since a GPU doesn't seem to give any advantage here) and it looks like the I/O is SUPER SLOW. Is there anything I can do to speed up the Disk I/O?...