RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

What is meant by a runner?

I have created my worker template and I am configuring GH actions. I am just unsure of what RUNNER_24GB is supposed to be, as to create a serverless endpoints require a container image but building and testing is the point of the CI/CD pipeline?

Kohya-ss on serverless

Hi there, I was wondering if anyone got Kohya setup successfully via a serverless endpoint on RunPod...

vLLM serverless throws 502 errors

I'm getting these errors out of the blue, anyone knows why? 2024-06-28 00:44:12.053 [71ncv12913w751]...

Error Handling Issue: Updating Response Status in Python’s Runpod

Hello everyone! I encountered an issue where I need to raise an error in my handler, but I’ve found that in the Python’s runpod library, errors are added to a job_output list. There’s a condition where it searches for the error field to update the response status to FAILED. However, since the error is within an output list, it doesn’t recognize that field, and the status remains COMPLETED. This is my handler and here is where I raise the error ```python...

Exposing http ports on serverless

There's no way to expose http ports on serverless is there? When I'm creating a new template and flip the template type from Pod to Serverless that option goes away.

Prevent Extra Workers from appearing

Many times extra workers are spawned for multiple hours even though there is no need for them as the load is easily kept up by the normal workers. How can i prevent these from appearing? i already set max workers but it does not help. this costs so much money that i am thinking about switching provider....

Quantization method

Hello, I am trying to quantize the model, I see several libraries. Do you have any advice on which library is the best? Or they all are fine and I can choose any library
No description

Maximum queue size

Hi, is there a limit for maximum pending jobs in the queue with serverless endpoints or are there any other queue size limitations?

LoRA adapter on Runpod.io (using vLLM Worker)

Hi, I hope everyone is doing well. I'm reaching out to seek some insights or advice regarding an issue I'm encountering while attempting to deploy a serverless API endpoint on RunPod.io. The model in question has been adapted using a Lora adapter, and I seems like I am stuck because of missing configuration file. However, the nature of the model's adaptation with the Lora adapter means that I don't have a traditional configuration file available. (see screenshot please) Given the technical nature of this issue, I was hoping someone here might have encountered a similar situation or could offer guidance on how to proceed. Specifically, I'm looking for any advice on how to bypass the requirement for a config file in this context or if there's an alternative method of supplying the necessary configuration information to satisfy the deployment process....
No description

No config error /

Hello everyone, i have finetuned unsloth/Phi-3-mini-4k-instruct, and then pushed my model to Huggingface repo. One of the problems is - OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like alsokit/eLM-mini-4B-4K-4bit-v01-merged-4bit is not the path to a directory containing a file named config.json. But I have config.json file (see screenshot) Also there are other errors among that error (please see the second screenshot)...
No description

Distributing model across multiple GPUs using vLLM

vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs

worker no execute

this worker "ir6fwdhb34ujm7" run but no execute, why????

Environment Variable in Serverless

Hello, quick question, how do I pass environment variables in a request for serverless endpoint? Espcially assign them with a value dynamically while sending a request.
Solution:
Modify your handler to read the data from request payload instead of environment variables.

How does the soft check on workers limit work?

I've noticed that the first soft cap is about 100$ so i guess that having a balance that's more than 100$ will increase my workers limit. What happens if my balance goes to 90$ afterwards? Will my limit be lowered? What will happen to active workers?
Solution:
The soft limit just checks your balance at the time of the upgrade, if you do fall below that balance at another time you will not lose access to the upgraded workers count.

Stuck in the initialization

Seems that I'm stuck in the intiialization loop e.g. ``` 2024-06-24T10:47:39Z worker is ready 2024-06-24T10:49:04Z loading container image from cache 2024-06-24T10:49:33Z The image runpod/worker-vllm:stable-cuda12.1.0 already exists, renaming the old one with ID sha256:08d4ab2735bbe3528acdd1a11322c570347bcf3b77c9779e9886e78b647818bd to empty string...
Solution:
I've cloned my endpoint and deleted the original one. The cloned one seems to work just fine.

cannot stream openai compatible response out

I have the below code for streaming the response, the generator is working but cannnot stream the response: llm = Llama(model_path="Phi-3-mini-4k-instruct-q4.gguf", n_gpu_layers=-1, n_ctx=4096,...

[URGENT] Failed to return results

Hi, I am having issues for a few hours with one of my serverless pods. When the process ends, it fails to reach to api.runpod. 2024-06-23T09:09:05.462788318Z {"requestId": "sync-53542990-e57d-4f02-acb4-988800d2cd1a-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/2ylrt71iu9oxpi/job-done/wy06bwgvghwp50/sync-53542990-e57d-4f02-acb4-988800d2cd1a-u1?gpu=NVIDIA+RTX+A4000&isStream=false", "level": "ERROR"}...

Is there an equivalent of flash boot for CPU-only serverless?

I was trying to figure out if there was a way to have a CPU job only fire up when it was needed so it would not accrue charges when idle (like flash boot for GPU serverless) Thanks!

Why the available GPUs are only 1?

I want to run my pod with at least 2 gpus. My pod is A5000. Now available gpus are ony 2. what happened?...
Solution:
@Robbie if you created pod you cant edit number of gpus's you would need to make new one with correct amount

Faster-Whisper worker template is not fully up-to-date

Hi, We're using the Faster-Whisper worker (https://github.com/runpod-workers/worker-faster_whisper) on Serverless. I saw that Faster-Whisper itself is currently on version 1.0.2, whereas the Runpod template is still on 0.10.0. There are a few changes that have been introduced in Faster-Whisper (now using CUDA 12) since, that we would like to benefit from, especially the language_detection_threshold setting, since it seems like most of our transcriptions done by people with British accent are being transcribed into Welsh (with a language detection confidence of around 0.51 to 0.55) - which could be circumvented by increasing the threshold....
Next