n8tzto Comments - Answer Overflow

n8tzto

•Created by luckedup. on 7/12/2024 in #⚡｜serverless

Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2...

I hope this issue gets fixed ASAP because it causes production jobs to get stuck indefinitely. Even worse, the stuck jobs might continue to drain credits.

30 replies

RRunPod

•Created by luckedup. on 7/12/2024 in #⚡｜serverless

Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2...

Yes, it only happens sometimes, not consistently. It seems like the internal webhook connection on RunPod isn't stable.

30 replies

RRunPod

•Created by luckedup. on 7/12/2024 in #⚡｜serverless

Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2...

The problem still exists, it just occurred again.

30 replies

RRunPod

•Created by luckedup. on 7/12/2024 in #⚡｜serverless

Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2...

I'm experiencing the same issue. When the Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/{endpoint-id}/job-done/... error occurs, the job remains stuck in IN_PROGRESS indefinitely, even though the log indicates that the job is completed. Plus, no webhook is sent from RunPod. It appears that RunPod fails to mark the job as completed due to this internal HTTP request failure. Some people have suggested that this issue might be caused by a large payload returned from the handler. However, in my case, the output size is only a few KBs, as it is just a JSON containing a URL to the output file.

30 replies

RRunPod

•Created by n8tzto on 3/14/2024 in #⚡｜serverless

Unstable Internet Connection in the Workers

Never mind. After using Runpod for over 4 months and spending over a thousand dollars on it, our company has decided to completely drop Runpod and switch to another platform. This decision was made due to Runpod's frequent instability and lack of timely and adequate support. During our time with Runpod, we encountered numerous issues, including a significant "throttle disaster" two weeks ago, problems with webhooks, network issues, and more. These incidents have resulted in financial losses for us, with some customers becoming upset and leaving. We can no longer tolerate these challenges. We appreciate the creation of Runpod, and it can still be useful for testing, development, or personal purposes.🙏 However, it is not suitable for use in production environments.

10 replies

RRunPod

•Created by ashleyk on 3/18/2024 in #⚡｜serverless

High execution time, high amount of failed jobs

Any idea what is causing the failures and delays?

5 replies

RRunPod

•Created by n8tzto on 3/14/2024 in #⚡｜serverless

Unstable Internet Connection in the Workers

We've transitioned to Cloudflare R2 and are utilizing its auto region, but the problem persists: Downloading from its public HTTP URL and uploading via boto are excessively slow, with frequent packet losses. Note that this issue doesn't always occur; it's intermittent. Some tasks are processed quickly and without any problems. However, we've noticed a higher frequency of occurrences over the past few days.

10 replies

RRunPod

•Created by n8tzto on 3/14/2024 in #⚡｜serverless

Unstable Internet Connection in the Workers

Things have gotten worse lately. Our jobs are failing or taking way too long, which means our credits are burning and our clients are getting frustrated. Any updates on this?

10 replies

RRunPod

•Created by ribbit on 3/14/2024 in #⚡｜serverless

Knowing Which Machine The Endpoint Used

I've thought of a way to get the GPU info in the handler function using Python code. You can use either the torch library or the GPUtil library to get the GPU info. References: - https://stackoverflow.com/questions/76581229/is-it-possible-to-check-if-gpu-is-available-without-using-deep-learning-packages - https://stackoverflow.com/questions/64776822/how-do-i-list-all-currently-available-gpus-with-pytorch

10 replies

RRunPod

•Created by ribbit on 3/14/2024 in #⚡｜serverless

Knowing Which Machine The Endpoint Used

I believe what @gpu poor wants to know is if it's possible to figure out which type of GPU a specific serverless job is using, especially when there are different types of GPUs set for the endpoint. To my understanding, currently, there isn't a way to programmatically find out the type of GPU a job is using. When you check the result through /status/{job_id} endpoint or webhook, there's no GPU information provided. Even the event parameter passed to the runpod handler function doesn't include GPU details. Right now, the only way to know the GPU type is by checking the Runpod's web UI and seeing which worker is handling the job.

10 replies

RRunPod

•Created by n8tzto on 3/14/2024 in #⚡｜serverless

Unstable Internet Connection in the Workers

I'm not exactly sure what time the issue occurred, but I observed significant instability in the connection just recently (on March 14, 2024, between approximately 2pm and 4pm UTC). While it's currently showing signs of improvement compared to earlier, the speed is still notably slow.

10 replies

RRunPod

•Created by n8tzto on 3/14/2024 in #⚡｜serverless

Unstable Internet Connection in the Workers

Endpoint ID: 1pzws8rhbpku7g

10 replies

RRunPod

•Created by ssssteven on 2/26/2024 in #⚡｜serverless

Failed to get job. | Error Type: ClientConnectorError

I have also encountered these errors. In recent days, there have been network connection issues within the serverless workers. I have noticed that endpoints occasionally encounter network connection problems. This impacts several processes within running jobs, such as downloading files from URLs, uploading files to S3, and sending HTTP update requests, causing them to fail or become extremely slow.

14 replies

Gaming

Programming