n8tzto
RRunPod
•Created by n8tzto on 3/14/2024 in #⚡|serverless
Unstable Internet Connection in the Workers
Recently, I've noticed that several serverless jobs are encountering a very unstable internet connection, leading to extremely slow download and upload speeds.
This instability is resulting in connection errors on HTTP requests and the loss of packets. Additionally, the slow connection speed is causing significant delays in downloading from and uploading to S3, even for asset files that are just a few MBs in size, resulting in the consumption of excessive credits.
Furthermore, there are instances where connection errors or timeouts are causing the failure of generated output files to upload, resulting in job failures. This is particularly frustrating as credits were spent generating the output.
It's worth noting that this issue doesn't occur consistently; rather, it happens occasionally on some jobs.
Is anyone else experiencing this issue, or is it just me?
10 replies
RRunPod
•Created by n8tzto on 1/19/2024 in #⚡|serverless
Intermittent Slow Performance Issue with GPU Workers
I am currently encountering an intermittent issue with some GPU workers exhibiting significantly slower performance. I have tried to measure the time taken for a specific task on a designated type of GPU worker (4090 24GB). Typically, when I send the exact identical payload input to the endpoint, the execution time is around 1 minute. However, I have observed that occasionally, a worker becomes exceptionally slow. Even with the same payload input, Docker image, tag, and GPU type, the execution time extends to a few hours. Notably, during these occurrences, the GPU utilization remains constantly at 0%.
Upon reviewing the output log, it is evident that the inference speed is unusually slow when the affected worker is in operation. Have any of you experienced a similar problem, and if so, how did you resolve it?
Your insights and assistance in addressing this issue would be greatly appreciated. Thank you.
5 replies