ssssteven
RRunPod
•Created by ssssteven on 7/3/2024 in #⚡|serverless
network connections are very slow, Failed to return job results.
Hello, I'm starting seeing this error for all my jobs:
2024-07-03T16:16:03.367164993Z {"requestId": "c1ffb9c9-970e-4602-906a-b64838a17309-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/m6fixbtjutzbvo/job-done/ap10mhw4d9tubz/c1ffb9c9-970e-4602-906a-b64838a17309-u1?gpu=NVIDIA+RTX+A6000&isStream=false", "level": "ERROR"}
It seemes like all network requests are taking much longer, like 8-10s to post a 1m image to an AWS server. Has anyone else seening this? Thanks!
40 replies
RRunPod
•Created by ssssteven on 7/1/2024 in #⚡|serverless
Can two serverless endpoint point to the same docker image with different tags?
Hey all, I didn't realized deploy a new version will update the images for both my serverless endpoints. How do I separate them? Thanks!
4 replies
RRunPod
•Created by ssssteven on 6/30/2024 in #⚡|serverless
error starting: Error response from daemon: Container aa58de3216b8515a3ee78aa46d9102331aaaf6c210a36c
Hey all, all my requests stopped working on p45dj1lfott9ob.
Example: 2319a9c9-dccd-4930-b34f-0acce67bdc3c-u1
Example: 40c01e77-32c9-451e-ac7a-dc134b513a97-u1
Can someone helped me look into why this is happending? Thanks
59 replies
RRunPod
•Created by ssssteven on 5/16/2024 in #⚡|serverless
container create: signal: killed?
Hey all, I have a task stuck at the booting state, and this is the error message I got:
2024-05-16T18:41:04Z create container stevenynie/dreamweaver:20240516111003
2024-05-16T18:42:04Z error creating container: container: create: container create: signal: killed
2024-05-16T18:42:24Z create container stevenynie/dreamweaver:20240516111003
request_id: 70ca2fc5-bcdf-4819-b936-742a5ed739e6-u1
Any hints on how to avoid this error? It has been like this for 3 mins.
8 replies
RRunPod
•Created by ssssteven on 5/13/2024 in #⚡|serverless
Hey all, why does this worker keep alive after the task is completed?
Here is the request ID: d7aa74ec-5b1f-4af9-8bf4-d8211740019b-u1
According to log, there is no crash at all. but the worker status is green on the RP dashboard. Thanks!
10 replies
RRunPod
•Created by ssssteven on 3/19/2024 in #⚡|serverless
S3 download is quite slow
Hey all, I just learnt that my workers spent 4s to download total ~5mb files from s3. Is that normal? Or is the best practice to include these files in the runpod payload? If so, is there any size limit on the runpod post request? Thanks!
12 replies
RRunPod
•Created by ssssteven on 3/7/2024 in #⚡|serverless
GPU memory usage is at 99% when starting the task.
I started to notice some GPU OOM failure today, and it's specific to this instance:
A40 - 44adfw5inhfp98
When the job starts, it says the GPU utilization is at 99%. Did something change on RP?
4 replies
RRunPod
•Created by ssssteven on 3/4/2024 in #⚡|serverless
Failed to get job. | Error Type: ClientConnectorError
3 replies
RRunPod
•Created by ssssteven on 3/3/2024 in #⚡|serverless
error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/"
Hey all, I have been getting this error on the worker:
error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
worker: e70bl866azw54g
Can someone take a look please?
2 replies
RRunPod
•Created by ssssteven on 2/26/2024 in #⚡|serverless
Failed to get job. | Error Type: ClientConnectorError
Hey all, I'm starting to receive this kind of error:
2024-02-26T21:49:02.442274586Z connectionpool.py :872 2024-02-26 21:49:02,441 Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd718d52aa0>: Failed to resolve 'api.runpod.ai' ([Errno -3] Temporary failure in name resolution)")': /v2/d7n1ceeuq4swlp/ping/xkqvldjqlccihw?gpu=NVIDIA+A40&runpod_version=1.6.0
2024-02-26T21:49:12.459986454Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientConnectorError | Error Message: Cannot connect to host api.runpod.ai:443 ssl:default [Temporary failure in name resolution]", "level": "ERROR"}
It seems like the system is keep retrying to get the job for 40s and this time interval is included for the serverless billing time. what is going on? Thanks!
request id: 0e0314f9-3a78-46bc-b708-969d86ec5b84-u1
worker id: xkqvldjqlccihw
14 replies
RRunPod
•Created by ssssteven on 2/14/2024 in #⚡|serverless
error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/"
2024-02-14T04:23:33Z create container stevenynie/dreamweaver:20240213192815
2024-02-14T04:23:41Z pending image pull stevenynie/dreamweaver:20240213192815
2024-02-14T04:23:47Z create container stevenynie/dreamweaver:20240213192815
2024-02-14T04:23:47Z pending image pull stevenynie/dreamweaver:20240213192815
2024-02-14T04:24:02Z create container stevenynie/dreamweaver:20240213192815
2024-02-14T04:24:02Z pending image pull stevenynie/dreamweaver:20240213192815
2024-02-14T04:24:08Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
it seems like my worker has been stuck in this loop.
24 replies
RRunPod
•Created by ssssteven on 1/20/2024 in #⚡|serverless
Is it possible to release a new version via command line?
Instead of web interface, can we do this via command line? Thanks!
5 replies
RRunPod
•Created by ssssteven on 1/18/2024 in #⚡|serverless
is there anyway to restart the worker when SSH into the device
Hey all, I have an reserved instance and I’m debugging some issues running on the GPU. I can ssh into the device and change some code. Is there an easy way for me to restart the worker process? Thanks!
2 replies
RRunPod
•Created by ssssteven on 1/11/2024 in #⚡|serverless
Performance Difference between machine u3q0zswsna6v88 and cizgr1kbbfrp04
Hey all, what are the difference between these two machines? For the exact same code, u3q0zswsna6v88 takes 60s and cizgr1kbbfrp04 takes 8s. I repeat the same request multiple times and none of these requests hit cold start. Happened around 5:15pm today.
Thanks!
22 replies
RRunPod
•Created by ssssteven on 1/10/2024 in #⚡|serverless
RuntimeError: The NVIDIA driver on your system is too old (found version 11080). Please update your
I deploy a new version today but keep running into this error. Did something changed on RunPod? Thanks!
30 replies
RRunPod
•Created by ssssteven on 1/7/2024 in #⚡|serverless
What does the delay time and execution mean in the request page?
Hey all, I'm not sure what the delay time mean in the Requests page. Is it about the cold start? Could someone help me understand it? Also, the execution time seems to be way larger than the duration I've logged. Is the execution time means the excution time of the handler function? Thanks!
31 replies
RRunPod
•Created by ssssteven on 1/7/2024 in #⚡|serverless
Set timeout on each job
Hello, is there anyway to set a hard limit timeout for each job? Thank you!
13 replies
RRunPod
•Created by ssssteven on 1/6/2024 in #⚡|serverless
Monitor Logs from command line
Hello all, is there any command line tool to monitor of an endpoint without opening up the webpage? Thanks!
4 replies