nevermind
nevermind
RRunPod
Created by nevermind on 9/26/2024 in #⛅|pods
Urgent: {'message': 'Something went wrong. Please try again later or contact support.'}
We have been encountering this API error every day for about 3 days (usually 6:00-12:00, so 6 hours a day). Could you please check if the error is on our side or yours? Timestamps of api errors that might be useful: 2024-09-26T07:15:01.021521866Z 2024-09-26T07:15:00.95935792Z 2024-09-26T07:08:59.823314972Z P.S. Also noticed this one: 2024-09-25T22:54:16.07879268Z stderr F Response json was: response.json()={'message': 'Service Unavailable'}
7 replies
RRunPod
Created by nevermind on 9/21/2024 in #⛅|pods
SOS pod gpu errors
pod_id=usg9djjhmpjfpd
# nvidia-smi
Sat Sep 21 14:48:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
|ERR! 36C P0 48W / 450W | 24205MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
# nvidia-smi
Sat Sep 21 14:48:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
|ERR! 36C P0 48W / 450W | 24205MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
As you see, gpu is completely dead. We facing multiple errors like this. We faced it multiple times. It started since 19 september.
8 replies
RRunPod
Created by nevermind on 9/3/2024 in #⛅|pods
My pod had been stuck during initialization
ogw47gdxzk3a26 - stuck during image pulling. Could you checkout what happened and handle that issue, because our infra is not ready to handle this kind of your errors.
11 replies
RRunPod
Created by nevermind on 8/21/2024 in #⛅|pods
How does runpod handle pod terminating
It is very likely that runpod simply sends a sigkill to the main container process. This is really annoying when you are trying to handle termination. Could you please provide information on how your orche system handles pod termination and how I can get the OS signal
26 replies
RRunPod
Created by nevermind on 5/21/2024 in #⛅|pods
graphql Unauthorized
When I perform the "myPods" query [https://graphql-spec.runpod.io/#query-myself looks similar] with the "machines" field, I receive a strange output:
{
"errors": [
{
"message": "Unauthorized",
"locations": [
{
"line": 15,
"column": 3
}
],
"path": ["myself", "machines"],
"extensions": {
"code": "RUNPOD"
}
}
],
"data": ...
}
{
"errors": [
{
"message": "Unauthorized",
"locations": [
{
"line": 15,
"column": 3
}
],
"path": ["myself", "machines"],
"extensions": {
"code": "RUNPOD"
}
}
],
"data": ...
}
The "data" field contains normal data but without "machines". 1. Why am I facing "Unauthorized"? 2. How do I filter my pods by dataCenter's value? Script to reproduce:
import requests

query = {
"operationName":"myPods",
"variables":{},
"query":"query myPods {\n myself { pods {\n desiredStatus \n dockerId\n id\n imageName\n lastStatusChange\n locked\n machineId\n name\n machineType\n templateId\n uptimeSeconds\n }\n machines { id } }\n}"
}
r = requests.post(
"https://api.runpod.io/graphql?api_key=...",
json=query
)
print(r.content)
print(r.status_code)
import requests

query = {
"operationName":"myPods",
"variables":{},
"query":"query myPods {\n myself { pods {\n desiredStatus \n dockerId\n id\n imageName\n lastStatusChange\n locked\n machineId\n name\n machineType\n templateId\n uptimeSeconds\n }\n machines { id } }\n}"
}
r = requests.post(
"https://api.runpod.io/graphql?api_key=...",
json=query
)
print(r.content)
print(r.status_code)
11 replies