ashleyk
ashleyk
RRunPod
Created by ashleyk on 3/18/2024 in #⚡|serverless
High execution time, high amount of failed jobs
No description
5 replies
RRunPod
Created by ashleyk on 3/11/2024 in #⛅|pods
Different levels of performance from same GPU types in Community Cloud
When I use an A5000 in Community Cloud, I am able to get over 3it/s training Kohya_ss in ES and FR regions, but only a pathetic 1s/it in BG region. If different hosts are going to offer different levels of performance, they should not all earn the same fixed rate. A less performant host needs to earn less than the ones that have decent performance.
15 replies
RRunPod
Created by ashleyk on 3/11/2024 in #⚡|serverless
What is the difference between setting execution timeout on an endpoint and setting in the request?
No description
4 replies
RRunPod
Created by ashleyk on 3/11/2024 in #⚡|serverless
What is N95 in serverless metrics?
I finally understand the percentiles (P70, P90, P98) for serverless metrics, but I don't understand what N95 is. Can someone please explain what it is, how its calculated and what the significance of it is?
10 replies
RRunPod
Created by ashleyk on 3/7/2024 in #⚡|serverless
Something broken at 1am UTC
No description
5 replies
RRunPod
Created by ashleyk on 3/6/2024 in #⚡|serverless
What happened to the webhook graph?
There was a webhook graph for serverless but I can't seem to find it anymore. Was it removed for some reason?
2 replies
RRunPod
Created by ashleyk on 3/4/2024 in #⚡|serverless
Massive spike in executionTime causing my jobs to fail (AGAIN)
No description
4 replies
RRunPod
Created by ashleyk on 3/1/2024 in #⚡|serverless
Broken serverless worker - wqk2lrr3e9cekc
No description
7 replies
RRunPod
Created by ashleyk on 2/26/2024 in #⚡|serverless
Unacceptably high failed jobs suddenly
Suddenly almost 20% of my serverless jobs failed. I have never had this issue until yesterday. This is is completely UNACCEPTABLE that I am being charged for this immense fuck up and that my customers are being impacted. This needs to be resolved IMMEDIATELY and I demand a refund for this!
46 replies
RRunPod
Created by ashleyk on 2/20/2024 in #⚡|serverless
Broken serverless worker - can't find GPU
Serverless worker qbw30nmknd6cmh is broken can't can't find the GPU.
{
"dt":"2024-02-19 23:34:37.252459"
"endpointid":"qbw30nmknd6cmh"
"level":"error"
"message":"An exception was raised: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'Conv_24' Status Message: CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=acb6f843d220 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/conv.cc ; line=382 ; expr=cudnnFindConvolutionForwardAlgorithmEx( GetCudnnHandle(context), s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size); "
"workerId":"ptrh2jn7wjkcmd"
}
{
"dt":"2024-02-19 23:34:37.252459"
"endpointid":"qbw30nmknd6cmh"
"level":"error"
"message":"An exception was raised: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'Conv_24' Status Message: CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=acb6f843d220 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/conv.cc ; line=382 ; expr=cudnnFindConvolutionForwardAlgorithmEx( GetCudnnHandle(context), s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.y_tensor, s_.y_data, 1, &algo_count, &perf, algo_search_workspace.get(), max_ws_size); "
"workerId":"ptrh2jn7wjkcmd"
}
3 replies
RRunPod
Created by ashleyk on 2/16/2024 in #⚡|serverless
24GB PRO availability in RO
No description
46 replies