There's inconsistency in performance ( POD )
Hello. I rent and operate 20 RTX4090 GPUs all day long.
However, there are significant differences in inference speeds.
Each line in the table in the attached image represents 2 RTX 4090 GPUs.
One processes 150 images in 3 minutes. However, the rest only process 50-80 images. On my own RTX4090 2-way server that I purchased directly, the throughput is 180 images processed in 3 minutes. I haven't been able to figure out why these speed differences are occurring.
The inference task is generating one image.
17 Replies
I used community cloud
There are significant performance variations each time an instance is created.
Have pod id's?
I'm currently looking for a suitable GPU provider, but in the case of RunPod, the performance variance is too severe. I've also tested Vast.ai, and such performance instability issues hardly occur. I need to be prepared for a situation where I'll have to rent 100-200 RTX 4090 GPUs in the future, so this problem needs to be resolved.
qqqvn244j95e71
this is worse performance pod id
eeqpk5j05wyz6j
kts3emj3q087ee
too.
and
ifjn32k8a0hru1 pod is good performance
@streamize
Escalated To Zendesk
The thread has been escalated to Zendesk!
Hey does the ticket looks right
yes but I put the not runpod account email. no problem?
I created ticket with other email
Ooh alright
I'm wondering if those pods are the same specifications?
yes righjt. I used same docker image with same spec
The cpu model, ram amount?
I'm guessing those are the factors or the ssd
Or maybe network connection ( depending how does your app works )
But most likely the specs of the server
The CPU and RAM, as you know, are not directly specified by me and always come out differently. So, when looking at pods with good performance in relation to this issue, there were cases with even lower VRAM and lower CPU performance. Therefore, I've put on hold judging it as a VRAM or CPU problem.
We use a system where tasks are stacked in a queue for processing. Additionally, in our server source code, we've separated the pending situations where the next task can't be processed due to network delays. The criterion for a task being completed in the queue is simply the inference (it doesn't include delivery to the user). So, I'm not currently considering it as a network connection problem.
If we assume it's a network connection issue, I think all pods should have experienced uniform problems in processing capacity. However, some pods are working well.
So, I haven't been able to identify the cause yet, haha...
This problem didn't occur on vast.ai. So now I have a headache....
If we assume the SSD is the problem, what exactly is the issue? Is it a problem I can control? Because when creating pods, I only input the capacity, and I entered the same for all of them. I don't exactly understand what it could be.
There is about a 2-fold difference in creation speed between instances with low performance and those without. This is a very significant figure.
Not really, like ssds wear out right, older ssd can yield worse performance leading to lower throughput like gpus
Yeah I agree
Is this true even though inference is already made while the model is on the GPU?
I'm not sure about what's the real issue here, oh you're using the same model in vram? Most likely the gpu or cpu then
Have you also check nvidia-smi maybe they're power limited
Just guessing, but I'm sure staffs can check more about the machine
How does your app receives input for its inference?
What about secure cloud ever tried those too?
The same model is used in VRAM, and the actions performed by each instance are 100% identical. They use the same Docker image and receive the same requests from clients. What I'm curious about is the performance of the computers in the community cloud. As far as I know, these receive GPUs from an unspecified number of users. Does RunPod have internal criteria for determining suitability for hosting? As you mentioned, if it's a power issue, the GPU performance might decrease. Are there no management standards for this?
The Docker image hosts a Socket.IO server. It receives messages from users, generates images, and when generation is complete, it sends the generated image as a base64 string.
As you mentioned, I've now set up hosting to test the secure cloud. Due to price differences, I used the community cloud.
There is but I don't know if they do any checks periodically for that
Oh okay nice
If you don't mind update me here, I'm curious what can cause what you're experiencing
I think we can discuss it here.
Ya sure discuss what