RunPod•5mo ago

Issues with network volume access

It seems like there is a latency in accessing files from serverless workers. Error on the worker: 2024-10-22T18:36:15.970116008Z CompletedProcess(args=['python3', '/runpod-volume/script3.py', 'xx', 'xx-xx'], returncode=1, stdout='File /runpod-volume/xx/xx/file.txt not found after 15 seconds. Exiting.\n', stderr='') root@9bd985fc78bf:/workspace/xx/xx# ls -ltr total 45 -rwxrwxrwx 1 root root 46062 Oct 22 18:35 file.txt Not sure why the pod was unable to find it. Started happening last night pacific time. DC: EU-RO-1

11 Replies

prongsOP•5mo ago

Anyone? Still no update on this one?

nerdylive•5mo ago

What's your code like? Is your endpoint using the network volume? Does this happen rarely or all the time?

prongsOP•5mo ago

Yes, my endpoint uses network volume but I have since switched to syncing data from S3 to local filesystem. It was working good until 2 days ago. Facing a new issue now with a job getting retried while executing. Seems like a new issue? I use https://api.runpod.ai/v2/tbkwpbgdwzm1z3/run to execute my job. Is there any way to get around the bug? also, I see that the same worker is being used for 2 jobs while the first one is being executed, thereby causing a delay to the second one. These are all new findings. Earlier, the job used to be delayed until the the worker completed its current request. @flash-singh @yhlong00000 very erratic behavior now, please look into it at the earliest. I have 3 endpoints with about 250 workers and the performance has been nothing short of horrible for the last 3 days

Poddy•5mo ago

@prongs

Escalated To Zendesk

The thread has been escalated to Zendesk!

nerdylive•5mo ago

I guess this is possible because a bug in the new sdk

yhlong00000•5mo ago

Will take a look and let you know

prongsOP•5mo ago

Thank you. Please let me know if you were able to check. I have downgraded to 1.6.2 and retrying

yhlong00000•5mo ago

I can only see two endpoints on your account. For the first one (tbkwpbgdwzm1z3), you are using SDK 1.7.3—let’s see if downgrading helps improve the situation. For the second one (vrk03g7imxvx90), there are multiple errors related to the SFTP transfer for script3.py. let me know if you need more info.

prongsOP•5mo ago

Do you still see 1.7.3 being used for tbkwp*? I downgraded to 1.6.2 about an hour ago. Seeing better results. Monitoring.

yhlong00000•5mo ago

nope, I see it change to 1.6.2

prongsOP•5mo ago

looking good so far, increasing workers now. Will monitor.

Gaming

Programming

Issues with network volume access

Did you find this page helpful?