R
RunPod2mo ago
prongs

Issues with network volume access

It seems like there is a latency in accessing files from serverless workers. Error on the worker: 2024-10-22T18:36:15.970116008Z CompletedProcess(args=['python3', '/runpod-volume/script3.py', 'xx', 'xx-xx'], returncode=1, stdout='File /runpod-volume/xx/xx/file.txt not found after 15 seconds. Exiting.\n', stderr='') root@9bd985fc78bf:/workspace/xx/xx# ls -ltr total 45 -rwxrwxrwx 1 root root 46062 Oct 22 18:35 file.txt Not sure why the pod was unable to find it. Started happening last night pacific time. DC: EU-RO-1
11 Replies
prongs
prongsOP2mo ago
Anyone? Still no update on this one?
nerdylive
nerdylive2mo ago
What's your code like? Is your endpoint using the network volume? Does this happen rarely or all the time?
prongs
prongsOP2mo ago
Yes, my endpoint uses network volume but I have since switched to syncing data from S3 to local filesystem. It was working good until 2 days ago. Facing a new issue now with a job getting retried while executing. Seems like a new issue? I use https://api.runpod.ai/v2/tbkwpbgdwzm1z3/run to execute my job. Is there any way to get around the bug? also, I see that the same worker is being used for 2 jobs while the first one is being executed, thereby causing a delay to the second one. These are all new findings. Earlier, the job used to be delayed until the the worker completed its current request. @flash-singh @yhlong00000 very erratic behavior now, please look into it at the earliest. I have 3 endpoints with about 250 workers and the performance has been nothing short of horrible for the last 3 days
Poddy
Poddy2mo ago
@prongs
Escalated To Zendesk
The thread has been escalated to Zendesk!
nerdylive
nerdylive2mo ago
I guess this is possible because a bug in the new sdk
yhlong00000
yhlong000002mo ago
Will take a look and let you know
prongs
prongsOP2mo ago
Thank you. Please let me know if you were able to check. I have downgraded to 1.6.2 and retrying
yhlong00000
yhlong000002mo ago
I can only see two endpoints on your account. For the first one (tbkwpbgdwzm1z3), you are using SDK 1.7.3—let’s see if downgrading helps improve the situation. For the second one (vrk03g7imxvx90), there are multiple errors related to the SFTP transfer for script3.py. let me know if you need more info.
prongs
prongsOP2mo ago
Do you still see 1.7.3 being used for tbkwp*? I downgraded to 1.6.2 about an hour ago. Seeing better results. Monitoring.
yhlong00000
yhlong000002mo ago
nope, I see it change to 1.6.2
prongs
prongsOP2mo ago
looking good so far, increasing workers now. Will monitor.
Want results from more Discord servers?
Add your server