Issues with network volume access
It seems like there is a latency in accessing files from serverless workers.
Error on the worker:
2024-10-22T18:36:15.970116008Z CompletedProcess(args=['python3', '/runpod-volume/script3.py', 'xx', 'xx-xx'], returncode=1, stdout='File /runpod-volume/xx/xx/file.txt not found after 15 seconds. Exiting.\n', stderr='')
root@9bd985fc78bf:/workspace/xx/xx# ls -ltr
total 45
-rwxrwxrwx 1 root root 46062 Oct 22 18:35 file.txt
Not sure why the pod was unable to find it. Started happening last night pacific time.
DC: EU-RO-1
11 Replies
Anyone?
Still no update on this one?
What's your code like?
Is your endpoint using the network volume?
Does this happen rarely or all the time?
Yes, my endpoint uses network volume but I have since switched to syncing data from S3 to local filesystem. It was working good until 2 days ago. Facing a new issue now with a job getting retried while executing. Seems like a new issue? I use https://api.runpod.ai/v2/tbkwpbgdwzm1z3/run to execute my job. Is there any way to get around the bug?
also, I see that the same worker is being used for 2 jobs while the first one is being executed, thereby causing a delay to the second one. These are all new findings. Earlier, the job used to be delayed until the the worker completed its current request.
@flash-singh @yhlong00000
very erratic behavior now, please look into it at the earliest. I have 3 endpoints with about 250 workers and the performance has been nothing short of horrible for the last 3 days
@prongs
Escalated To Zendesk
The thread has been escalated to Zendesk!
I guess this is possible because a bug in the new sdk
Will take a look and let you know
Thank you. Please let me know if you were able to check.
I have downgraded to 1.6.2 and retrying
I can only see two endpoints on your account. For the first one (tbkwpbgdwzm1z3), you are using SDK 1.7.3—let’s see if downgrading helps improve the situation.
For the second one (vrk03g7imxvx90), there are multiple errors related to the SFTP transfer for script3.py. let me know if you need more info.
Do you still see 1.7.3 being used for tbkwp*? I downgraded to 1.6.2 about an hour ago. Seeing better results. Monitoring.
nope, I see it change to 1.6.2
looking good so far, increasing workers now. Will monitor.