R
RunPod2mo ago
JCtheMC

Issue with Huggingface dataset not being cached to storage volume

I want to use https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu for a project. I'm trying to download this dataset through the python datasets package. I want this download to be stored on my storage volume. As per the documentation here: https://huggingface.co/docs/datasets/v3.2.0/en/cache#cache-directory , the package offers the option to either set an environment variable or use a function argument to specify the download directory. I've tried both approaches, but whatever i do, the cached files keep ending up on the Container instead of my storage Volume. Edit: it may very well be that i'm not defining the path correctly - i have limited linux experience. Please help.
Solution:
If that's the right variable use export command in Linux to set the env variable instead of setting in runpod
Jump to solution
30 Replies
nerdylive
nerdylive2mo ago
You must set the hf cache path, yes
JCtheMC
JCtheMCOP2mo ago
yep - i'm doing it, and it's not working for some reason.
nerdylive
nerdylive2mo ago
Oh what did you set it to Via env variable right?
JCtheMC
JCtheMCOP2mo ago
No description
nerdylive
nerdylive2mo ago
It must be in your mount point, most likely inside /workspace Then you manually download using terminal or what
JCtheMC
JCtheMCOP2mo ago
through a python script
JCtheMC
JCtheMCOP2mo ago
files are still ending up in root
No description
nerdylive
nerdylive2mo ago
Manually ran or from the container dockerfile/ start script executed from the docker file? Seems like you ran it manually
JCtheMC
JCtheMCOP2mo ago
yes i ran it manually
Solution
nerdylive
nerdylive2mo ago
If that's the right variable use export command in Linux to set the env variable instead of setting in runpod
nerdylive
nerdylive2mo ago
If the other way, it'll work ( from dockerfile)
JCtheMC
JCtheMCOP2mo ago
oh ok! let'me give that a try
nerdylive
nerdylive2mo ago
It doesn't work here because if you run in other terminal then the env is not there Check it if you want before setting
JCtheMC
JCtheMCOP2mo ago
oooh ok i completely do not understand how this works apparently hahaha
JCtheMC
JCtheMCOP2mo ago
No description
JCtheMC
JCtheMCOP2mo ago
now it seems to be saving to my workspace correctly
nerdylive
nerdylive2mo ago
Nice Because... When you set in runpod website, it'll apply to your dockerfile only Imagine it that way, and your terminal is an external terminal ( outside dockerfile)
JCtheMC
JCtheMCOP2mo ago
aaah ok - that i didn't realise.
nerdylive
nerdylive2mo ago
So a way to sync this up is to use a script to re-export each of this env's And put it in some script that dockerfile will execute So when you open a terminal envs will be there
JCtheMC
JCtheMCOP2mo ago
and this is some bash script that i would have to write myself?
nerdylive
nerdylive2mo ago
I have the script if you want
JCtheMC
JCtheMCOP2mo ago
yes please i'm trying to learn and for me examples generally work best
nerdylive
nerdylive2mo ago
https://discord.com/channels/912829806415085598/948767517332107274/1305682393654366228 Should be in entry point or cmd. if you want to execute other commands, then make a new sh file then move this into the script with the other commands
JCtheMC
JCtheMCOP2mo ago
so if i run this when the pod is setup, whenever i open a terminal, it'll have environment variables i defined through the web interface?
nerdylive
nerdylive2mo ago
Yep Wait did you see the script I think I linked wrong
JCtheMC
JCtheMCOP2mo ago
Thank you for the help - appreciate it. I really should take some time to go over the tutorials in more depth.
JCtheMC
JCtheMCOP2mo ago
No description
nerdylive
nerdylive2mo ago
Yep sure. Some github references for runpod templates , YouTube tutorial would be useful
JCtheMC
JCtheMCOP2mo ago
i honestly need to work on a bunch of stuff - limited linux experiences, barely any docker experience. for now though, this works. Which means i can do stuff. Thank your for the help, i appreciate it.
nerdylive
nerdylive2mo ago
Your welcome!

Did you find this page helpful?