Issue with Huggingface dataset not being cached to storage volume
I want to use https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu for a project. I'm trying to download this dataset through the python datasets package. I want this download to be stored on my storage volume. As per the documentation here: https://huggingface.co/docs/datasets/v3.2.0/en/cache#cache-directory , the package offers the option to either set an environment variable or use a function argument to specify the download directory. I've tried both approaches, but whatever i do, the cached files keep ending up on the Container instead of my storage Volume. Edit: it may very well be that i'm not defining the path correctly - i have limited linux experience. Please help.
Solution:Jump to solution
If that's the right variable use export command in Linux to set the env variable instead of setting in runpod
30 Replies
You must set the hf cache path, yes
yep - i'm doing it, and it's not working for some reason.
Oh what did you set it to
Via env variable right?
It must be in your mount point, most likely inside /workspace
Then you manually download using terminal or what
through a python script
files are still ending up in root
Manually ran or from the container dockerfile/ start script executed from the docker file?
Seems like you ran it manually
yes i ran it manually
Solution
If that's the right variable use export command in Linux to set the env variable instead of setting in runpod
If the other way, it'll work ( from dockerfile)
oh ok! let'me give that a try
It doesn't work here because if you run in other terminal then the env is not there
Check it if you want before setting
oooh ok i completely do not understand how this works apparently hahaha
now it seems to be saving to my workspace correctly
Nice
Because... When you set in runpod website, it'll apply to your dockerfile only
Imagine it that way, and your terminal is an external terminal ( outside dockerfile)
aaah ok - that i didn't realise.
So a way to sync this up is to use a script to re-export each of this env's
And put it in some script that dockerfile will execute
So when you open a terminal envs will be there
and this is some bash script that i would have to write myself?
I have the script if you want
yes please
i'm trying to learn and for me examples generally work best
https://discord.com/channels/912829806415085598/948767517332107274/1305682393654366228
Should be in entry point or cmd. if you want to execute other commands, then make a new sh file then move this into the script with the other commands
so if i run this when the pod is setup, whenever i open a terminal, it'll have environment variables i defined through the web interface?
Yep
Wait did you see the script I think I linked wrong
Thank you for the help - appreciate it. I really should take some time to go over the tutorials in more depth.
Yep sure. Some github references for runpod templates , YouTube tutorial would be useful
i honestly need to work on a bunch of stuff - limited linux experiences, barely any docker experience.
for now though, this works. Which means i can do stuff. Thank your for the help, i appreciate it.
Your welcome!