#Issue with Huggingface dataset not being cached to storage volume

44 messages · Page 1 of 1 (latest)

river stratus
#

I want to use https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu for a project. I'm trying to download this dataset through the python datasets package. I want this download to be stored on my storage volume. As per the documentation here: https://huggingface.co/docs/datasets/v3.2.0/en/cache#cache-directory , the package offers the option to either set an environment variable or use a function argument to specify the download directory. I've tried both approaches, but whatever i do, the cached files keep ending up on the Container instead of my storage Volume. Edit: it may very well be that i'm not defining the path correctly - i have limited linux experience. Please help.

wraith ruinBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

wind dagger
#

You must set the hf cache path, yes

river stratus
#

yep - i'm doing it, and it's not working for some reason.

wind dagger
#

Oh what did you set it to

#

Via env variable right?

river stratus
wind dagger
#

It must be in your mount point, most likely inside /workspace

wind dagger
river stratus
#

through a python script

#

files are still ending up in root

wind dagger
#

Manually ran or from the container dockerfile/ start script executed from the docker file?

#

Seems like you ran it manually

river stratus
#

yes i ran it manually

wind dagger
# river stratus

If that's the right variable use export command in Linux to set the env variable instead of setting in runpod

wind dagger
river stratus
#

oh ok! let'me give that a try

wind dagger
#

It doesn't work here because if you run in other terminal then the env is not there

#

Check it if you want before setting

river stratus
#

oooh ok i completely do not understand how this works apparently hahaha

#

now it seems to be saving to my workspace correctly

wind dagger
#

Nice

#

Because... When you set in runpod website, it'll apply to your dockerfile only

#

Imagine it that way, and your terminal is an external terminal ( outside dockerfile)

river stratus
#

aaah ok - that i didn't realise.

wind dagger
#

So a way to sync this up is to use a script to re-export each of this env's

#

And put it in some script that dockerfile will execute

#

So when you open a terminal envs will be there

river stratus
#

and this is some bash script that i would have to write myself?

wind dagger
#

I have the script if you want

river stratus
#

yes please

#

i'm trying to learn and for me examples generally work best

wind dagger
#

#🎤|general message

#

Should be in entry point or cmd. if you want to execute other commands, then make a new sh file then move this into the script with the other commands

river stratus
#

so if i run this when the pod is setup, whenever i open a terminal, it'll have environment variables i defined through the web interface?

wind dagger
#

Yep

#

Wait did you see the script I think I linked wrong

river stratus
#

Thank you for the help - appreciate it. I really should take some time to go over the tutorials in more depth.

wind dagger
#

Yep sure. Some github references for runpod templates , YouTube tutorial would be useful

river stratus
#

i honestly need to work on a bunch of stuff - limited linux experiences, barely any docker experience.

#

for now though, this works. Which means i can do stuff. Thank your for the help, i appreciate it.

wind dagger
#

Your welcome!