-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Use OS-specific cache directories instead of home directory #31295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I think this would warrant a new entry, see: https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/upcoming_changes/README.md for instructions. There's quite a few other places that need to be updated, I would suggest searching for "scikit_learn_data" in the codebase. |
I believe I have covered all the files now |
Thanks for the PR @norgera. This is a breaking change so I think it needs a deprecation cycle. |
Indeed. Here is a link on how we want to go through when it comes to deprecation: https://scikit-learn.org/dev/developers/contributing.html#deprecation |
Thank you all for the information. I've added deprecation to use the original home directory if its detected or manually set, providing a warning + instructions to move files to new cache folder. Otherwise uses new cache folder. Let me know if this approach is appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a flyby note, current release is 1.7, so should be 'deprecated in 1.7, removed in 1.9'.
Thanks a lot for tackling this and also adding all the machinery for deprecation. Do we really need to do a deprecation (and introduce a new parameter)? To me the location of a cache directory feels like an internal detail of the data downloader. If the user specified a path explicitly and we somehow changed it then I'd agree we need a deprecation. But for a cache directory that is implicitly set I am less sure. But then again people might be using this to download datasets and then accessing the files they downloaded. So a better way to think of the path is as a default download location, not a cache directory (in which case we should do a deprecation). But if we think of this as more of a data downloading tool where people can (and should) access the files - should we be using a directory that is explicitly called cache (which to me implies that it can be deleted at random or its structure changed without notice)? For example if my OS asks me to "clean up caches and temporary files?" I think it should always be safe to click yes. But for the use-case where someone uses the downloading machinery to place a file in a specific directory, that is no longer true? Or am I making this too complicated? What do you all think about this? |
Resolves #31267
The get_data_home function now uses standard OS cache directories:
Previously, data was stored in ~/scikit_learn_data by default.
This change follows OS conventions for cache storage and improves
maintainability.
Implemented deprecation protocol and added tests in test_base.py