Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH fetch_file to fetch data files by URL with retries, checksuming and local caching #29354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 11, 2024

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Jun 26, 2024

The goal of this helper would be to make it possible to have education examples that download and cache datafiles that are then manually loaded with functions such as pandas.read_csv, pandas.read_parquet and so on.

The goal is to avoid inducing the readers of scikit-learn examples that machine learning is only about working with benchmark data wrapped as a Bunch object fetched by magic helpers such as fetch_openml that hide how to properly load a parquet file or specify non-default parameters to read_csv.

TODO

  • write more tests;
  • upgrade at least one example to show how to use it, for instance with a parquet file from openml.org

Copy link

github-actions bot commented Jun 26, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4fc5c10. Link to the linter CI: here

# expressions (like `pl.col("count").shift(1)` below). See
# https://docs.pola.rs/user-guide/lazy/optimizations/ for more information.

df = pl.read_parquet(bike_sharing_data_file)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers, for some reason this file had Windows-style CRLF line endings. This PR only changes the content of first 2 cells.

@ogrisel ogrisel marked this pull request as ready for review June 27, 2024 13:58
@ogrisel
Copy link
Member Author

ogrisel commented Jun 27, 2024

@ogrisel
Copy link
Member Author

ogrisel commented Jun 27, 2024

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review of the example (I don't know why the diff is so important on GitHub part).

I really feel that this pattern has an added educational value to our user.

examples/applications/plot_time_series_lagged_features.py Outdated Show resolved Hide resolved
examples/applications/plot_time_series_lagged_features.py Outdated Show resolved Hide resolved
examples/applications/plot_time_series_lagged_features.py Outdated Show resolved Hide resolved
examples/applications/plot_time_series_lagged_features.py Outdated Show resolved Hide resolved
sklearn/datasets/_base.py Outdated Show resolved Hide resolved
sklearn/datasets/_base.py Outdated Show resolved Hide resolved
ogrisel and others added 5 commits June 27, 2024 17:26
@ogrisel
Copy link
Member Author

ogrisel commented Jun 28, 2024

The ruff problems will be fixed by #29359.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel

Comment on lines +301 to +302
# We can now visualize the performance of the model with regards
# to the 5th percentile, median and the 95th percentile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where I wish we had much nicer / easier API to get the same output with much less boilerplate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a for loop instead of repeating calls. However this is outside of the scope of this PR.

sklearn/datasets/_base.py Outdated Show resolved Hide resolved
sklearn/datasets/_base.py Outdated Show resolved Hide resolved
sklearn/datasets/_base.py Show resolved Hide resolved
sklearn/datasets/_base.py Show resolved Hide resolved
sklearn/datasets/_base.py Show resolved Hide resolved
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits, otherwise LGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MarcoGorelli 🥳

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳 nice!

(when I saw I had a notification from scikit-learn, my first thought was "oh no this is gonna be another grid-search cv_results_ issue?" 😅 )

sklearn/datasets/tests/test_base.py Show resolved Hide resolved
sklearn/datasets/tests/test_base.py Show resolved Hide resolved
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel 🥳

@ogrisel
Copy link
Member Author

ogrisel commented Jul 5, 2024

Thanks @ogrisel 🥳

Thank you @adrinjalali for the quality review :)

@ogrisel ogrisel added Waiting for Second Reviewer First reviewer is done, need a second one! Waiting for Reviewer labels Jul 8, 2024

pl.Config.set_fmt_str_lengths(20)

bike_sharing_data_file = fetch_file(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions about this:

  • should we use sha256 argument? What's the worst that can happen if we don't use it? Using a corrupted file and have a weird error? Using it in this example kind of makes fetch_file look not super convenient to use. Originally I thought by looking at the code in this example that the sha256 was required but it's not.
  • I am guessing the URL is likely to change (my understanding is that this is some kind of OpenML internal details), but we can wait and see if that actually happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use sha256 argument? What's the worst that can happen if we don't use it? Using a corrupted file and have a weird error? Using it in this example kind of makes fetch_file look not super convenient to use. Originally I thought by looking at the code in this example that the sha256 was required but it's not.

I think it's good to use the sha256 argument here for two reasons:

  • it makes it explicit to detect when the upstream dataset changes: some changes in the upstream data could be such that the example still runs without error but the analysis of the results in the example or plots would become invalid.
  • it teaches readers about this good practice of always checking the integrity of stuff you download on the open web.

I am guessing the URL is likely to change (my understanding is that this is some kind of OpenML internal details), but we can wait and see if that actually happens.

OpenML URLs that include the dataset id should always server the same contents because new dataset versions get a different id on openml.org.

Copy link
Member

@lesteve lesteve Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fair enough, let's leave it like this then!

One weird thing in particular, I think, is that there is no way to know the sha256sum in advance in the OpenML parquet case, but maybe one day there will be one.

sklearn/datasets/_base.py Show resolved Hide resolved
sklearn/datasets/tests/test_base.py Outdated Show resolved Hide resolved
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
@ogrisel
Copy link
Member Author

ogrisel commented Jul 9, 2024

Thanks @lesteve for the review. I answered your questions and applied the suggested change.

assert filename == "file.tar.gz"

folder, filename = _derive_folder_and_filename_from_url(
"https://example.com/نمونه نماینده.data"
Copy link
Member

@lesteve lesteve Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this means "representative sample" for those who wonder 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so meta, like the original contents of the corrupted file ;)

sklearn/datasets/_base.py Outdated Show resolved Hide resolved
sklearn/datasets/_base.py Outdated Show resolved Hide resolved
Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, two small questions

sklearn/datasets/_base.py Outdated Show resolved Hide resolved
f"re-downloading from {remote.url} ."
)

temp_file = NamedTemporaryFile(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, here it seems like you are using NamedTemporaryFile only to get a filename, right?

Is it important to use delete=False, I guess maybe this is for Windows (glancing over the doc: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, here it seems like you are using NamedTemporaryFile only to get a filename, right?

And create the temporary file in a concurrency-safe way.

Is it important to use delete=False

Because we rename the file at the end. If we keep the default delete=True we get an error because the original file no longer exists when the temporary file object gets garbage collected by Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a way to add in the comment that the delete on garbage-collection is only for Python<3.12.

I was a bit confused by your explanation, because I was reading the Python 3.12 doc and delete_on_close=True by default, so I was thinking but then it's deleted at close.

Turns out delete_on_close is a new parameter in Python 3.12 ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to close to be able to call shutil.move. I hadn't realized that delete_on_close existed in 3.12 but it's ignored when delete=False.

Not sure how to improve the comment. I think mentioning the 3.12 only delete_on_close option that we don't use would be more confusing than helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to add a dedicated inline comment before the line where we close temp_file in anticipation: bed8515.

Copy link
Member

@lesteve lesteve Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK good enough I think, thanks!

I looked a bit more and the behaviour is Python version specific, for example in Python 3.11 there is a traceback with the following snippet (moving a NamedTemporaryFile before it is deleted) but not in Python 3.12:

import tempfile
import shutil

tf = tempfile.NamedTemporaryFile(mode='w')

shutil.move(tf.name, '/tmp/new')
del tf

I miss the fact originally, that indeed NamedTemporaryFile ensures that the file name does not already exist so that is a protection when calling this function in parallel.

sklearn/datasets/_base.py Show resolved Hide resolved
@ogrisel
Copy link
Member Author

ogrisel commented Jul 10, 2024

@lesteve I pushed a comment to explicitly answer your questions about the handling of the temporary file.

sklearn/datasets/_base.py Show resolved Hide resolved
@lesteve lesteve enabled auto-merge (squash) July 11, 2024 13:25
@lesteve
Copy link
Member

lesteve commented Jul 11, 2024

Alright I set auto-merge 🏁

@lesteve lesteve merged commit 2b2e290 into scikit-learn:main Jul 11, 2024
28 checks passed
@ogrisel ogrisel deleted the fetch_file branch August 5, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.