ENH fetch_file to fetch data files by URL with retries, checksuming and local caching #29354

ogrisel · Jun 26, 2024

The goal of this helper would be to make it possible to have education examples that download and cache datafiles that are then manually loaded with functions such as pandas.read_csv, pandas.read_parquet and so on.

The goal is to avoid inducing the readers of scikit-learn examples that machine learning is only about working with benchmark data wrapped as a Bunch object fetched by magic helpers such as fetch_openml that hide how to properly load a parquet file or specify non-default parameters to read_csv.

TODO

write more tests;
upgrade at least one example to show how to use it, for instance with a parquet file from openml.org

…nd local caching

github-actions · Jun 26, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4fc5c10. Link to the linter CI: here}

…mand to use fetch_file

ogrisel · Jun 27, 2024

examples/applications/plot_time_series_lagged_features.py

+# expressions (like `pl.col("count").shift(1)` below). See
+# https://docs.pola.rs/user-guide/lazy/optimizations/ for more information.
+
+df = pl.read_parquet(bike_sharing_data_file)


Note to reviewers, for some reason this file had Windows-style CRLF line endings. This PR only changes the content of first 2 cells.

ogrisel · Jun 27, 2024

/cc @glemaitre @GaelVaroquaux @adrinjalali

ogrisel · Jun 27, 2024

The HTML of the updated example looks good: https://output.circle-artifacts.com/output/job/14e0e12d-bc0a-44fe-ae16-1aa287820d57/artifacts/0/doc/auto_examples/applications/plot_time_series_lagged_features.html

glemaitre

Quick review of the example (I don't know why the diff is so important on GitHub part).

I really feel that this pattern has an added educational value to our user.

examples/applications/plot_time_series_lagged_features.py

sklearn/datasets/_base.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ogrisel · Jun 28, 2024

The ruff problems will be fixed by #29359.

adrinjalali

Thanks @ogrisel

adrinjalali · Jul 4, 2024

examples/applications/plot_time_series_lagged_features.py

+# We can now visualize the performance of the model with regards
+# to the 5th percentile, median and the 95th percentile:


this is where I wish we had much nicer / easier API to get the same output with much less boilerplate.

We could use a for loop instead of repeating calls. However this is outside of the scope of this PR.

sklearn/datasets/_base.py

adrinjalali

some nits, otherwise LGTM.

adrinjalali · Jul 5, 2024

examples/applications/plot_time_series_lagged_features.py

cc @MarcoGorelli 🥳

🥳 nice!

(when I saw I had a notification from scikit-learn, my first thought was "oh no this is gonna be another grid-search cv_results_ issue?" 😅 )

sklearn/datasets/tests/test_base.py

adrinjalali

Thanks @ogrisel 🥳

ogrisel · Jul 5, 2024

Thanks @ogrisel 🥳

Thank you @adrinjalali for the quality review :)

lesteve · Jul 8, 2024

examples/applications/plot_time_series_lagged_features.py

+
+pl.Config.set_fmt_str_lengths(20)
+
+bike_sharing_data_file = fetch_file(


Two questions about this:

should we use sha256 argument? What's the worst that can happen if we don't use it? Using a corrupted file and have a weird error? Using it in this example kind of makes fetch_file look not super convenient to use. Originally I thought by looking at the code in this example that the sha256 was required but it's not.

I am guessing the URL is likely to change (my understanding is that this is some kind of OpenML internal details), but we can wait and see if that actually happens.

should we use sha256 argument? What's the worst that can happen if we don't use it? Using a corrupted file and have a weird error? Using it in this example kind of makes fetch_file look not super convenient to use. Originally I thought by looking at the code in this example that the sha256 was required but it's not.

I think it's good to use the sha256 argument here for two reasons:

it makes it explicit to detect when the upstream dataset changes: some changes in the upstream data could be such that the example still runs without error but the analysis of the results in the example or plots would become invalid.

it teaches readers about this good practice of always checking the integrity of stuff you download on the open web.

I am guessing the URL is likely to change (my understanding is that this is some kind of OpenML internal details), but we can wait and see if that actually happens.

OpenML URLs that include the dataset id should always server the same contents because new dataset versions get a different id on openml.org.

OK fair enough, let's leave it like this then!

One weird thing in particular, I think, is that there is no way to know the sha256sum in advance in the OpenML parquet case, but maybe one day there will be one.

sklearn/datasets/_base.py

sklearn/datasets/tests/test_base.py

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

ogrisel · Jul 9, 2024

Thanks @lesteve for the review. I answered your questions and applied the suggested change.

lesteve · Jul 9, 2024

sklearn/datasets/tests/test_base.py

+    assert filename == "file.tar.gz"
+
+    folder, filename = _derive_folder_and_filename_from_url(
+        "https://example.com/نمونه نماینده.data"


Apparently this means "representative sample" for those who wonder 😉

It's so meta, like the original contents of the corrupted file ;)

sklearn/datasets/_base.py

lesteve

LGTM, two small questions

sklearn/datasets/_base.py

lesteve · Jul 10, 2024

sklearn/datasets/_base.py

+                f"re-downloading from {remote.url} ."
+            )
+
+    temp_file = NamedTemporaryFile(


Just curious, here it seems like you are using NamedTemporaryFile only to get a filename, right?

Is it important to use delete=False, I guess maybe this is for Windows (glancing over the doc: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile)?

Just curious, here it seems like you are using NamedTemporaryFile only to get a filename, right?

And create the temporary file in a concurrency-safe way.

Is it important to use delete=False

Because we rename the file at the end. If we keep the default delete=True we get an error because the original file no longer exists when the temporary file object gets garbage collected by Python.

Maybe there is a way to add in the comment that the delete on garbage-collection is only for Python<3.12.

I was a bit confused by your explanation, because I was reading the Python 3.12 doc and delete_on_close=True by default, so I was thinking but then it's deleted at close.

Turns out delete_on_close is a new parameter in Python 3.12 ...

We need to close to be able to call shutil.move. I hadn't realized that delete_on_close existed in 3.12 but it's ignored when delete=False.

Not sure how to improve the comment. I think mentioning the 3.12 only delete_on_close option that we don't use would be more confusing than helpful.

I decided to add a dedicated inline comment before the line where we close temp_file in anticipation: bed8515.

OK good enough I think, thanks!

I looked a bit more and the behaviour is Python version specific, for example in Python 3.11 there is a traceback with the following snippet (moving a NamedTemporaryFile before it is deleted) but not in Python 3.12:

import tempfile import shutil tf = tempfile.NamedTemporaryFile(mode='w') shutil.move(tf.name, '/tmp/new') del tf

I miss the fact originally, that indeed NamedTemporaryFile ensures that the file name does not already exist so that is a protection when calling this function in parallel.

sklearn/datasets/_base.py

… file

ogrisel · Jul 10, 2024

@lesteve I pushed a comment to explicitly answer your questions about the handling of the temporary file.

sklearn/datasets/_base.py

lesteve · Jul 11, 2024

Alright I set auto-merge 🏁

ENH fetch_file to fetch data files by URL with retries, checksuming a…

6119be4

…nd local caching

github-actions bot added the module:datasets label Jun 26, 2024

ogrisel added 11 commits June 27, 2024 09:37

TST add tests for fetch_file, with and without SHA256 checks

a4c456d

Improve docstring

d8bd174

Test fetch_file's use of get_data_home

ff808d3

Merge branch 'main' into fetch_file

8facede

Add changelog entry

219a077

Fix PR number in changelog entry...

b969c85

Close the temp file earlier to make Windows happier?

b6900a1

Update example on feature engineering with Polars for bike sharing de…

77ce36e

…mand to use fetch_file

Make expected warning message OS independent

c7a35e0

Shorter warning message

c09ddff

Improve phrasing in the first example cell

8368fe7

ogrisel commented Jun 27, 2024

View reviewed changes

Simplify _slugify

5aa07b7

ogrisel marked this pull request as ready for review June 27, 2024 13:58

glemaitre reviewed Jun 27, 2024

View reviewed changes

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

ogrisel commented Jun 27, 2024

View reviewed changes

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

ogrisel and others added 5 commits June 27, 2024 17:26

Apply suggestions from code review

5da0feb

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Add URL to openml UI for the Bike Sharing Dataset

b9df5b2

Explain the use of the sha256 argument

4e50efa

Trim useless empty cell.

fa429e1

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Trailing line

811dc3c

Merge branch 'main' into fetch_file

46756de

adrinjalali reviewed Jul 4, 2024

View reviewed changes

Empty commit to trigger PR update

f14235d

adrinjalali reviewed Jul 5, 2024

View reviewed changes

ogrisel added 2 commits July 5, 2024 15:48

Test .. explicitly and more stripping patterns

4854eba

Better test what Adrin actually suggested

c78019f

adrinjalali approved these changes Jul 5, 2024

View reviewed changes

ogrisel added Waiting for Second Reviewer First reviewer is done, need a second one! Waiting for Reviewer labels Jul 8, 2024

lesteve reviewed Jul 8, 2024

View reviewed changes

sklearn/datasets/_base.py Show resolved Hide resolved

lesteve reviewed Jul 8, 2024

View reviewed changes

sklearn/datasets/tests/test_base.py Outdated Show resolved Hide resolved

Better corrupted contents.

f55da8c

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

lesteve reviewed Jul 9, 2024

View reviewed changes

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

ogrisel commented Jul 9, 2024

View reviewed changes

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

Mention slugify in the docstring.

e6e35a1

lesteve approved these changes Jul 10, 2024

View reviewed changes

jeremiedbb reviewed Jul 10, 2024

View reviewed changes

sklearn/datasets/_base.py Show resolved Hide resolved

Explain the logic behid manual deletion and renaming of the temporary…

94e5563

… file

Also clean the tempfile in case of ctrl-c

82211b3

ogrisel commented Jul 11, 2024

View reviewed changes

sklearn/datasets/_base.py Show resolved Hide resolved

ogrisel and others added 2 commits July 11, 2024 14:12

Clarify the relationship between delete=False and temp_file.close()

bed8515

lint

4fc5c10

lesteve enabled auto-merge (squash) July 11, 2024 13:25

lesteve merged commit 2b2e290 into scikit-learn:main Jul 11, 2024
28 checks passed

ogrisel deleted the fetch_file branch August 5, 2024 15:02

glemaitre mentioned this pull request Sep 13, 2024

DOC add fetch_file into the API reference #29841

Merged

		# We can now visualize the performance of the model with regards
		# to the 5th percentile, median and the 95th percentile:


		pl.Config.set_fmt_str_lengths(20)

		bike_sharing_data_file = fetch_file(

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH fetch_file to fetch data files by URL with retries, checksuming and local caching #29354

ENH fetch_file to fetch data files by URL with retries, checksuming and local caching #29354

Uh oh!

Conversation

ogrisel commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

github-actions bot commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 27, 2024

Uh oh!

ogrisel commented Jun 27, 2024

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jun 28, 2024

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jul 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jul 9, 2024

Uh oh!

lesteve Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 26, 2024 •

edited

Loading

github-actions bot commented Jun 26, 2024 •

edited

Loading

lesteve Jul 9, 2024 •

edited

Loading

lesteve Jul 9, 2024 •

edited

Loading

lesteve Jul 11, 2024 •

edited

Loading