-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH fetch_file to fetch data files by URL with retries, checksuming and local caching #29354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…mand to use fetch_file
# expressions (like `pl.col("count").shift(1)` below). See | ||
# https://docs.pola.rs/user-guide/lazy/optimizations/ for more information. | ||
|
||
df = pl.read_parquet(bike_sharing_data_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers, for some reason this file had Windows-style CRLF line endings. This PR only changes the content of first 2 cells.
The HTML of the updated example looks good: https://output.circle-artifacts.com/output/job/14e0e12d-bc0a-44fe-ae16-1aa287820d57/artifacts/0/doc/auto_examples/applications/plot_time_series_lagged_features.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review of the example (I don't know why the diff is so important on GitHub part).
I really feel that this pattern has an added educational value to our user.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
The ruff problems will be fixed by #29359. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ogrisel
# We can now visualize the performance of the model with regards | ||
# to the 5th percentile, median and the 95th percentile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where I wish we had much nicer / easier API to get the same output with much less boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a for loop instead of repeating calls. However this is outside of the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits, otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MarcoGorelli 🥳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳 nice!
(when I saw I had a notification from scikit-learn, my first thought was "oh no this is gonna be another grid-search cv_results_ issue?" 😅 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ogrisel 🥳
Thank you @adrinjalali for the quality review :) |
|
||
pl.Config.set_fmt_str_lengths(20) | ||
|
||
bike_sharing_data_file = fetch_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions about this:
- should we use
sha256
argument? What's the worst that can happen if we don't use it? Using a corrupted file and have a weird error? Using it in this example kind of makesfetch_file
look not super convenient to use. Originally I thought by looking at the code in this example that thesha256
was required but it's not. - I am guessing the URL is likely to change (my understanding is that this is some kind of OpenML internal details), but we can wait and see if that actually happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use sha256 argument? What's the worst that can happen if we don't use it? Using a corrupted file and have a weird error? Using it in this example kind of makes fetch_file look not super convenient to use. Originally I thought by looking at the code in this example that the sha256 was required but it's not.
I think it's good to use the sha256
argument here for two reasons:
- it makes it explicit to detect when the upstream dataset changes: some changes in the upstream data could be such that the example still runs without error but the analysis of the results in the example or plots would become invalid.
- it teaches readers about this good practice of always checking the integrity of stuff you download on the open web.
I am guessing the URL is likely to change (my understanding is that this is some kind of OpenML internal details), but we can wait and see if that actually happens.
OpenML URLs that include the dataset id should always server the same contents because new dataset versions get a different id on openml.org.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK fair enough, let's leave it like this then!
One weird thing in particular, I think, is that there is no way to know the sha256sum
in advance in the OpenML parquet case, but maybe one day there will be one.
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Thanks @lesteve for the review. I answered your questions and applied the suggested change. |
assert filename == "file.tar.gz" | ||
|
||
folder, filename = _derive_folder_and_filename_from_url( | ||
"https://example.com/نمونه نماینده.data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently this means "representative sample" for those who wonder 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so meta, like the original contents of the corrupted file ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, two small questions
f"re-downloading from {remote.url} ." | ||
) | ||
|
||
temp_file = NamedTemporaryFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, here it seems like you are using NamedTemporaryFile
only to get a filename, right?
Is it important to use delete=False
, I guess maybe this is for Windows (glancing over the doc: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, here it seems like you are using NamedTemporaryFile only to get a filename, right?
And create the temporary file in a concurrency-safe way.
Is it important to use delete=False
Because we rename the file at the end. If we keep the default delete=True
we get an error because the original file no longer exists when the temporary file object gets garbage collected by Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is a way to add in the comment that the delete on garbage-collection is only for Python<3.12.
I was a bit confused by your explanation, because I was reading the Python 3.12 doc and delete_on_close=True
by default, so I was thinking but then it's deleted at close.
Turns out delete_on_close
is a new parameter in Python 3.12 ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to close to be able to call shutil.move
. I hadn't realized that delete_on_close
existed in 3.12 but it's ignored when delete=False
.
Not sure how to improve the comment. I think mentioning the 3.12 only delete_on_close
option that we don't use would be more confusing than helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to add a dedicated inline comment before the line where we close temp_file
in anticipation: bed8515.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK good enough I think, thanks!
I looked a bit more and the behaviour is Python version specific, for example in Python 3.11 there is a traceback with the following snippet (moving a NamedTemporaryFile before it is deleted) but not in Python 3.12:
import tempfile
import shutil
tf = tempfile.NamedTemporaryFile(mode='w')
shutil.move(tf.name, '/tmp/new')
del tf
I miss the fact originally, that indeed NamedTemporaryFile
ensures that the file name does not already exist so that is a protection when calling this function in parallel.
@lesteve I pushed a comment to explicitly answer your questions about the handling of the temporary file. |
Alright I set auto-merge 🏁 |
The goal of this helper would be to make it possible to have education examples that download and cache datafiles that are then manually loaded with functions such as
pandas.read_csv
,pandas.read_parquet
and so on.The goal is to avoid inducing the readers of scikit-learn examples that machine learning is only about working with benchmark data wrapped as a
Bunch
object fetched by magic helpers such asfetch_openml
that hide how to properly load a parquet file or specify non-default parameters toread_csv
.TODO