Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH Adds categories with missing values support to fetch_openml with as_frame=True #19365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 6, 2021
4 changes: 0 additions & 4 deletions 4 doc/whats_new/v0.24.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,10 +234,6 @@ Changelog
files downloaded or cached to ensure data integrity.
:pr:`14800` by :user:`Shashank Singh <shashanksingh28>` and `Joel Nothman`_.

- |Feature| :func:`datasets.fetch_openml` now validates md5checksum of arff
glemaitre marked this conversation as resolved.
Show resolved Hide resolved
files downloaded or cached to ensure data integrity.
:pr:`14800` by :user:`Shashank Singh <shashanksingh28>` and `Joel Nothman`_.

- |Enhancement| :func:`datasets.fetch_openml` now allows argument `as_frame`
to be 'auto', which tries to convert returned data to pandas DataFrame
unless data is sparse.
Expand Down
8 changes: 8 additions & 0 deletions 8 doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,14 @@ Changelog
:class:`~sklearn.semi_supervised.LabelPropagation`.
:pr:`19271` by :user:`Zhaowei Wang <ThuWangzw>`.

:mod:`sklearn.datasets`
.......................

- |Enhancement| :func:`datasets.fetch_openml` now supports categories with
missing values when returning a pandas dataframe. :pr:`19365` by
`Thomas Fan`_ and :user:`Amanda Dsouza <amy12xx>` and
:user:`EL-ATEIF Sara <elateifsara>`.

Code and Documentation Contributors
-----------------------------------

Expand Down
6 changes: 5 additions & 1 deletion 6 sklearn/datasets/_openml.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from . import get_data_home
from urllib.error import HTTPError
from ..utils import Bunch
from ..utils import is_scalar_nan
from ..utils import get_chunk_n_rows
from ..utils import _chunk_generator
from ..utils import check_pandas_support # noqa
Expand Down Expand Up @@ -357,7 +358,10 @@ def _convert_arff_data_dataframe(
for column in columns_to_keep:
dtype = _feature_to_dtype(features_dict[column])
if dtype == 'category':
dtype = pd.api.types.CategoricalDtype(attributes[column])
cats_without_missing = [cat for cat in attributes[column]
if cat is not None and
not is_scalar_nan(cat)]
dtype = pd.api.types.CategoricalDtype(cats_without_missing)
df[column] = df[column].astype(dtype, copy=False)
return (df, )

Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
15 changes: 15 additions & 0 deletions 15 sklearn/datasets/tests/test_openml.py
Original file line number Diff line number Diff line change
Expand Up @@ -1311,3 +1311,18 @@ def test_convert_arff_data_type():
msg = r"arff\['data'\] must be a generator when converting to pd.DataFrame"
with pytest.raises(ValueError, match=msg):
_convert_arff_data_dataframe(arff, ['a'], {})


def test_missing_values_pandas(monkeypatch):
"""check that missing values in categories are compatible with pandas
categorical"""
pytest.importorskip('pandas')

data_id = 42585
_monkey_patch_webbased_functions(monkeypatch, data_id, True)
penguins = fetch_openml(data_id=data_id, cache=False, as_frame=True)

cat_dtype = penguins.data.dtypes['sex']
# there are nans in the categorical
assert penguins.data['sex'].isna().any()
assert_array_equal(cat_dtype.categories, ['FEMALE', 'MALE', '_'])
Morty Proxy This is a proxified and sanitized view of the page, visit original site.