Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Remove nan-likes from category header#1037

Merged
PGijsbers merged 2 commits intodevelopopenml/openml-python:developfrom
fix_1036openml/openml-python:fix_1036Copy head branch name to clipboard
Mar 12, 2021
Merged

Remove nan-likes from category header#1037
PGijsbers merged 2 commits intodevelopopenml/openml-python:developfrom
fix_1036openml/openml-python:fix_1036Copy head branch name to clipboard

Conversation

@PGijsbers
Copy link
Collaborator

Pandas does not accept None/nan as a category (note: of course it does allow nan-values in the data itself). However outside source (i.e. ARFF files) do allow nan as a category, so we must filter these.

Penguins has the column: @ATTRIBUTE sex {?,FEMALE,MALE,_}

Running

import openml
penguins = openml.datasets.get_dataset(42585)
data, *_ = penguins.get_data()
print(data.head())

Before:

Traceback (most recent call last):
  File "E:/repositories/openml-python/mwe.py", line 4, in <module>
    data, *_ = penguins.get_data()
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 693, in get_data
    data, categorical, attribute_names = self._load_data()
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 531, in _load_data
    return self._cache_compressed_file_from_file(file_to_load)
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 488, in _cache_compressed_file_from_file
    data, categorical, attribute_names = self._parse_data_from_arff(data_file)
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 445, in _parse_data_from_arff
    self._unpack_categories(X[column_name], categories_names[column_name])
  File "E:\repositories\openml-python\openml\datasets\dataset.py", line 650, in _unpack_categories
    raw_cat = pd.Categorical(col, ordered=True, categories=categories)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\arrays\categorical.py", line 316, in __init__
    dtype = CategoricalDtype._from_values_or_dtype(
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 330, in _from_values_or_dtype
    dtype = CategoricalDtype(categories, ordered)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 222, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 369, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "E:\repositories\openml-python\venv\lib\site-packages\pandas\core\dtypes\dtypes.py", line 543, in validate_categories
    raise ValueError("Categorial categories cannot be null")
ValueError: Categorial categories cannot be null

Process finished with exit code 1

After:

  species     island  culmen_length_mm  ...  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen              39.1  ...              181.0       3750.0    MALE
1  Adelie  Torgersen              39.5  ...              186.0       3800.0  FEMALE
2  Adelie  Torgersen              40.3  ...              195.0       3250.0  FEMALE
3  Adelie  Torgersen               NaN  ...                NaN          NaN     NaN
4  Adelie  Torgersen              36.7  ...              193.0       3450.0  FEMALE

[5 rows x 7 columns]

Pandas does not accept None/nan as a category (note: of course
it does allow nan-values in the data itself). However outside source
(i.e. ARFF files) do allow nan as a category, so we must filter these.
@PGijsbers PGijsbers requested a review from mfeurer March 12, 2021 09:59
@PGijsbers PGijsbers merged commit 4aec00a into develop Mar 12, 2021
@PGijsbers PGijsbers deleted the fix_1036 branch March 12, 2021 13:09
PGijsbers added a commit to Mirkazemi/openml-python that referenced this pull request Feb 23, 2023
* Remove nan-likes from category header

Pandas does not accept None/nan as a category (note: of course
it does allow nan-values in the data itself). However outside source
(i.e. ARFF files) do allow nan as a category, so we must filter these.

* Test output of _unpack_categories
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.