forked from openml/openml-python
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdatasets_tutorial.py
More file actions
165 lines (134 loc) 路 4.69 KB
/
datasets_tutorial.py
File metadata and controls
165 lines (134 loc) 路 4.69 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# %% [markdown]
# How to list and download datasets.
# %%
import pandas as pd
import openml
from openml.datasets import edit_dataset, fork_dataset, get_dataset
# %% [markdown]
# ## Exercise 0
#
# * List datasets and return a dataframe
# %%
datalist = openml.datasets.list_datasets()
datalist = datalist[["did", "name", "NumberOfInstances", "NumberOfFeatures", "NumberOfClasses"]]
print(f"First 10 of {len(datalist)} datasets...")
datalist.head(n=10)
# The same can be done with lesser lines of code
openml_df = openml.datasets.list_datasets()
openml_df.head(n=10)
# %% [markdown]
# ## Exercise 1
#
# * Find datasets with more than 10000 examples.
# * Find a dataset called 'eeg_eye_state'.
# * Find all datasets with more than 50 classes.
# %%
datalist[datalist.NumberOfInstances > 10000].sort_values(["NumberOfInstances"]).head(n=20)
# %%
datalist.query('name == "eeg-eye-state"')
# %%
datalist.query("NumberOfClasses > 50")
# %% [markdown]
# ## Download datasets
# %%
# This is done based on the dataset ID.
dataset = openml.datasets.get_dataset(dataset_id="eeg-eye-state", version=1)
# Print a summary
print(
f"This is dataset '{dataset.name}', the target feature is '{dataset.default_target_attribute}'"
)
print(f"URL: {dataset.url}")
print(dataset.description[:500])
# %% [markdown]
# Get the actual data.
#
# openml-python returns data as pandas dataframes (stored in the `eeg` variable below),
# and also some additional metadata that we don't care about right now.
# %%
eeg, *_ = dataset.get_data()
# %% [markdown]
# You can optionally choose to have openml separate out a column from the
# dataset. In particular, many datasets for supervised problems have a set
# `default_target_attribute` which may help identify the target variable.
# %%
X, y, categorical_indicator, attribute_names = dataset.get_data(
target=dataset.default_target_attribute
)
print(X.head())
print(X.info())
# %% [markdown]
# Sometimes you only need access to a dataset's metadata.
# In those cases, you can download the dataset without downloading the
# data file. The dataset object can be used as normal.
# Whenever you use any functionality that requires the data,
# such as `get_data`, the data will be downloaded.
# Starting from 0.15, not downloading data will be the default behavior instead.
# The data will be downloading automatically when you try to access it through
# openml objects, e.g., using `dataset.features`.
# %%
dataset = openml.datasets.get_dataset(1471)
# %% [markdown]
# ## Exercise 2
# * Explore the data visually.
# %%
eegs = eeg.sample(n=1000)
_ = pd.plotting.scatter_matrix(
X.iloc[:100, :4],
c=y[:100],
figsize=(10, 10),
marker="o",
hist_kwds={"bins": 20},
alpha=0.8,
cmap="plasma",
)
# %% [markdown]
# ## Edit a created dataset
# This example uses the test server, to avoid editing a dataset on the main server.
# %%
openml.config.start_using_configuration_for_example()
# %% [markdown]
# Edit non-critical fields, allowed for all authorized users:
# description, creator, contributor, collection_date, language, citation,
# original_data_url, paper_url
# %%
desc = (
"This data sets consists of 3 different types of irises' "
"(Setosa, Versicolour, and Virginica) petal and sepal length,"
" stored in a 150x4 numpy.ndarray"
)
did = 128
data_id = edit_dataset(
did,
description=desc,
creator="R.A.Fisher",
collection_date="1937",
citation="The use of multiple measurements in taxonomic problems",
language="English",
)
edited_dataset = get_dataset(data_id)
print(f"Edited dataset ID: {data_id}")
# %% [markdown]
# Editing critical fields (default_target_attribute, row_id_attribute, ignore_attribute) is allowed
# only for the dataset owner. Further, critical fields cannot be edited if the dataset has any
# tasks associated with it. To edit critical fields of a dataset (without tasks) owned by you,
# configure the API key:
# openml.config.apikey = 'FILL_IN_OPENML_API_KEY'
# This example here only shows a failure when trying to work on a dataset not owned by you:
# %%
try:
data_id = edit_dataset(1, default_target_attribute="shape")
except openml.exceptions.OpenMLServerException as e:
print(e)
# %% [markdown]
# ## Fork dataset
# Used to create a copy of the dataset with you as the owner.
# Use this API only if you are unable to edit the critical fields (default_target_attribute,
# ignore_attribute, row_id_attribute) of a dataset through the edit_dataset API.
# After the dataset is forked, you can edit the new version of the dataset using edit_dataset.
# %%
data_id = fork_dataset(1)
print(data_id)
data_id = edit_dataset(data_id, default_target_attribute="shape")
print(f"Forked dataset ID: {data_id}")
# %%
openml.config.stop_using_configuration_for_example()