Open
Description
SimpleImputer
seems to scale badly with the number of rows when fitting it to a large DataFrame
object, taking several times more than if using pandas' own functions in a loop.
Example:
import numpy as np, pandas as pd
from sklearn.impute import SimpleImputer
rng = np.random.default_rng(seed=1)
nrows = int(1e6)
ncols = 10
nmissing = int(.1 * nrows)
X = pd.DataFrame(rng.normal(size=(nrows, ncols)))
for cl in X.columns:
X.loc[rng.choice(nrows, nmissing, False), cl] = np.nan
%%timeit
imputer = SimpleImputer(strategy="median").fit(X)
1.68 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Compare against this:
from sklearn.base import BaseEstimator, TransformerMixin
class CustomImputer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.medians = dict()
for cl in X.columns:
self.medians[cl] = X[cl].median()
return self
def transform(self, X):
X_new = X.copy()
for cl,m in self.medians.items():
X_new[cl] = X_new[cl].fillna(m)
return X_new
%%timeit
imputer = CustomImputer().fit(X)
227 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)