scikit-learn / scikit-learn Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Change fit_transform of MissingIndicator class to get mask only once #14356
Conversation
…alling _get_missing_features_info twice (scikit-learn#14278)
sklearn/impute/_base.py
Outdated
| self._n_features = X.shape[1] | ||
|
|
||
| if self.features not in ('missing-only', 'all'): | ||
| raise ValueError("'features' has to be either 'missing-only' or " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't duplicate the code ideally. It's not covered by the tests which is why you're seeing the code coverage failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller Then what should I do, because I need to do the exact same things, I can't change the variable name (can I??). I can't call fit function because if I do I need to create a data member to store what the function returns which raises more failures during check because the tests do not know have any information about it.
Any help??
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amuller Can I use mangling (double leading underscore)??
sklearn/impute/_base.py
Outdated
| @@ -1,7 +1,3 @@ | ||
| # Authors: Nicolas Tresegnie <nicolas.tresegnie@gmail.com> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the author information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do that intentionally, its a mistake, sorry I'll restore it immediately.
sklearn/impute/_base.py
Outdated
| @@ -612,7 +612,8 @@ def fit(self, X, y=None): | ||
| raise ValueError("'sparse' has to be a boolean or 'auto'. " | ||
| "Got {!r} instead.".format(self.sparse)) | ||
|
|
||
| self.features_ = self._get_missing_features_info(X)[1] | ||
| self.__missing_features_info = self._get_missing_features_info(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we usually only use a single underscore, but otherwise this solution seems good to me.
Do we have a test that checks that fit(X).transform(X) == fit_transform(X)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single underscore will declare a protected instance variable, which can be accessed outside the class making data transparent, hence I've used double underscore which can't be accessed outside the class which would maintain opacity as the user need not know that data, he is already getting it when using fit_transform.
And moreover I think that can raise issue in PR checks, as the fit function returns self, which would include that protected member, but the tests do not know if fit return it, so we need to add that to tests. (I think this may happen, please correct me if wrong)
So, is it okay using private instance variable or should I change it to protected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test that checks that
fit(X).transform(X) == fit_transform(X)?
I don't think so. I'm sorry but I'm very new to test stuff, but I had a look in the test file and don't think that we have something that checks it.
I'll try to learn more about testing ASAP and will try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add one? it should be quite simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I add it to the pre existing test function test_missing_indicator_new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I did add a test, I'm pushing it. Please do let me know if any changes are to be made
|
@amueller I did add a test, please have a look and let me know if I need to make some changes. |
|
Storing
|
|
@thomasjpfan I'm a bit confused please help out here - |
|
|
|
Okay I see, where you are going. Amazing, thats a nice idea! Thanks! |
|
@thomasjpfan Just out of curiocity, what if I would have stored the return values in a local variable in the |
It will only exist in
In scikit-learn we use a single |
|
@thomasjpfan Thanks for clearing that up, I made the necessary changes and pushed them. Can you please have a look and let me know. I was also asked to add a test for the same, can you please have a look at it too (bcs I'm very new to tests). Thanks. |
sklearn/impute/_base.py
Outdated
| @@ -612,7 +612,23 @@ def fit(self, X, y=None): | ||
| raise ValueError("'sparse' has to be a boolean or 'auto'. " | ||
| "Got {!r} instead.".format(self.sparse)) | ||
|
|
||
| self.features_ = self._get_missing_features_info(X)[1] | ||
| return self._get_missing_features_info(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that _fit actually fits and learns something:
missing_features_info = self._get_missing_features_info(X)
self.features_ = missing_features_info[1]
return missing_features_info
sklearn/impute/_base.py
Outdated
| self : object | ||
| Returns self. | ||
| """ | ||
| self.features_ = self._fit(X, y)[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change suggested in _fit this can be:
| self.features_ = self._fit(X, y)[1] | |
| self._fit(X, y) |
sklearn/impute/_base.py
Outdated
| @@ -667,7 +683,14 @@ def fit_transform(self, X, y=None): | ||
| will be boolean. | ||
| """ | ||
| return self.fit(X, y).transform(X) | ||
| imputer_mask, features = self._fit(X, y) | ||
| self.features_ = features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.features_ = features can be removed with the suggested change in _fit.
After I approve this PR, we need to wait for another core dev to review the PR before it is merged. |
|
@thomasjpfan Yup, I also thought to do so, but since you asked me not to store in missing_features_info, I thought you don't want me to store it in a local variable too. So I didn't do it. No problem, I'll patch it up. |
|
@jnothman Sorry, I am creating a mess here. Please help me out here. As I was getting merge conflict, by mistake I merged my missing_indicator branch to upstream/master branch, to correct it I merged the upstream/master changes to local master and reset my missing_indicator head and force pushed it. But still I'm getting merge conflicts. Thanks. |
|
@jnothman I think I reversed back - I restored my local repo back where I didn't merge (commit ffe3a5a), and also did the necessary changes (removed the unnecessary condition, removed tests and added efficiency entry), I pushed it and think every thing just got back where the merge conflict arose. Should I save the conflict by keeping both the logs or removing the log which was merged in master. Thanks. |
We prefer if you do not do force pushes to remove commits. |
sklearn/impute/_base.py
Outdated
| return self.fit(X, y).transform(X) | ||
| imputer_mask = self._fit(X, y) | ||
|
|
||
| if (self.features_.size < self._n_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (self.features_.size < self._n_features): | |
| if self.features_.size < self._n_features: |
|
Well done for fixing it up though |
sklearn/impute/_base.py
Outdated
| @@ -598,7 +598,7 @@ def fit(self, X, y=None): | ||
| Returns | ||
| ------- | ||
| self : object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the return comment of _fit or fit?
@thomasjpfan Sorry, just got scared for messing things up. Won't happen again. |
@jnothman Thanks! Please have a look and let me know if I did it correctly -
Are you talking about the return comment of Thanks. |
sklearn/impute/_base.py
Outdated
| Returns self. | ||
| Xt : {ndarray or sparse matrix}, shape (n_samples, n_features) | ||
| The missing indicator for input data. The data type of ``Xt`` | ||
| will be boolean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can take use the docstring from _get_missing_features_info, since _fit returns one of the outputs of _get_missing_features_info:
imputer_mask : {ndarray or sparse matrix}, shape (n_samples, n_features_with_missing)
The imputer mask of the original data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Just one thing since _fit returns the full imputer_mask rather than only features_with_missing (as it is being checked in fit_transform), can I change the shape to just (n_samples, n_features)?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I agree with just (n_samples, n_features).
@harsh020 It is okay! Thank you for working on this PR! |
The pleasure was all mine. Thank you for helping me all along, as without you guys it would have been impossible for me to work on this issue. Thanks. |
sklearn/impute/_base.py
Outdated
| @@ -597,8 +597,10 @@ def fit(self, X, y=None): | ||
| Returns | ||
| ------- | ||
| self : object | ||
| Returns self. | ||
| imputer_mask : {ndarray or sparse matrix}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imputer_mask : {ndarray or sparse matrix}, shape (n_samples, \
n_features)
The imputer mask of the original data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spacing. This is only for consistency. This docstring is not going to be rendered since _fit is private.
Notice the use of \ and how n_features) is aligned and how many white spaces there are before The imputer....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks.
|
Thank you @harsh020! |
|
@jnothman Can I delete my branch? |
|
You are welcome to.
|
| @@ -107,6 +107,11 @@ Changelog | ||
| preserve the class balance of the original training set. :pr:`14194` | ||
| by :user:`Johann Faouzi <johannfaouzi>`. | ||
|
|
||
| - |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just realised this is in completely the wrong section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, my bad. Really sorry for the blunder. I'm immediately opening a new pr and fixing it up.
| @@ -107,6 +107,11 @@ Changelog | ||
| preserve the class balance of the original training set. :pr:`14194` | ||
| by :user:`Johann Faouzi <johannfaouzi>`. | ||
|
|
||
| - |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the | ||
| _get_missing_features_info function is now called once when calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn't usually mention private methods in a change log. Rather, this should just say "avoid repeated computation"
| - |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the | ||
| _get_missing_features_info function is now called once when calling | ||
| fit_transform for MissingIndicator class. :pr:`14356` by :user: | ||
| `Harsh Soni <harsh020>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't render correctly with space before it. It must be right after the :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.


Reference Issues/PRs
Fixes #14278
What does this implement/fix? Explain your changes.
The implementation fixes the function _get_missing_features_info from being called twice in fit_transform and also adds a test to check the result are correct.
--Declare a private instance variable (double leading underscore variable/mangling) in fit function, to store the return values of _get_missing_features_info.
-- Called fit from fit_transform.
--Used the private instance variable, when implementing transform in fit_transform to retrieve data.
Any other comments?
The text was updated successfully, but these errors were encountered: