-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX "CategoricalNB fails with sparse matrix" #16561 #23224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX "CategoricalNB fails with sparse matrix" #16561 #23224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. This requires test to be added to sklearn/tests/test_naive_bayes.py
that test for the sparse data. A good reference is:
scikit-learn/sklearn/tests/test_naive_bayes.py
Lines 453 to 454 in 9752c03
@pytest.mark.parametrize("kind", ("dense", "sparse")) | |
def test_mnnb(kind): |
where we can adjust the following test to accept sparse data:
def test_categoricalnb(): |
Note that we are releasing v1.1 soon, so most of the maintainers' attention is focused on releasing.
sklearn/naive_bayes.py
Outdated
@@ -1339,14 +1340,14 @@ def _more_tags(self): | ||
def _check_X(self, X): | ||
"""Validate X, used only in predict* methods.""" | ||
X = self._validate_data( | ||
X, dtype="int", accept_sparse=False, force_all_finite=True, reset=False | ||
X, dtype="int", accept_sparse="csr", force_all_finite=True, reset=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how X
is accessed column-wise (with X[:, i]
), I think this is better as CSC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. I'm with you. How about changing accept_sparse=True
and converting a given matrix to csc if it is not csc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, even if I specified csr
in accept_sparse
, CSC matrix was accepeted. Is it a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If accept_sparse='csc'
and a CSR matrix is passed in, check_array
will convert:
scikit-learn/sklearn/utils/validation.py
Lines 645 to 646 in 40f9460
'csr', etc. If the input is sparse but not in the allowed format, | |
it will be converted to the first listed format. True allows the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what about considering accept_sparse="csc"
as initially suggested by Thomas above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it sounds good. I changed it.
@@ -1386,6 +1390,8 @@ def _update_cat_count_dims(cat_count, highest_feature): | ||
return cat_count | ||
|
||
def _update_cat_count(X_feature, Y, cat_count, n_classes): | ||
if sp.issparse(X_feature): | ||
X_feature = X_feature.toarray().flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, I prefer to avoid calling toarray
all together and have _update_cat_count
work on sparse data without the conversion.
Details: If X
started as a CSC, then I think you can extract the feature for a specific column and then create a new bincount
that works with CSC data. (There are other details for you to work on here, such as the weights and the mask.)
That being said, I am +0.25 on the current solution. It densitfies per column, which is slightly better than densifing all of X
. A follow up PR can work on making it more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details. I completely agree with you. I don't like calling toarray
every time either. But I didn't come up with a good idea.
I don't think I fully understand the details, but I will look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand it now. I'll implement it.
for i in range(self.n_features_in_): | ||
indices = X[:, i] | ||
if is_sparse: | ||
indices = indices.toarray().flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here regarding CSC and not needing to denstify in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong or I may miss something, but even though the matrix is CSC, X[:, i]
is still a sparse matrix. So, it cannot be used as indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand it now. You suggested I use indices and indptr to access to the column directly.
csc_matrix((data, indices, indptr), [shape=(M, N)])
is the standard CSC representation where the row indices for column i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not supplied, the matrix dimensions are inferred from the index arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @g-morishita. Please add an entry to the 1.2 what's new. I'd consider it more like an enhancement than a fix.
Sorry for the late reply. I got it. Could you tell me how to add the entry? I read how to send PR but I couldn't find it. @jeremiedbb |
It means adding an entry in the file |
Hey! Why wasn't this merged ?? |
Reference Issues/PRs
Fixes #16561.
What does this implement/fix? Explain your changes.
I fixed a bug where even though the documentation says
CategoricalNB
acceptssparse_matrix
as a feature matrix, it cannot.Now
CategoricalNB
accepts a sparce matrix (in particular, csr_matrix) without any error.Also, I added the test for it.
Any other comments?
In #16561, one of the maintainers asked in what situation you want CategoricalNB for sparse data.
When I did an assignment in my university class, I was required to train a very simple model like naive Bayes algorithms with sparse data, which is originally text data.
If
toarray()
was used, the memory error occurred.I created such an example. If I run the following code on my local computer with 16GB memory, the process was killed becaues of the memory shortage if I didn't use a sparse matrix. If I did, I was able to successfully finish it.