Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FIX "CategoricalNB fails with sparse matrix" #16561 #23224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
Loading
from

Conversation

g-morishita
Copy link

@g-morishita g-morishita commented Apr 27, 2022

Reference Issues/PRs

Fixes #16561.

What does this implement/fix? Explain your changes.

I fixed a bug where even though the documentation says CategoricalNB accepts sparse_matrix as a feature matrix, it cannot.

Now CategoricalNB accepts a sparce matrix (in particular, csr_matrix) without any error.
Also, I added the test for it.

Any other comments?

In #16561, one of the maintainers asked in what situation you want CategoricalNB for sparse data.

When I did an assignment in my university class, I was required to train a very simple model like naive Bayes algorithms with sparse data, which is originally text data.
If toarray() was used, the memory error occurred.

I created such an example. If I run the following code on my local computer with 16GB memory, the process was killed becaues of the memory shortage if I didn't use a sparse matrix. If I did, I was able to successfully finish it.

from sklearn.naive_bayes import CategoricalNB, MultinomialNB
from scipy.sparse import csr_matrix
import numpy as np

sample_size = 1000
n_features = 50000
n_ones = 100
is_sparse = True

rng = np.random.RandomState(1)
rows = rng.choice(sample_size, size=n_ones)
cols = rng.choice(n_features, size=n_ones)

if is_sparse:
    X = csr_matrix((np.ones(n_ones), [rows, cols]), shape=(sample_size, n_features))
else:
    X = np.zeros((sample_size, n_features))
    X[rows, cols] = 1
y = rng.choice(2, size=sample_size)
clf = CategoricalNB(min_categories=3, class_prior=np.array([0.3, 0.7]))
clf.fit(X, y)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. This requires test to be added to sklearn/tests/test_naive_bayes.py that test for the sparse data. A good reference is:

@pytest.mark.parametrize("kind", ("dense", "sparse"))
def test_mnnb(kind):

where we can adjust the following test to accept sparse data:

def test_categoricalnb():

Note that we are releasing v1.1 soon, so most of the maintainers' attention is focused on releasing.

@@ -1339,14 +1340,14 @@ def _more_tags(self):
def _check_X(self, X):
"""Validate X, used only in predict* methods."""
X = self._validate_data(
X, dtype="int", accept_sparse=False, force_all_finite=True, reset=False
X, dtype="int", accept_sparse="csr", force_all_finite=True, reset=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how X is accessed column-wise (with X[:, i]), I think this is better as CSC.

Copy link
Author

@g-morishita g-morishita Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. I'm with you. How about changing accept_sparse=True and converting a given matrix to csc if it is not csc.

Copy link
Author

@g-morishita g-morishita Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, even if I specified csr in accept_sparse, CSC matrix was accepeted. Is it a bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If accept_sparse='csc' and a CSR matrix is passed in, check_array will convert:

'csr', etc. If the input is sparse but not in the allowed format,
it will be converted to the first listed format. True allows the input

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what about considering accept_sparse="csc" as initially suggested by Thomas above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it sounds good. I changed it.

@@ -1386,6 +1390,8 @@ def _update_cat_count_dims(cat_count, highest_feature):
return cat_count

def _update_cat_count(X_feature, Y, cat_count, n_classes):
if sp.issparse(X_feature):
X_feature = X_feature.toarray().flatten()
Copy link
Member

@thomasjpfan thomasjpfan Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I prefer to avoid calling toarray all together and have _update_cat_count work on sparse data without the conversion.

Details: If X started as a CSC, then I think you can extract the feature for a specific column and then create a new bincount that works with CSC data. (There are other details for you to work on here, such as the weights and the mask.)

That being said, I am +0.25 on the current solution. It densitfies per column, which is slightly better than densifing all of X. A follow up PR can work on making it more efficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details. I completely agree with you. I don't like calling toarray every time either. But I didn't come up with a good idea.

I don't think I fully understand the details, but I will look into it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it now. I'll implement it.

for i in range(self.n_features_in_):
indices = X[:, i]
if is_sparse:
indices = indices.toarray().flatten()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding CSC and not needing to denstify in that case.

Copy link
Author

@g-morishita g-morishita Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong or I may miss something, but even though the matrix is CSC, X[:, i] is still a sparse matrix. So, it cannot be used as indexing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it now. You suggested I use indices and indptr to access to the column directly.

csc_matrix((data, indices, indptr), [shape=(M, N)])
is the standard CSC representation where the row indices for column i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not supplied, the matrix dimensions are inferred from the index arrays.

sklearn/tests/test_naive_bayes.py Show resolved Hide resolved
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @g-morishita. Please add an entry to the 1.2 what's new. I'd consider it more like an enhancement than a fix.

@g-morishita
Copy link
Author

Sorry for the late reply. I got it. Could you tell me how to add the entry? I read how to send PR but I couldn't find it. @jeremiedbb

@jeremiedbb
Copy link
Member

Could you tell me how to add the entry?

It means adding an entry in the file doc/whats_new/v1.2.rst. You can look at v1.1.rst to see how it's organised and how to write it. Don't forget to mention the PR number (:pr:) and credit yourself (:user:).

@neel2299
Copy link

Hey! Why wasn't this merged ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

categoricalNB fails with sparse matrix
5 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.