FIX "CategoricalNB fails with sparse matrix" #16561 #23224

g-morishita · Apr 27, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

I fixed a bug where even though the documentation says CategoricalNB accepts sparse_matrix as a feature matrix, it cannot.

Now CategoricalNB accepts a sparce matrix (in particular, csr_matrix) without any error.
Also, I added the test for it.

Any other comments?

In #16561, one of the maintainers asked in what situation you want CategoricalNB for sparse data.

When I did an assignment in my university class, I was required to train a very simple model like naive Bayes algorithms with sparse data, which is originally text data.
If toarray() was used, the memory error occurred.

I created such an example. If I run the following code on my local computer with 16GB memory, the process was killed becaues of the memory shortage if I didn't use a sparse matrix. If I did, I was able to successfully finish it.

from sklearn.naive_bayes import CategoricalNB, MultinomialNB
from scipy.sparse import csr_matrix
import numpy as np

sample_size = 1000
n_features = 50000
n_ones = 100
is_sparse = True

rng = np.random.RandomState(1)
rows = rng.choice(sample_size, size=n_ones)
cols = rng.choice(n_features, size=n_ones)

if is_sparse:
    X = csr_matrix((np.ones(n_ones), [rows, cols]), shape=(sample_size, n_features))
else:
    X = np.zeros((sample_size, n_features))
    X[rows, cols] = 1
y = rng.choice(2, size=sample_size)
clf = CategoricalNB(min_categories=3, class_prior=np.array([0.3, 0.7]))
clf.fit(X, y)

thomasjpfan

Thank you for the PR. This requires test to be added to sklearn/tests/test_naive_bayes.py that test for the sparse data. A good reference is:

scikit-learn/sklearn/tests/test_naive_bayes.py

Lines 453 to 454 in 9752c03

    
           @pytest.mark.parametrize("kind", ("dense", "sparse")) 
        
           def test_mnnb(kind):

where we can adjust the following test to accept sparse data:

scikit-learn/sklearn/tests/test_naive_bayes.py

Line 678 in 9752c03

def test_categoricalnb():

Note that we are releasing v1.1 soon, so most of the maintainers' attention is focused on releasing.

thomasjpfan · Apr 27, 2022

sklearn/naive_bayes.py

@@ -1339,14 +1340,14 @@ def _more_tags(self):
    def _check_X(self, X):
        """Validate X, used only in predict* methods."""
        X = self._validate_data(
-            X, dtype="int", accept_sparse=False, force_all_finite=True, reset=False
+            X, dtype="int", accept_sparse="csr", force_all_finite=True, reset=False


Given how X is accessed column-wise (with X[:, i]), I think this is better as CSC.

Thank you for the review. I'm with you. How about changing accept_sparse=True and converting a given matrix to csc if it is not csc.

By the way, even if I specified csr in accept_sparse, CSC matrix was accepeted. Is it a bug?

If accept_sparse='csc' and a CSR matrix is passed in, check_array will convert:

scikit-learn/sklearn/utils/validation.py

Lines 645 to 646 in 40f9460

'csr', etc. If the input is sparse but not in the allowed format,

it will be converted to the first listed format. True allows the input

I missed it. Thank you.

So what about considering accept_sparse="csc" as initially suggested by Thomas above?

Yes, it sounds good. I changed it.

thomasjpfan · Apr 27, 2022

sklearn/naive_bayes.py

@@ -1386,6 +1390,8 @@ def _update_cat_count_dims(cat_count, highest_feature):
            return cat_count

        def _update_cat_count(X_feature, Y, cat_count, n_classes):
+            if sp.issparse(X_feature):
+                X_feature = X_feature.toarray().flatten()


If possible, I prefer to avoid calling toarray all together and have _update_cat_count work on sparse data without the conversion.

Details: If X started as a CSC, then I think you can extract the feature for a specific column and then create a new bincount that works with CSC data. (There are other details for you to work on here, such as the weights and the mask.)

That being said, I am +0.25 on the current solution. It densitfies per column, which is slightly better than densifing all of X. A follow up PR can work on making it more efficient.

Thanks for the details. I completely agree with you. I don't like calling toarray every time either. But I didn't come up with a good idea.

I don't think I fully understand the details, but I will look into it.

I understand it now. I'll implement it.

thomasjpfan · Apr 27, 2022

sklearn/naive_bayes.py

        for i in range(self.n_features_in_):
            indices = X[:, i]
+            if is_sparse:
+                indices = indices.toarray().flatten()


Same here regarding CSC and not needing to denstify in that case.

I may be wrong or I may miss something, but even though the matrix is CSC, X[:, i] is still a sparse matrix. So, it cannot be used as indexing?

I understand it now. You suggested I use indices and indptr to access to the column directly.

csc_matrix((data, indices, indptr), [shape=(M, N)])
is the standard CSC representation where the row indices for column i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not supplied, the matrix dimensions are inferred from the index arrays.

sklearn/tests/test_naive_bayes.py

This reverts commit 2da2093.

jeremiedbb

Thanks for the PR @g-morishita. Please add an entry to the 1.2 what's new. I'd consider it more like an enhancement than a fix.

g-morishita · May 4, 2022

Sorry for the late reply. I got it. Could you tell me how to add the entry? I read how to send PR but I couldn't find it. @jeremiedbb

jeremiedbb · May 4, 2022

Could you tell me how to add the entry?

It means adding an entry in the file doc/whats_new/v1.2.rst. You can look at v1.1.rst to see how it's organised and how to write it. Don't forget to mention the PR number (:pr:) and credit yourself (:user:).

neel2299 · Sep 30, 2024

Hey! Why wasn't this merged ??

g-morishita added 2 commits April 27, 2022 19:14

fix categoricalNB failure with sparse matrix

30f526b

add categoricalNB tests

2da2093

github-actions bot added the module:naive_bayes label Apr 27, 2022

thomasjpfan reviewed Apr 27, 2022

View reviewed changes

g-morishita added 3 commits April 28, 2022 07:49

Revert "add categoricalNB tests"

14a7e4e

This reverts commit 2da2093.

add categoricalNB tests for sparse data

caf33f1

change accepted type of sparse matrix data

cab5ced

jeremiedbb reviewed May 2, 2022

View reviewed changes

	@pytest.mark.parametrize("kind", ("dense", "sparse"))
	def test_mnnb(kind):

	'csr', etc. If the input is sparse but not in the allowed format,
	it will be converted to the first listed format. True allows the input

Search code, repositories, users, issues, pull requests...

Uh oh!

FIX "CategoricalNB fails with sparse matrix" #16561 #23224

Are you sure you want to change the base?

FIX "CategoricalNB fails with sparse matrix" #16561 #23224

Uh oh!

Conversation

g-morishita commented Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

g-morishita Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

g-morishita Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

g-morishita Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

g-morishita commented May 4, 2022

Uh oh!

jeremiedbb commented May 4, 2022

Uh oh!

neel2299 commented Sep 30, 2024

Uh oh!

Uh oh!

g-morishita commented Apr 27, 2022 •

edited

Loading

g-morishita Apr 27, 2022 •

edited

Loading

g-morishita Apr 27, 2022 •

edited

Loading

thomasjpfan Apr 27, 2022 •

edited

Loading

g-morishita Apr 27, 2022 •

edited

Loading