ENH Add new smoothing methods to MultinomialNB #12943

psendyk · Jan 9, 2019

Reference Issues/PRs

Resolves #12862

What does this implement/fix? Explain your changes.

Implements the Good-Turing smoothing algorithm for Naive Bayes classifier and adds two other possible options.

Any other comments?

If the maintainers decide to add Jelinek-Mercer or Absolute Discounting, we need to use different default smoothing parameters depending on the selected algorithm.

Since Good-Turing uses raw counts, the code won't work if the input is transformed using, for example, tf-idf. Would throwing an error be a reasonable solution when the input is not raw counts?

At first, I implemented Simple Good-Turing operating on the entire matrix self.feature_count_ but I the current solution is more readable. For more information on the notation see Good-Turing Frequency Estimation Without Tears.

…o MultinomialNB

jnothman

So far this is mostly issues of idiom. But you need to add tests, firstly to show that changing the parameter has the desired (or at least any) effect, secondly to test your implementation against known values from the literature or toy data.

I'm not persuaded that in the context of naive bayes classification we should expect substantial benefit from a wide array of smoothing choices. Ideally this would be supported by literature or examples to show that less naive smoothing can help substantially in multinomial naive bayes classification.

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

@@ -698,10 +704,16 @@ class MultinomialNB(BaseDiscreteNB):
    C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to
    Information Retrieval. Cambridge University Press, pp. 234-265.
    https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
+
+    Gale, William A., and Geoffrey Sampson. 1996. Good-Turing Frequency


Please roughly follow a consistent citation style, i.e. first name first.

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

-        (0 for no smoothing).
+        Smoothing parameter (0 for no smoothing).
+
+    smoothing : string, optional (default='additive')


This will state the acceptable values

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

@@ -638,8 +640,11 @@ class MultinomialNB(BaseDiscreteNB):
    Parameters
    ----------
    alpha : float, optional (default=1.0)
-        Additive (Laplace/Lidstone) smoothing parameter
-        (0 for no smoothing).
+        Smoothing parameter (0 for no smoothing).


The meaning of this parameter in different smoothing methods should be noted here or under smoothing

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

@@ -728,6 +745,92 @@ def _joint_log_likelihood(self, X):
        return (safe_sparse_dot(X, self.feature_log_prob_.T) +
                self.class_log_prior_)

+    def _additive_smoothing(self, alpha):


To avoid obscurity and follow the convention that functions correspond to verbs, let's use the naming convention _smooth_additive etc

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

+        """Compute log probabilities using Simple Good-Turing smoothing"""
+        def sgt(fc):
+            # Get the frequencies of frequencies
+            n = dict()


This is not using numpy idiom. This should be using numpy data structures and numpy.sort where possible.

Use np.unique and np.bincount? Roughly:

freq_values, freq_idx = np.unique(fc, return_inverse=True) freq_freqs = np.bincount(freq_idx, minlength=len(freq_values))

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

+            # Compute the Z values
+            r = np.array(sorted(n.items(), key=lambda keyval: keyval[0]),
+                         dtype='int')[:, 0]
+            n = np.array(sorted(n.items(), key=lambda keyval: keyval[0]),


Please avoid duplicating the sort

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

+            Z[r[0]] = 2*n[0] / r[1]
+            Z[r[-1]] = n[-1] / (r[-1] - r[-2])
+            for (idx, j) in enumerate(r):
+                if idx == 0 or idx >= len(r) - 1:


These conditions can be specified in the for line by slicing r and using the second parameter of enumerate

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

+            Z = dict()
+            Z[r[0]] = 2*n[0] / r[1]
+            Z[r[-1]] = n[-1] / (r[-1] - r[-2])
+            for (idx, j) in enumerate(r):


I suspect this loop can be written as a vectorised numpy operation, e.g. Z[1:-1] = 2 * freq_freqs[1:-1] / (freq_values[2:] - freq_values[:-2]) or something??

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

+            p_r = (1-P_0)*(r_star/N_prime)
+
+            # Calculate probabilities for each feature
+            total_unseen = np.count_nonzero(fc == 0)


use sum rather than count_nonzero. count_nonzero is implemented as sum(x != 0)

jnothman · Jan 9, 2019

sklearn/naive_bayes.py

+
+            # Calculate probabilities for each feature
+            total_unseen = np.count_nonzero(fc == 0)
+            unseen_prob = P_0/total_unseen


please use spaces around binary operators unless deleting spaces helps make the order of operations clearer

oanise93 · Feb 16, 2019

I'm not persuaded that in the context of naive bayes classification we should expect substantial benefit from a wide array of smoothing choices. Ideally this would be supported by literature or examples to show that less naive smoothing can help substantially in multinomial naive bayes classification

How would you define "substantial" @jnothman? Looking at Improving Naive Bayes Text Classifier Using Smoothing Methods, it seems like alternate smoothing methods can perform a good bit better it depending on the feature size and the amount of training data.

jnothman · Feb 16, 2019

I've not looked yet, but I did not claim that sophisticated smoothing would not help, but that allowing the user to choose among many might not give much benefit beyond the gains of a single sophisticated method. Certainly it would be helpful to cite work that identifies the benefits of each method for classification.

oanise93 · Feb 16, 2019

Sorry, I misunderstood your comment. It seems sensible to provide justification for each individual smoothing method.

amueller · Aug 6, 2019

@oanise93 are you still working on this?

oanise93 · Aug 6, 2019

Sorry for the confusion @amueller I was merely commenting on the thread. I’m not working on this PR.

HuStmpHrrr · Sep 22, 2019

I am a complete noob in this area. I suppose good-Turing is an smoothing method independent of learning algorithms, so should it be separated out to allow other learning methods to use it (like logistic regression)?

psendyk added 5 commits January 7, 2019 10:53

add good-turing smoothing

55b3c8f

add good-turing to _update_feature_log_prob

c6a01c0

add good-turing, jelinek-mercer, and absolute discounting smoothing t…

0bb8fdc

…o MultinomialNB

fix doctest

b0f36de

merge upstream

f894c93

jnothman reviewed Jan 9, 2019

View reviewed changes

github-actions bot added the module:naive_bayes label Mar 2, 2020

cmarmo added help wanted Stalled labels Aug 23, 2020

Base automatically changed from master to main January 22, 2021 10:50

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH Add new smoothing methods to MultinomialNB #12943

Are you sure you want to change the base?

ENH Add new smoothing methods to MultinomialNB #12943

Uh oh!

Conversation

psendyk commented Jan 9, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oanise93 commented Feb 16, 2019

Uh oh!

jnothman commented Feb 16, 2019 via email

Uh oh!

oanise93 commented Feb 16, 2019

Uh oh!

amueller commented Aug 6, 2019

Uh oh!

oanise93 commented Aug 6, 2019

Uh oh!

HuStmpHrrr commented Sep 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HuStmpHrrr commented Sep 22, 2019 •

edited

Loading