-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH Add new smoothing methods to MultinomialNB #12943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far this is mostly issues of idiom. But you need to add tests, firstly to show that changing the parameter has the desired (or at least any) effect, secondly to test your implementation against known values from the literature or toy data.
I'm not persuaded that in the context of naive bayes classification we should expect substantial benefit from a wide array of smoothing choices. Ideally this would be supported by literature or examples to show that less naive smoothing can help substantially in multinomial naive bayes classification.
@@ -698,10 +704,16 @@ class MultinomialNB(BaseDiscreteNB): | ||
C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to | ||
Information Retrieval. Cambridge University Press, pp. 234-265. | ||
https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html | ||
|
||
Gale, William A., and Geoffrey Sampson. 1996. Good-Turing Frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please roughly follow a consistent citation style, i.e. first name first.
(0 for no smoothing). | ||
Smoothing parameter (0 for no smoothing). | ||
|
||
smoothing : string, optional (default='additive') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will state the acceptable values
@@ -638,8 +640,11 @@ class MultinomialNB(BaseDiscreteNB): | ||
Parameters | ||
---------- | ||
alpha : float, optional (default=1.0) | ||
Additive (Laplace/Lidstone) smoothing parameter | ||
(0 for no smoothing). | ||
Smoothing parameter (0 for no smoothing). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of this parameter in different smoothing methods should be noted here or under smoothing
@@ -728,6 +745,92 @@ def _joint_log_likelihood(self, X): | ||
return (safe_sparse_dot(X, self.feature_log_prob_.T) + | ||
self.class_log_prior_) | ||
|
||
def _additive_smoothing(self, alpha): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid obscurity and follow the convention that functions correspond to verbs, let's use the naming convention _smooth_additive
etc
"""Compute log probabilities using Simple Good-Turing smoothing""" | ||
def sgt(fc): | ||
# Get the frequencies of frequencies | ||
n = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not using numpy idiom. This should be using numpy data structures and numpy.sort where possible.
Use np.unique and np.bincount? Roughly:
freq_values, freq_idx = np.unique(fc, return_inverse=True)
freq_freqs = np.bincount(freq_idx, minlength=len(freq_values))
# Compute the Z values | ||
r = np.array(sorted(n.items(), key=lambda keyval: keyval[0]), | ||
dtype='int')[:, 0] | ||
n = np.array(sorted(n.items(), key=lambda keyval: keyval[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid duplicating the sort
Z[r[0]] = 2*n[0] / r[1] | ||
Z[r[-1]] = n[-1] / (r[-1] - r[-2]) | ||
for (idx, j) in enumerate(r): | ||
if idx == 0 or idx >= len(r) - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These conditions can be specified in the for line by slicing r and using the second parameter of enumerate
Z = dict() | ||
Z[r[0]] = 2*n[0] / r[1] | ||
Z[r[-1]] = n[-1] / (r[-1] - r[-2]) | ||
for (idx, j) in enumerate(r): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this loop can be written as a vectorised numpy operation, e.g. Z[1:-1] = 2 * freq_freqs[1:-1] / (freq_values[2:] - freq_values[:-2])
or something??
p_r = (1-P_0)*(r_star/N_prime) | ||
|
||
# Calculate probabilities for each feature | ||
total_unseen = np.count_nonzero(fc == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use sum rather than count_nonzero. count_nonzero is implemented as sum(x != 0)
|
||
# Calculate probabilities for each feature | ||
total_unseen = np.count_nonzero(fc == 0) | ||
unseen_prob = P_0/total_unseen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use spaces around binary operators unless deleting spaces helps make the order of operations clearer
How would you define "substantial" @jnothman? Looking at Improving Naive Bayes Text Classifier Using Smoothing Methods, it seems like alternate smoothing methods can perform a good bit better it depending on the feature size and the amount of training data. |
I've not looked yet, but I did not claim that sophisticated smoothing would
not help, but that allowing the user to choose among many might not give
much benefit beyond the gains of a single sophisticated method. Certainly
it would be helpful to cite work that identifies the benefits of each
method for classification.
|
Sorry, I misunderstood your comment. It seems sensible to provide justification for each individual smoothing method. |
@oanise93 are you still working on this? |
Sorry for the confusion @amueller I was merely commenting on the thread. I’m not working on this PR. |
I am a complete noob in this area. I suppose good-Turing is an smoothing method independent of learning algorithms, so should it be separated out to allow other learning methods to use it (like logistic regression)? |
Reference Issues/PRs
Resolves #12862
What does this implement/fix? Explain your changes.
Implements the Good-Turing smoothing algorithm for Naive Bayes classifier and adds two other possible options.
Any other comments?
If the maintainers decide to add Jelinek-Mercer or Absolute Discounting, we need to use different default smoothing parameters depending on the selected algorithm.
Since Good-Turing uses raw counts, the code won't work if the input is transformed using, for example, tf-idf. Would throwing an error be a reasonable solution when the input is not raw counts?
At first, I implemented Simple Good-Turing operating on the entire matrix
self.feature_count_
but I the current solution is more readable. For more information on the notation see Good-Turing Frequency Estimation Without Tears.