-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX An overflow issue in HashingVectorizer #19035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @ly648499246 !
Can you add a non-regression test that would fail at master but pass in this PR?
Thank you for your attention. from sklearn.feature_extraction.text import HashingVectorizer
hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3), strip_accents='ascii')
print(hashing.transform(['22pcs efuture']).indices) before this change , this code will print And we can also test this case: from sklearn.feature_extraction.text import HashingVectorizer
hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3), strip_accents='ascii')
print(hashing.transform(['22pcs efuture'])) before this change , this code will throw a exception: This case works in Linux and Windows.
|
Would the hash value for all input change then? Which is probably fine in this case as a bug fix, but we need to add a note on breaking backward compatibility (and potentially the token collisions that happen) in the release notes. Please add an entry to the change log at |
For the non-regression test, we could monkey patch mumurhash function to ensure to return |
Thank you for your advice, I already do some test locally, hash value for other value would not change. |
Thanks for your advice! Maybe I'm ignorant, Is there a way to monkey patch mumurhash function? Because And in my case above, I can promise that the mumurhash function will return -2**31 because I print it to see when I test. Moreover, the test above is without |
It‘s really a latent bug, nice work |
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Thank you for your advice @jeremiedbb ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ly648499246. For the non regression test, what other reviewers asked is to actually write a test inside the scikit-learn test suite. The test you wrote in the comment is perfectly fine, i.e
hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3))
indices = hashing.transform(['22pcs efuture']).indices
and you can assert that the value in indices is not negative. You can add this test at the end of sklearn/feature_extraction/tests/test_text.py
, with a small comment, mentioning the PR number
@glemaitre it's only hashing 1 string. The 100000 features only specify the range for the indices but the resulting array is a sparse matrix containing only 1 non-zero element. |
OK so it should not be an issue then. |
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ly648499246. Just nitpicks
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Thanks @jeremiedbb for your careful check! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for working on this @ly648499246 ! |
Reference Issues/PRs
Fixes #19034
What does this implement/fix? Explain your changes.
There is an overflow issue in _hashing_fast.pyx
In this code, when h == -2147483648 (-2^31), result of abs(h) is still -2147483648, and result of abs(h) % n_features is a negative.
After this change, when h != -2147483648, result of abs(h) % n_features is same with before change, and when h == -2147483648, it can return a correct result.
Any other comments?
This issue truly happened in my job, hope it can be resolved.
Thanks for your attention!