Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FIX An overflow issue in HashingVectorizer #19035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 12, 2021

Conversation

ly648499246
Copy link
Contributor

@ly648499246 ly648499246 commented Dec 18, 2020

Reference Issues/PRs

Fixes #19034

What does this implement/fix? Explain your changes.

There is an overflow issue in _hashing_fast.pyx

In this code, when h == -2147483648 (-2^31), result of abs(h) is still -2147483648, and result of abs(h) % n_features is a negative.

After this change, when h != -2147483648, result of abs(h) % n_features is same with before change, and when h == -2147483648, it can return a correct result.

Any other comments?

This issue truly happened in my job, hope it can be resolved.
Thanks for your attention!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @ly648499246 !

Can you add a non-regression test that would fail at master but pass in this PR?

@ly648499246
Copy link
Contributor Author

ly648499246 commented Dec 18, 2020

Thank you for the PR @ly648499246 !

Can you add a non-regression test that would fail at master but pass in this PR?

Thank you for your attention.
The case is

from sklearn.feature_extraction.text import HashingVectorizer

hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3), strip_accents='ascii')

print(hashing.transform(['22pcs efuture']).indices)

before this change , this code will print array([-483648], dtype=int32)
after this change, this code will print array([483648], dtype=int32)

And we can also test this case:

from sklearn.feature_extraction.text import HashingVectorizer

hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3), strip_accents='ascii')

print(hashing.transform(['22pcs efuture']))

before this change , this code will throw a exception: ValueError: negative column index found
after this change, this code will print (0, 483648) -1.0

This case works in Linux and Windows.
Results of different platform are as follow.

Linux Windows MacOS
before change wrong wrong correct
after change correct correct correct

@rth
Copy link
Member

rth commented Dec 18, 2020

Would the hash value for all input change then? Which is probably fine in this case as a bug fix, but we need to add a note on breaking backward compatibility (and potentially the token collisions that happen) in the release notes.

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

@glemaitre
Copy link
Member

For the non-regression test, we could monkey patch mumurhash function to ensure to return -2 ** 31 without having to run a potential expensive hashing fit?

@ly648499246
Copy link
Contributor Author

Would the hash value for all input change then? Which is probably fine in this case as a bug fix, but we need to add a note on breaking backward compatibility (and potentially the token collisions that happen) in the release notes.

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

Thank you for your advice, I already do some test locally, hash value for other value would not change.
And I will add entry to the change log.

@ly648499246
Copy link
Contributor Author

For the non-regression test, we could monkey patch mumurhash function to ensure to return -2 ** 31 without having to run a potential expensive hashing fit?

Thanks for your advice!

Maybe I'm ignorant, Is there a way to monkey patch mumurhash function? Because _hashing_fast and murmurhash is precompiled.

And in my case above, I can promise that the mumurhash function will return -2**31 because I print it to see when I test.

Moreover, the test above is without fit, it is fast to run, maybe you can try it.

@ly648499246 ly648499246 changed the title fic bug: when h == -2**31 abs(h) cause an overflow fix bug: when h == -2**31 abs(h) cause an overflow Dec 19, 2020
@williechai
Copy link

It‘s really a latent bug, nice work

@ly648499246 ly648499246 changed the title fix bug: when h == -2**31 abs(h) cause an overflow [MRG]fix bug: when h == -2**31 abs(h) cause an overflow Dec 21, 2020
@ly648499246 ly648499246 changed the title [MRG]fix bug: when h == -2**31 abs(h) cause an overflow [MRG]fix bug: an overflow issue in HashingVectorizer Dec 21, 2020
sklearn/feature_extraction/_hashing_fast.pyx Outdated Show resolved Hide resolved
ly648499246 and others added 2 commits December 23, 2020 09:41
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
@ly648499246
Copy link
Contributor Author

Thank you for your advice @jeremiedbb !
This is a more safe change and I committed the suggestion.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ly648499246. For the non regression test, what other reviewers asked is to actually write a test inside the scikit-learn test suite. The test you wrote in the comment is perfectly fine, i.e

hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3))
indices = hashing.transform(['22pcs efuture']).indices

and you can assert that the value in indices is not negative. You can add this test at the end of sklearn/feature_extraction/tests/test_text.py, with a small comment, mentioning the PR number

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
@jeremiedbb
Copy link
Member

For the non-regression test, we could monkey patch mumurhash function to ensure to return -2 ** 31 without having to run a potential expensive hashing fit?

@glemaitre it's only hashing 1 string. The 100000 features only specify the range for the indices but the resulting array is a sparse matrix containing only 1 non-zero element.

@glemaitre
Copy link
Member

@glemaitre it's only hashing 1 string. The 100000 features only specify the range for the indices but the resulting array is a sparse matrix containing only 1 non-zero element.

OK so it should not be an issue then.

ly648499246 and others added 6 commits January 6, 2021 21:03
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ly648499246. Just nitpicks

sklearn/feature_extraction/tests/test_text.py Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
ly648499246 and others added 3 commits January 6, 2021 21:34
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
@ly648499246
Copy link
Contributor Author

thanks @ly648499246. Just nitpicks

Thanks @jeremiedbb for your careful check!

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sklearn/feature_extraction/_hashing_fast.pyx Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title [MRG]fix bug: an overflow issue in HashingVectorizer FIX An overflow issue in HashingVectorizer Jan 12, 2021
@thomasjpfan thomasjpfan merged commit 1bb0306 into scikit-learn:master Jan 12, 2021
@thomasjpfan
Copy link
Member

Thank you for working on this @ly648499246 !

@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
@ly648499246 ly648499246 deleted the development-ly branch January 3, 2022 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A negative is in indices of HashingVectorizer result
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.