FIX An overflow issue in HashingVectorizer #19035

ly648499246 · Dec 18, 2020

Reference Issues/PRs

Fixes #19034

What does this implement/fix? Explain your changes.

There is an overflow issue in _hashing_fast.pyx

In this code, when h == -2147483648 (-2^31), result of abs(h) is still -2147483648, and result of abs(h) % n_features is a negative.

After this change, when h != -2147483648, result of abs(h) % n_features is same with before change, and when h == -2147483648, it can return a correct result.

Any other comments?

This issue truly happened in my job, hope it can be resolved.
Thanks for your attention!

thomasjpfan

Thank you for the PR @ly648499246 !

Can you add a non-regression test that would fail at master but pass in this PR?

ly648499246 · Dec 18, 2020

Thank you for the PR @ly648499246 !

Can you add a non-regression test that would fail at master but pass in this PR?

Thank you for your attention.
The case is

from sklearn.feature_extraction.text import HashingVectorizer

hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3), strip_accents='ascii')

print(hashing.transform(['22pcs efuture']).indices)

before this change , this code will print array([-483648], dtype=int32)
after this change, this code will print array([483648], dtype=int32)

And we can also test this case:

from sklearn.feature_extraction.text import HashingVectorizer

hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3), strip_accents='ascii')

print(hashing.transform(['22pcs efuture']))

before this change , this code will throw a exception: ValueError: negative column index found
after this change, this code will print (0, 483648) -1.0

This case works in Linux and Windows.
Results of different platform are as follow.

	Linux	Windows	MacOS
before change	wrong	wrong	correct
after change	correct	correct	correct

rth · Dec 18, 2020

Would the hash value for all input change then? Which is probably fine in this case as a bug fix, but we need to add a note on breaking backward compatibility (and potentially the token collisions that happen) in the release notes.

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

glemaitre · Dec 18, 2020

For the non-regression test, we could monkey patch mumurhash function to ensure to return -2 ** 31 without having to run a potential expensive hashing fit?

ly648499246 · Dec 18, 2020

Would the hash value for all input change then? Which is probably fine in this case as a bug fix, but we need to add a note on breaking backward compatibility (and potentially the token collisions that happen) in the release notes.

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

Thank you for your advice, I already do some test locally, hash value for other value would not change.
And I will add entry to the change log.

ly648499246 · Dec 18, 2020

For the non-regression test, we could monkey patch mumurhash function to ensure to return -2 ** 31 without having to run a potential expensive hashing fit?

Thanks for your advice!

Maybe I'm ignorant, Is there a way to monkey patch mumurhash function? Because _hashing_fast and murmurhash is precompiled.

And in my case above, I can promise that the mumurhash function will return -2**31 because I print it to see when I test.

Moreover, the test above is without fit, it is fast to run, maybe you can try it.

…t-learn into development-ly

williechai · Dec 19, 2020

It‘s really a latent bug, nice work

sklearn/feature_extraction/_hashing_fast.pyx

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

ly648499246 · Dec 23, 2020

Thank you for your advice @jeremiedbb !
This is a more safe change and I committed the suggestion.

jeremiedbb

Thanks @ly648499246. For the non regression test, what other reviewers asked is to actually write a test inside the scikit-learn test suite. The test you wrote in the comment is perfectly fine, i.e

hashing = HashingVectorizer(n_features=1000000, ngram_range=(2,3))
indices = hashing.transform(['22pcs efuture']).indices

and you can assert that the value in indices is not negative. You can add this test at the end of sklearn/feature_extraction/tests/test_text.py, with a small comment, mentioning the PR number

doc/whats_new/v1.0.rst

jeremiedbb · Dec 23, 2020

For the non-regression test, we could monkey patch mumurhash function to ensure to return -2 ** 31 without having to run a potential expensive hashing fit?

@glemaitre it's only hashing 1 string. The 100000 features only specify the range for the indices but the resulting array is a sparse matrix containing only 1 non-zero element.

glemaitre · Jan 4, 2021

@glemaitre it's only hashing 1 string. The 100000 features only specify the range for the indices but the resulting array is a sparse matrix containing only 1 non-zero element.

OK so it should not be an issue then.

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

jeremiedbb

thanks @ly648499246. Just nitpicks

sklearn/feature_extraction/tests/test_text.py

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

ly648499246 · Jan 6, 2021

thanks @ly648499246. Just nitpicks

Thanks @jeremiedbb for your careful check!

jeremiedbb

LGTM

sklearn/feature_extraction/_hashing_fast.pyx

thomasjpfan

LGTM

thomasjpfan · Jan 12, 2021

Thank you for working on this @ly648499246 !

fic bug: when h == -2**31 abs(h) cause an overflow

39330bb

github-actions bot added the module:feature_extraction label Dec 18, 2020

thomasjpfan reviewed Dec 18, 2020

View reviewed changes

ly648499246 added 2 commits December 19, 2020 01:21

fic bug: when h == -2**31 abs(h) cause an overflow

af48fbd

Merge branch 'development-ly' of https://github.com/ly648499246/sciki…

17cff23

…t-learn into development-ly

ly648499246 changed the title ~~fic bug: when h == -2**31 abs(h) cause an overflow~~ fix bug: when h == -2**31 abs(h) cause an overflow Dec 19, 2020

Merge branch 'master' into development-ly

1b9e128

ly648499246 changed the title ~~fix bug: when h == -2**31 abs(h) cause an overflow~~ [MRG]fix bug: when h == -2**31 abs(h) cause an overflow Dec 21, 2020

ly648499246 changed the title ~~[MRG]fix bug: when h == -2**31 abs(h) cause an overflow~~ [MRG]fix bug: an overflow issue in HashingVectorizer Dec 21, 2020

jeremiedbb reviewed Dec 22, 2020

View reviewed changes

sklearn/feature_extraction/_hashing_fast.pyx Outdated Show resolved Hide resolved

ly648499246 and others added 2 commits December 23, 2020 09:41

Update sklearn/feature_extraction/_hashing_fast.pyx

6d37536

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Merge branch 'master' into development-ly

62e848f

jeremiedbb reviewed Dec 23, 2020

View reviewed changes

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

ly648499246 and others added 6 commits January 6, 2021 21:03

Update doc/whats_new/v1.0.rst

028e109

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Update doc/whats_new/v1.0.rst

52bc31f

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Update doc/whats_new/v1.0.rst

4a42078

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Update doc/whats_new/v1.0.rst

cd4eaa8

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

add test for negative indices in HashingVectorizer result

f417881

Update test_text.py

4bbb568

jeremiedbb reviewed Jan 6, 2021

View reviewed changes

sklearn/feature_extraction/tests/test_text.py Show resolved Hide resolved

sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved

sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved

Update test_text.py

a954f81

ly648499246 and others added 3 commits January 6, 2021 21:34

Update sklearn/feature_extraction/tests/test_text.py

861d538

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Update sklearn/feature_extraction/tests/test_text.py

0a97469

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Update test_text.py

fc7bbe1

jeremiedbb approved these changes Jan 6, 2021

View reviewed changes

ly648499246 requested a review from thomasjpfan January 6, 2021 14:44

thomasjpfan reviewed Jan 10, 2021

View reviewed changes

sklearn/feature_extraction/_hashing_fast.pyx Show resolved Hide resolved

ly648499246 added 2 commits January 12, 2021 10:44

Update _hashing_fast.pyx

2c73088

Merge branch 'master' into development-ly

725cc2c

thomasjpfan approved these changes Jan 12, 2021

View reviewed changes

thomasjpfan changed the title ~~[MRG]fix bug: an overflow issue in HashingVectorizer~~ FIX An overflow issue in HashingVectorizer Jan 12, 2021

thomasjpfan merged commit 1bb0306 into scikit-learn:master Jan 12, 2021

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

ly648499246 deleted the development-ly branch January 3, 2022 14:46

Search code, repositories, users, issues, pull requests...

Uh oh!

FIX An overflow issue in HashingVectorizer #19035

FIX An overflow issue in HashingVectorizer #19035

Uh oh!

Conversation

ly648499246 commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

ly648499246 commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Dec 18, 2020

Uh oh!

glemaitre commented Dec 18, 2020

Uh oh!

ly648499246 commented Dec 18, 2020

Uh oh!

ly648499246 commented Dec 18, 2020

Uh oh!

williechai commented Dec 19, 2020

Uh oh!

Uh oh!

ly648499246 commented Dec 23, 2020

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiedbb commented Dec 23, 2020

Uh oh!

glemaitre commented Jan 4, 2021

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ly648499246 commented Jan 6, 2021

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Jan 12, 2021

Uh oh!

Uh oh!

ly648499246 commented Dec 18, 2020 •

edited

Loading

ly648499246 commented Dec 18, 2020 •

edited

Loading