Underflow issues in the Cython perplexity search code used in t-SNE

The _binary_search_perplexity function in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/manifold/_utils.pyx uses several float variables that cause underflow and incorrect t-SNE results on some datasets.

Two reproducible examples:

from sklearn.manifold._utils import _binary_search_perplexity

np.random.seed(42)
x = np.random.randn(1,90).astype(np.float32) + 100
B = _binary_search_perplexity(x, 30, verbose=0)
print(2**-np.nansum(B[0,1:]*np.log2(B[0,1:])))   # Perplexity should be 30

np.random.seed(0)
x = np.random.randint(2100,2140,size=90).astype(np.float32)[np.newaxis,:] / 10000
B = _binary_search_perplexity(x, 30, verbose=0)
print(np.max(B[0,1:]))   # Should be O(0.01)

This outputs

151311.23286064313        # should be 30
7.006443589251029e-38     # should be O(0.01)

My student @xwymary with help by @pavlin-policar tracked down the bug to float definitions in lines 54--65. Replacing them with double and rerunning the same code yields:

30.00022084763819
0.06641180919096631

which is the correct result (as confirmed by using the de-Cythonized version of the _binary_search_perplexity function).

We actually had a real-life dataset where this bug caused very noticeable and very strong artifacts in the t-SNE results.

I am going to submit a PR to fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Underflow issues in the Cython perplexity search code used in t-SNE #19471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Underflow issues in the Cython perplexity search code used in t-SNE #19471

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions