Closed
Description
The _binary_search_perplexity
function in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/manifold/_utils.pyx uses several float
variables that cause underflow and incorrect t-SNE results on some datasets.
Two reproducible examples:
from sklearn.manifold._utils import _binary_search_perplexity
np.random.seed(42)
x = np.random.randn(1,90).astype(np.float32) + 100
B = _binary_search_perplexity(x, 30, verbose=0)
print(2**-np.nansum(B[0,1:]*np.log2(B[0,1:]))) # Perplexity should be 30
np.random.seed(0)
x = np.random.randint(2100,2140,size=90).astype(np.float32)[np.newaxis,:] / 10000
B = _binary_search_perplexity(x, 30, verbose=0)
print(np.max(B[0,1:])) # Should be O(0.01)
This outputs
151311.23286064313 # should be 30
7.006443589251029e-38 # should be O(0.01)
My student @xwymary with help by @pavlin-policar tracked down the bug to float
definitions in lines 54--65. Replacing them with double
and rerunning the same code yields:
30.00022084763819
0.06641180919096631
which is the correct result (as confirmed by using the de-Cythonized version of the _binary_search_perplexity
function).
We actually had a real-life dataset where this bug caused very noticeable and very strong artifacts in the t-SNE results.
I am going to submit a PR to fix this.
Metadata
Metadata
Assignees
Labels
No labels