The function to calculate Morgan hashes shows some presumably unintended behavior.
When giving larger radiuses, it will continually update the hashes, with the consequence of number of hashes growing, and corresponding structures being repeated several times at different radii:
c1 = smiles("C/C=C/C")
print(c1.morgan_smiles_hash(1,4))
print(c1.morgan_hash_smiles(1,4))
print(c1._morgan_hash_dict(1, 4))
output:
{'C': [-3850700631077715909], 'CC': [6744783386241714987], 'C(=C)C': [-7539450242065226921, -8323264873442648791], 'C(=C/C)\\C': [3662467204175228013, 8307239162483189762, 631123986373182916]}
{-3850700631077715909: ['C'], 6744783386241714987: ['CC'], -7539450242065226921: ['C(=C)C'], -8323264873442648791: ['C(=C)C'], 3662467204175228013: ['C(=C/C)\\C'], 8307239162483189762: ['C(=C/C)\\C'], 631123986373182916: ['C(=C/C)\\C']}
[{1: -3850700631077715909, 2: -3850700631077715909, 3: -3850700631077715909, 4: -3850700631077715909}, {1: 6744783386241714987, 2: -7539450242065226921, 3: -7539450242065226921, 4: 6744783386241714987}, {1: -8323264873442648791, 2: 3662467204175228013, 3: 3662467204175228013, 4: -8323264873442648791}, {1: 8307239162483189762, 2: 631123986373182916, 3: 631123986373182916, 4: 8307239162483189762}]
(note that in the above output, the hashes for the SMILES C(=C)C are coming from two different radii, the same for C(=C/C)\C)
For reference, RDKit implementation of Morgan fingerprints does not create superfluous hashes, example below (repeated with radius set as 2, 3, 4, the result is always the same).
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
m = Chem.MolFromSmiles('C/C=C/C')
fpgen = AllChem.GetMorganGenerator(radius=4)
ao = AllChem.AdditionalOutput()
ao.CollectBitInfoMap()
fp = fpgen.GetSparseCountFingerprint(m,additionalOutput=ao)
info = ao.GetBitInfoMap()
print(info)
output:
{736731344: ((1, 1), (2, 1)), 1796441752: ((1, 2),), 2246703798: ((1, 0), (2, 0)), 2246728737: ((0, 0), (3, 0)), 3545353036: ((0, 1), (3, 1))}
Is the behavior intended?
The function to calculate Morgan hashes shows some presumably unintended behavior.
When giving larger radiuses, it will continually update the hashes, with the consequence of number of hashes growing, and corresponding structures being repeated several times at different radii:
output:
(note that in the above output, the hashes for the SMILES C(=C)C are coming from two different radii, the same for C(=C/C)\C)
For reference, RDKit implementation of Morgan fingerprints does not create superfluous hashes, example below (repeated with radius set as 2, 3, 4, the result is always the same).
output:
Is the behavior intended?