Closed
Description
Hi, thanks for inventing this great algorithm! And glad to see ntHash2 is also published.
I've used ntHash in several projects of mine, specifically, using a Golang implementation (by @will-rowe) of ntHash v1.0.4. It's really fast!
I just find that the hash values are unstable for k > 64. I guess it's related to the algorithm itself, cause it happens to be 64, not another number. I am not familiar with C/C++, so I can't check the original implementation. Here are steps with tools using the Golang implementation to reproduce the issue.
$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" | seqkit stats
file format type num_seqs sum_len min_len avg_len max_len
- FASTA DNA 1 65 65 65 65
# compute canonical ntHash with https://github.com/shenwei356/unikmer
$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
14872199115326727818 ****
# now we add one base in the start position.
$ echo -ne ">s\naACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
5042997269030491403
13219011773654478434 ****
$ echo -ne ">s\ncACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
5252439041897790003
1248486797404628174 ****
$ echo -ne ">s\ngACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
6432712771380638299
10232415768797241538 ****
$ echo -ne ">s\ntACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
5774425108696765737
11299031900349869606 ****
While for k=64, it's stable (7718595180140858881
):
$ echo -ne ">s\nCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
7718595180140858881
$ echo -ne ">s\naCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
8752650216170443135
7718595180140858881
$ echo -ne ">s\ncCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
9222164657016850147
7718595180140858881
$ echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881
echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881
Metadata
Metadata
Assignees
Labels
No labels