Page 1 of 4 123 ... LastLast
Results 1 to 30 of 97

Thread: New saca and bwt library (libsais)

  1. #1
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts

    New saca and bwt library (libsais)

    libsais is my new library for fast (see Benchmarks on GitHub) linear time suffix array and Burrows-Wheeler transform construction based on induced sorting (same algorithm as in sais-lite by Yuta Mori).

    The algorithms runs in a linear time (and outperforms divsufsort) using typically only ~12KB of extra memory (with 2n bytes as absolute worst-case extra working space).

    Source code and Benchmarks is available at:
    https://github.com/IlyaGrebnov/libsais
    Enjoy coding, enjoy life!

  2. Thanks (16):

    algorithm (24th February 2021),avitar (7th March 2021),Bulat Ziganshin (24th February 2021),data man (19th December 2021),encode (25th February 2021),hexagone (24th February 2021),introspec (23rd March 2021),JamesB (8th March 2021),Lucas (24th February 2021),michael maniscalco (25th February 2021),Mike (24th February 2021),mitiko (24th February 2021),ne0n (28th February 2021),Shelwien (24th February 2021),snowcat (24th February 2021),xinix (24th February 2021)

  3. #2
    Member
    Join Date
    Feb 2015
    Location
    Canada
    Posts
    224
    Thanks
    37
    Thanked 109 Times in 65 Posts
    It's great to see an actual improvement in SACA performance after 10 years of stagnation in the field. Brilliant work!
    This certainly seems like it will become a new standard.

  4. #3
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Quote Originally Posted by Lucas View Post
    It's great to see an actual improvement in SACA performance after 10 years of stagnation in the field.
    libsais is not based on net new ideas (we are still using 10+ years old sais algorithm). I think what changed is hardware profile. CPU Frequencies are stalled, but CPU cache and RAM keep getting bigger and faster. So something which were not practical 10 years ago become practical now. DDR5 is also expect to land later this year, so I think could be even better.
    Enjoy coding, enjoy life!

  5. Thanks:

    encode (18th March 2021)

  6. #4
    Member
    Join Date
    Feb 2015
    Location
    Canada
    Posts
    224
    Thanks
    37
    Thanked 109 Times in 65 Posts
    I know sais isn't a new algorithm, but being able to execute a majority of it out-of-order is an impressive optimization nonetheless.

  7. #5
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    143
    Thanks
    45
    Thanked 40 Times in 27 Posts
    Also my experience after my submission to GDCC.

    For inverse BWT, if you do it in parallel like in Lucas iBWT, the bottleneck is the Memory Level Parallelism of the cpu (that is the number of concurrent memory cache misses). Intel Skylake has around 10 and ZEN2 around 20. (I have not tested for ZEN but it is likely faster). Also you need Huge pages because then TLB misses will be the bottleneck.

    But note my block sorting was not exactly BWT but something a bit strange that used the cache better (but also with a compression ratio penalty).

  8. #6
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    758
    Thanks
    287
    Thanked 123 Times in 91 Posts
    Will you be updating your BSC with this new library?

  9. #7
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Quote Originally Posted by Gonzalo View Post
    Will you be updating your BSC with this new library?
    I have no plans on updating bsc.
    Enjoy coding, enjoy life!

  10. #8
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Asking for help. I do not have access to latest AMD (Zen 2/3) or ARM64 (Apple M1) CPUs. So I would appreciate if folks could help and benchmark libsais vs divsufsort on this CPU microarchitectures.
    Enjoy coding, enjoy life!

  11. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,904
    Thanks
    432
    Thanked 1,985 Times in 1,147 Posts
    i7-7820X @ 4.5Ghz
    Code:
    > bwttest enwik9 enwik9bwt
    divsufsort time = 100.422 speed = 9.497MB/s
    libsais time = 60.677 speed = 15.717MB/s
    Attached Files Attached Files

  12. Thanks:

    Gribok (9th March 2021)

  13. #10
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    666
    Thanks
    366
    Thanked 287 Times in 149 Posts
    AMD Ryzen 5 4600H (Zen 2), 6 x 3.0 GHz (4.0 GHz boost)
    enwik8 and enwik9, 3 runs each

    Code:
    enwik8:
    divsufsort time = 7.273 speed = 13.113MB/s
    libsais time = 4.815 speed = 19.808MB/s
    divsufsort time = 7.255 speed = 13.145MB/s
    libsais time = 4.803 speed = 19.856MB/s
    divsufsort time = 7.298 speed = 13.067MB/s
    libsais time = 4.823 speed = 19.773MB/s
    
    enwik9:
    divsufsort time = 103.198 speed = 9.241MB/s
    libsais time = 71.389 speed = 13.359MB/s
    divsufsort time = 103.012 speed = 9.258MB/s
    libsais time = 71.305 speed = 13.375MB/s
    divsufsort time = 103.091 speed = 9.251MB/s
    libsais time = 70.881 speed = 13.455MB/s
    @Shelwien: Could you post the source code or a -march=znver2 version? Or do you expect no speedup from this?
    http://schnaader.info
    Damn kids. They're all alike.

  14. Thanks (2):

    encode (18th March 2021),Gribok (9th March 2021)

  15. #11
    Member kampaster's Avatar
    Join Date
    Apr 2010
    Location
    ->
    Posts
    62
    Thanks
    23
    Thanked 15 Times in 12 Posts
    AMD Ryzen 5 3600X, 4342 MHz (from AIDA64)
    Code:
    > bwttest enwik9 enwik9bwt
    divsufsort time = 85.416 speed = 11.165MB/s
    libsais time = 54.552 speed = 17.482MB/s

  16. Thanks:

    Gribok (9th March 2021)

  17. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,904
    Thanks
    432
    Thanked 1,985 Times in 1,147 Posts

  18. Thanks (3):

    Gribok (9th March 2021),schnaader (9th March 2021),xinix (9th March 2021)

  19. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,146
    Thanks
    617
    Thanked 558 Times in 213 Posts
    To test out the latest and greatest BWT/SA library by Ilya Grebnov, please check out the brand new BCM v1.60:

    https://compressme.net/#downloads


  20. Thanks:

    xinix (18th March 2021)

  21. #14
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    118
    Thanks
    14
    Thanked 12 Times in 12 Posts
    AMD Ryzen 5800x (Zen3)

    Enwik8
    Code:
    filelength = 100000000
    memory allocation (6N) complete.
    data loading complete (100000000 bytes).
    divsufsort time = 6.468 speed = 14.744MB/s
    libsais time = 3.875 speed = 24.611MB/s
    BWT storing complete. index = 22481309
    CARE_SBI_Join.tar (some extracted excel)
    Code:
    filelength = 54346752
    memory allocation (6N) complete.
    data loading complete (54346752 bytes).
    divsufsort time = 1.507 speed = 34.385MB/s
    libsais time = 1.190 speed = 43.556MB/s
    BWT storing complete. index = 51422568

  22. Thanks:

    Gribok (25th March 2021)

  23. #15
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    I updated libsais to version 2.0 which comes with OpenMP support (no changes to single threaded performance).

    Source code and Benchmarks are available at project page: https://github.com/IlyaGrebnov/libsais

    For OpenMP testing I strongly recommend Clang for best performance.
    Attached Files Attached Files
    Last edited by Gribok; 9th April 2021 at 00:49.
    Enjoy coding, enjoy life!

  24. Thanks (4):

    encode (4th April 2021),JamesB (5th April 2021),Shelwien (4th April 2021),xinix (24th April 2021)

  25. #16
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts

    Version 2.1.0 with additional OpenMP parallelization

    I updated libsais to version 2.1 with additional OpenMP parallelization (no changes to single threaded performance).

    I am also attaching updated bwtbench program and test result of "divsufsort vs libsais vs msufsort" on enwik9 on Azure DS14_v2 VM (Xeon 8171M 2.1 GHz)
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	Enwik9bench.png 
Views:	577 
Size:	36.4 KB 
ID:	8462  
    Attached Files Attached Files
    Enjoy coding, enjoy life!

  26. Thanks (2):

    Mike (19th April 2021),xinix (20th April 2021)

  27. #17
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    244
    Thanks
    48
    Thanked 172 Times in 64 Posts
    Quote Originally Posted by Gribok View Post
    I updated libsais to version 2.1 with additional OpenMP parallelization (no changes to single threaded performance).

    I am also attaching updated bwtbench program and test result of "divsufsort vs libsais vs msufsort" on enwik9 on Azure DS14_v2 VM (Xeon 8171M 2.1 GHz)
    Great work. I never added induced sorting to MSufSort 4 so it's no where near as fast as it should be. But your recent work has inspired me to finish up MSufSort 4.

    I'm curious, however, as to why the numbers I get for multithreaded performance on enwik8 are very different from yours. I find it hard to believe that different platforms could possibly produce such a very different result. Here are my numbers using an AMD Ryzen 9 3900 with Ubuntu 20.04:


    Code:
    libsais$ for ((i=1; i <=12 ; i++)); do echo $i; ./build/bin/libsais_demo enwik8 $i; done
    1
    elasped time = 18.95 MB/sec
    2
    elasped time = 27.58 MB/sec
    3
    elasped time = 34.09 MB/sec
    4
    elasped time = 39.21 MB/sec
    5
    elasped time = 42.76 MB/sec
    6
    elasped time = 45.09 MB/sec
    7
    elasped time = 45.93 MB/sec
    8
    elasped time = 46.27 MB/sec
    9
    elasped time = 47.87 MB/sec
    10
    elasped time = 47.77 MB/sec
    11
    elasped time = 48.06 MB/sec
    12
    elasped time = 48.33 MB/sec

    Code:
    msufsort$ for ((i=1; i <=12; i++)); do echo $i; ./build/bin/msufsort_demo b enwik8 $i | grep "transform completed"; done
    1
    burrows wheeler transform completed - total elapsed time: 15.65 MB/sec
    2
    burrows wheeler transform completed - total elapsed time: 30.96 MB/sec
    3
    burrows wheeler transform completed - total elapsed time: 41.80 MB/sec
    4
    burrows wheeler transform completed - total elapsed time: 51.70 MB/sec
    5
    burrows wheeler transform completed - total elapsed time: 60.31 MB/sec
    6
    burrows wheeler transform completed - total elapsed time: 65.61 MB/sec
    7
    burrows wheeler transform completed - total elapsed time: 73.26 MB/sec
    8
    burrows wheeler transform completed - total elapsed time: 78.18 MB/sec
    9
    burrows wheeler transform completed - total elapsed time: 83.19 MB/sec
    10
    burrows wheeler transform completed - total elapsed time: 87.56 MB/sec
    11
    burrows wheeler transform completed - total elapsed time: 90.90 MB/sec
    12
    burrows wheeler transform completed - total elapsed time: 93.89 MB/sec
    Click image for larger version. 

Name:	libsais_vs_msufsort.png 
Views:	2056 
Size:	21.4 KB 
ID:	8463

  28. Thanks:

    Mike (20th April 2021)

  29. #18
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    sais algorithm is very sensitive to memory speed. My RAM is also overclocked with tight sub-timings. So I am not actually that surprised to see this difference especially with AMD (which is know by having weaker memory controller prior to Zen3 compensated by larger L3 cache).

    For best performance try compiling with Clang 11.0 with "-Ofast -fopenmp -march=znver2 -DNDEBUG" (this is what I am using for testing with exception to -march=skylake instead of -march=znver2).

    I am also having trouble with MSufSort4 on very repetitive inputs like fib41, rs13 and tm29 from Pizza & Chilli Repetitive Corpus.
    Enjoy coding, enjoy life!

  30. #19
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    244
    Thanks
    48
    Thanked 172 Times in 64 Posts
    Quote Originally Posted by Gribok View Post
    sais algorithm is very sensitive to memory speed. My RAM is also overclocked with tight sub-timings. So I am not actually that surprised to see this difference especially with AMD (which is know by having weaker memory controller prior to Zen3 compensated by larger L3 cache).

    For best performance try compiling with Clang 11.0 with "-Ofast -fopenmp -march=znver2 -DNDEBUG" (this is what I am using for testing with exception to -march=skylake instead of -march=znver2).

    I am also having trouble with MSufSort4 on very repetitive inputs like fib41, rs13 and tm29 from Pizza & Chilli Repetitive Corpus.
    I'll measure again with the compiler flags you mentioned and report back. But I think that I'm probably getting results for libsais that roughly match your results. I think your result for msufsort is what is different from mine.

    Yes, msufsort4 doesn't have the induced sort component yet so it will suffer on those types of input at the moment. I'm mostly curious about MT performance in this case. Is there a reason why libsais appears to max out at about 6ish cores?

  31. #20
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Quote Originally Posted by michael maniscalco View Post
    I think your result for msufsort is what is different from mine.
    I have some trouble compiling from sources, so I pick up binary from "msufsort 4 update" thread. Maybe it is not the latest version? RAM speed is the bottleneck and with 2 channels it max out around 6 cores. Intel Xeon 8171M I used from Azure have 6 channels, so it scales a bit better.
    Enjoy coding, enjoy life!

  32. #21
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Also attaching benchmark on same computer. It only have 8 cores.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	msufsort4.png 
Views:	627 
Size:	25.4 KB 
ID:	8464  
    Enjoy coding, enjoy life!

  33. #22
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    I did another test on Azure D32ds_v4 VM (Intel® Xeon® Platinum 8272CL). And this time I was able to compile msufsort from sources. Results looks consistent with my previous benchmarks.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	Azure_D32dsV4.png 
Views:	645 
Size:	49.8 KB 
ID:	8465  
    Enjoy coding, enjoy life!

  34. #23
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,146
    Thanks
    617
    Thanked 558 Times in 213 Posts
    Unfortunately, I have some issues using your great library under Linux:
    The first version gives "Segmentation fault (core dumped)". I believe stack size is about 8 KB.
    The latest version cannot be compiled at all...

  35. #24
    Member
    Join Date
    Feb 2020
    Posts
    51
    Thanks
    8
    Thanked 4 Times in 4 Posts
    this is so obviuos: #include <stddef.h>, #include <limits.h>:)

  36. Thanks:

    Gribok (8th May 2021)

  37. #25
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Quote Originally Posted by well View Post
    this is so obviuos: #include <stddef.h>, #include <limits.h>:)
    I fixed compilation on linux. Thanks @well for suggestion!
    Enjoy coding, enjoy life!

  38. #26
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Quote Originally Posted by encode View Post
    Unfortunately, I have some issues using your great library under Linux:
    The first version gives "Segmentation fault (core dumped)". I believe stack size is about 8 KB.
    The latest version cannot be compiled at all...
    Ilya, I can have a look into what cause segmentation fault if you could share file and compiler version / options with me.
    Enjoy coding, enjoy life!

  39. #27
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,146
    Thanks
    617
    Thanked 558 Times in 213 Posts
    I'm testing the newest version - no issues of any kind! Well done!
    One question - what is the purpose of a newly added function last parameter "fs"? It's an extra space, but for what? Is it okay to keep it 0 or with some added extra space the BWT construction will be faster?

    BTW, tested your great library on ARM (aarch64) machine - it's more than 2 times slower than divsufsort...

  40. #28
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    354
    Thanks
    102
    Thanked 273 Times in 99 Posts
    Quote Originally Posted by encode View Post
    One question - what is the purpose of a newly added function last parameter "fs"? It's an extra space, but for what? Is it okay to keep it 0 or with some added extra space the BWT construction will be faster?:
    fs is additional space available at the end of SA array. It should be 0 in most cases. But it also can improve performance in same case, but this is uncommon. Internally libsais uses space (including free space) allocated for suffix array for bucket counting with different induction algorithms. There are 4 algorithms with different break point: 6K, 4K, 2K and 1K (where K is alphabet size). If free space is not sufficient for most efficient algorithm (6K), libsais will need to fallback to less efficient one (4K, 2K, 1K, etc..). You can find some benchmarks on project site under 'Additional memory' section.

    libsais is very sensitive to fast memory and software prefetching, so it might not be suitable for some platform like ARM / AArch64. So I am not surprised. Was it Apple M1?
    Enjoy coding, enjoy life!

  41. Thanks:

    encode (16th May 2021)

  42. #29
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,146
    Thanks
    617
    Thanked 558 Times in 213 Posts
    No, it wasn't Apple M1. Can't say exact model because of a trade secret. But that was the one with a slow memory and weird behaviour indeed.
    On Intel platform your library is always faster. I've tested it it on many Intel CPUs, even on ones with T suffix.

  43. #30
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,146
    Thanks
    617
    Thanked 558 Times in 213 Posts
    Quote Originally Posted by michael maniscalco View Post
    ... I find it hard to believe that different platforms could possibly produce such a very different result.
    I can confirm that the performance of libsais could be vastly different depending on a platform.

Page 1 of 4 123 ... LastLast

Similar Threads

  1. TurboPFor Library usage
    By Baker22 in forum Data Compression
    Replies: 6
    Last Post: 31st July 2020, 12:03
  2. Compression library advice
    By bloom256 in forum Data Compression
    Replies: 24
    Last Post: 23rd November 2015, 23:22
  3. Need C++ range code library
    By coyote in forum Data Compression
    Replies: 10
    Last Post: 14th September 2015, 22:33
  4. SACA-K vs. divsufsort for computing BWT
    By Matt Mahoney in forum Data Compression
    Replies: 5
    Last Post: 16th February 2014, 06:41
  5. MM compression library
    By Bulat Ziganshin in forum Forum Archive
    Replies: 29
    Last Post: 12th September 2007, 16:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.