New saca and bwt library (libsais)

**Gribok** · 24th February 2021, 07:31

libsais is my new library for fast (see Benchmarks on GitHub) linear time suffix array and Burrows-Wheeler transform construction based on induced sorting (same algorithm as in sais-lite by Yuta Mori).

The algorithms runs in a linear time (and outperforms divsufsort) using typically only ~12KB of extra memory (with 2n bytes as absolute worst-case extra working space).

Source code and Benchmarks is available at:
https://github.com/IlyaGrebnov/libsais

**Lucas** · 24th February 2021, 19:23

It's great to see an actual improvement in SACA performance after 10 years of stagnation in the field. Brilliant work!
This certainly seems like it will become a new standard.

**Gribok** · 24th February 2021, 20:39

Originally Posted by Lucas

It's great to see an actual improvement in SACA performance after 10 years of stagnation in the field.

libsais is not based on net new ideas (we are still using 10+ years old sais algorithm). I think what changed is hardware profile. CPU Frequencies are stalled, but CPU cache and RAM keep getting bigger and faster. So something which were not practical 10 years ago become practical now. DDR5 is also expect to land later this year, so I think could be even better.

**Lucas** · 24th February 2021, 21:11

I know sais isn't a new algorithm, but being able to execute a majority of it out-of-order is an impressive optimization nonetheless.

**algorithm** · 24th February 2021, 22:41

Also my experience after my submission to GDCC.

For inverse BWT, if you do it in parallel like in Lucas iBWT, the bottleneck is the Memory Level Parallelism of the cpu (that is the number of concurrent memory cache misses). Intel Skylake has around 10 and ZEN2 around 20. (I have not tested for ZEN but it is likely faster). Also you need Huge pages because then TLB misses will be the bottleneck.

But note my block sorting was not exactly BWT but something a bit strange that used the cache better (but also with a compression ratio penalty).

**Gonzalo** · 24th February 2021, 22:53

Will you be updating your BSC with this new library?

**Gribok** · 25th February 2021, 02:02

Originally Posted by Gonzalo

Will you be updating your BSC with this new library?

I have no plans on updating bsc.

**Gribok** · 7th March 2021, 19:29

Asking for help. I do not have access to latest AMD (Zen 2/3) or ARM64 (Apple M1) CPUs. So I would appreciate if folks could help and benchmark libsais vs divsufsort on this CPU microarchitectures.

**Shelwien** · 9th March 2021, 15:04

i7-7820X @ 4.5Ghz

Code:

> bwttest enwik9 enwik9bwt
divsufsort time = 100.422 speed = 9.497MB/s
libsais time = 60.677 speed = 15.717MB/s

**schnaader** · 9th March 2021, 16:10

AMD Ryzen 5 4600H (Zen 2), 6 x 3.0 GHz (4.0 GHz boost)
enwik8 and enwik9, 3 runs each

Code:

enwik8:
divsufsort time = 7.273 speed = 13.113MB/s
libsais time = 4.815 speed = 19.808MB/s
divsufsort time = 7.255 speed = 13.145MB/s
libsais time = 4.803 speed = 19.856MB/s
divsufsort time = 7.298 speed = 13.067MB/s
libsais time = 4.823 speed = 19.773MB/s

enwik9:
divsufsort time = 103.198 speed = 9.241MB/s
libsais time = 71.389 speed = 13.359MB/s
divsufsort time = 103.012 speed = 9.258MB/s
libsais time = 71.305 speed = 13.375MB/s
divsufsort time = 103.091 speed = 9.251MB/s
libsais time = 70.881 speed = 13.455MB/s

@Shelwien: Could you post the source code or a -march=znver2 version? Or do you expect no speedup from this?

**kampaster** · 9th March 2021, 17:30

AMD Ryzen 5 3600X, 4342 MHz (from AIDA64)

Code:

> bwttest enwik9 enwik9bwt
divsufsort time = 85.416 speed = 11.165MB/s
libsais time = 54.552 speed = 17.482MB/s

**Shelwien** · 9th March 2021, 17:40

@schnaader: added source to https://encode.su/threads/3579-New-s...ll=1#post68977

**encode** · 18th March 2021, 18:22

To test out the latest and greatest BWT/SA library by Ilya Grebnov, please check out the brand new BCM v1.60:

https://compressme.net/#downloads

**jethro** · 22nd March 2021, 00:45

AMD Ryzen 5800x (Zen3)

Enwik8

Code:

filelength = 100000000
memory allocation (6N) complete.
data loading complete (100000000 bytes).
divsufsort time = 6.468 speed = 14.744MB/s
libsais time = 3.875 speed = 24.611MB/s
BWT storing complete. index = 22481309

CARE_SBI_Join.tar (some extracted excel)

Code:

filelength = 54346752
memory allocation (6N) complete.
data loading complete (54346752 bytes).
divsufsort time = 1.507 speed = 34.385MB/s
libsais time = 1.190 speed = 43.556MB/s
BWT storing complete. index = 51422568

**Gribok** · 4th April 2021, 09:27

I updated libsais to version 2.0 which comes with OpenMP support (no changes to single threaded performance).

Source code and Benchmarks are available at project page: https://github.com/IlyaGrebnov/libsais

For OpenMP testing I strongly recommend Clang for best performance.

**Gribok** · 19th April 2021, 20:57

I updated libsais to version 2.1 with additional OpenMP parallelization (no changes to single threaded performance).

I am also attaching updated bwtbench program and test result of "divsufsort vs libsais vs msufsort" on enwik9 on Azure DS14_v2 VM (Xeon 8171M 2.1 GHz)

**michael maniscalco** · 20th April 2021, 02:11

Originally Posted by Gribok

I updated libsais to version 2.1 with additional OpenMP parallelization (no changes to single threaded performance).

I am also attaching updated bwtbench program and test result of "divsufsort vs libsais vs msufsort" on enwik9 on Azure DS14_v2 VM (Xeon 8171M 2.1 GHz)

Great work. I never added induced sorting to MSufSort 4 so it's no where near as fast as it should be. But your recent work has inspired me to finish up MSufSort 4.

I'm curious, however, as to why the numbers I get for multithreaded performance on enwik8 are very different from yours. I find it hard to believe that different platforms could possibly produce such a very different result. Here are my numbers using an AMD Ryzen 9 3900 with Ubuntu 20.04:

Code:

libsais$ for ((i=1; i <=12 ; i++)); do echo $i; ./build/bin/libsais_demo enwik8 $i; done
1
elasped time = 18.95 MB/sec
2
elasped time = 27.58 MB/sec
3
elasped time = 34.09 MB/sec
4
elasped time = 39.21 MB/sec
5
elasped time = 42.76 MB/sec
6
elasped time = 45.09 MB/sec
7
elasped time = 45.93 MB/sec
8
elasped time = 46.27 MB/sec
9
elasped time = 47.87 MB/sec
10
elasped time = 47.77 MB/sec
11
elasped time = 48.06 MB/sec
12
elasped time = 48.33 MB/sec

Code:

msufsort$ for ((i=1; i <=12; i++)); do echo $i; ./build/bin/msufsort_demo b enwik8 $i | grep "transform completed"; done
1
burrows wheeler transform completed - total elapsed time: 15.65 MB/sec
2
burrows wheeler transform completed - total elapsed time: 30.96 MB/sec
3
burrows wheeler transform completed - total elapsed time: 41.80 MB/sec
4
burrows wheeler transform completed - total elapsed time: 51.70 MB/sec
5
burrows wheeler transform completed - total elapsed time: 60.31 MB/sec
6
burrows wheeler transform completed - total elapsed time: 65.61 MB/sec
7
burrows wheeler transform completed - total elapsed time: 73.26 MB/sec
8
burrows wheeler transform completed - total elapsed time: 78.18 MB/sec
9
burrows wheeler transform completed - total elapsed time: 83.19 MB/sec
10
burrows wheeler transform completed - total elapsed time: 87.56 MB/sec
11
burrows wheeler transform completed - total elapsed time: 90.90 MB/sec
12
burrows wheeler transform completed - total elapsed time: 93.89 MB/sec

Click image for larger version.

Name: libsais_vs_msufsort.png
Views: 2056
Size: 21.4 KB
ID: 8463

**Gribok** · 20th April 2021, 04:35

sais algorithm is very sensitive to memory speed. My RAM is also overclocked with tight sub-timings. So I am not actually that surprised to see this difference especially with AMD (which is know by having weaker memory controller prior to Zen3 compensated by larger L3 cache).

For best performance try compiling with Clang 11.0 with "-Ofast -fopenmp -march=znver2 -DNDEBUG" (this is what I am using for testing with exception to -march=skylake instead of -march=znver2).

I am also having trouble with MSufSort4 on very repetitive inputs like fib41, rs13 and tm29 from Pizza & Chilli Repetitive Corpus.

**michael maniscalco** · 20th April 2021, 04:54

Originally Posted by Gribok

sais algorithm is very sensitive to memory speed. My RAM is also overclocked with tight sub-timings. So I am not actually that surprised to see this difference especially with AMD (which is know by having weaker memory controller prior to Zen3 compensated by larger L3 cache).

For best performance try compiling with Clang 11.0 with "-Ofast -fopenmp -march=znver2 -DNDEBUG" (this is what I am using for testing with exception to -march=skylake instead of -march=znver2).

I am also having trouble with MSufSort4 on very repetitive inputs like fib41, rs13 and tm29 from Pizza & Chilli Repetitive Corpus.

I'll measure again with the compiler flags you mentioned and report back. But I think that I'm probably getting results for libsais that roughly match your results. I think your result for msufsort is what is different from mine.

Yes, msufsort4 doesn't have the induced sort component yet so it will suffer on those types of input at the moment. I'm mostly curious about MT performance in this case. Is there a reason why libsais appears to max out at about 6ish cores?

**Gribok** · 20th April 2021, 05:14

Originally Posted by michael maniscalco

I think your result for msufsort is what is different from mine.

I have some trouble compiling from sources, so I pick up binary from "msufsort 4 update" thread. Maybe it is not the latest version? RAM speed is the bottleneck and with 2 channels it max out around 6 cores. Intel Xeon 8171M I used from Azure have 6 channels, so it scales a bit better.

**Gribok** · 20th April 2021, 05:37

Also attaching benchmark on same computer. It only have 8 cores.

**Gribok** · 21st April 2021, 08:07

I did another test on Azure D32ds_v4 VM (Intel® Xeon® Platinum 8272CL). And this time I was able to compile msufsort from sources. Results looks consistent with my previous benchmarks.

**encode** · 8th May 2021, 01:17

Unfortunately, I have some issues using your great library under Linux:
The first version gives "Segmentation fault (core dumped)". I believe stack size is about 8 KB.
The latest version cannot be compiled at all...

**well** · 8th May 2021, 05:12

this is so obviuos: #include <stddef.h>, #include <limits.h>:)

**Gribok** · 8th May 2021, 07:36

Originally Posted by well

this is so obviuos: #include <stddef.h>, #include <limits.h>:)

I fixed compilation on linux. Thanks @well for suggestion!

**Gribok** · 8th May 2021, 07:41

Originally Posted by encode

Unfortunately, I have some issues using your great library under Linux:
The first version gives "Segmentation fault (core dumped)". I believe stack size is about 8 KB.
The latest version cannot be compiled at all...

Ilya, I can have a look into what cause segmentation fault if you could share file and compiler version / options with me.

**encode** · 16th May 2021, 00:53

I'm testing the newest version - no issues of any kind! Well done!

One question - what is the purpose of a newly added function last parameter "fs"? It's an extra space, but for what? Is it okay to keep it 0 or with some added extra space the BWT construction will be faster?

BTW, tested your great library on ARM (aarch64) machine - it's more than 2 times slower than divsufsort...

**Gribok** · 16th May 2021, 07:47

Originally Posted by encode

One question - what is the purpose of a newly added function last parameter "fs"? It's an extra space, but for what? Is it okay to keep it 0 or with some added extra space the BWT construction will be faster?:

fs is additional space available at the end of SA array. It should be 0 in most cases. But it also can improve performance in same case, but this is uncommon. Internally libsais uses space (including free space) allocated for suffix array for bucket counting with different induction algorithms. There are 4 algorithms with different break point: 6K, 4K, 2K and 1K (where K is alphabet size). If free space is not sufficient for most efficient algorithm (6K), libsais will need to fallback to less efficient one (4K, 2K, 1K, etc..). You can find some benchmarks on project site under 'Additional memory' section.

libsais is very sensitive to fast memory and software prefetching, so it might not be suitable for some platform like ARM / AArch64. So I am not surprised. Was it Apple M1?

**encode** · 16th May 2021, 11:28

No, it wasn't Apple M1. Can't say exact model because of a trade secret. But that was the one with a slow memory and weird behaviour indeed.
On Intel platform your library is always faster. I've tested it it on many Intel CPUs, even on ones with T suffix.

**encode** · 16th May 2021, 13:41

Originally Posted by michael maniscalco

... I find it hard to believe that different platforms could possibly produce such a very different result.

I can confirm that the performance of libsais could be vastly different depending on a platform.

Thread: New saca and bwt library (libsais)

Thread Tools

Search Thread

Display

New saca and bwt library (libsais)

Thanks (16):

Thanks:

Thanks:

Thanks (2):

Thanks:

Thanks (3):

Thanks:

Thanks:

Thanks (4):

Version 2.1.0 with additional OpenMP parallelization

Thanks (2):

Thanks:

Thanks:

Thanks:

Similar Threads

TurboPFor Library usage

Compression library advice

Need C++ range code library

SACA-K vs. divsufsort for computing BWT

MM compression library

Posting Permissions