From 9c17582faf9acf067e7ee384434d6c6f83ecc6ed Mon Sep 17 00:00:00 2001 From: AlexXan312 <62149707+AlexXan312@users.noreply.github.com> Date: Wed, 2 Feb 2022 15:42:37 +0300 Subject: [PATCH 001/173] fix dp divide and conquer --- content/russian/cs/layer-optimizations/divide-and-conquer.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/russian/cs/layer-optimizations/divide-and-conquer.md b/content/russian/cs/layer-optimizations/divide-and-conquer.md index a7731f49..61a7304a 100644 --- a/content/russian/cs/layer-optimizations/divide-and-conquer.md +++ b/content/russian/cs/layer-optimizations/divide-and-conquer.md @@ -19,10 +19,10 @@ $$ Конкретно в задаче покрытия точек отрезками, можно заметить следующее: $$ -opt[i, j] \leq opt[i, j+1] +opt[i, j] \leq opt[i+1, j] $$ -Интуиция такая: если у нас появился дополнительный отрезок, то последний отрезок нам не выгодно делать больше, а скорее наоборот его нужно «сжать». +Интуиция такая: когда мы сдвигаем i вправо, то точка, с которой может начинаться последняя группа, не может уменьшаться. ### Идея From 91108ace5d37d6730480dc2624cc1fdd64d26361 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 31 Mar 2022 17:46:29 +0300 Subject: [PATCH 002/173] note about filtering performance --- content/english/hpc/simd/shuffling.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/content/english/hpc/simd/shuffling.md b/content/english/hpc/simd/shuffling.md index f2a2cd15..5774b1fd 100644 --- a/content/english/hpc/simd/shuffling.md +++ b/content/english/hpc/simd/shuffling.md @@ -225,7 +225,9 @@ The vectorized version takes some work to implement, but it is 6-7x faster than ![](../img/filter.svg) -This operation is considerably faster on AVX-512: it has a special "[compress](_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines. +The loop performance is relatively low — taking 4 CPU cycles per iteration — because, on this particular CPU (Zen 2), `movemask`, `permute`, and `store` have low throughput and all have to go through the same execution port (P2). On most other platforms, you can expect it to be ~2x faster. + +Filtering can also be implemented considerably faster on AVX-512: it has a special "[compress](_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines, such as quicksort. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-blocked-barplot.svg b/content/english/hpc/algorithms/img/mm-blocked-barplot.svg new file mode 100644 index 00000000..93334ac1 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-blocked-barplot.svg @@ -0,0 +1,1402 @@ + + + + + + + + 2022-04-05T01:18:41.689702 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-blocked-plot.svg b/content/english/hpc/algorithms/img/mm-blocked-plot.svg new file mode 100644 index 00000000..87dda835 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-blocked-plot.svg @@ -0,0 +1,1474 @@ + + + + + + + + 2022-04-05T01:18:54.049300 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-kernel-barplot.svg b/content/english/hpc/algorithms/img/mm-kernel-barplot.svg new file mode 100644 index 00000000..834d8b39 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-kernel-barplot.svg @@ -0,0 +1,1277 @@ + + + + + + + + 2022-04-05T01:18:16.721432 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-kernel-plot.svg b/content/english/hpc/algorithms/img/mm-kernel-plot.svg new file mode 100644 index 00000000..99f9315a --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-kernel-plot.svg @@ -0,0 +1,1385 @@ + + + + + + + + 2022-04-05T01:18:30.773700 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-noalloc.svg b/content/english/hpc/algorithms/img/mm-noalloc.svg new file mode 100644 index 00000000..a4911ea0 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-noalloc.svg @@ -0,0 +1,1344 @@ + + + + + + + + 2022-04-05T01:19:35.314892 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg b/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg new file mode 100644 index 00000000..610d8276 --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg @@ -0,0 +1,1140 @@ + + + + + + + + 2022-04-05T01:17:55.289785 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/img/mm-vectorized-plot.svg b/content/english/hpc/algorithms/img/mm-vectorized-plot.svg new file mode 100644 index 00000000..7374f73f --- /dev/null +++ b/content/english/hpc/algorithms/img/mm-vectorized-plot.svg @@ -0,0 +1,1379 @@ + + + + + + + + 2022-04-05T01:18:01.560593 + image/svg+xml + + + Matplotlib v3.5.1, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 408c6892..29081c0c 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -1,9 +1,49 @@ --- title: Matrix Multiplication -weight: 4 +weight: 20 draft: true --- +"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn. + +For reasons that will later become aparent, we only use sizes that are multiples of $48$. 1920 + +Cache associativity strikes again. This is also an issue, but we will not address it for now. + +GCC 13. + +3.5s for 1025 ad 12s for 1024. + +baseline 13.58622 0.5209607970428861 +hugepages 16.749895 0.42256312651512146 +transposed 12.377302 0.5718441708863531 +autovec 3.117215 2.2705806304666187 +vectorized 3.075742 2.301196914435606 +kernel 2.24264 3.1560517960974566 +blocked 0.461477 15.33746643928083 +noalloc 0.408031 17.346446716058338 +nomove 0.303826 23.295860130469414 +blas 0.27489790320396423 25.747333528217077 + +![](../img/mm-vectorized-barplot.svg) + +![](../img/mm-vectorized-plot.svg) + +![](../img/mm-kernel-barplot.svg) + +![](../img/mm-kernel-plot.svg) + +![](../img/mm-blocked-plot.svg) + +![](../img/mm-blocked-barplot.svg) + +![](../img/mm-noalloc.svg) + +![](../img/mm-blas.svg) + +Which is fine, considering that this is not the only thing that CPUs are made for. + +--- ## Case Study: Distance Product From d97421b9a3d47b22fe6ba12591a04edc5e406af9 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 5 Apr 2022 01:41:49 +0300 Subject: [PATCH 012/173] matmul code --- content/english/hpc/algorithms/matmul.md | 137 +++++++++++++++++++++++ 1 file changed, 137 insertions(+) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 29081c0c..b787ae52 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -25,14 +25,151 @@ noalloc 0.408031 17.346446716058338 nomove 0.303826 23.295860130469414 blas 0.27489790320396423 25.747333528217077 +```c++ +void matmul(const float *a, const float *b, float *c, int n) { + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i * n + j] += a[i * n + k] * b[k * n + j]; +} +``` + +Transpose: + +```c++ +void matmul(const float *a, const float *_b, float *c, int n) { + float *b = new float[n * n]; + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + b[i * n + j] = _b[j * n + i]; + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i * n + j] += a[i * n + k] * b[j * n + k]; // notice indices +} +``` + +```c++ +void matmul(const float *a, const float *_b, float * __restrict__ c, int n) { + // ... +} +``` + +```c++ +const int B = 8; // number of elements in a vector +const int vecsize = B * sizeof(float); // size of a vector in bytes +typedef float vector __attribute__ (( vector_size(vecsize) )); + +vector* alloc(int n) { + vector* ptr = (vector*) std::aligned_alloc(vecsize, vecsize * n); + memset(ptr, 0, vecsize * n); + return ptr; +} + +float hsum(vector s) { + float res = 0; + for (int i = 0; i < B; i++) + res += s[i]; + return res; +} + +void matmul(const float *_a, const float *_b, float *c, int n) { + int nB = (n + B - 1) / B; + + vector *a = alloc(n * nB); + vector *b = alloc(n * nB); + + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + a[i * nB + j / 8][j % 8] = _a[i * n + j]; + b[i * nB + j / 8][j % 8] = _b[j * n + i]; // <- still transposed + } + } + + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + vector s = {0}; + for (int k = 0; k < nB; k++) + s += a[i * nB + k] * b[j * nB + k]; + c[i * n + j] = hsum(s); + } + } +} +``` + ![](../img/mm-vectorized-barplot.svg) ![](../img/mm-vectorized-plot.svg) +```c++ +void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) { + vector t[6][2]{}; + + for (int k = l; k < r; k++) { + for (int i = 0; i < 6; i++) { + vector alpha = vector{} + a[(x + i) * n + k]; + for (int j = 0; j < 2; j++) + t[i][j] += alpha * b[(k * n + y) / 8 + j]; + } + } + + for (int i = 0; i < 6; i++) + for (int j = 0; j < 2; j++) + c[((x + i) * n + y) / 8 + j] += t[i][j]; +} +``` + +```c++ +void matmul(const float *_a, const float *_b, float *_c, int n) { + int nx = (n + 5) / 6 * 6; + int ny = (n + 15) / 16 * 16; + + float *a = alloc(nx * ny); + float *b = alloc(nx * ny); + float *c = alloc(nx * ny); + + for (int i = 0; i < n; i++) { + memcpy(&a[i * ny], &_a[i * n], 4 * n); + memcpy(&b[i * ny], &_b[i * n], 4 * n); + } + + for (int x = 0; x < nx; x += 6) + for (int y = 0; y < ny; y += 16) + kernel(a, (vector*) b, (vector*) c, x, y, 0, n, ny); + + for (int i = 0; i < n; i++) + memcpy(&_c[i * n], &c[i * ny], 4 * n); + + std::free(a); + std::free(b); + std::free(c); +} +``` + ![](../img/mm-kernel-barplot.svg) ![](../img/mm-kernel-plot.svg) +```c++ +const int s3 = 64; +const int s2 = 120; +const int s1 = 240; + +for (int i3 = 0; i3 < ny; i3 += s3) + // now we are working with b[:][i3:i3+s3] + for (int i2 = 0; i2 < nx; i2 += s2) + // now we are working with a[i2:i2+s2][:] + for (int i1 = 0; i1 < ny; i1 += s1) + // now we are working with b[i1:i1+s1][i3:i3+s3] + // this equates to updating c[i2:i2+s2][i3:i3+s3] + // with [l:r] = [i1:i1+s1] + for (int x = i2; x < std::min(i2 + s2, nx); x += 6) + for (int y = i3; y < std::min(i3 + s3, ny); y += 16) + kernel(a, (vector*) b, (vector*) c, x, y, i1, std::min(i1 + s1, n), ny); +``` + ![](../img/mm-blocked-plot.svg) ![](../img/mm-blocked-barplot.svg) From 4f3fb47f84d394b114338f3bfb8dd6fc28ac5bff Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 5 Apr 2022 01:49:56 +0300 Subject: [PATCH 013/173] matmul outline --- content/english/hpc/algorithms/matmul.md | 434 ++--------------------- 1 file changed, 26 insertions(+), 408 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index b787ae52..1a611a52 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -6,6 +6,8 @@ draft: true "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn. +Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course. + For reasons that will later become aparent, we only use sizes that are multiples of $48$. 1920 Cache associativity strikes again. This is also an issue, but we will not address it for now. @@ -25,6 +27,12 @@ noalloc 0.408031 17.346446716058338 nomove 0.303826 23.295860130469414 blas 0.27489790320396423 25.747333528217077 +$$ +C_{ij} = \sum_{i=1}^{n} A_{ik} \cdot B_{kj} +$$ + +Implement the definition of what we need to do, but using arrays instead of matrices: + ```c++ void matmul(const float *a, const float *b, float *c, int n) { for (int i = 0; i < n; i++) @@ -103,6 +111,15 @@ void matmul(const float *_a, const float *_b, float *c, int n) { ![](../img/mm-vectorized-plot.svg) +## Theoretical Performance + +$$ +\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) +$$ + +RAM bandwidth is lower than that + + ```c++ void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) { vector t[6][2]{}; @@ -180,424 +197,25 @@ for (int i3 = 0; i3 < ny; i3 += s3) Which is fine, considering that this is not the only thing that CPUs are made for. ---- - -## Case Study: Distance Product - -(We are going to speedrun "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course) +### Generalizations Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as: $(D \circ D)_{ij} = \min_k(D_{ik} + D_{kj})$ ----- - -Graph interpretation: -find shortest paths of length 2 between all vertices in a fully-connected weighted graph - -![](https://i.imgur.com/Zf4G7qj.png) - ----- +Graph interpretation: find shortest paths of length 2 between all vertices in a fully-connected weighted graph A cool thing about distance product is that if if we iterate the process and calculate: -$D_2 = D \circ D, \;\; -D_4 = D_2 \circ D_2, \;\; -D_8 = D_4 \circ D_4, \;\; -\ldots$ - -Then we can find all-pairs shortest distances in $O(\log n)$ steps - -(but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it) - ---- - -## V0: Baseline - -Implement the definition of what we need to do, but using arrays instead of matrices: - -```cpp -const float infty = std::numeric_limits::infinity(); - -void step(float* r, const float* d, int n) { - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - float v = infty; - for (int k = 0; k < n; ++k) { - float x = d[n*i + k]; - float y = d[n*k + j]; - float z = x + y; - v = std::min(v, z); - } - r[n*i + j] = v; - } - } -} -``` - -Compile with `g++ -O3 -march=native -std=c++17` - -On our Intel Core i5-6500 ("Skylake," 4 cores, 3.6 GHz) with $n=4000$ it runs for 99s, -which amounts to ~1.3B useful floating point operations per second - ---- - -## Theoretical Performance - $$ -\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) +D_2 = D \circ D \\ +D_4 = D_2 \circ D_2 \\ +D_8 = D_4 \circ D_4 \\ +\ldots $$ -RAM bandwidth: 34.1 GB/s (or ~10 bytes per cycle) - - ---- - -## OpenMP - -* We have 4 cores, so why don't we use them? -* There are low-level ways of creating threads, but they involve a lot of code -* We will use a high-level interface called OpenMP -* (We will talk about multithreading in much more detail on the next lecture) - -![](https://www.researchgate.net/profile/Mario_Storti/publication/231168223/figure/fig2/AS:393334787985424@1470789729707/The-master-thread-creates-a-team-of-parallel-threads.png =400x) - ----- - -## Multithreading Made Easy - -All you need to know for now is the `#pragma omp parallel for` directive - -```cpp -#pragma omp parallel for -for (int i = 0; i < 10; ++i) { - do_stuff(i); -} -``` - -It splits iterations of a loop among multiple threads - -There are many ways to control scheduling, -but we'll just leave defaults because our use case is simple - - - ----- - -## Warning: Data Races - -This only works when all iterations can safely be executed simultaneously -It's not always easy to determine, but for now following rules of thumb are enough: - -* There must not be any shared data element that is read by X and written by Y -* There must not be any shared data element that is written by X and written by Y - -E. g. sum can't be parallelized this way, as threads would modify a shared variable - - ---- - -## Parallel Baseline - -OpenMP is included in compilers: just add `-fopenmp` flag and that's it - -```cpp -void step(float* r, const float* d, int n) { - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - float v = infty; - for (int k = 0; k < n; ++k) { - float x = d[n*i + k]; - float y = d[n*k + j]; - float z = x + y; - v = std::min(v, z); - } - r[n*i + j] = v; - } - } -} -``` - -Runs ~4x times faster, as it should - ---- - -## Memory Bottleneck - -![](https://i.imgur.com/z4d6aez.png =450x) - -(It is slower on macOS because of smaller page sizes) - ----- - -## Virtual Memory - -![](https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/images/Chapter9/9_01_VirtualMemoryLarger.jpg =500x) - ---- - -## V1: Linear Reading - -Just transpose it, as we did with matrices - -```cpp -void step(float* r, const float* d, int n) { - std::vector t(n*n); - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - t[n*j + i] = d[n*i + j]; - } - } - - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - float v = std::numeric_limits::infinity(); - for (int k = 0; k < n; ++k) { - float x = d[n*i + k]; - float y = t[n*j + k]; - float z = x + y; - v = std::min(v, z); - } - r[n*i + j] = v; - } - } -} -``` - ----- - -![](https://i.imgur.com/UwxcEG7.png =600x) - ----- - -![](https://i.imgur.com/2ySfr0V.png =600x) - ---- - -## V2: Instruction-Level Parallelism - -We can apply the same trick as we did with array sum earlier, so that instead of: - -```cpp -v = min(v, z0); -v = min(v, z1); -v = min(v, z2); -v = min(v, z3); -v = min(v, z4); -``` - -We use a few registers and compute minimum simultaneously utilizing ILP: - -```cpp -v0 = min(v0, z0); -v1 = min(v1, z1); -v0 = min(v0, z2); -v1 = min(v1, z3); -v0 = min(v0, z4); -... -v = min(v0, v1); -``` - ----- - -![](https://i.imgur.com/ihMC6z2.png) - -Our memory layout looks like this now - ----- - -```cpp -void step(float* r, const float* d_, int n) { - constexpr int nb = 4; - int na = (n + nb - 1) / nb; - int nab = na*nb; - - // input data, padded - std::vector d(n*nab, infty); - // input data, transposed, padded - std::vector t(n*nab, infty); - - #pragma omp parallel for - for (int j = 0; j < n; ++j) { - for (int i = 0; i < n; ++i) { - d[nab*j + i] = d_[n*j + i]; - t[nab*j + i] = d_[n*i + j]; - } - } - - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - // vv[0] = result for k = 0, 4, 8, ... - // vv[1] = result for k = 1, 5, 9, ... - // vv[2] = result for k = 2, 6, 10, ... - // vv[3] = result for k = 3, 7, 11, ... - float vv[nb]; - for (int kb = 0; kb < nb; ++kb) { - vv[kb] = infty; - } - for (int ka = 0; ka < na; ++ka) { - for (int kb = 0; kb < nb; ++kb) { - float x = d[nab*i + ka * nb + kb]; - float y = t[nab*j + ka * nb + kb]; - float z = x + y; - vv[kb] = std::min(vv[kb], z); - } - } - // v = result for k = 0, 1, 2, ... - float v = infty; - for (int kb = 0; kb < nb; ++kb) { - v = std::min(vv[kb], v); - } - r[n*i + j] = v; - } - } -} -``` - ----- - -![](https://i.imgur.com/5uHVRL4.png =600x) - ---- - -## V3: Vectorization - -![](https://i.imgur.com/EG0WjHl.png =400x) - ----- - -```cpp -static inline float8_t min8(float8_t x, float8_t y) { - return x < y ? x : y; -} - -void step(float* r, const float* d_, int n) { - // elements per vector - constexpr int nb = 8; - // vectors per input row - int na = (n + nb - 1) / nb; - - // input data, padded, converted to vectors - float8_t* vd = float8_alloc(n*na); - // input data, transposed, padded, converted to vectors - float8_t* vt = float8_alloc(n*na); - - #pragma omp parallel for - for (int j = 0; j < n; ++j) { - for (int ka = 0; ka < na; ++ka) { - for (int kb = 0; kb < nb; ++kb) { - int i = ka * nb + kb; - vd[na*j + ka][kb] = i < n ? d_[n*j + i] : infty; - vt[na*j + ka][kb] = i < n ? d_[n*i + j] : infty; - } - } - } - - #pragma omp parallel for - for (int i = 0; i < n; ++i) { - for (int j = 0; j < n; ++j) { - float8_t vv = f8infty; - for (int ka = 0; ka < na; ++ka) { - float8_t x = vd[na*i + ka]; - float8_t y = vt[na*j + ka]; - float8_t z = x + y; - vv = min8(vv, z); - } - r[n*i + j] = hmin8(vv); - } - } - - std::free(vt); - std::free(vd); -} -``` - ----- - -![](https://i.imgur.com/R3OvLKO.png =600x) - ---- - -## V4: Register Reuse - -* At this point we are actually bottlenecked by memory -* It turns out that calculating one $r_{ij}$ at a time is not optimal -* We can reuse data that we read into registers to update other fields - ----- - -![](https://i.imgur.com/ljvD0ba.png =400x) - ----- - -```cpp -for (int ka = 0; ka < na; ++ka) { - float8_t y0 = vt[na*(jc * nd + 0) + ka]; - float8_t y1 = vt[na*(jc * nd + 1) + ka]; - float8_t y2 = vt[na*(jc * nd + 2) + ka]; - float8_t x0 = vd[na*(ic * nd + 0) + ka]; - float8_t x1 = vd[na*(ic * nd + 1) + ka]; - float8_t x2 = vd[na*(ic * nd + 2) + ka]; - vv[0][0] = min8(vv[0][0], x0 + y0); - vv[0][1] = min8(vv[0][1], x0 + y1); - vv[0][2] = min8(vv[0][2], x0 + y2); - vv[1][0] = min8(vv[1][0], x1 + y0); - vv[1][1] = min8(vv[1][1], x1 + y1); - vv[1][2] = min8(vv[1][2], x1 + y2); - vv[2][0] = min8(vv[2][0], x2 + y0); - vv[2][1] = min8(vv[2][1], x2 + y1); - vv[2][2] = min8(vv[2][2], x2 + y2); -} -``` - -Ugly, but worth it - ----- - -![](https://i.imgur.com/GZvIt8J.png =600x) - ---- - -## V5: More Register Reuse - -![](https://i.imgur.com/amUznoQ.png =400x) - ----- - -![](https://i.imgur.com/24nBJ1Y.png =600x) - ---- - -## V6: Software Prefetching - -![](https://i.imgur.com/zwqa1ZS.png =600x) - ---- - -## V7: Temporal Cache Locality - -![](https://i.imgur.com/29vTLKJ.png) - ----- - -### Z-Curve - -![](https://i.imgur.com/0optLZ3.png) - ----- - -![](https://i.imgur.com/U3GaO5b.png) - ---- +Then we can find all-pairs shortest distances in $O(\log n)$ steps -## Summary +(but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it) -* Deal with memory problems first (make sure data fits L3 cache) -* SIMD can get you ~10x speedup -* ILP can get you 2-3x speedup -* Multi-core parallelism can get you $NUM_CORES speedup - (and it can be just one `#pragma omp parallel for` away) +Which is an exercise. From 823b55298830685d3eaa57b2b28d10ea91c92de1 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 5 Apr 2022 17:33:05 +0300 Subject: [PATCH 014/173] matmul intro --- content/english/hpc/algorithms/matmul.md | 64 +++++++++++++++++++----- 1 file changed, 51 insertions(+), 13 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 1a611a52..c092f138 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -4,17 +4,8 @@ weight: 20 draft: true --- -"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn. - -Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course. - -For reasons that will later become aparent, we only use sizes that are multiples of $48$. 1920 - -Cache associativity strikes again. This is also an issue, but we will not address it for now. - -GCC 13. - -3.5s for 1025 ad 12s for 1024. + + +In this case study, we will design and implement several algorithms for matrix multiplication. We start with the naive "for-for-for" algorithm and incrementally improve it, eventually developing an implementation that is 50 times faster and matches the performance of BLAS libraries while being under 40 lines of C. + +We compile our implementations with GCC 13 and run them on Zen 2 clocked at 2GHz. + +## Baseline + +The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is an $l \times m$ matrix $C$ calculated as: $$ -C_{ij} = \sum_{i=1}^{n} A_{ik} \cdot B_{kj} +C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj} $$ -Implement the definition of what we need to do, but using arrays instead of matrices: +For simplicity, we will only consider *square* matrices, where $l = m = n$. + +To implement matrix multiplication, we can just transfer this definition into code — but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays, to be explicit about memory addressing: ```c++ void matmul(const float *a, const float *b, float *c, int n) { @@ -42,6 +44,14 @@ void matmul(const float *a, const float *b, float *c, int n) { } ``` +For reasons that will become aparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes. + +Compiled with `g++ -O3 -march=native -funroll-loops`, this code runs in ~16.7s for $n = 1920$. + +[Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now. + +3.5s for 1025 ad 12s for 1024. + Transpose: ```c++ @@ -113,6 +123,8 @@ void matmul(const float *_a, const float *_b, float *c, int n) { ## Theoretical Performance +This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in the later implementations. + $$ \underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) $$ @@ -197,6 +209,20 @@ for (int i3 = 0; i3 < ny; i3 += s3) Which is fine, considering that this is not the only thing that CPUs are made for. +```c++ +for (int i3 = 0; i3 < n; i3 += s3) + for (int i2 = 0; i2 < n; i2 += s2) + for (int i1 = 0; i1 < n; i1 += s1) + for (int x = i2; x < i2 + s2; x += 6) + for (int y = i3; y < i3 + s3; y += 16) + for (int k = i1; k < i1 + s1; k++) + for (int i = 0; i < 6; i++) + for (int j = 0; j < 2; j++) + c[x * n / 8 + i * n / 8 + y / 8 + j] + += (vector{} + a[x * n + i * n + k]) + * b[n / 8 * k + y / 8 + j]; +``` + ### Generalizations Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as: @@ -219,3 +245,15 @@ Then we can find all-pairs shortest distances in $O(\log n)$ steps (but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it) Which is an exercise. + +Strassen algorithm is only useful for large matrices. + +https://arxiv.org/pdf/1605.01078.pdf + +[cache-oblivious](/hpc/external-memory/oblivious/#matrix-multiplication) algorithms + +## Acknowledgements + +"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn. + +Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course. From ab322dd710898564821b2f58cf84ada7e171f845 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 5 Apr 2022 19:39:23 +0300 Subject: [PATCH 015/173] transposet matmul --- .../hpc/algorithms/img/column-major.jpg | Bin 0 -> 22004 bytes content/english/hpc/algorithms/matmul.md | 61 +++++++++++++----- 2 files changed, 46 insertions(+), 15 deletions(-) create mode 100644 content/english/hpc/algorithms/img/column-major.jpg diff --git a/content/english/hpc/algorithms/img/column-major.jpg b/content/english/hpc/algorithms/img/column-major.jpg new file mode 100644 index 0000000000000000000000000000000000000000..675d0b856231c263b12d3a246ab6216fe02af3f0 GIT binary patch literal 22004 zcmdRWcUV(f+HdS#57MM7RY2fSLKQei2uK%52rY1=6Pol6mXRt!KtQ?%Bn^Uu9-2t+ zHS{9AgY*tJdPZbs&dfLW`R+e=^AO&wz4qJIyV|>co1@;NFMwZ^6_gYJr%nR^r^r8m zqXB>%;M|$BXV0EFNB%f>?%eqcS1(>5UpFpaxpbA{#?6})Hz>r*&YeAT>hwkOUC1whQ)kHZh07POTsVFH^ck{Mr_Y={ zcb?+HO`1D`I??SNl(b^MLtNe7l3Hve_S4Jd2FV({{SZ z2{L(3kTa7ve@|vhf9e$A^ck{ueqMjFCExp%g6y}`WKW$r_f4^HHBOzTICJ)veNT8`AeKVubn@cwH%8~fvSew{{s zZYFvXxbiZ1kmn}ER=bSPFzlr!wRdufa>)lwI(+gn6ea?ii%GvaKVFL2;krCapY=pW zsy`+22|fb5b*==vX*%|#w3&*or`B70Rh_y6X!OO$Jx`W>6vhTwx)5ztOZjri>Rmmd zsmI!53TW($%^qEo16^*|9xnnfOHgQvs6Fniab@!AkLN*Ei$iIT0KFvdXOmEZ$1DB< zt9LDB$HLD1)ZjjLQ}H|Be|uXzQNS5=Uc^ zlyN^^9#cIImh@*@)TkGSKKvn;um7%a7Ge`Mk;t0{Fa5N=4EIy|OjN)P-ry;Xs2cT- z|M#~MTvrC$f4>l9uBWE%v|!1eulEOZfXVo3_Z-H2U(bL(uWow)HEx5osWYN_`!+#^ z4+zrq0Gz$|{~~55+-|12&%Jz;^2WXU zgK3fVV)$n3#Ayw-W|(~)%i~;AvBZGUb5d}4--C$kH!DRDx{@8kZd?+PC&O0C&gQqCc!>9OM?H~2qtYS>g< zqZbf2^d&ljg-Rp#YeHLSS?m*g8{Knrh7)~oH)^xPb#T2s_QjsX_4v61?se|nYt!5p zjsSt>r+W)}dzP7vn58(dBWAgs{dOW5Y46kRCw|R_XoiuX@;#7Ai1gpIX+rtH!tR)38n-8aUzK(m%eDy#|{Ifl$@Uf5DYTP42FS?;5h1gu94tUJ}95eG}&iOXwJ7oJyG7UMx4SYbf3<(xY$Kp$0-AI~owkIAl_LSf8~mqj_qU>AhcKMJbzx363}T1RaDRSFdUR|gmEY;<^|?s#Rq&QyVV0bxT!l+oItNo_55teedX@NkNI z0u#ztKI((&tEEWxo2sE%E-rVMrl#lO=Q4YYgm{DB=2AhZnkO|Ch2eLy^H{+w^#EczSGqb&6cf*!X zcvwJ6XJ!*E@zW1cWg?!PGgXxff%lbWXjLKDR8^>?yk*C#<#aU(>jQP5P)K!4k$WS6%a&1j-9#mEy0>x=k%By)qUsKz>!nR0F zF9UBfl$E=e{mqnpm!^q5NYG$(d{#IFrpD}J^kp&4^rpcg z6-|>3!}AYqW!#!f!c$?lBJ!#1U)lI+VX-SKR)KJzE0BJs6z+wx zI!Qy-#kl634!_>!%ImW5*3E05$P-hBt6%8sf%Ic==BQv2uT<3Jk_5H8tCMvW3k`85 zpe7h;6&T02&h27=)%yMG#xw5jR7WXf6#~j8R^;vqfyqR5{QFMw{m#ijQns~Nwq_3J zSW8lTYfi-%Ir#*PO2Y<6C?4dyw4<}JD9#dGgYBe=MVYEck^Y)sdO+?)ql?{Zdm|Ax!dEUS#Ba)kg_ro3 z-_ZHTVN;vFzgs|0BS&R_shQ9pLkNSIBonv<_pPfY9!9tv0faEyVQ}gC%KgsR&0gPn z-Skoz<@zfd{jWuTB}TuL!%ova2bLT*OS`c+iT`+>P0gD{p>|kyBoIy-ZAobl zF2xEz6dX$FPtUWti3dT$_6cm>ns9M4TFmfNoh?*o;0n~0)W#MCc3?$fc|6)77ZY=9hXj*@6ZG(3 zLs?cAv%1IU8Ch%hi)}Uvck4)>KQ{aQlW?U`OBcuWJsrazwQ4VB9|8FL5A0Ktxyj7(*tlq-$*m~2yztHf>W^sx z1Vf|kj)Cj*iBXzJp7f;xN8`9(qHJ zzA~iPV;!@}(i`*34Z<_RzUa_3?YkqtEFjK3PHe69;=f)uYUvz~>L-lS$W9IzXV!N6 zeRoO@*(q$t7P0Ga1dTbF)8#subaN4S2W`Xf1rN|Q`{$!w;Pxw=SRB%WptEex6Z&UHIX8=~jvXNWBOv2O&ip!TH9kal!|aLhwJC&j`|C}m zFYuxJj7#mum5!skB#6@mtSK_j)OoW#!#Tf5>gT$ceE{6v z>?3!@l;g%|a_S2jQ_0&Ux>norUN(JQ7m-P%hD(iVac(9MZIIR&%kL35P40UVy`&vT zGv%hW$q`^0-0|_))MH;o?0*kZ7el6q*b>0fb;^b#K((3cJ?Gdtc`e?bsyi7v zGv%4}N<*?(avy=8M2foW>)(3sH`mcMm(UF7PgJT>6W>bE6NF7>1Fp_L{XQ1}0OzF$ z<#8EnVYLG&Q8uPNjAOn8Bq)BhiGX1j-oGIN9qH`kie&_ zYi#9aYj!-Tw{nbaMZLVd4XM4iE(4R+l-LP5u0-V)`7oJeEjUiu!nUnqGPm|g$iP{5 z{kJ?g{UOfyi#oT4mp|OXJLSo`VZA~*xKa+|Kl|R%Fvap-x_<;%@1~r+?9HU9tCLK~ zaD9DRE0q6tSZukMcLH%ATNsi~-o79%fXtzlQ& zWqcWH^B|0*u}Ki}C~nsj8# zE01>>_Rc*5)M1hYni%T{6&Cq$2&|-x1LSa`vx>JPN|DnNKBv zhE6NYj|MWWQMWFmL&kp6FF{O@Pb}3BROdVy)HfA{vbDsdi<@n7#~^Cj;GSyi&D6c~ z+H!h6*_QcNctxMd7(B^9`?xnck3C~hx+y=+CuhK**#^u~0Mc*3Cj1g@YT#AckZx?! zooj(U3oZM5EFZOK*+rSwa-`N4{3;$AA!c0fvushW908~SjQb(WpvT#yVSWK;Ym6zh z+!g2?8$nO*B@O5<4SMXIY+*~3%c&IG$i+8<0upwdkVmy(Egfx19t$8vu3bzLVT%g9 zZ#T3^-PnM}T5(w8z%Z)jvE^v!RF40(k+#G>uICvVO}Qyf9SGc-UU|UsjJj=9g$~!0 zzDAMN;lo4ar&kN$eJs_*zY+9poI_qIlce$K>3X9N^CEQ`DRoJDx7Ov`TaDP&e~0Ls z%wjVI-O{eBv(^-_q`P>Y?K`W5tY`WRm7k2m!?ZzRnQ){NXOL@fa0%y41I7JuiuIU; zr72z|zSKj7D>k&-`i$Wb~@%J{c z)dEuo`lZ#Px|AU;IF@@Z4todr^a_WV&DU1FnI@SYZhLwCec^5|;Tw4k2Fa$ce5949|o@R1u%|ndA@sB zAd|nsb7yHo zJr3{GG_Bm@cMY2m6ck*8AbI|>U>k4*Fpd}v)^kEX$EPO~u`g|In+QqfKDShe0%EEX z>bhW*p)i**PU~FGhE-$n1Bn;AYP5u|=(A;xyUrKZ;|0tyTsMf9y>h>tDr@=si~zW` z)Np8#(<1gQ3HHj}++SF4nt`15v^OiB9xaGkPI;-skjr~r!=tztch8l^JUw4^H;$WM ziGLzzCYUPa#pEDqE~rEwD-e9;w#ys}tXD4Rkt5t|t0(WO7g(E-N1&YmsvkD>X|}{^ zNl&UBa)hgi7{OdcK4EbU7m-PVCT}}w9xf(S=qPP+`Ga{U_EXF9*hQPa;Mn&l%dym} zewrit5lkIHOR=Nz(;IsNs*(6`$=Y|mmKC=mAUgWpX1Fldea4$WM(h+8pQ%Po?NJt` zd=Z-%q6QwO1j9J8=WOygn=`t$(n&&?h;lc{58YOyO9)*()-2{-#lp}mW=xIFQ#C=W zYzixahU@|GZ@UkYYr#4Iod2+c@!4cESj)+uIJ2N~9aZXaNh8TZMX_W+EH7k>HpM{gddSlA;1Vnlg}f-z)6XdKL6 z@jP8YfZD8OSl(-XN7M-@qcw-C&VB^m7YewX_Z*p0tk5SFEhokv4(@SnUXi$Oy`J)i zPP}a?#CYMRnDJHd<+pf$UwP;WT8%t))W$FJ+0qa(nO(0kG7+?r$w{kQoDJ(5k=1^o z#n=@a6vgaWY(1ymYGNZ@VgIjEHBug_S$3eYNj>>~GuulHkq>F)jTaf_GUYqLW5Xm7 z;Kz&U349RfMO&$93-{^XgXgb{?tENMiW^BW-Msu@-CYj#H9?Qu99Z0VksZ!T&)0Wp zi_e@%b$t)vN)%_%#i&;;h3RC`OocR4219@%?=Gg3`&7J3bqY^bv&A7p`Fjb1#yQ=T zIBS{74^00uxwP*`>_?x?*N*7wU4B$6ddH1uFolBgX0~x?_wjE{q}=pLN2Z4$#??Gx zAEZ$V`USrwikZ=P&C8I_5W85YKe%>aiOEAHpmP4Mj}HSCgsOKPN*OPK%3sA z3WTgbF^8nZv>yS)t9&`q_liO~a<`PpoeIVny4S>H>k9T3FR?vLwcNVAA41SiLrgU2 z@B!rsb<>KvbD4e*3Ow;JKWzw+Re+-86pTIu)3r8$jKAV5LM`m>IsWohq;ZcEkOFpS zgH3*VB_knk+cIO_g2m=d0V|&$-MlniVE-gvt6=g$wK}H_DR$+4Qq3h!-H9m2kj1)z zl4+~>nc#8A${i*GFQ^7lt>;_bDVcomx?5%(cI5~_pF}%RdHM)&oim&0D^>jtXx{@k zd+A!tp?SovhST08Z~Y0n5keAr*1t$3r038T@x>s?N1ORQO2C$$GdH0Ph+qRM%uOtN zY4dz((cXqUL~b;Po<7jRlX^Fh1y?IL*UO!XLSmkACrQguA9SP^a!s`BPl<(t+?KG3 z0U_t{Is$cEo6%oJ?ym7!4&Hxvf)1Jsa&c@;$S=>9+p?3Me!;j?Y_mrt>()8UZYkwh z%}wxD<6sgM)Nm@*QGWBt&@9lC?_)7SDS4uEA`eXzbXAl#sBX|V*;H`@BHDEMeqpDg z`bG32-T09Fr11xcE5w){n{-=1lwf~zr*^i9rNgQ;CJoK87T)<*Hm%*8PsvthhI_8_ z>G?hJ9}|md!acAmFdpj{QMi2sIGAmI_1Vq&*<)H;n7x>)a=n3`bj=q+t28C^7FMSv z2s^0;4W^K-D`U0{OwuXlWEP7O%j-#KJ}^5p^(@awX&T$g?rRlv91aghyB|}{8&;b~ z0NasA;Wl|CBi$NPVhOwT`z-?AOo?l>Of%cNfj=oy>cBw?xngX-rC6zs(rVNi7dZ{+ zMZGFf+CIW90j;q#Fzy>ZCzo?I03UwsV`mpg0gup~F|~P1MA7qbQRs0txj98jsnRk| z8Xw5WAZ0MLWho2eV_O{y4q_WoJ7H6Y1b^Ys;abc4)wNzSGVhNmq+cfU$inX5ykVK& z>Ii<<)X}?lVft<1vSMwQ)G;OV_lkytJI&kkmv)raUf%aVOw!c*{)lB!cgt5E?-u&H zUmLHgqRZK}fo}wn4_lvui(Ll|t#SH9-@7jrk1Y<`%}|tRxpHGX*Kzjos+ zf`@bK(4C4kQKn_-UO8`LrNVh*D8|0j*Of>*Sv#=&oy|MJ?D;@Xr5AJ%{J=V zXJO5-H9ov6KV_Xlu&#%)c(_;8A4`<_At$O@n;=fV6{yRYd@Qjv-g!%G_SQk~7^e4| zS|$wFclwfcHf8@vySw+}!DMB-*9Nm#ULNkX`yCQE3#B=uSWcUJK{eToP-!26KUPrMtG2&_TXmzp>`+krf)EWIt{a;zU6?u3I-Jk!Yca&$>L%4)*_tts z7PnMCj_%j4(adT4Dt!??!%6R|~EwMb1AT>q(whUNNU`<;3hdJh^5~`}r5p+a{_L&PQ$F{j=2Bzp0C!}e1+SFuNcxcO!4L+KVtvil}1?j{;4;boTm z#15mIAI-mr81CHnPW;{gKk8Iel8R}9T`TI2pK>Q}$C;;@$lek6DiBc#CyX=^WR;eT zeLi@~I`r{Tmqrmnfz5x|f#jowQtefDoJw82d%pT$ODs{PsO>HBSx#?)v6`BsaWeO}ROkv<{>K#?{yXfvV~ORzpq};ieIq z&Bhh0JUs6OpJ^hE7M!ORvlbp53}|>(zc62^IQv8{X=(u>%66-(NLV3{QeLC};KJ6i zrK>!H?fiujOpCI$MYUviBQ{gU2gk`xUN<^NzEdQSNIs!H;0*19tA`&e(h+F!=FzZS zfn-xEH&Jp(`zzLi*Cu{D5^VYwp_57pqgYzmZq$AuD99bc$>Bsk68bN=EtlB7g3==qSnopI5%smqvZJIrC6=fTXo6Nio<b4^(-yo*wwsm&Axp)RJbNI06h1tv;J{nrIH#epQL` zeK~UQI_G;%?A3tfbHjuM{v*J`%fxT6kF69Cpv zkpt(BTuNd-T$+|fW2XggO0yYp3`5%GL3N>dUsV?fge}KDQb;iy$v&zBxK`AxP_&C( zl5SRKca9@Y%1=PF-Uf%>xsxl<`Bp5=dI4hYd4l#x4&DQZfxw_Kx$~IfdV>ISi4Ox- z1-zWU!?Ac<@YPH+@Vd&C!PlSWAR3>S^D-b%IDTXdJjTg70UD^ge`^v3@sqO6%0z8T zL(~|BHEtY;6$MR;?MFjycwW@d-U`JO|j+wt8a_uEK<@u|H)MMpcUL{|2=g9m=VzCa8A#&8o`EJa6=I;HQ z+)Tyy_Tlk3T|8KxC%15r9DB^ljrBwW-m!AD34Xzk{%2J$NmA1Ji(B7Rhbz zEQ{2=sjoiK18kGXkw8ZjLH%{GE7-ZjfXIu%1Xe9?_K$;(CD%hpuC{^@QN1aB594ws zUbDDW(eqL@-N3r8&d|`%pbUWO;6uA(+>X&nnhWVyk!V@S$8Vg=z|(E2vm7#$i_a>O zxftVBbgTT?2Sq=bF@Z)ed=R?3pe7x0RxN_^55RB9dD|}O>qZXnSM5z>J)<27?VjYr zM@6xZ#dqS9p91R0iO2u__1IgsdBuiVG#^(w@Wu6`p-qRbS$cvev0HUw!XnWMT|Q<1 z^n(k?(OSN`59-rInJDO$JM?Ts`0VsmqfD_sNWX^%hLPH@;>s*8wcIMzxBs*p_uM%^ zKquiN{N0n4BIGURZs7%m*&Krd;*f_jmKXvu`?Xd@sA!2K9yLKN&~JY+%PAd1E3nY% zc1`sLD__@t;6?XW3j8)Y?kxYS>VQlYFaP$d+=4f8Z(^&G-#33qc$m7)ry>?R;p8BV z*}zf_t^KP0q^W2*%Br0ifYQ4Zm z;vR$TR}(Il^*iz9iW2Ix7MkPFi8ki@6oGDR!LJ`4--WCc0pxsb>hOk*Y72~?j&1(Fxr z(cNqU36xh8aHuiRIk06rpoW=_^WpsRDGo#MIc3-I%mIOPp~cTuVo2o%7molSsP1+n zd?fR4{zzQL^0Ue|$G64EoU6rf@nHeSa*gY>noL)Dg^?x@V5>ZM3J~DFB)LsiD7%j)yJs2>yNQ=D64k!H!mKS)pC!8oZaF{K{?|hLmNmAV}rLWat~nSns!MvXZ6!{gbjk%XGJE;Uifu$jWDBygSp9 zIviX%G&uk4A zY_`G$E~YzBaO3&FRc!eKUt&4;u9Rb`guYJ%19pGT^`y&m-vFMfZ($wl%$NF%8^6A( zbDHbt-CJ*XoS}iHdVX20S$?W!>c(NBs|%dW=&yB z7V+t}QHkZ(BBj&M_u1e<&S@w5WK5P@fhSV-H?* z%3JXEQMZXlmzQ40)KxBsc+>zvrtf_S)ymbN7zk-ix&iNlFz$9voEGSItq&cDH`$zu zZT55gwq1i~A4~+=t4-Wqi)csLkwj=P54k z<^-Zcf@Tda7ZYXJA>&?LoY)|=5uN$~bUZ1gXJRu0pNy=q30sBsN8Fk-i2S?({j`&? zt-ynG7t?fs$bu*7=xbb=N=`52ZJiA1g-QLheUz)t(Dh)W$@@{u(?TVk9c&E$4E zn62|welPuV9T8WZRrx^-xkd(?A{-vO$pNJSry~CNcfoarvc}G%1|n+U)3TJK=$St} zhZ&sZkJREzPnHltEqQ0b6%gT0YDa)&UJ?ZQ`r60%x>cS83p+AO1r_qpK?@>k&4?>y zae8T{HbP6E9*RlPThv`Ljj?EHnP-o|cRDUnxpWM&qs*^eeoza#Z`C6P^2>5+Y$P9j z+0Quts*+T*_@oa6t08zD0Tv6;7+%Z$4=`S~=+~;tAB-7v7yV1|WxtBO_D`|zW2ZR+ z?BBBxz?I$;yU4S&?nQjBWucyz!|sS;UX@Z5liw9tp$f0gjC1REYEXT)5gjvU!ui_y3=jS+}E2?`#UXdlyA zD7KPP8oe6bS*tq7+0jWL>L@C)whOkzh``gsN!pVh1s=I1*K6-srZe1n-{{gb39n}r zsznXii`ER<3z6!kNwwOmW?_qvAswrUcA}#_5=B?jAKh>{K~yEBH?ymICo|SvyV5PQ zj%(YgOHNW@3Zxp8X$#MEVuw#Ck}};CGE};w8mr$0H>tqeJSArX`r@ zfwZ2~=x#hVo`2Q44pmq$tny)VD8cS}`e#Dxlhn@`?A_aHMoB{~x@dF@$Su>M2YE-z zx4=~NT8ggoJ^mdCG>X*KwtqcmoSQ3Vg`JSi9-@PbW>*BZLeyjw;?tLA7-odRQpVrV z+M%Yl=F{?`qiLF6W)C1EKOEFV?NJLSPAP;ImrVN8V+B~PQOv}`e;3Bz%Fpor+`T+H zF%X*a4R^G9Ij&{^7+SRJE%V#-7GE3ixkPBTYZtiPVpXHB>VZ+ny$JMxcyCFXM_OyIoI^ZH@clq@kzbyZXKIn33RBtrfR`MYo!ajjf9!OAsKHV?D3SxEdoS>*flo z&8N_jY#|HTLbLG9yHcdUFVYG z@XlvFrz|9{qkZm8t8N604E%yscTUVD8XEDlMO3~bEBjpmM)7Y3tiO?dNBvfTLC+0c z3>k}24%v;l=iJpO$1o9%jpvbQh{t29YsbcozhKg~ec>$6Q2Vone~1Yzjn(jb_0ra?cn@OZ#$%BHWfm2w5}hM zh!$61fVg6aU1Kpy^K}V9QLL4J_X}6aFw(W>~5V^ zs#ybWyRj@7j~n0hv0VA9wv!Uc1-lTsRMb??-h?O(!I>DXNYQ)fp=lmJp`|5Tq;vkLzP(MC)@F8Gw}Koyoc!!#|^8? z0FAHHm!|`y|841W%UfFb;J2m+f3H3MA8Y?L=kU_INg780w=J!ly)x zo)qr6O!<_^lbTUUaUrY&<;kZiTlNP~>~oA~0+V(M)RnrRXWT(j67h3GY@ftEM-p5n zlH{H~St}Qvt(Oe>BBGs*&+HTJ`C5n(ef+vBT|2fVyS-?r#k+EZp|CT!89XjmxKG3G zUA|^--8hMn47d2F?Zq}&+jC-hTWk)ATZ;H1Vz}6}3s-;t$*O1|ZoMlBWF_qBvao3V zx7bt2X?x2~E!XR3KrMr<>1lJkzpc*pc3v&#uVf852o!!v=e(D@0+Z>=$)rg1MoKz$V}0-a%2ef2G?RTHnjFy8;#yz4|3 zEbcj)us)Fl@(=;4Xw6$pZQetAc!zz;-Xe-qw~l%rD(T7d_BJ-vcg+ zhbY7Yp%JoFC?mUoS!x4!(g-k{^+r>k>Ah?OhNHK}Vub3i;HiV~8 zD%{zxeNNC#o%0{iQrFp3#dUKerze}fkP~EkisPh!6G%_n+#MQnNq3Q{k_YGOf^!jBXTD1D=UhNj8dkZKy)8kcX}>$9s%yCgEVRM zgOeUD;-$iE*1ii?yx_Rx&@ZjPU*Bb<{Wal1L9FnL!Y#2y=2TjDKZ3aZxYnoId2f!DCrCdo zq1fk#@@&Vf#h`b>X1Hc3!R1pf?Fl7nUwYkJnwDF$_s-pcvjXLCU-s9&;h~s}ZkQhap1b?TH=<&}zfLqo zE1cj`k)1B;|g z|6!-6RAPxrb@SQgKnVsMipm(bhbWQXM10OT3UPnDhww;ba>Vd62f5+dmER-v6{{kG(>KEuUS;n%uVNQ$I#v3(fOOsGP$lWAK@&W-ZBj9MRbB<%eV}t$37?wvuqk-2BtZ*Q7QJ-Gn(#Up9to z;C%B8I~!V&?92{$(3d5?2){QL zxZ({s3LV+IajknG$3~iZN-_wW8n*YbOUg_4u^wZ#!p2HbHjci$^m4~~^B4^(k>kXS zaVdU(tm5E~h6$FxY7as#%^L?Q zJAkIhNMFVlK0hYzQrd+6R1s(QD+D^s4B%?0xNA`W8wkqBVWDlBjt*|*$R$`3uH z?FyjChiAa&)imhqMY@D*87QM$TFwAdtWd9aWj+N}mnusk|;V#hWssNnQ1mrp4$>;=*5kP9nKb zQ)OV2BRUnP*uV0yNARR4G($l0PEP4)6#4BVY(vbDcfq1D^GAc64CF+&97n@4wM$z) zrM1B-n@Oq(d|8Qg#c@?(9KxD!Y+WQ|vjp)^jf`CO&4lhVgFwDOcKuq7o3mIOV#tR4 z;d+REG*SZQqg2v+0-!b#P+8_?>E|2X%YF5YyWEE#U^VeRU+SfCD`nnmeK8@y7A9)Y z!44(1K;ziKc1>*}k0WalH6_eNiXB0+k5kKGFm+@AL3}@B-9F8}3nLH_K+td;n{#|P zSiYSBOU|3F&|wqgIw9>%m0X1b`LS-*bgu`6u9))?cra;rXZHAz56j?(F$#IW5fg`{ zEPCs8?suN;4Ub{KOA29HK3rkQ|dd1p39gVib;^zITa zCq9n{>zNd42y?h(LP#>&^xar4t~-5fn=0m6gz@NmJiC;8(7J zmN3rBM8lfM)j0!;d&AR%*gH$;*Exka_X#D_9UC5jdXh({>!YCoqy)7`r}i~$wh{8P zfQRZflU*$<9)oca>#{Yz5V5K}4bqu=>9)s2&+Dh2P?sU^K=V@2VH0}u2w*XJ1lYb2 z&BhTDc`H4(Sh=phr+#-yULNWnzmlRY%xd2=nw-=A))%ie81xf;cfZDai<*4apkri^ z3yh{|K=YBpkQtF&qL*XkBhv)HV}4^;wp=4=;JC1=4iWppI#+xnRkCpEmwo2czfq7H zsg3}&byW_|e(QGX6VtMuAs58Zc4d{UVHYR?L8Xp$}hTh=a=2WjDUM;GL_8~By_+dI;DmsN<~#f zOqPbiMfmO0K@V+zu8dC7`gqJ6JV}`69_1BWIYg2=L8k&lMPRp!e?OmQ2Nhs*9e3J} z4=H2Y>wD=dU3nGh;|b%})?{$E;ip>7TlAT)S(kShL-NdoP0GtlNNMFkRs5<}^}X3+ zcMX^fSHT0wK|Lt3y09YsC74pWg+pAlKS}s ziMbEK9lTifun_bxS6(o7nTLxdC_y1=q3zPa2M@7^`;B^AH}0OGd6_h?BYpzUznUK7 z(W_o{jPo^e_Pg#erwA08y12oW^|=2isAm)+x$6|QB(-`Q;!Po96qsr(e+kOcGr{H^vEJZ?3A*C zA)LF9Re#i-F=uW`S`&@#PA8TcV68Fc2))Jajevk!Ym~Cr#@sUXmy9>pYiy#@DBA%!(K_YW=$R4uw+(Ff4re`&#eRiH$efte6Y?A^cKUB=}P; z$LDbzjFlC&IlHfa{69Qi%dlKS9uN`^x;MfD+Wq>}4(Gj^=ZoBYE52^q(JsCjSt`iz z5e?mJaM&{FzQW=5PbWGIa9o;OJwYcQo79lY-+ogEn?>)p`dqm|dduGyHEDbV;Pq-8 z)tHNY|JA6W%=nKGOdU3`a`cUV9(j969M1mQAlpxVA#TV)FaGOd*_Uk}q3}(jKUtIv zo#Gx$b}w_DbIki^Ml^2I8v3y4`Y#xWIy&w$&mnb*(pSl>Vc6d_Oz8GprrapMuHfC} z+d2eQ1Hx{|R$7(YrXK-3OHD^}jq|nrONMh2JCChf7W-jcLiYulpNIXz1|n_E1(L1- zRul3yoFv_Kx0L?W-D`J;KQ9Epy#QxeKYjQ1Rg`Jj^~EDVEp8JNGNyRq%Qz}<;QDm| zpZ%w&7apH{O(?qRANI#E4!G2NqY~8UF8P<|Dd68skrZ0%u28_GM*zTI&sA46hprdA zI2`kn9eW3`l@bBQW_!BMX9aSi6CV)wfpMvHeT6WG5(f_Gw~mh_=J zy!+9B(~SVY$x$`(#Op80+Cx0GdVq7?0DxNm@oOs)H04(}l3!IWI>k)CZS@;NQ3B`c z06_XLfd2#u23$S~);&qTos&yN_zdM8gs>D_FWQ0k-&phm107nLxNinx|y%D7hu|9)C^eeA-~qId|Un#i8itI zkK8oU|7q{C4X(jVtTSKRMPv8Uy+^+I)YlNnuWl~;KU?FXC%=)+aaY(HWoe^o8-HBH zR(g1$Dba@fs^^<`)+kd|+vc3B43`5A0%~bk!fWc6>lvHv@~M>QzADGs7m0av>g2f? z({r`yZx#0)0R*!V!7m6-C-xl`EBv~H`A*Iy*%F-p#?EOYeRoy&pO2rY#{H=zR$+9L zfho{#A*9x#Kz}@eF1jNh5&fEt#FHuyQ9#1W^)mh;Wi|Q_N0R6LH3;5JyiC;B_V|#` z%zVP0CvS;3W*bd<7wCsp#TP*}hJB@a28u`tF8c@OU%0?a_z<_>x3UVHaiX(3yR}|> z?vgq7DJdBxXTVG0{9&(s=NAX(nu&gXIjzP7UM8Pw&kI)m=8WpVqPKT-%6! z3QubNM?su*8BT_G3T0WFlAvcb0=Q!iWY%tNz)R7@Ri5mS^3C_iQ&{)=#DA-|r;F{2 zWB!OABK17ZV+L1e`YsHY4Ygzppr+^V=aS#Z({6>_r#o-TnCFy6x?#8LqG$BVhN5MB z9M%n#Nq?Vu+7&|Vs?ruiMV={z;F9Q28m-+@B>kfH_-_u}rk@+~b+v!G41rH8Sd*fPml0uL`S|Ejg&gR$g~)pDO8#9F z#DS!!C{V;k6F4C+FQ;$@WG@NOc^z$R^Uvss$!MR-U!a*g4 z)xS9yo9CP=LT3nSit_GEQf{{y2 z_4^CMpGR7nM!tFMr+kX^q|gU9j%)hh+%zmRQX+Q7vm~JSK~r zaizcbEs^*v&^)H_v6-N`1dc5XT>}(Yo9U8orx0(QKW1cfDqZwHsQ%*Elob7IaecE{ zG<%6#iSeU+WvQbgD?PBuWX}4+DYj??j9Qa5~GrIUbd)e+&M}438T5Nst*xKdT6^-@ReT>$f?NfW2dh*QT zNsD(W+i|n>-JO+kwc_fUdl!ZKL!Hak^`+Vd|K?miD?0VnpVLe4ZMs;r!&2wXj>4iSPe2Y>mozKI^KZXV%tBQExwNld`giWD(pxu_rljt^bpbDp%l+ zw~$w}7Zblw(9mbfjhoM1IwJ! z;Iqt|n;*^F{_y1as^*fbrcdn8f4$x%`B7%M_6f7rqD@byomzG6sN3m9hfb-xd{?#6 z^j!C5ZFSVqt7nV$*OtUD{>$KyE|HZw^X3g`7GK&T#8*iYUAum3Lq2a@fIw_3tMB@eliJzFMs~ROZH1QPFj-S4As* zepW5gdUoYpuqklD^m}H)r1!m@|1Ouj@VZ}Q{Wz(9%Uvs>?RRF+OO@U^FSPDvSd;fY zGxguTmltZ+ohGqXZ?{8M$%bg8fHrp;pI!BA8~46*`qxfp-F=mK1X|Lb~O%i8a1`DVQjZ+zCcC#rx|Uis|e3ZPrs-yxPS8QH8h+8BE8qvgwe=98?n zFNK8c)Y&?}x&Ogdo_p~JL8~{9ewq6B<3{~k>*jLCzVj3B__rte^7+Zr=gf_m-kPDZ z`*BkI)>tc{-FNza+?f70+^=QbceS>zZCO#F3qWC2cNG{`??GV&Rn{w-6BTzFF_?JD zVD?ct$8BzRbbZ!Mn#5D)kuocb^X08cx;}YsueGIDz3AE#@3?yQjKHkKX;fQyOK$oZ KcKVqA-vj`yOfcI3 literal 0 HcmV?d00001 diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index c092f138..7102bc8c 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -5,8 +5,6 @@ draft: true --- ```c++ void matmul(const float *a, const float *_b, float *c, int n) { @@ -65,16 +71,26 @@ void matmul(const float *a, const float *_b, float *c, int n) { for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) - c[i * n + j] += a[i * n + k] * b[j * n + k]; // notice indices + c[i * n + j] += a[i * n + k] * b[j * n + k]; // <- note the indices } ``` +This code runs in ~12.4s, or about 30% faster. As we will see in a bit, there are more important benefits to transposing it than just the sequential memory reads. + +## Vectorization + +/hpc/compilation/contracts/#memory-aliasing + +/hpc/simd/auto-vectorization/ + ```c++ void matmul(const float *a, const float *_b, float * __restrict__ c, int n) { // ... } ``` +![](../img/mm-vectorized-barplot.svg) + ```c++ const int B = 8; // number of elements in a vector const int vecsize = B * sizeof(float); // size of a vector in bytes @@ -117,20 +133,27 @@ void matmul(const float *_a, const float *_b, float *c, int n) { } ``` -![](../img/mm-vectorized-barplot.svg) - ![](../img/mm-vectorized-plot.svg) -## Theoretical Performance +[memory bandwidth](/hpc/cpu-cache/bandwidth/) is not the problem. -This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in the later implementations. +[Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now. -$$ -\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) -$$ +$1920 = 2^7 \times 3 \times 5$, so it is divisible by a large power of two. + +Slightly slower than. + +3.5s for 1025 ad 12s for 1024. + +However, now we *really* hit the memory limit. + +## Register reuse + +This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in later implementations. RAM bandwidth is lower than that +The latency of FMA is 5 cycles, while its reciprocal throughput is ½. ```c++ void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) { @@ -203,10 +226,18 @@ for (int i3 = 0; i3 < ny; i3 += s3) ![](../img/mm-blocked-barplot.svg) +Avoid moving anything: + ![](../img/mm-noalloc.svg) +$$ +\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) +$$ + ![](../img/mm-blas.svg) +We hit about 95. + Which is fine, considering that this is not the only thing that CPUs are made for. ```c++ From f08193cceb6cfd7563d2d558ab77533e6c4b6ded Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 5 Apr 2022 21:43:45 +0300 Subject: [PATCH 016/173] vectorized matmul --- content/english/hpc/algorithms/matmul.md | 147 +++++++++++++++++------ 1 file changed, 109 insertions(+), 38 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 7102bc8c..09184bce 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -44,7 +44,7 @@ void matmul(const float *a, const float *b, float *c, int n) { For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although it can be [generalized](#generalizations) to other types and operations. -Compiled with `g++ -O3 -march=native -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920$ in ~16.7 seconds. That is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication. +Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. That is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication. ## Transposition @@ -79,76 +79,132 @@ This code runs in ~12.4s, or about 30% faster. As we will see in a bit, there ar ## Vectorization -/hpc/compilation/contracts/#memory-aliasing +Now that we are just sequentially reading the elements of `a` and `b`, multiplying them, and adding the result to an accumulator variable, we can use [SIMD](/hpc/simd/) instructions to speed it up like [any other reduction](/hpc/simd/reduction/). -/hpc/simd/auto-vectorization/ +We can use [GCC vector types](/hpc/simd/intrinsics/#gcc-vector-extensions) to implement it: ```c++ -void matmul(const float *a, const float *_b, float * __restrict__ c, int n) { - // ... -} -``` +// a vector of 256 / 32 = 8 floats +typedef float vec __attribute__ (( vector_size(32) )); -![](../img/mm-vectorized-barplot.svg) - -```c++ -const int B = 8; // number of elements in a vector -const int vecsize = B * sizeof(float); // size of a vector in bytes -typedef float vector __attribute__ (( vector_size(vecsize) )); - -vector* alloc(int n) { - vector* ptr = (vector*) std::aligned_alloc(vecsize, vecsize * n); - memset(ptr, 0, vecsize * n); +// helper function that allocates n vectors and initializes them with zeros +vec* alloc(int n) { + vec* ptr = (vec*) std::aligned_alloc(32, 32 * n); + memset(ptr, 0, 32 * n); return ptr; } -float hsum(vector s) { - float res = 0; - for (int i = 0; i < B; i++) - res += s[i]; - return res; -} - void matmul(const float *_a, const float *_b, float *c, int n) { - int nB = (n + B - 1) / B; + // first, we need to align rows and pad them with zeros + int nB = (n + 7) / 8; // number of 8-element vectors in a row (rounded up) - vector *a = alloc(n * nB); - vector *b = alloc(n * nB); + vec *a = alloc(n * nB); + vec *b = alloc(n * nB); + // move both matrices for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { a[i * nB + j / 8][j % 8] = _a[i * n + j]; - b[i * nB + j / 8][j % 8] = _b[j * n + i]; // <- still transposed + b[i * nB + j / 8][j % 8] = _b[j * n + i]; // <- b is still transposed } } for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { - vector s = {0}; + vec s{}; // initialize the accumulator with zeros + + // vertical summation for (int k = 0; k < nB; k++) s += a[i * nB + k] * b[j * nB + k]; - c[i * n + j] = hsum(s); + + // horizontal summation + for (int k = 0; k < 8; k++) + c[i * n + j] += s[k]; } } } ``` +The performance for $n = 1920$ is now around 2.3 GFLOPS — or another ~4 times higher. + +![](../img/mm-vectorized-barplot.svg) + +This optimization looks neither too complex or specific to matrix multiplication. Why can't the compiler simply [auto-vectorizate](/hpc/simd/auto-vectorization/) the inner loop? It actually can — the only thing preventing that is the possibility that `c` overlaps with either `a` or `b`. The only thing that you need to do is to guarantee that `c` is not [aliased](/hpc/compilation/contracts/#memory-aliasing) with anything by adding the `__restrict__` keyword to it: + + + +```c++ +void matmul(const float *a, const float *_b, float * __restrict__ c, int n) { + // ... +} +``` + +Both manually and auto-vectorized implementations perform roughly the same. + +## Memory efficiency + +Now, what is interesting is that the implementation efficiency depends on the problem size: + ![](../img/mm-vectorized-plot.svg) [memory bandwidth](/hpc/cpu-cache/bandwidth/) is not the problem. [Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now. +You can see an even more noticeable dip at $1536 = 2^9 \times 3$. + $1920 = 2^7 \times 3 \times 5$, so it is divisible by a large power of two. Slightly slower than. 3.5s for 1025 ad 12s for 1024. +Now it is clear that we are really bottlenecked by the memory system. + However, now we *really* hit the memory limit. ## Register reuse +If we + +Here is a proof of concept: + +```c++ +void update(int x, int y) { + int c00 = 0, c01 = 0, c10 = 0, c11 = 0; + + for (int k = 0; k < n; k++) { + int a0 = a[x][k]; + int a1 = a[x + 1][k]; + + int b0 = b[k][y]; + int b1 = b[k][y + 1]; + + c00 += a0 * b0; + c01 += a0 * b0; + c10 += a0 * b0; + c11 += a1 * b1; + } + + c[x][y] += c00; + c[x][y + 1] += c01; + c[x + 1][y] += c10; + c[x + 1][y + 1] += c11; +} +``` + +Before, we were reading $2 n$ elements to update one cell, and now we are reading $4n$ elements to update four cells: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. + +It also boosts instruction-level parallelism and saves some instructions from execcuting the read instructions. + +We are not going to really try it. Instead, we will generalize it right away. + +Of course, this would not beat SIMD. + +## Micro-kernel + +*micro-kernel*. + This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in later implementations. RAM bandwidth is lower than that @@ -156,14 +212,14 @@ RAM bandwidth is lower than that The latency of FMA is 5 cycles, while its reciprocal throughput is ½. ```c++ -void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) { - vector t[6][2]{}; +void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) { + vec t[6][2]{}; // will be stored in ymm registers for (int k = l; k < r; k++) { for (int i = 0; i < 6; i++) { - vector alpha = vector{} + a[(x + i) * n + k]; + vec alpha = vec{} + a[(x + i) * n + k]; // broadcast for (int j = 0; j < 2; j++) - t[i][j] += alpha * b[(k * n + y) / 8 + j]; + t[i][j] += alpha * b[(k * n + y) / 8 + j]; // fused multiply-add } } @@ -173,6 +229,8 @@ void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) { } ``` +## Macro-kernel + ```c++ void matmul(const float *_a, const float *_b, float *_c, int n) { int nx = (n + 5) / 6 * 6; @@ -189,7 +247,7 @@ void matmul(const float *_a, const float *_b, float *_c, int n) { for (int x = 0; x < nx; x += 6) for (int y = 0; y < ny; y += 16) - kernel(a, (vector*) b, (vector*) c, x, y, 0, n, ny); + kernel(a, (vec*) b, (vec*) c, x, y, 0, n, ny); for (int i = 0; i < n; i++) memcpy(&_c[i * n], &c[i * ny], 4 * n); @@ -204,6 +262,10 @@ void matmul(const float *_a, const float *_b, float *_c, int n) { ![](../img/mm-kernel-plot.svg) +There is still a memory bandwidth problem. + +## Blocking + ```c++ const int s3 = 64; const int s2 = 120; @@ -219,7 +281,7 @@ for (int i3 = 0; i3 < ny; i3 += s3) // with [l:r] = [i1:i1+s1] for (int x = i2; x < std::min(i2 + s2, nx); x += 6) for (int y = i3; y < std::min(i3 + s3, ny); y += 16) - kernel(a, (vector*) b, (vector*) c, x, y, i1, std::min(i1 + s1, n), ny); + kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny); ``` ![](../img/mm-blocked-plot.svg) @@ -234,6 +296,13 @@ $$ \underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) $$ +(and also getting rid of `std::min` in the macro-kernel) + + +[https://www.openblas.net/](OpenBLAS) + +[numpy](/hpc/complexity/languages/#blas) + ![](../img/mm-blas.svg) We hit about 95. @@ -250,11 +319,13 @@ for (int i3 = 0; i3 < n; i3 += s3) for (int i = 0; i < 6; i++) for (int j = 0; j < 2; j++) c[x * n / 8 + i * n / 8 + y / 8 + j] - += (vector{} + a[x * n + i * n + k]) + += (vec{} + a[x * n + i * n + k]) * b[n / 8 * k + y / 8 + j]; ``` -### Generalizations +Register spilling. + +## Generalizations Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as: From c8860104c5e7610c7e03eddda174ade8712f7100 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 5 Apr 2022 23:31:52 +0300 Subject: [PATCH 017/173] matmul memory efficiency --- content/english/hpc/algorithms/matmul.md | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 09184bce..09bb6f29 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -122,6 +122,9 @@ void matmul(const float *_a, const float *_b, float *c, int n) { c[i * n + j] += s[k]; } } + + std::free(a); + std::free(b); } ``` @@ -147,21 +150,13 @@ Now, what is interesting is that the implementation efficiency depends on the pr ![](../img/mm-vectorized-plot.svg) -[memory bandwidth](/hpc/cpu-cache/bandwidth/) is not the problem. - -[Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now. - -You can see an even more noticeable dip at $1536 = 2^9 \times 3$. - -$1920 = 2^7 \times 3 \times 5$, so it is divisible by a large power of two. - -Slightly slower than. +First, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. However, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). -3.5s for 1025 ad 12s for 1024. +It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and actually slightly better because of the transpose itself — for all but few data points, where the performance deteriorates. This is because of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is divisible by a large power of two, we are fetching addresses of `b` that all likely map to the same cache line, reducing the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$. -Now it is clear that we are really bottlenecked by the memory system. +One may think that there would be at least some general performance gain from full sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` is painful, but the next 15 columns will actually be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for all practical problem sizes. -However, now we *really* hit the memory limit. +So, counterintuitively, transposing the matrix doesn't help the memory bandwidth — and in the naive implementation, we are not really bottlenecked by it anyway. But for our vectorize implementation, we certainly are, so let's tackle it. ## Register reuse From 376d46a118aed91aa8d78a279c4ba327f4bd3ab5 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 5 Apr 2022 23:50:40 +0300 Subject: [PATCH 018/173] matmul register reuse --- content/english/hpc/algorithms/matmul.md | 45 ++++++++++++++---------- 1 file changed, 26 insertions(+), 19 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 09bb6f29..e90abf3b 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -144,13 +144,15 @@ void matmul(const float *a, const float *_b, float * __restrict__ c, int n) { Both manually and auto-vectorized implementations perform roughly the same. +The performance is bottlenecked by using a single variable. We could use multiple variables similar to other reductions, but we will solve it later anyway. + ## Memory efficiency -Now, what is interesting is that the implementation efficiency depends on the problem size: +Now, what is interesting is that the implementation efficiency depends on the problem size. -![](../img/mm-vectorized-plot.svg) +At first, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). -First, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. However, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). +![](../img/mm-vectorized-plot.svg) It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and actually slightly better because of the transpose itself — for all but few data points, where the performance deteriorates. This is because of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is divisible by a large power of two, we are fetching addresses of `b` that all likely map to the same cache line, reducing the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$. @@ -160,13 +162,20 @@ So, counterintuitively, transposing the matrix doesn't help the memory bandwidth ## Register reuse -If we +To compute the cell $C[i][j]$, we need to compute the dot product of $A[i][:]$ and $B[:][j]$ (we are using the Python notation here to select rows and columns), which requires fetching $2n$ elements, even when $B$ is stored in column-major order. + +What if we were to compute $C[i:i+2][j:j+2]$, a $2 \times 2$ submatrix of $C$? We would need $A[i:i+2][:]$ and $B[:][j:j+2]$, which is $4n$ elements in total: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. + +To actually avoid reading more data, we need to read these $2+2$ rows and columns in parallel and update all $2 \times 2$ cells at once using all possible combinations of products. Here is a proof of concept: ```c++ -void update(int x, int y) { - int c00 = 0, c01 = 0, c10 = 0, c11 = 0; +void update_2x2(int x, int y) { + int c00 = c[x][y], + c01 = c[x][y + 1], + c10 = c[x + 1][y], + c11 = c[x + 1][y + 1]; for (int k = 0; k < n; k++) { int a0 = a[x][k]; @@ -176,25 +185,21 @@ void update(int x, int y) { int b1 = b[k][y + 1]; c00 += a0 * b0; - c01 += a0 * b0; - c10 += a0 * b0; + c01 += a0 * b1; + c10 += a1 * b0; c11 += a1 * b1; } - c[x][y] += c00; - c[x][y + 1] += c01; - c[x + 1][y] += c10; - c[x + 1][y + 1] += c11; + c[x][y] = c00; + c[x][y + 1] = c01; + c[x + 1][y] = c10; + c[x + 1][y + 1] = c11; } ``` -Before, we were reading $2 n$ elements to update one cell, and now we are reading $4n$ elements to update four cells: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. - -It also boosts instruction-level parallelism and saves some instructions from execcuting the read instructions. +It also boosts instruction-level parallelism (we don't have to wait between iterations to update the loop state) and saves some cycles from executing the read instructions. -We are not going to really try it. Instead, we will generalize it right away. - -Of course, this would not beat SIMD. +Of course, although better in terms of I/O, this $2 \times 2$ update would not beat our vectorized implementation, so we are not going to try this version in particular and instead will scale the idea right away. ## Micro-kernel @@ -224,7 +229,7 @@ void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) { } ``` -## Macro-kernel +The rest of the implementaiton: ```c++ void matmul(const float *_a, const float *_b, float *_c, int n) { @@ -261,6 +266,8 @@ There is still a memory bandwidth problem. ## Blocking +*Macro-kernel* + ```c++ const int s3 = 64; const int s2 = 120; From 4491d76d98814d18f62aa498eababe0bd87f7ac3 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 15:17:30 +0300 Subject: [PATCH 019/173] matmul kernel --- content/english/hpc/algorithms/matmul.md | 42 +++++++++++++++++------- 1 file changed, 31 insertions(+), 11 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index e90abf3b..1f6abce0 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -162,6 +162,8 @@ So, counterintuitively, transposing the matrix doesn't help the memory bandwidth ## Register reuse +Any two cells of A and B are used to update some cell of C. + To compute the cell $C[i][j]$, we need to compute the dot product of $A[i][:]$ and $B[:][j]$ (we are using the Python notation here to select rows and columns), which requires fetching $2n$ elements, even when $B$ is stored in column-major order. What if we were to compute $C[i:i+2][j:j+2]$, a $2 \times 2$ submatrix of $C$? We would need $A[i:i+2][:]$ and $B[:][j:j+2]$, which is $4n$ elements in total: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. @@ -201,17 +203,22 @@ It also boosts instruction-level parallelism (we don't have to wait between iter Of course, although better in terms of I/O, this $2 \times 2$ update would not beat our vectorized implementation, so we are not going to try this version in particular and instead will scale the idea right away. -## Micro-kernel +## Designing the kernel -*micro-kernel*. +We follow this approach and design a general kernel that updates a $h \times w$ submatrix of C using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$ (i. e. not a full computation, but only a partial update — it will be clear why later). We have several considerations: -This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in later implementations. +- In general, if we are updating an $h \times w$ submatrix, we will be fetching $2 \cdot n \cdot (h + w)$ elements to update $h \cdot w$ elements. We want that ratio of $\frac{h \cdot w}{2 \cdot n \cdot (h + w)}$ to be as high as possible, which is achieved with large square-ish submatrices. +- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instructions that are available on all modern x86 architectures. As you can guess from the name, they perform a vector `c += a * b` operation in one go, which is the core of our computation. +- We want to be able to exploit [instruction-level parallelism](/hpc/pipelining/) to achieve better utilizaiton of this instruction. On Zen 2, the `fma` instruction has the latency of 5 and the throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to fully saturate its execution ports. +- We only have $16$ logical vector registers that we can use as accumulators, and we want to avoid register spill. -RAM bandwidth is lower than that +For these reasons, we settle on a $6 \times 16$ kernel. We process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we need some more to store temporary values). We [broadcast](/hpc/simd/moving/#broadcast) an element of A, and then use it to update the first row ($8 + 8$ elements). Then we load the one below it, and so on. When we have updated the last row, we move to the next $6$ elements to the right. -The latency of FMA is 5 cycles, while its reciprocal throughput is ½. +The final implementation is simpler than it sounds: ```c++ +// update 6x16 submatrix C[x:x+6][y:y+16] +// using A[x:x+6][l:r] and B[l:r][y:y+16] void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) { vec t[6][2]{}; // will be stored in ymm registers @@ -229,10 +236,15 @@ void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) { } ``` -The rest of the implementaiton: +We need `t` so that the compiler stores these elements in vector registers. We could just update the final destinations, but unfortunately, the compiler re-writes them back to memory, causing a huge slowdown — and wrapping everything in `__restrict__` keywords doesn't help. + +The rest of the implementaiton is straightforward. Similar to the previous vectorized implementation, we just allocate aligned arrays and call the kernel instead of the innermost loop: ```c++ void matmul(const float *_a, const float *_b, float *_c, int n) { + // to avoid implementing partials, + // we pad height to nearest 6 and width to 16 + int nx = (n + 5) / 6 * 6; int ny = (n + 15) / 16 * 16; @@ -242,7 +254,7 @@ void matmul(const float *_a, const float *_b, float *_c, int n) { for (int i = 0; i < n; i++) { memcpy(&a[i * ny], &_a[i * n], 4 * n); - memcpy(&b[i * ny], &_b[i * n], 4 * n); + memcpy(&b[i * ny], &_b[i * n], 4 * n); // we don't need to transpose b this time } for (int x = 0; x < nx; x += 6) @@ -258,15 +270,19 @@ void matmul(const float *_a, const float *_b, float *_c, int n) { } ``` +This improves the performance by another ~40%: + ![](../img/mm-kernel-barplot.svg) +The speedup is much better (2-3x) on smaller arrays, indicating that there is still a bandwidth problem: + ![](../img/mm-kernel-plot.svg) -There is still a memory bandwidth problem. +If you've read the section on [cache-oblivious algorithms](/hpc/external-memory/oblivious/), you know that one universal solution to these types of things is to split matrices in four parts, do eight recursive block matrix multiplications until the matrix fits into cache, and carefully combine the results together. We will follow a different, simpler approach. ## Blocking -*Macro-kernel* +Note that we are reading. ```c++ const int s3 = 64; @@ -286,6 +302,8 @@ for (int i3 = 0; i3 < ny; i3 += s3) kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny); ``` +This part is sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix). + ![](../img/mm-blocked-plot.svg) ![](../img/mm-blocked-barplot.svg) @@ -294,8 +312,10 @@ Avoid moving anything: ![](../img/mm-noalloc.svg) +The theoretical performance limit is: + $$ -\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11}) +\underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10}) $$ (and also getting rid of `std::min` in the macro-kernel) @@ -325,7 +345,7 @@ for (int i3 = 0; i3 < n; i3 += s3) * b[n / 8 * k + y / 8 + j]; ``` -Register spilling. +(Assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.) ## Generalizations From 02aba431230500c968e972019b7aa92f0e03ebfb Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 15:52:04 +0300 Subject: [PATCH 020/173] matmul cache blocking --- content/english/hpc/algorithms/matmul.md | 40 ++++++++++++++++++++---- 1 file changed, 34 insertions(+), 6 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 1f6abce0..9acbc61a 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -282,12 +282,24 @@ If you've read the section on [cache-oblivious algorithms](/hpc/external-memory/ ## Blocking -Note that we are reading. +Alternative to divide-and-conquer is *cache blocking*: selecting a subset of data and processing it, and then going to the next block. Sometimes blocking is hierarchical: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on. + +It is less trivial to do for matrices than for arrays, but the trick is like this: + +- Let's select a subset of B that fits into the L3 cache (say, a subset of its columns). +- Now, let's select a submatrix of A that fits into the L2 cache (a subset of its rows). +- Select a submatrix of previously selected submatrix of B that fits into the L1 cache, and use it to do the kernel update (a subset of its rows). + +Here is a good [visualization](https://jukkasuomela.fi/cache-blocking-demo/) by Jukka Suomela (it shows different approaches; we use the last one). + +We could have started with A, but this would be slower. Note that during the kernel execution, we are reading the elements of $A$ slower than elements of $B$: we are fetching and broadcasting just one element, and then we multiply it with $16$ elements of $B$, so we need to store $B$ in cache, and the last stage be about selecting B in cache. + +We can implement it with three more outer `for` loops: ```c++ -const int s3 = 64; -const int s2 = 120; -const int s1 = 240; +const int s3 = 64; // how many columns of B to select +const int s2 = 120; // how many rows of A to select +const int s1 = 240; // how many rows of B to select for (int i3 = 0; i3 < ny; i3 += s3) // now we are working with b[:][i3:i3+s3] @@ -302,13 +314,29 @@ for (int i3 = 0; i3 < ny; i3 += s3) kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny); ``` -This part is sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix). +These outer `for` loops are sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix). + +It completely removes the memory bottleneck: ![](../img/mm-blocked-plot.svg) +The performance is no longer seriously affected by the problem size: + ![](../img/mm-blocked-barplot.svg) -Avoid moving anything: +Notice the dip at $1536$ is still there. Cache associativity affects the effective cache size. We need to adjust the step constants or insert holes into the layout to mitigate this. + +## Optimization + +We need a few more optimizations to reach the performance limit: + +- Remove memory allocation and just operate on the arrays that we are given (note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use unaligned `store` for `c` as we only use it rarely). +- Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code. +- Rewrite the micro-kernel using 12 variables (the compiler seems to have a problem with keeping them fully in registers). + +Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. + +Avoiding moving anything pays off: ![](../img/mm-noalloc.svg) From 6553b3f085132e827b2dd207ab702927b8ca89db Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 16:04:13 +0300 Subject: [PATCH 021/173] matmul optimization --- content/english/hpc/algorithms/matmul.md | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 9acbc61a..30bdd36c 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -332,32 +332,27 @@ We need a few more optimizations to reach the performance limit: - Remove memory allocation and just operate on the arrays that we are given (note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use unaligned `store` for `c` as we only use it rarely). - Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code. -- Rewrite the micro-kernel using 12 variables (the compiler seems to have a problem with keeping them fully in registers). +- Rewrite the micro-kernel by hand using 12 variables (the compiler seems to have a problem with keeping them fully in registers). Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. -Avoiding moving anything pays off: +Avoiding moving anything pays off. These improvements sum up and give us a 50% improvement: ![](../img/mm-noalloc.svg) -The theoretical performance limit is: +We are actually not that far from the theoretical performance limit — which can be calculated as the throughput of the SIMD lane width times the fma instruction times the clock frequency: $$ -\underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10}) +\underbrace{8}_{SIMD} \cdot \underbrace{2}_{thr.} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10}) $$ -(and also getting rid of `std::min` in the macro-kernel) - - -[https://www.openblas.net/](OpenBLAS) - -[numpy](/hpc/complexity/languages/#blas) +A more realistic comparison is some practical library, such as [https://www.openblas.net/](OpenBLAS). We just call it from Python using [numpy](/hpc/complexity/languages/#blas), so there may be some minor overhead, but reaching 80% of theoretical performance seems plausible (matrix multiplication is not the only thing that CPUs are made for): ![](../img/mm-blas.svg) -We hit about 95. +We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is basically 40 lines of C. -Which is fine, considering that this is not the only thing that CPUs are made for. +Interestingly, the whole thing can be rolled into one large `for` loop: ```c++ for (int i3 = 0; i3 < n; i3 += s3) @@ -406,6 +401,6 @@ https://arxiv.org/pdf/1605.01078.pdf ## Acknowledgements -"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn. +The algorithm was originally designed by Kazushige Goto, and it is the basis of GotoBLAS and OpenBLAS. The author himself described it and some other aspects in more detail in "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)". -Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course. +The exposition style is inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/)" course by Jukka Suomela, which features a [similar case study](http://ppc.cs.aalto.fi/ch2/) on speeding up the distance product. From 32518f4c003b666546ba393e167556c5ff8fd430 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 16:39:02 +0300 Subject: [PATCH 022/173] floyd algorithm and matmul --- content/english/hpc/algorithms/matmul.md | 45 ++++++++++++++++-------- 1 file changed, 30 insertions(+), 15 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 30bdd36c..e01544db 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -334,7 +334,9 @@ We need a few more optimizations to reach the performance limit: - Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code. - Rewrite the micro-kernel by hand using 12 variables (the compiler seems to have a problem with keeping them fully in registers). -Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. +Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. We leave the code out, because the change is large and tedious and involves slightly modifying the benchmarking code itself. It is straightforward, but we only implement the version for this particular size, whithout any safety checks. Cheating on the benchmark. + +https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc Avoiding moving anything pays off. These improvements sum up and give us a 50% improvement: @@ -350,9 +352,9 @@ A more realistic comparison is some practical library, such as [https://www.open ![](../img/mm-blas.svg) -We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is basically 40 lines of C. +We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is essentially just 40 lines of C. -Interestingly, the whole thing can be rolled into one large `for` loop: +Interestingly, the whole thing can be rolled into one large `for` loop (assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.): ```c++ for (int i3 = 0; i3 < n; i3 += s3) @@ -368,17 +370,21 @@ for (int i3 = 0; i3 < n; i3 += s3) * b[n / 8 * k + y / 8 + j]; ``` -(Assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.) +There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is [only efficient for very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$) for which we typically use multi-threading anyway. ## Generalizations -Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as: +FMA also supports 64-bit floating point number, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you know that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better convert them to and from floats. + +You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication" defined as: -$(D \circ D)_{ij} = \min_k(D_{ik} + D_{kj})$ +$$ +(D \circ D)_{ij} = \min_{1 \le k \le n} (D_{ik} + D_{kj}) +$$ -Graph interpretation: find shortest paths of length 2 between all vertices in a fully-connected weighted graph +It is also known as the "distance product" due to its graph interpretation: the result is the matrix of shortest paths of length two between all pairs of vertices in a fully-connected weighted graph. -A cool thing about distance product is that if if we iterate the process and calculate: +A cool thing about the distance product is that if if we iterate the process and calculate: $$ D_2 = D \circ D \\ @@ -387,17 +393,26 @@ D_8 = D_4 \circ D_4 \\ \ldots $$ -Then we can find all-pairs shortest distances in $O(\log n)$ steps +Then we can find all-pairs shortest distances in $O(\log n)$ steps: -(but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it) - -Which is an exercise. +```c++ +for (int l = 0; l < logn; l++) + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + d[i][j] = min(d[i][j], d[i][k] + d[k][j]); +``` -Strassen algorithm is only useful for large matrices. +This requires $O(n^3 \log n)$ operations, but if we do these two-edge relaxations in a particular order, we can do it with just one pass, which is known as the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm): -https://arxiv.org/pdf/1605.01078.pdf +```c++ +for (int k = 0; k < n; k++) + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + d[i][j] = min(d[i][j], d[i][k] + d[k][j]); +``` -[cache-oblivious](/hpc/external-memory/oblivious/#matrix-multiplication) algorithms +As an exercise, try to think of ways of speeding up this "for-for-for" computation. It will be harder than matrix multiplication because you need to perform updates in this particular order. ## Acknowledgements From 82ddb7412be5c82bb937185e5c199cb7c418fe23 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 20:32:17 +0300 Subject: [PATCH 023/173] scalar matmul edits --- content/english/hpc/algorithms/matmul.md | 49 ++++++++++++++---------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index e01544db..a6237da5 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -17,13 +17,15 @@ nomove 0.303826 23.295860130469414 blas 0.27489790320396423 25.747333528217077 --> -In this case study, we will design and implement several algorithms for matrix multiplication. We start with the naive "for-for-for" algorithm and incrementally improve it, eventually developing an implementation that is 50 times faster and matches the performance of BLAS libraries while being under 40 lines of C. +In this case study, we will design and implement several algorithms for matrix multiplication. -We compile our implementations with GCC 13 and run them on Zen 2 clocked at 2GHz. +We start with the naive "for-for-for" algorithm and incrementally improve it, eventually arriving at a version that is 50 times faster and matches the performance of BLAS libraries while being under 40 lines of C. + +All implementations are compiled with GCC 13 and run on a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2) CPU clocked at 2GHz. ## Baseline -The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is an $l \times m$ matrix $C$ calculated as: +The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is defined as an $l \times m$ matrix $C$ calculated as $$ C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj} @@ -31,7 +33,7 @@ $$ For simplicity, we will only consider *square* matrices, where $l = m = n$. -To implement matrix multiplication, we can just transfer this definition into code — but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays, to be explicit about memory addressing: +To implement matrix multiplication, we can simply transfer this definition into code — but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays to be explicit about pointer arithmetic: ```c++ void matmul(const float *a, const float *b, float *c, int n) { @@ -42,17 +44,17 @@ void matmul(const float *a, const float *b, float *c, int n) { } ``` -For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although it can be [generalized](#generalizations) to other types and operations. +For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations remain correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although all implementations can be easily [generalized](#generalizations) to other data types and operations. -Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. That is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication. +Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. Put in perspective, it is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication, which doesn't look that good yet. ## Transposition In general, when you optimize an algorithm that processes large quantities of data — and $1920^2 \times 3 \times 4 \approx 42$ MB clearly is a large quantity as it can't fit into any of the [CPU caches](/hpc/cpu-cache) — you should always start with memory before optimizing arithmetic, as it is much more likely to be the bottleneck. -Note that the field $C_{ij}$ can be viewed as the dot product of row $i$ in matrix $A$ and column $j$ in matrix $B$. As we are incrementing the `k` variable in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa). +The field $C_{ij}$ can be seen as the dot product of row $i$ of matrix $A$ and column $j$ of matrix $B$. As we increment `k` in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa) as sequential iteration. -One [well-known optimization](/hpc/external-memory/oblivious/#matrix-multiplication) that mitigates this problem is to either store matrix $B$ in *column-major* order or to *transpose* it before the matrix multiplication — spending $O(n^2)$ additional operations, but ensuring sequential reads in the hot loop: +One [well-known](/hpc/external-memory/oblivious/#matrix-multiplication) optimization that tackles this problem is to store matrix $B$ in *column-major* order — or, alternatively, to *transpose* it before the matrix multiplication. This requires $O(n^2)$ additional operations but ensures sequential reads in the innermost loop: @@ -144,21 +145,27 @@ void matmul(const float *a, const float *_b, float * __restrict__ c, int n) { Both manually and auto-vectorized implementations perform roughly the same. + + ## Memory efficiency -Now, what is interesting is that the implementation efficiency depends on the problem size. +What is interesting is that the implementation efficiency depends on the problem size. -At first, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). +At first, the performance (in terms of useful operations per second) increases as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). ![](../img/mm-vectorized-plot.svg) -It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and actually slightly better because of the transpose itself — for all but few data points, where the performance deteriorates. This is because of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is divisible by a large power of two, we are fetching addresses of `b` that all likely map to the same cache line, reducing the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$. +It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and even slightly better because it doesn't need to perform a transposition. + +One might think that there would be some *general* performance gain from doing sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` indeed takes more time, but the next 15 column reads will be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for any practical matrix sizes. -One may think that there would be at least some general performance gain from full sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` is painful, but the next 15 columns will actually be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for all practical problem sizes. +Instead, the performance deteriorates on only a few specific matrix sizes due to the effects of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is a multiple of a large power of two, we are fetching the addresses of `b` that all likely map to the same cache line, which reduces the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$. -So, counterintuitively, transposing the matrix doesn't help the memory bandwidth — and in the naive implementation, we are not really bottlenecked by it anyway. But for our vectorize implementation, we certainly are, so let's tackle it. +So, counterintuitively, transposing the matrix doesn't help with caching — and in the naive implementation, we are not really bottlenecked by the memory bandwidth anyway. But our vectorized implementation certainly is, so let's work on its I/O efficiency. ## Register reuse From 55edc44d68bf04054446ee084f575a397a5ee66b Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 21:48:29 +0300 Subject: [PATCH 024/173] matmul kernel --- content/english/hpc/algorithms/matmul.md | 83 ++++++++++++++++-------- 1 file changed, 56 insertions(+), 27 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index a6237da5..64282bd3 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -169,36 +169,41 @@ So, counterintuitively, transposing the matrix doesn't help with caching — and ## Register reuse -Any two cells of A and B are used to update some cell of C. +Using a Python-like notation to refer to submatrices, to compute the cell $C[x][y]$, we need to calculate the dot product of $A[x][:]$ and $B[:][y]$, which requires fetching $2n$ elements, even if we store $B$ in column-major order. -To compute the cell $C[i][j]$, we need to compute the dot product of $A[i][:]$ and $B[:][j]$ (we are using the Python notation here to select rows and columns), which requires fetching $2n$ elements, even when $B$ is stored in column-major order. + -What if we were to compute $C[i:i+2][j:j+2]$, a $2 \times 2$ submatrix of $C$? We would need $A[i:i+2][:]$ and $B[:][j:j+2]$, which is $4n$ elements in total: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. +To compute $C[x:x+2][y:y+2]$, a $2 \times 2$ submatrix of $C$, we would need two rows from $A$ and two columns from $B$, namely $A[x:x+2][:]$ and $B[:][y:y+2]$, containing $4n$ elements in total, to update four elements instead of one — which is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. + + + +To avoid re-fetching data, we need to iterate these rows and columns in parallel and calculate all $2 \times 2$ possible combinations of products. Here is a proof of concept: ```c++ -void update_2x2(int x, int y) { - int c00 = c[x][y], - c01 = c[x][y + 1], - c10 = c[x + 1][y], - c11 = c[x + 1][y + 1]; +void kernel_2x2(int x, int y) { + int c00 = 0, c01 = 0, c10 = 0, c11 = 0; for (int k = 0; k < n; k++) { + // read rows int a0 = a[x][k]; int a1 = a[x + 1][k]; + // read columns int b0 = b[k][y]; int b1 = b[k][y + 1]; + // update all combinations c00 += a0 * b0; c01 += a0 * b1; c10 += a1 * b0; c11 += a1 * b1; } + // write the results to C c[x][y] = c00; c[x][y + 1] = c01; c[x + 1][y] = c10; @@ -206,52 +211,74 @@ void update_2x2(int x, int y) { } ``` -It also boosts instruction-level parallelism (we don't have to wait between iterations to update the loop state) and saves some cycles from executing the read instructions. +We can now simply call this kernel on all 2x2 submatrices of $C$, but we won't bother evaluating it: although this algorithm is better in terms of I/O operations, it would still not beat our SIMD-based implementation. Instead, we will extend this approach and develop a similar *vectorized* kernel right away. + + + ## Designing the kernel -We follow this approach and design a general kernel that updates a $h \times w$ submatrix of C using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$ (i. e. not a full computation, but only a partial update — it will be clear why later). We have several considerations: +Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ from scratch, we will declare a function that *updates* it using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$. For now, this seems like an over-generalization, but this API will be useful later. + + -For these reasons, we settle on a $6 \times 16$ kernel. We process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we need some more to store temporary values). We [broadcast](/hpc/simd/moving/#broadcast) an element of A, and then use it to update the first row ($8 + 8$ elements). Then we load the one below it, and so on. When we have updated the last row, we move to the next $6$ elements to the right. +To determine $h$ and $w$, we have several performance considerations: + +- In general, to compute an $h \times w$ submatrix, we need to fetch $2 \cdot n \cdot (h + w)$ elements. To optimize the I/O efficiency, we would want the $\frac{h \cdot w}{h + w}$ ratio to be high, which is achieved with large and square-ish submatrices. +- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instruction available on all modern x86 architectures. As you can guess from the name, it performs the `c += a * b` operation — which is the core of a dot product — on 8-element vectors in one go, which saves us from executing vector multiplication and addition separately. +- To achieve better utilization of this instruction, we want to make use of [instruction-level parallelism](/hpc/pipelining/). On Zen 2, the `fma` instruction has a latency of 5 and a throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to saturate its execution ports. +- We want to avoid register spill, and we only have $16$ logical vector registers that we can use as accumulators. + +For these reasons, we settle on a $6 \times 16$ kernel. This way, we process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we can't use an $8 \times 16$ kernel and use all 16 vector registers because we need some to hold temporary values). + +To update them efficiently, we use the following procedure: + + + ```c++ // update 6x16 submatrix C[x:x+6][y:y+16] // using A[x:x+6][l:r] and B[l:r][y:y+16] void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) { - vec t[6][2]{}; // will be stored in ymm registers + vec t[6][2]{}; // will be zero-filled and stored in ymm registers for (int k = l; k < r; k++) { for (int i = 0; i < 6; i++) { - vec alpha = vec{} + a[(x + i) * n + k]; // broadcast + // broadcast a[x + i][k] into a register + vec alpha = vec{} + a[(x + i) * n + k]; // converts to a broadcast + // multiply b[k][y:y+16] by it and update t[i][0] and t[i][1] for (int j = 0; j < 2; j++) - t[i][j] += alpha * b[(k * n + y) / 8 + j]; // fused multiply-add + t[i][j] += alpha * b[(k * n + y) / 8 + j]; // converts to an fma } } + // write the results back to C for (int i = 0; i < 6; i++) for (int j = 0; j < 2; j++) c[((x + i) * n + y) / 8 + j] += t[i][j]; } ``` -We need `t` so that the compiler stores these elements in vector registers. We could just update the final destinations, but unfortunately, the compiler re-writes them back to memory, causing a huge slowdown — and wrapping everything in `__restrict__` keywords doesn't help. +We need `t` so that the compiler stores these elements in vector registers. We could just update the final destinations, but, unfortunately, the compiler re-writes them back to memory, causing a slowdown (wrapping everything in `__restrict__` keywords doesn't help). -The rest of the implementaiton is straightforward. Similar to the previous vectorized implementation, we just allocate aligned arrays and call the kernel instead of the innermost loop: +The rest of the implementation is straightforward. Similar to the previous vectorized implementation, we just allocate aligned arrays and call the kernel instead of the innermost loop: ```c++ void matmul(const float *_a, const float *_b, float *_c, int n) { - // to avoid implementing partials, - // we pad height to nearest 6 and width to 16 - + // to simplify the implementation, we pad the height and width + // so that they are divisible by 6 and 16 respectively int nx = (n + 5) / 6 * 6; int ny = (n + 15) / 16 * 16; @@ -277,15 +304,15 @@ void matmul(const float *_a, const float *_b, float *_c, int n) { } ``` -This improves the performance by another ~40%: +This improves the benchmark performance, but only by ~40%: ![](../img/mm-kernel-barplot.svg) -The speedup is much better (2-3x) on smaller arrays, indicating that there is still a bandwidth problem: +The speedup is much higher (2-3x) on smaller arrays, indicating that there is still a bandwidth problem: ![](../img/mm-kernel-plot.svg) -If you've read the section on [cache-oblivious algorithms](/hpc/external-memory/oblivious/), you know that one universal solution to these types of things is to split matrices in four parts, do eight recursive block matrix multiplications until the matrix fits into cache, and carefully combine the results together. We will follow a different, simpler approach. +Now, if you've read the section on [cache-oblivious algorithms](/hpc/external-memory/oblivious/), you know that one universal solution to these types of things is to split all matrices into four parts, perform eight recursive block matrix multiplications, and carefully combine the results together. This solution is okay in practice, but there is some [overhead to recursion](/hpc/architecture/functions/), and it also doesn't allow us to fine-tune the algorithm, so instead, we will follow a different, simpler approach. ## Blocking @@ -419,6 +446,8 @@ for (int k = 0; k < n; k++) d[i][j] = min(d[i][j], d[i][k] + d[k][j]); ``` +Vectorizing the distance product and executing it $O(\log n)$ times is faster than than naively executing the Floyd-Warshall algorithm, although not by a lot. + As an exercise, try to think of ways of speeding up this "for-for-for" computation. It will be harder than matrix multiplication because you need to perform updates in this particular order. ## Acknowledgements From d129828bb3d764ccf146eb41df189aac7559a4dc Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 22:17:08 +0300 Subject: [PATCH 025/173] matmul cache blocking --- content/english/hpc/algorithms/matmul.md | 32 +++++++++++------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 64282bd3..2126daea 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -316,19 +316,20 @@ Now, if you've read the section on [cache-oblivious algorithms](/hpc/external-me ## Blocking -Alternative to divide-and-conquer is *cache blocking*: selecting a subset of data and processing it, and then going to the next block. Sometimes blocking is hierarchical: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on. +The *cache-aware* alternative to this divide-and-conquer trick is *cache blocking*: splitting the data into blocks that can fit into the cache and processing them one by one. If we have more than one layer of cache, we can do hierarchical blocking: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on. This requires knowing the cache sizes in advance, but it is usually easier to implement and also faster in practice. -It is less trivial to do for matrices than for arrays, but the trick is like this: +Cache blocking is less trivial to do with matrices than with arrays, but the general idea is this: -- Let's select a subset of B that fits into the L3 cache (say, a subset of its columns). -- Now, let's select a submatrix of A that fits into the L2 cache (a subset of its rows). -- Select a submatrix of previously selected submatrix of B that fits into the L1 cache, and use it to do the kernel update (a subset of its rows). +- Select a submatrix of $B$ that fits into the L3 cache (say, a subset of its columns). +- Select a submatrix of $A$ that fits into the L2 cache (say, a subset of its rows). +- Select a submatrix of the previously selected submatrix of $B$ (a subset of its rows) that fits into the L1 cache. +- Update the relevant submatrix of $C$ using the kernel. -Here is a good [visualization](https://jukkasuomela.fi/cache-blocking-demo/) by Jukka Suomela (it shows different approaches; we use the last one). +Here is a good [visualization](https://jukkasuomela.fi/cache-blocking-demo/) by Jukka Suomela (it features many different approaches; you are interested in the last one). -We could have started with A, but this would be slower. Note that during the kernel execution, we are reading the elements of $A$ slower than elements of $B$: we are fetching and broadcasting just one element, and then we multiply it with $16$ elements of $B$, so we need to store $B$ in cache, and the last stage be about selecting B in cache. +Note that the decision to start this process with matrix $B$ is not arbitrary. During the kernel execution, we are reading the elements of $A$ much slower than the elements of $B$: we fetch and broadcast just one element of $A$ and then multiply it with $16$ elements of $B$. Therefore, we want $B$ to be in the L1 cache while $A$ can stay in the L2 cache and not the other way around. -We can implement it with three more outer `for` loops: +This sounds complicated, but we can implement it with just three more outer `for` loops, which are collectively called *macro-kernel* (and the highly optimized low-level function that updates a 6x16 submatrix is called *micro-kernel*): ```c++ const int s3 = 64; // how many columns of B to select @@ -341,24 +342,21 @@ for (int i3 = 0; i3 < ny; i3 += s3) // now we are working with a[i2:i2+s2][:] for (int i1 = 0; i1 < ny; i1 += s1) // now we are working with b[i1:i1+s1][i3:i3+s3] - // this equates to updating c[i2:i2+s2][i3:i3+s3] - // with [l:r] = [i1:i1+s1] + // and we need to update c[i2:i2+s2][i3:i3+s3] with [l:r] = [i1:i1+s1] for (int x = i2; x < std::min(i2 + s2, nx); x += 6) for (int y = i3; y < std::min(i3 + s3, ny); y += 16) kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny); ``` -These outer `for` loops are sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix). +Cache blocking completely removes the memory bottleneck: -It completely removes the memory bottleneck: - -![](../img/mm-blocked-plot.svg) +![](../img/mm-blocked-barplot.svg) -The performance is no longer seriously affected by the problem size: +The performance is no longer significantly affected by the problem size: -![](../img/mm-blocked-barplot.svg) +![](../img/mm-blocked-plot.svg) -Notice the dip at $1536$ is still there. Cache associativity affects the effective cache size. We need to adjust the step constants or insert holes into the layout to mitigate this. +Notice that the dip at $1536$ is still there: cache associativity still affects the effective cache size. To mitigate this, we can adjust the step constants or insert holes into the layout, but we are not going to bother doing that for now. ## Optimization From f50135e9fa4cd1937da55eb1df4d5077d26e70df Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 22:57:16 +0300 Subject: [PATCH 026/173] matmul final edits --- content/english/hpc/algorithms/matmul.md | 52 ++++++++++++++---------- 1 file changed, 30 insertions(+), 22 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 2126daea..e6749b81 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -360,33 +360,39 @@ Notice that the dip at $1536$ is still there: cache associativity still affects ## Optimization -We need a few more optimizations to reach the performance limit: +To approach closer to the performance limit, we need a few more optimizations: -- Remove memory allocation and just operate on the arrays that we are given (note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use unaligned `store` for `c` as we only use it rarely). -- Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code. -- Rewrite the micro-kernel by hand using 12 variables (the compiler seems to have a problem with keeping them fully in registers). +- Remove memory allocation and operate on the arrays that are passed to the function. Note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use an unaligned `store` for `c` as we only use it rarely. +- Get rid of the `std::min` so that the size parameters are (mostly) constant and can be embedded into the machine code by the compiler (which also lets it [unroll](/hpc/architecture/loops/) the micro-kernel loop more efficiently without runtime checks). +- Rewrite the micro-kernel by hand using 12 vector variables (the compiler seems to struggle with keeping them in registers and writes them first to temporary storage and only then to $C$). + +These optimizations are straightforward but quite tedious to implement, so we are not going to list [the code](https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc) in the article. It also requires some more work to effectively support "weird" matrix sizes, which is why we only run benchmarks for sizes that are multiple of $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. + + + +These individually small improvements sum up and result in another 50% improvement: ![](../img/mm-noalloc.svg) -We are actually not that far from the theoretical performance limit — which can be calculated as the throughput of the SIMD lane width times the fma instruction times the clock frequency: +We are actually not that far from the theoretical performance limit — which can be calculated as the width of a SIMD lane times the `fma` instruction throughput times the clock frequency: $$ \underbrace{8}_{SIMD} \cdot \underbrace{2}_{thr.} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10}) $$ -A more realistic comparison is some practical library, such as [https://www.openblas.net/](OpenBLAS). We just call it from Python using [numpy](/hpc/complexity/languages/#blas), so there may be some minor overhead, but reaching 80% of theoretical performance seems plausible (matrix multiplication is not the only thing that CPUs are made for): +It is more useful to compare against some practical library, such as [OpenBLAS](https://www.openblas.net/). The laziest way is to simply invoke matrix multiplication from Python with [numpy](/hpc/complexity/languages/#blas). There may be some minor overhead, but it ends up reaching 80% of the theoretical limit, which seems plausible (this overhead is typical, as matrix multiplication is not the only thing that CPUs are made for): ![](../img/mm-blas.svg) -We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is essentially just 40 lines of C. +We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C. -Interestingly, the whole thing can be rolled into one large `for` loop (assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.): +Interestingly, the whole thing can be rolled into just one deeply nested `for` loop (assuming that we are in 2050 and using the 35th version of GCC, which finally does not screw up with register spilling.): ```c++ for (int i3 = 0; i3 < n; i3 += s3) @@ -402,21 +408,23 @@ for (int i3 = 0; i3 < n; i3 += s3) * b[n / 8 * k + y / 8 + j]; ``` -There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is [only efficient for very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$) for which we typically use multi-threading anyway. +There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway. + + ## Generalizations -FMA also supports 64-bit floating point number, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you know that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better convert them to and from floats. +FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you can guarantee that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better to convert them to and from floats. -You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication" defined as: +You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication," which is defined as: $$ -(D \circ D)_{ij} = \min_{1 \le k \le n} (D_{ik} + D_{kj}) +(A \circ B)_{ij} = \min_{1 \le k \le n} (A_{ik} + B_{kj}) $$ -It is also known as the "distance product" due to its graph interpretation: the result is the matrix of shortest paths of length two between all pairs of vertices in a fully-connected weighted graph. +It is also known as the "distance product" due to its graph interpretation: when applied to itself $(D \circ D)$, the result is the matrix of shortest paths of length two between all pairs of vertices in a fully-connected weighted graph specified by the edge weight matrix $D$. -A cool thing about the distance product is that if if we iterate the process and calculate: +A cool thing about the distance product is that if we iterate the process and calculate $$ D_2 = D \circ D \\ @@ -425,7 +433,7 @@ D_8 = D_4 \circ D_4 \\ \ldots $$ -Then we can find all-pairs shortest distances in $O(\log n)$ steps: +…we can find all-pairs shortest paths in $O(\log n)$ steps: ```c++ for (int l = 0; l < logn; l++) @@ -435,7 +443,7 @@ for (int l = 0; l < logn; l++) d[i][j] = min(d[i][j], d[i][k] + d[k][j]); ``` -This requires $O(n^3 \log n)$ operations, but if we do these two-edge relaxations in a particular order, we can do it with just one pass, which is known as the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm): +This requires $O(n^3 \log n)$ operations. If we do these two-edge relaxations in a particular order, we can do it with just one pass, which is known as the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm): ```c++ for (int k = 0; k < n; k++) @@ -444,12 +452,12 @@ for (int k = 0; k < n; k++) d[i][j] = min(d[i][j], d[i][k] + d[k][j]); ``` -Vectorizing the distance product and executing it $O(\log n)$ times is faster than than naively executing the Floyd-Warshall algorithm, although not by a lot. +Interestingly, vectorizing the distance product and executing it $O(\log n)$ times in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot. -As an exercise, try to think of ways of speeding up this "for-for-for" computation. It will be harder than matrix multiplication because you need to perform updates in this particular order. +As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because you need to perform updates in a particular order, but it is still possible to design a similar kernel and an iteration order that achieves a 30-50x total speedup. ## Acknowledgements -The algorithm was originally designed by Kazushige Goto, and it is the basis of GotoBLAS and OpenBLAS. The author himself described it and some other aspects in more detail in "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)". +The final algorithm was originally designed by Kazushige Goto, and it is the basis of GotoBLAS and OpenBLAS. The author himself describes it in more detail in "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)". -The exposition style is inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/)" course by Jukka Suomela, which features a [similar case study](http://ppc.cs.aalto.fi/ch2/) on speeding up the distance product. +The exposition style is inspired by the "[Programming Parallel Computers](http://ppc.cs.aalto.fi/)" course by Jukka Suomela, which features a [similar case study](http://ppc.cs.aalto.fi/ch2/) on speeding up the distance product. From 15e65f57a7b32c64d4afdabff0e726a542615fb5 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 6 Apr 2022 23:00:53 +0300 Subject: [PATCH 027/173] publish matmul --- content/english/hpc/algorithms/matmul.md | 3 +-- content/english/hpc/complexity/languages.md | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index e6749b81..01159313 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -1,7 +1,6 @@ --- title: Matrix Multiplication weight: 20 -draft: true --- @@ -154,17 +156,17 @@ The performance is bottlenecked by using a single variable. We could use multipl What is interesting is that the implementation efficiency depends on the problem size. -At first, the performance (in terms of useful operations per second) increases as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). +At first, the performance (defined as the number of useful operations per second) increases as the overhead of the loop management and the horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/). ![](../img/mm-vectorized-plot.svg) It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and even slightly better because it doesn't need to perform a transposition. -One might think that there would be some *general* performance gain from doing sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` indeed takes more time, but the next 15 column reads will be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for any practical matrix sizes. +One might think that there would be some general performance gain from doing sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` indeed takes more time, but the next 15 column reads will be in the same cache lines as the first one, so they will be cached anyway — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for any practical matrix sizes. Instead, the performance deteriorates on only a few specific matrix sizes due to the effects of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is a multiple of a large power of two, we are fetching the addresses of `b` that all likely map to the same cache line, which reduces the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$. -So, counterintuitively, transposing the matrix doesn't help with caching — and in the naive implementation, we are not really bottlenecked by the memory bandwidth anyway. But our vectorized implementation certainly is, so let's work on its I/O efficiency. +So, counterintuitively, transposing the matrix doesn't help with caching — and in the naive scalar implementation, we are not really bottlenecked by the memory bandwidth anyway. But our vectorized implementation certainly is, so let's work on its I/O efficiency. ## Register reuse @@ -172,7 +174,7 @@ Using a Python-like notation to refer to submatrices, to compute the cell $C[x][ -To compute $C[x:x+2][y:y+2]$, a $2 \times 2$ submatrix of $C$, we would need two rows from $A$ and two columns from $B$, namely $A[x:x+2][:]$ and $B[:][y:y+2]$, containing $4n$ elements in total, to update four elements instead of one — which is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. +To compute $C[x:x+2][y:y+2]$, a $2 \times 2$ submatrix of $C$, we would need two rows from $A$ and two columns from $B$, namely $A[x:x+2][:]$ and $B[:][y:y+2]$, containing $4n$ elements in total, to update *four* elements instead of *one* — which is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency. -To avoid re-fetching data, we need to iterate these rows and columns in parallel and calculate all $2 \times 2$ possible combinations of products. Here is a proof of concept: +To avoid fetching data more than once, we need to iterate over these rows and columns in parallel and calculate all $2 \times 2$ possible combinations of products. Here is a proof of concept: ```c++ void kernel_2x2(int x, int y) { @@ -220,7 +222,7 @@ Of course, although better in terms of I/O, this $2 \times 2$ update would not b ## Designing the kernel -Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ from scratch, we will declare a function that *updates* it using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$. For now, this seems like an over-generalization, but this API will be useful later. +Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ from scratch, we will declare a function that *updates* it using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$. For now, this seems like an over-generalization, but this function interface will prove useful later. - To achieve better utilization of this instruction, we want to make use of [instruction-level parallelism](/hpc/pipelining/). On Zen 2, the `fma` instruction has a latency of 5 and a throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to saturate its execution ports. -- We want to avoid register spill, and we only have $16$ logical vector registers that we can use as accumulators. - -For these reasons, we settle on a $6 \times 16$ kernel. This way, we process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we can't use an $8 \times 16$ kernel and use all 16 vector registers because we need some to hold temporary values). +- We want to avoid register spill (move data to and from registers more than necessary), and we only have $16$ logical vector registers that we can use as accumulators (minus those that we need to hold temporary values). -To update them efficiently, we use the following procedure: +For these reasons, we settle on a $6 \times 16$ kernel. This way, we process $96$ elements at once that are stored in $6 \times 2 = 12$ vector registers. To update them efficiently, we use the following procedure: -These individually small improvements sum up and result in another 50% improvement: +These individually small improvements compound and result in another 50% improvement: ![](../img/mm-noalloc.svg) -We are actually not that far from the theoretical performance limit — which can be calculated as the width of a SIMD lane times the `fma` instruction throughput times the clock frequency: +We are actually not that far from the theoretical performance limit — which can be calculated as the SIMD width times the `fma` instruction throughput times the clock frequency: $$ \underbrace{8}_{SIMD} \cdot \underbrace{2}_{thr.} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10}) $$ -It is more useful to compare against some practical library, such as [OpenBLAS](https://www.openblas.net/). The laziest way is to simply invoke matrix multiplication from Python with [numpy](/hpc/complexity/languages/#blas). There may be some minor overhead, but it ends up reaching 80% of the theoretical limit, which seems plausible (this overhead is typical, as matrix multiplication is not the only thing that CPUs are made for): +It is more representative to compare against some practical library, such as [OpenBLAS](https://www.openblas.net/). The laziest way to do it is to simply [invoke matrix multiplication from NumPy](/hpc/complexity/languages/#blas). There may be some minor overhead due to Python, but it ends up reaching 80% of the theoretical limit, which seems plausible (a 20% overhead is okay: matrix multiplication is not the only thing that CPUs are made for). ![](../img/mm-blas.svg) We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C. -Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS-level of performance (assuming that we're in 2050 and using GCC 35, which finally does not screw up with register spilling): +Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS-level of performance (assuming that we're in 2050 and using GCC version 35, which finally stopped screwing up with register spilling): ```c++ for (int i3 = 0; i3 < n; i3 += s3) @@ -407,13 +407,13 @@ for (int i3 = 0; i3 < n; i3 += s3) * b[n / 8 * k + y / 8 + j]; ``` -There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway. +There is also an approach that performs asymptotically fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway. ## Generalizations -FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you can guarantee that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better to convert them to and from floats. +FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which results in decreased performance. If you can guarantee that all intermediate results can be represented exactly as 32- or 64-bit floating-point numbers (which is [often the case](/hpc/arithmetic/errors/)), it may be faster to just convert them to and from floats. You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication," which is defined as: @@ -453,7 +453,7 @@ for (int k = 0; k < n; k++) Interestingly, vectorizing the distance product and executing it $O(\log n)$ times in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot. -As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because you need to perform updates in a particular order, but it is still possible to design a similar kernel and an iteration order that achieves a 30-50x total speedup. +As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because now there is a logical dependency between the iterations, and you need to perform updates in a particular order, but it is still possible to design a similar kernel and a block iteration order that achieves a 30-50x total speedup. ## Acknowledgements From b149f0900ce5b63a3c94088152879ee5530f81dc Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 7 Apr 2022 01:42:11 +0300 Subject: [PATCH 029/173] typo --- content/english/hpc/algorithms/matmul.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index e0ebdaac..a5a7b4f2 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -391,7 +391,7 @@ It is more representative to compare against some practical library, such as [Op We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C. -Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS-level of performance (assuming that we're in 2050 and using GCC version 35, which finally stopped screwing up with register spilling): +Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS level of performance (assuming that we're in 2050 and using GCC version 35, which finally stopped screwing up with register spilling): ```c++ for (int i3 = 0; i3 < n; i3 += s3) @@ -409,8 +409,6 @@ for (int i3 = 0; i3 < n; i3 += s3) There is also an approach that performs asymptotically fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway. - - ## Generalizations FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which results in decreased performance. If you can guarantee that all intermediate results can be represented exactly as 32- or 64-bit floating-point numbers (which is [often the case](/hpc/arithmetic/errors/)), it may be faster to just convert them to and from floats. From 1d039027db5e184c4d0b4b4824ddfbd119ae1f62 Mon Sep 17 00:00:00 2001 From: Daniel Paleka Date: Thu, 7 Apr 2022 13:30:12 +0200 Subject: [PATCH 030/173] Typo in argmin.md --- content/english/hpc/algorithms/argmin.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/algorithms/argmin.md b/content/english/hpc/algorithms/argmin.md index ccd9f140..0a9531c1 100644 --- a/content/english/hpc/algorithms/argmin.md +++ b/content/english/hpc/algorithms/argmin.md @@ -3,7 +3,7 @@ title: Argmin with SIMD weight: 7 --- -Computing the *minimum* of an array [easily vectorizable](/hpc/simd/reduction), as it is not different from any other reduction: in AVX2, you just need to use a convenient `_mm256_min_epi32` intrinsic as the inner operation. It computes the minimum of two 8-element vectors in one cycle — even faster than in the scalar case, which requires at least a comparison and a conditional move. +Computing the *minimum* of an array is [easily vectorizable](/hpc/simd/reduction), as it is not different from any other reduction: in AVX2, you just need to use a convenient `_mm256_min_epi32` intrinsic as the inner operation. It computes the minimum of two 8-element vectors in one cycle — even faster than in the scalar case, which requires at least a comparison and a conditional move. Finding the *index* of that minimum element (*argmin*) is much harder, but it is still possible to vectorize very efficiently. In this section, we design an algorithm that computes the argmin (almost) at the speed of computing the minimum and ~15x faster than the naive scalar approach. From 965c76bb87126d51013dbbe8e181fa439c638138 Mon Sep 17 00:00:00 2001 From: Alex Saveau Date: Sat, 9 Apr 2022 14:33:35 -0700 Subject: [PATCH 031/173] Fix extra word typo --- content/english/hpc/arithmetic/errors.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/arithmetic/errors.md b/content/english/hpc/arithmetic/errors.md index f2e0fbf6..df62e91d 100644 --- a/content/english/hpc/arithmetic/errors.md +++ b/content/english/hpc/arithmetic/errors.md @@ -125,7 +125,7 @@ $$ f(x, y) = x^2 - y^2 = (x + y) \cdot (x - y) $$ -In this one, it is easy to show that the error is be bound by $\epsilon \cdot |x - y|$. It is also faster because it needs 2 additions and 1 multiplication: one fast addition more and one slow multiplication less compared to the original. +In this one, it is easy to show that the error is bound by $\epsilon \cdot |x - y|$. It is also faster because it needs 2 additions and 1 multiplication: one fast addition more and one slow multiplication less compared to the original. ### Kahan Summation From a211cf62040495eddefa3c88f46b2206b513fd86 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Sun, 10 Apr 2022 19:52:42 +0300 Subject: [PATCH 032/173] bugfix --- content/russian/cs/tree-structures/treap.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/russian/cs/tree-structures/treap.md b/content/russian/cs/tree-structures/treap.md index dd3417dd..724ed15f 100644 --- a/content/russian/cs/tree-structures/treap.md +++ b/content/russian/cs/tree-structures/treap.md @@ -199,7 +199,7 @@ struct Node { Вместо того, чтобы модифицировать и `merge`, и `split` под наши хотелки, напишем вспомогательную функцию `upd`, которую будем вызывать при обновлении детей вершины: ```c++ -void sum(Node* v) { return v ? v->sum : 0; } +int sum(Node* v) { return v ? v->sum : 0; } // обращаться по пустому указателю нельзя -- выдаст ошибку void upd(Node* v) { v->sum = sum(v->l) + sum(v->r) + v->val; } From cbd4948a082bc4959dfc565a2cc99041753d03b9 Mon Sep 17 00:00:00 2001 From: Alex Saveau Date: Sun, 10 Apr 2022 13:32:20 -0700 Subject: [PATCH 033/173] Fix possible typo? I'm pretty sure this should say not. --- content/english/hpc/external-memory/policies.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/external-memory/policies.md b/content/english/hpc/external-memory/policies.md index 1ff0e724..4cb36bdd 100644 --- a/content/english/hpc/external-memory/policies.md +++ b/content/english/hpc/external-memory/policies.md @@ -33,7 +33,7 @@ $$ The main idea of the proof is to consider the worst case scenario. For LRU it would be the repeating series of $\frac{M}{B}$ distinct blocks: each block is new and so LRU has 100% cache misses. Meanwhile, $OPT_{M/2}$ would be able to cache half of them (but not more, because it only has half the memory). Thus $LRU_M$ needs to fetch double the number of blocks that $OPT_{M/2}$ does, which is basically what is expressed in the inequality, and anything better for $LRU$ would only weaken it. -![Dimmed are the blocks cached by OPT (but note cached by LRU)](../img/opt.png) +![Dimmed are the blocks cached by OPT (but not cached by LRU)](../img/opt.png) This is a very relieving result. It means that, at least in terms of asymptotic I/O complexity, you can just assume that the eviction policy is either LRU or OPT — whichever is easier for you — do complexity analysis with it, and the result you get will normally transfer to any other reasonable cache replacement policy. From 6e13a8d7a027ad4dc486e7b82335e766d8137c59 Mon Sep 17 00:00:00 2001 From: Alex Saveau Date: Mon, 11 Apr 2022 00:26:23 -0700 Subject: [PATCH 034/173] Fix code typo --- content/english/hpc/cpu-cache/paging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/cpu-cache/paging.md b/content/english/hpc/cpu-cache/paging.md index fad39a54..684fcd65 100644 --- a/content/english/hpc/cpu-cache/paging.md +++ b/content/english/hpc/cpu-cache/paging.md @@ -53,7 +53,7 @@ always [madvise] never #include void *ptr = std::aligned_alloc(page_size, array_size); -madvise(pre, array_size, MADV_HUGEPAGE); +madvise(ptr, array_size, MADV_HUGEPAGE); ``` You can only request a memory region to be allocated using huge pages if it has the corresponding alignment. From fc5fb2c45ee664d270bc65ca78e40a3a0aaaffbf Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 11 Apr 2022 17:58:33 +0300 Subject: [PATCH 035/173] fix approximate logarithm formula --- content/english/hpc/arithmetic/rsqrt.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/english/hpc/arithmetic/rsqrt.md b/content/english/hpc/arithmetic/rsqrt.md index 06659136..63b2799a 100644 --- a/content/english/hpc/arithmetic/rsqrt.md +++ b/content/english/hpc/arithmetic/rsqrt.md @@ -77,13 +77,13 @@ $$ \log_2 x = e_x + \log_2 (1 + m_x) \approx e_x + m_x + \sigma $$ -Now, having this approximation in mind and defining $L=23$ as the number of mantissa bits in a `float` and $B=127$ for the exponent bias, when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we get +Now, having this approximation in mind and defining $L=2^{23}$ (the number of mantissa bits in a `float`) and $B=127$ (the exponent bias), when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we get $$ \begin{aligned} -I_x &= L(e_x + B + m_x) -\\ &= L(e_x + m_x + \sigma +B-\sigma ) -\\ &\approx L\log_2 (x) + L (B-\sigma ) +I_x &= L \cdot (e_x + B + m_x) +\\ &= L \cdot (e_x + m_x + \sigma +B-\sigma ) +\\ &\approx L \cdot \log_2 (x) + L \cdot (B-\sigma ) \end{aligned} $$ From bb31ad26a9cb50c350a24104c2d734704ea72e2f Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 11 Apr 2022 18:30:22 +0300 Subject: [PATCH 036/173] exponent bias --- content/english/hpc/arithmetic/ieee-754.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/content/english/hpc/arithmetic/ieee-754.md b/content/english/hpc/arithmetic/ieee-754.md index ae624add..6b1e2a24 100644 --- a/content/english/hpc/arithmetic/ieee-754.md +++ b/content/english/hpc/arithmetic/ieee-754.md @@ -15,7 +15,7 @@ When we designed our [DIY floating-point type](../float), we omitted quite a lot - What happens if we increment the largest representable number? - Can we somehow detect if one of the above three happened? -Most of the early computers didn't have floating-point arithmetic, and when vendors started adding floating-point coprocessors, they had slightly different visions for what answers to those questions should be. Diverse implementations made it difficult to use floating-point arithmetic reliably and portably — particularly for people developing compilers. +Most of the early computers didn't support floating-point arithmetic, and when vendors started adding floating-point coprocessors, they had slightly different visions for what the answers to these questions should be. Diverse implementations made it difficult to use floating-point arithmetic reliably and portably — especially for the people who develop compilers. In 1985, the Institute of Electrical and Electronics Engineers published a standard (called [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754)) that provided a formal specification of how floating-point numbers should work, which was quickly adopted by the vendors and is now used in virtually all general-purpose computers. @@ -27,6 +27,15 @@ Similar to our handmade float implementation, hardware floats use one bit for si One of the reasons why they are stored in this exact order is that it is easier to compare and sort them: you can use mostly the same comparator circuit as for [unsigned integers](../integer), except for maybe flipping some bits in case one of the numbers is negative. +For the same reason, the exponent is *biased:* the actual value is 127 less than the stored unsigned integer, which lets us also cover the values less than one (with negative exponents). In the example above: + +$$ +(-1)^0 \times 2^{01111100_2 - 127} \times (1 + 2^{-2}) += 2^{124 - 127} \times 1.25 += \frac{1.25}{8} += 0.15625 +$$ + IEEE 754 and a few consequent standards define not one but *several* representations that differ in sizes, most notably: | Type | Sign | Exponent | Mantissa | Total bits | Approx. decimal digits | @@ -46,11 +55,11 @@ Their availability ranges from chip to chip: - Half-precision arithmetic only supports a small subset of operations and is generally used for machine learning applications, especially neural networks, because they tend to do a large amount of calculation, but don't require a high level of precision. - Half-precision is being gradually replaced by bfloat, which trades off 3 mantissa bits to have the same range as single-precision, enabling interoperability with it. It is mostly being adopted by specialized hardware: TPUs, FGPAs, and GPUs. The name stands for "[Brain](https://en.wikipedia.org/wiki/Google_Brain) float." -Lower precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e. g. the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it. +Lower-precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e. g. the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it. Deep learning, emerging as a very popular and computationally-intensive field, created a huge demand for low-precision matrix multiplication, which led to manufacturers developing separate hardware or at least adding specialized instructions that support these types of computations — most notably, Google developing a custom chip called TPU (*tensor processing unit*) that specializes on multiplying 128-by-128 bfloat matrices, and NVIDIA adding "tensor cores," capable of performing 4-by-4 matrix multiplication in one go, to all their newer GPUs. -Apart from their sizes, most of the behavior is exactly the same between all floating-point types, which we will now clarify. +Apart from their sizes, most of the behavior is the same between all floating-point types, which we will now clarify. ## Handling Corner Cases From 436ffa7b608309d8a2246f403d2c95557bbb7d76 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 11 Apr 2022 19:10:37 +0300 Subject: [PATCH 037/173] comments about bit tricks in fast rsqrt --- content/english/hpc/arithmetic/rsqrt.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/arithmetic/rsqrt.md b/content/english/hpc/arithmetic/rsqrt.md index 63b2799a..9817e5a9 100644 --- a/content/english/hpc/arithmetic/rsqrt.md +++ b/content/english/hpc/arithmetic/rsqrt.md @@ -77,7 +77,7 @@ $$ \log_2 x = e_x + \log_2 (1 + m_x) \approx e_x + m_x + \sigma $$ -Now, having this approximation in mind and defining $L=2^{23}$ (the number of mantissa bits in a `float`) and $B=127$ (the exponent bias), when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we get +Now, having this approximation in mind and defining $L=2^{23}$ (the number of mantissa bits in a `float`) and $B=127$ (the exponent bias), when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we essentially get $$ \begin{aligned} @@ -87,9 +87,11 @@ I_x &= L \cdot (e_x + B + m_x) \end{aligned} $$ +(Multiplying a number by $L=2^{23}$ is equivalent to left-shifting it by 23.) + When you tune $\sigma$ to minimize the mean square error, this results in a surprisingly accurate approximation. -![](../img/approx.svg) +![Reinterpreting a floating-point number $x$ as an integer (blue) compared to its scaled and shifted logarithm (gray)](../img/approx.svg) Now, expressing the logarithm from the approximation, we get From 95899a63c97b582a7b93ceb66369d27cd854c3e0 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 11 Apr 2022 19:12:10 +0300 Subject: [PATCH 038/173] more precise wording --- content/english/hpc/arithmetic/rsqrt.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/arithmetic/rsqrt.md b/content/english/hpc/arithmetic/rsqrt.md index 9817e5a9..0fa4d209 100644 --- a/content/english/hpc/arithmetic/rsqrt.md +++ b/content/english/hpc/arithmetic/rsqrt.md @@ -87,7 +87,7 @@ I_x &= L \cdot (e_x + B + m_x) \end{aligned} $$ -(Multiplying a number by $L=2^{23}$ is equivalent to left-shifting it by 23.) +(Multiplying an integer by $L=2^{23}$ is equivalent to left-shifting it by 23.) When you tune $\sigma$ to minimize the mean square error, this results in a surprisingly accurate approximation. From aec8d782b10e76df76c6483d34fb05a4f988462a Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 11 Apr 2022 20:40:12 +0300 Subject: [PATCH 039/173] fix variable names in dp example --- .../english/hpc/external-memory/locality.md | 49 +++++++++++-------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/content/english/hpc/external-memory/locality.md b/content/english/hpc/external-memory/locality.md index a26ff70f..569d9437 100644 --- a/content/english/hpc/external-memory/locality.md +++ b/content/english/hpc/external-memory/locality.md @@ -47,44 +47,51 @@ In practice, there is still some overhead associated with the recursion, and for ### Dynamic Programming -Similar reasoning can be applied to the implementations of dynamic programming algorithms but leading to the reverse result. Consider the classic knapsack problem, where we got $n$ items with integer costs $c_i$, and we need to pick a subset of items with the maximum total cost that does not exceed a given constant $w$. +Similar reasoning can be applied to the implementations of dynamic programming algorithms but leading to the reverse result. Consider the classic *knapsack problem:* given $N$ items with positive integer costs $c_i$, pick a subset of items with the maximum total cost that does not exceed a given constant $W$. -The way to solve it is to introduce the *state* $f[i, k]$, which corresponds to the maximum total cost not exceeding $k$ that can be achieved having already considered and excluded the first $i$ items. The state can be updated in $O(1)$ time per entry if consider either taking or not taking the $i$-th item and using further states of the dynamic to compute the optimal decision for each state. +The way to solve it is to introduce the *state* $f[n, w]$, which corresponds to the maximum total cost not exceeding $w$ that can be achieved using only the first $n$ items. These values can be computed in $O(1)$ time per entry if we consider either taking or not taking the $n$-th item and using the previous states of the dynamic to make the optimal decision. -Python has a handy `lru_cache` decorator, which can be used for implementing it with memoized recursion: +Python has a handy `lru_cache` decorator which can be used for implementing it with memoized recursion: ```python @lru_cache -def f(i, k): - if i == n or k == 0: +def f(n, w): + # check if we have no items to choose + if n == 0: return 0 - if w[i] > k: - return f(i + 1, k) - return max(f(i + 1, k), c[i] + f(i + 1, k - w[i])) + + # check if we can't pick the last item (note zero-based indexing) + if c[n - 1] > w: + return f(n - 1, w) + + # otherwise, we can either pick the last item or not + return max(f(n - 1, w), c[n - 1] + f(n - 1, w - c[n - 1])) ``` -When computing $f[n, w]$, the recursion may visit up to $O(n \cdot w)$ different states, which is asymptotically efficient, but rather slow in reality. Even after nullifying the overhead of Python recursion and all the hash table queries required for the LRU cache to work, it would still be slow because it does random I/O throughout most of the execution. +When computing $f[N, W]$, the recursion may visit up to $O(N \cdot W)$ different states, which is asymptotically efficient, but rather slow in reality. Even after nullifying the overhead of Python recursion and all the [hash table queries](../policies/#implementing-caching) required for the LRU cache to work, it would still be slow because it does random I/O throughout most of the execution. What we can do instead is to create a two-dimensional array for the dynamic and replace the recursion with a nice nested loop like this: ```cpp -int f[N + 1][W + 1]; +int f[N + 1][W + 1] = {0}; // this zero-fills the array -for (int i = n - 1; i >= 0; i++) - for (int k = 0; k <= W; k++) - f[i][k] = w[i] > k ? f[i + 1][k] : max(f[i + 1][k], c[i] + f[i + 1][k - w[i]]); +for (int n = 1; n <= N; n++) + for (int w = 0; w <= W; w++) + f[n][w] = c[n - 1] > w ? + f[n - 1][w] : + max(f[n - 1][k], c[n - 1] + f[n - 1][w - c[n - 1]]); ``` -Notice that we are only using the previous layer of the dynamic to calculate the next one. This means that if we can store one layer in the cache, we would only need to write $O(\frac{n \cdot w}{B})$ blocks in external memory. +Notice that we are only using the previous layer of the dynamic to calculate the next one. This means that if we can store one layer in the cache, we would only need to write $O(\frac{N \cdot W}{B})$ blocks in external memory. -Moreover, if we only need the answer, we don't actually have to store the whole 2d array but only the last layer. This lets us use just $O(w)$ memory by maintaining a single array of $w$ values. To simplify the code, we can slightly change the dynamic to store a binary value: whether it is possible to get the sum of exactly $k$ using the items that we have already considered. This dynamic is even faster to compute: +Moreover, if we only need the answer, we don't actually have to store the whole 2d array but only the last layer. This lets us use just $O(W)$ memory by maintaining a single array of $W$ values. To simplify the code, we can slightly change the dynamic to store a binary value: whether it is possible to get the sum of exactly $w$ using the items that we have already considered. This dynamic is even faster to compute: ```cpp -bool f[W + 1] = {}; // this zero-fills the array +bool f[W + 1] = {0}; f[0] = 1; -for (int i = 0; i < n; i++) - for (int x = W - a[i]; x >= 0; x--) - f[x + a[i]] |= f[x]; +for (int n = 0; n < N; n++) + for (int x = W - c[n]; x >= 0; x--) + f[x + c[n]] |= f[x]; ``` As a side note, now that it only uses simple bitwise operations, it can be optimized further by using a bitset: @@ -92,8 +99,8 @@ As a side note, now that it only uses simple bitwise operations, it can be optim ```cpp std::bitset b; b[0] = 1; -for (int i = 0; i < n; i++) - b |= b << c[i]; +for (int n = 0; n < N; n++) + b |= b << c[n]; ``` Surprisingly, there is still some room for improvement, and we will come back to this problem later. From 9872a11b931c184f51ce01e076d6b9adb1bbe690 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 11 Apr 2022 20:46:50 +0300 Subject: [PATCH 040/173] change wording --- content/english/hpc/cpu-cache/bandwidth.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/cpu-cache/bandwidth.md b/content/english/hpc/cpu-cache/bandwidth.md index 88b547ad..a28570f5 100644 --- a/content/english/hpc/cpu-cache/bandwidth.md +++ b/content/english/hpc/cpu-cache/bandwidth.md @@ -38,7 +38,7 @@ All CPU cache layers are placed on the same microchip as the processor, so the b ![](../img/boost.svg) -This detail comes into play when comparing algorithm implementations. Unless the dataset fits entirely in the cache, the relative performance of the two implementations may be different depending on the CPU clock rate because the RAM remains unaffected by it, while everything else does. +This detail comes into play when comparing algorithm implementations. When the working dataset fits in the cache, the relative performance of the two implementations may be different depending on the CPU clock rate because the RAM remains unaffected by it (while everything else does not). For this reason, it is [advised](/hpc/profiling/noise) to keep the clock rate fixed, and as the turbo boost isn't stable enough, we run most of the benchmarks in this book at plain 2GHz. From 69390b1012b84459f33b279f2e4646a0ed41f357 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 12 Apr 2022 17:45:08 +0300 Subject: [PATCH 041/173] typo --- content/english/hpc/external-memory/list-ranking.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/english/hpc/external-memory/list-ranking.md b/content/english/hpc/external-memory/list-ranking.md index cf5d9929..6d7c0053 100644 --- a/content/english/hpc/external-memory/list-ranking.md +++ b/content/english/hpc/external-memory/list-ranking.md @@ -50,11 +50,11 @@ List ranking is especially useful in graph algorithms. For example, we can obtain the Euler tour of a tree in external memory by constructing a linked list from the tree that corresponds to its Euler tour and then applying the list ranking algorithm — the ranks of each node will be the same as its index $tin_v$ in the Euler tour. To construct this list, we need to: -- split each undirected tree edge into two directed ones; -- duplicate the parent node for each up-edge (because list nodes can only have one incoming edge, but we visit some tree vertices multiple times); +- split each undirected edge into two directed ones; +- duplicate the parent node for each up-edge (because list nodes can only have one incoming edge, but we visit some vertices multiple times); - route each such node either to the "next sibling," if it has one, or otherwise to its own parent; - and then finally break the resulting cycle at the root. This general technique is called *tree contraction*, and it serves as the basis for a large number of tree algorithms. -Exactly the same approach can be applied to parallel algorithms, and we will convert that much more deeply in part 2. +The same approach can be applied to parallel algorithms, and we will cover that much more deeply in part II. From 0b9d2bb532003b65c1ae4bc9f5477bd2f4a5ddf4 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 12 Apr 2022 17:48:21 +0300 Subject: [PATCH 042/173] link to strassen algorithm implementation paper --- content/english/hpc/external-memory/oblivious.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/external-memory/oblivious.md b/content/english/hpc/external-memory/oblivious.md index 5e4650b2..a0327855 100644 --- a/content/english/hpc/external-memory/oblivious.md +++ b/content/english/hpc/external-memory/oblivious.md @@ -198,7 +198,7 @@ $$ T(N) = O\left(\frac{(\sqrt{M})^2}{B} \cdot \left(\frac{N}{\sqrt M}\right)^3\right) = O\left(\frac{N^3}{B\sqrt{M}}\right) $$ -This is better than just $O(\frac{N^3}{B})$ and by quite a lot. +This is better than just $O(\frac{N^3}{B})$, and by quite a lot. ### Strassen Algorithm @@ -237,7 +237,7 @@ $$ You can verify these formulas with simple substitution if you feel like it. -As far as I know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are some prototype implementations that are efficient for matrices larger than 4000 or so. +As far as I know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are [some prototype implementations](https://arxiv.org/pdf/1605.01078.pdf) that are efficient for matrices larger than 2000 or so. This technique can and actually has been extended multiple times to reduce the asymptotic even further by considering more submatrix products. As of 2020, current world record is $O(n^{2.3728596})$. Whether you can multiply matrices in $O(n^2)$ or at least $O(n^2 \log^k n)$ time is an open problem. From c5b7bd4b85ab1a90c25400c073181df687978377 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 12 Apr 2022 19:11:58 +0300 Subject: [PATCH 043/173] note about kernel design choices --- content/english/hpc/algorithms/matmul.md | 25 ++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index a5a7b4f2..c692a227 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -272,6 +272,31 @@ void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) { We need `t` so that the compiler stores these elements in vector registers. We could just update their final destinations in `c`, but, unfortunately, the compiler re-writes them back to memory, causing a slowdown (wrapping everything in `__restrict__` keywords doesn't help). +After unrolling these loops and hoisting `b` out of the `i` loop (`b[(k * n + y) / 8 + j]` does not depend on `i` and can be loaded once and reused in all 6 iterations), the compiler generates something more similar to this: + + + +```c++ +for (int k = l; k < r; k++) { + __m256 b0 = _mm256_load_ps((__m256*) &b[k * n + y]; + __m256 b1 = _mm256_load_ps((__m256*) &b[k * n + y + 8]; + + __m256 a0 = _mm256_broadcast_ps((__m128*) &a[x * n + k]); + t00 = _mm256_fmadd_ps(a0, b0, t00); + t01 = _mm256_fmadd_ps(a0, b1, t01); + + __m256 a1 = _mm256_broadcast_ps((__m128*) &a[(x + 1) * n + k]); + t10 = _mm256_fmadd_ps(a1, b0, t10); + t11 = _mm256_fmadd_ps(a1, b1, t11); + + // ... +} +``` + +We are using $12+3=15$ vector registers and a total of $6 \times 3 + 2 = 20$ instructions to perform $16 \times 6 = 96$ updates. Assuming that there are no other bottleneks, we should be hitting the throughput of `_mm256_fmadd_ps`. + +Note that this kernel is architecture-specific. If we didn't have `fma`, or if its throughput/latency were different, or if the SIMD width was 128 or 512 bits, we would have made different design choices. Multi-platform BLAS implementations ship [many kernels](https://github.com/xianyi/OpenBLAS/tree/develop/kernel), each written in assembly by hand and optimized for a particular architecture. + The rest of the implementation is straightforward. Similar to the previous vectorized implementation, we just move the matrices to memory-aligned arrays and call the kernel instead of the innermost loop: ```c++ From 473fe8562d44b769d27a0b8c8229f281eea2d3b3 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 12 Apr 2022 23:40:21 +0300 Subject: [PATCH 044/173] mlp clarifications --- content/english/hpc/cpu-cache/mlp.md | 2 +- content/english/hpc/cpu-cache/prefetching.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/english/hpc/cpu-cache/mlp.md b/content/english/hpc/cpu-cache/mlp.md index 11c5b660..95dfa4cb 100644 --- a/content/english/hpc/cpu-cache/mlp.md +++ b/content/english/hpc/cpu-cache/mlp.md @@ -3,7 +3,7 @@ title: Memory-Level Parallelism weight: 5 --- -Memory requests can overlap in time: while you wait for a read request to complete, you can send a few others, which will be executed concurrently with it. This is the reason why [linear iteration](../bandwidth) is so much faster than [pointer jumping](../latency): the CPU knows which memory locations it needs to fetch next and sends memory requests far ahead of time. +Memory requests can overlap in time: while you wait for a read request to complete, you can send a few others, which will be executed concurrently with it. This is the main reason why [linear iteration](../bandwidth) is so much faster than [pointer jumping](../latency): the CPU knows which memory locations it needs to fetch next and sends memory requests far ahead of time. The number of concurrent memory operations is large but limited, and it is different for different types of memory. When designing algorithms and especially data structures, you may want to know this number, as it limits the amount of parallelism your computation can achieve. diff --git a/content/english/hpc/cpu-cache/prefetching.md b/content/english/hpc/cpu-cache/prefetching.md index 8ccdea6b..3001389c 100644 --- a/content/english/hpc/cpu-cache/prefetching.md +++ b/content/english/hpc/cpu-cache/prefetching.md @@ -70,7 +70,7 @@ There is some overhead to computing the next address, but for arrays large enoug ![](../img/sw-prefetch.svg) -Interestingly, we can prefetch more than just two elements ahead, making use of this pattern in the LCG function: +Interestingly, we can prefetch more than just one element ahead, making use of this pattern in the LCG function: $$ \begin{aligned} @@ -82,17 +82,17 @@ $$ \end{aligned} $$ -Hence, in order to load `D` elements ahead, we can do this: +Hence, to load the `D`-th element ahead, we can do this: ```cpp __builtin_prefetch(&q[((1 << D) * k + (1 << D) - 1) % n]); ``` -Ignoring some issues such as the integer overflow, this way we can reduce the latency arbitrarily close to the cost of computing the next index (which in this case is dominated by the [modulo operation](/hpc/arithmetic/division)). +If we execute this request on every iteration, we will be simultaneously prefetching `D` elements ahead on average, increasing the throughput by `D` times. Ignoring some issues such as the integer overflow when `D` is too large, this way, we can reduce the average latency arbitrarily close to the cost of computing the next index (which, in this case, is dominated by the [modulo operation](/hpc/arithmetic/division)). ![](../img/sw-prefetch-others.svg) -Note that this is an artificial example, and you actually fail more often than not when trying to insert software prefetching into practical programs. This is largely due to the fact that you need to issue a separate memory instruction that may compete for resources with the others. At the same time, hardware prefetching is 100% harmless as it only activates when the memory and cache buses are not busy. +Note that this is an artificial example, and you actually fail more often than not when trying to insert software prefetching into practical programs. This is largely because you need to issue a separate memory instruction that may compete for resources with the others. At the same time, hardware prefetching is 100% harmless as it only activates when the memory and cache buses are not busy. You can also specify a specific level of cache the data needs to be brought to when doing software prefetching — when you aren't sure if you will be using it and don't want to kick out what is already in the L1 cache. You can use it with the `_mm_prefetch` intrinsic, which takes an integer value as the second parameter, specifying the cache level. This is useful in combination with [non-temporal loads and stores](../bandwidth#bypassing-the-cache). From 2a0cf6808d345a51c19de689c8b206f30d1ae92d Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 12 Apr 2022 23:47:11 +0300 Subject: [PATCH 045/173] prefetching edits --- content/english/hpc/cpu-cache/prefetching.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/english/hpc/cpu-cache/prefetching.md b/content/english/hpc/cpu-cache/prefetching.md index 3001389c..4f5a7545 100644 --- a/content/english/hpc/cpu-cache/prefetching.md +++ b/content/english/hpc/cpu-cache/prefetching.md @@ -30,9 +30,9 @@ for (int i = 0; i + 16 < N; i += 16) { } ``` -There is no point in making a graph because the latency is flat: 3ns regardless of the array size. Even though the instruction scheduler still can't tell what we are going to fetch next, the memory prefetcher can detect a pattern just by looking at the memory accesses and start loading the next cache line ahead of time, leveling out its latency. +There is no point in making a graph because it would be just flat: the latency is 3ns regardless of the array size. Even though the instruction scheduler still can't tell what we are going to fetch next, the memory prefetcher can detect a pattern just by looking at the memory accesses and start loading the next cache line ahead of time, mitigating the latency. -Hardware prefetching is usually powerful enough for most cases, but it only detects simple patterns. You can iterate forward and backward over multiple arrays in parallel, perhaps with small-to-medium strides, but that's about it. For anything more complex, the prefetcher won't figure out what's happening, and we need to help it out ourselves. +Hardware prefetching is smart enough for most use cases, but it only detects simple patterns. You can iterate forward and backward over multiple arrays in parallel, perhaps with small-to-medium strides, but that's about it. For anything more complex, the prefetcher won't figure out what's happening, and we need to help it out ourselves. ### Software Prefetching @@ -88,7 +88,7 @@ Hence, to load the `D`-th element ahead, we can do this: __builtin_prefetch(&q[((1 << D) * k + (1 << D) - 1) % n]); ``` -If we execute this request on every iteration, we will be simultaneously prefetching `D` elements ahead on average, increasing the throughput by `D` times. Ignoring some issues such as the integer overflow when `D` is too large, this way, we can reduce the average latency arbitrarily close to the cost of computing the next index (which, in this case, is dominated by the [modulo operation](/hpc/arithmetic/division)). +If we execute this request on every iteration, we will be simultaneously prefetching `D` elements ahead on average, increasing the throughput by `D` times. Ignoring some issues such as the integer overflow when `D` is too large, we can reduce the average latency arbitrarily close to the cost of computing the next index (which, in this case, is dominated by the [modulo operation](/hpc/arithmetic/division)). ![](../img/sw-prefetch-others.svg) From 68ae398833ab4e47918c4e20f013a333729b0bb9 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 13 Apr 2022 10:54:28 +0300 Subject: [PATCH 046/173] column -> cell --- content/english/hpc/cpu-cache/aos-soa.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/cpu-cache/aos-soa.md b/content/english/hpc/cpu-cache/aos-soa.md index 048271db..d5765339 100644 --- a/content/english/hpc/cpu-cache/aos-soa.md +++ b/content/english/hpc/cpu-cache/aos-soa.md @@ -99,8 +99,8 @@ As the performance on smaller arrays sizes is not affected, this clearly has som From the performance analysis point of view, all data in RAM is physically stored in a two-dimensional array of tiny capacitor cells, which is split into rows and columns. To read or write any cell, you need to perform one, two, or three actions: 1. Read the contents of a row in a *row buffer*, which temporarily discharges the capacitors. -2. Read or write a specific column in this buffer. -3. Write the contents of a row buffer back into the capacitors, so that the data is preserved, and the row buffer can be used for other memory accesses. +2. Read or write a specific cell in this buffer. +3. Write the contents of a row buffer back into the capacitors so that the data is preserved and the row buffer can be used for other memory accesses. Here is the punchline: you don't have to perform steps 1 and 3 between two memory accesses that correspond to the same row — you can just use the row buffer as a temporary cache. These three actions take roughly the same time, so this optimization makes long sequences of row-local accesses run thrice as fast compared to dispersed access patterns. From 50ffb1c9324e9d62433f178ba62494070c9b1afd Mon Sep 17 00:00:00 2001 From: Alex Saveau Date: Fri, 15 Apr 2022 11:33:19 -0700 Subject: [PATCH 047/173] Fix missing word --- content/english/hpc/data-structures/binary-search.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index ff9f73b4..36bb5059 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -9,7 +9,7 @@ Instead, the most fascinating showcases of performance engineering are multifold -In this article, we focus on such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code. +In this article, we focus on one such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code. The first algorithm achieves that by removing [branches](/hpc/pipelining/branching), and the second also optimizes the memory layout to achieve better [cache system](/hpc/cpu-cache) performance. This technically disqualifies it from being a drop-in replacement for `std::lower_bound` as it needs to permute the elements of the array before it can start answering queries — but I can't recall a lot of scenarios where you obtain a sorted array but can't afford to spend linear time on preprocessing. From 35016003c29a455a023f56118f5a9a0cf9c48072 Mon Sep 17 00:00:00 2001 From: Elk Cloner <28754537+elkcl@users.noreply.github.com> Date: Sat, 16 Apr 2022 17:34:45 +0300 Subject: [PATCH 048/173] =?UTF-8?q?=D0=98=D1=81=D0=BF=D1=80=D0=B0=D0=B2?= =?UTF-8?q?=D0=BB=D0=B5=D0=BD=D0=B8=D0=B5=20=D1=81=D1=81=D1=8B=D0=BB=D0=BA?= =?UTF-8?q?=D0=B8=20=D0=BD=D0=B0=20z-=D1=84=D1=83=D0=BD=D0=BA=D1=86=D0=B8?= =?UTF-8?q?=D1=8E=20=D0=B2=20=D1=81=D1=82=D0=B0=D1=82=D1=8C=D0=B5=20=D0=BF?= =?UTF-8?q?=D1=80=D0=BE=20=D1=81=D1=83=D1=84=D0=BC=D0=B0=D1=81?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- content/russian/cs/string-structures/suffix-array.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/russian/cs/string-structures/suffix-array.md b/content/russian/cs/string-structures/suffix-array.md index 80d2b129..25b90a3e 100644 --- a/content/russian/cs/string-structures/suffix-array.md +++ b/content/russian/cs/string-structures/suffix-array.md @@ -136,7 +136,7 @@ vector suffix_array(vector &s) { ### Алгоритм Касаи, Аримуры, Арикавы, Ли, Парка -Алгоритм в реальности называется как угодно, но не исходным способом (*алгоритм Касаи*, *алгоритм пяти корейцев*, и т. д.). Используется для подсчета $lcp$ за линейное время. Автору алгоритм кажется чем-то похожим на [z-функцию](string-searching) по своей идее. +Алгоритм в реальности называется как угодно, но не исходным способом (*алгоритм Касаи*, *алгоритм пяти корейцев*, и т. д.). Используется для подсчета $lcp$ за линейное время. Автору алгоритм кажется чем-то похожим на [z-функцию](/cs/string-searching/z-function) по своей идее. **Утверждение.** Пусть мы уже построили суфмасс и посчитали $lcp[i]$. Тогда: From 656f10fb82d03cb22d928566a5a67e7f6a8fcbd6 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 05:28:45 +0300 Subject: [PATCH 049/173] bugfix --- content/russian/cs/layer-optimizations/_index.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/content/russian/cs/layer-optimizations/_index.md b/content/russian/cs/layer-optimizations/_index.md index 492473b5..2456aa4c 100644 --- a/content/russian/cs/layer-optimizations/_index.md +++ b/content/russian/cs/layer-optimizations/_index.md @@ -10,10 +10,7 @@ date: 2021-08-29 **Задача.** Даны $n$ точек на прямой, отсортированные по своей координате $x_i$. Нужно найти $m$ отрезков, покрывающих все точки, минимизировав при этом сумму квадратов их длин. -**Базовое решение** — это следующая динамика: - -- $f[i, j]$ = минимальная стоимость покрытия $i$ первых точек, используя не более $j$ отрезков. -- Переход — перебор всех возможных последних отрезков, то есть +**Базовое решение** — определить состояние динамики $f[i, j]$ как минимальную стоимость покрытия $i$ первых точек используя не более $j$ отрезков. Пересчитывать её можно перебором всех возможных последних отрезков: $$ f[i, j] = \min_{k < i} \{f[k, j-1] + (x_{i-1}-x_k)^2 \} @@ -30,7 +27,7 @@ int cost(int i, int j) { } for (int i = 0; i <= m; i++) - f[0][k] = 0; // если нам не нужно ничего покрывать, то всё и так хорошо + f[0][i] = 0; // если нам не нужно ничего покрывать, то всё и так хорошо // все остальные f предполагаем равными бесконечности for (int i = 1; i <= n; i++) From 85bc919acc8cb33a7a09e0d37d973cef0548e7bf Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 05:54:04 +0300 Subject: [PATCH 050/173] fix divide and conquer dp --- .../layer-optimizations/divide-and-conquer.md | 31 +++++++++---------- 1 file changed, 14 insertions(+), 17 deletions(-) diff --git a/content/russian/cs/layer-optimizations/divide-and-conquer.md b/content/russian/cs/layer-optimizations/divide-and-conquer.md index 61a7304a..c5e218db 100644 --- a/content/russian/cs/layer-optimizations/divide-and-conquer.md +++ b/content/russian/cs/layer-optimizations/divide-and-conquer.md @@ -8,44 +8,43 @@ published: true *Эта статья — одна из [серии](../). Рекомендуется сначала прочитать все предыдущие.* -Посмотрим на формулу пересчета динамики для базового решения: +Посмотрим на формулу пересчета динамики из базового решения: $$ f[i, j] = \min_{k < i} \{f[k, j-1] + (x_{i-1}-x_k)^2 \} $$ -Обозначим за $opt[i, j]$ оптимальный $k$ для данного состояния — то есть от выражения выше. Для однозначности, если оптимальный индекс не один, то выберем среди них самый правый. +Обозначим за $opt[i, j]$ оптимальный $k$ для данного состояния — то есть аргминимум от выражения выше. Для однозначности, если оптимальный индекс не один, то выберем среди них самый правый. -Конкретно в задаче покрытия точек отрезками, можно заметить следующее: +Конкретно в задаче покрытия точек отрезками можно заметить следующее: $$ -opt[i, j] \leq opt[i+1, j] +opt[i + 1, j] \leq opt[i, j] $$ -Интуиция такая: когда мы сдвигаем i вправо, то точка, с которой может начинаться последняя группа, не может уменьшаться. +Интуация такая: если нам нужно покрыть больший префикс точек, то начало последнего отрезка точно не будет раньше. -### Идея +### Алгоритм -Пусть мы уже знаем $opt[i, l]$ и $opt[i, r]$ и хотим посчитать $opt[i, j]$ для какого-то $j$ между $l$ и $r$. Тогда, воспользовавшись неравенством выше, мы можем сузить отрезок поиска оптимального индекса для $j$ со всего отрезка $[0, i-1]$ до $[opt[i, l], opt[i, r]]$. +Пусть мы уже знаем $opt[l, k]$ и $opt[r, k]$ и хотим посчитать $opt[i, k]$ для какого-то $i$ между $l$ и $r$. Тогда, воспользовавшись неравенством выше, мы можем сузить отрезок поиска оптимального индекса для $i$ со всего отрезка $[0, i - 1]$ до $[opt[l, k], opt[r, k]]$. -Будем делать следующее: заведем рекурсивную функцию, которая считает динамики для отрезка $[l, r]$, зная, что их $opt$ лежат между $l'$ и $r'$. Эта функция просто берет середину отрезка $[l, r]$ и линейным проходом считает ответ для неё, а затем рекурсивно запускается от половин, передавая в качестве границ $[l', opt]$ и $[opt, r']$ соответственно. - -### Реализация - -Один $k$-тый слой целиком пересчитывается из $(k-1)$-го следующим образом: +Будем делать следующее: заведем рекурсивную функцию, которая считает динамики для отрезка $[l, r]$ на $k$-том слое, зная, что их $opt$ лежат между $l'$ и $r'$. Эта функция просто берет середину отрезка $[l, r]$ и линейным проходом считает ответ для неё, а затем рекурсивно запускается от половин, передавая в качестве границ $[l', opt]$ и $[opt, r']$ соответственно: ```c++ +// [ l, r] -- какие динамики на k-том слое посчитать +// [_l, _r] -- где могут быть их ответы void solve(int l, int r, int _l, int _r, int k) { if (l > r) return; // отрезок пустой -- выходим int opt = _l, t = (l + r) / 2; + // считаем ответ для f[t][k] for (int i = _l; i <= min(_r, t); i++) { int val = f[i + 1][k - 1] + cost(i, t - 1); if (val < f[t][k]) f[t][k] = val, opt = i; } - solve(l, t - 1, _l, opt, k); - solve(t + 1, r, opt, _r, k); + solve(l, t - 1, _l, opt, k); + solve(t + 1, r, opt, _r, k); } ``` @@ -56,8 +55,6 @@ for (int k = 1; k <= m; k++) solve(0, n - 1, 0, n - 1, k); ``` -### Асимптотика - Так как отрезок $[l, r]$ на каждом вызове уменьшается примерно в два раза, глубина рекурсии будет $O(\log n)$. Так как отрезки поиска для всех элементов на одном «уровне» могут пересекаться разве что только по границам, то суммарно на каждом уровне поиск проверит $O(n)$ различных индексов. Соответственно, пересчет всего слоя займет $O(n \log n)$ операций вместо $O(n^2)$ в базовом решении. -Таким образом, мы улучшили асимптотику до $O(n m \log n)$. +Таким образом, мы улучшили асимптотику до $O(n \cdot m \cdot \log n)$. From d5c5fb5a62c2a5645d9473dda6bec8eb7430a39f Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 05:56:29 +0300 Subject: [PATCH 051/173] fix knuth dp criterion --- content/russian/cs/layer-optimizations/knuth.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/russian/cs/layer-optimizations/knuth.md b/content/russian/cs/layer-optimizations/knuth.md index 5c49dbe6..8a184d2d 100644 --- a/content/russian/cs/layer-optimizations/knuth.md +++ b/content/russian/cs/layer-optimizations/knuth.md @@ -9,13 +9,13 @@ prerequisites: Предыдущий метод оптимизации опирался на тот факт, что $opt[i, j] \leq opt[i, j + 1]$. -Асимптотику можно ещё улучшить, заметив, что $opt$ монотонен ещё и по первому параметру: +Асимптотику можно ещё улучшить, заметив, что $opt$ монотонен также и по второму параметру: $$ -opt[i-1, j] \leq opt[i, j] \leq opt[i, j+1] +opt[i - 1, j] \leq opt[i, j] \leq opt[i, j + 1] $$ -В задаче про покрытие отрезками это выполняется примерно по той же причине: если нам нужно покрывать меньше точек, то новый оптимальный последний отрезок будет начинаться не позже старого. +В задаче про покрытие отрезками это выполняется примерно по той же причине: если нам доступно больше отрезков, то последний отрезок в оптимальном решении точно не будет длиннее, чем раньше. ### Алгоритм From ac8906113eee302e9ee6b681909a56d391cf5bb3 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 06:15:41 +0300 Subject: [PATCH 052/173] mark drafts in toc --- themes/algorithmica/assets/style.sass | 5 +++++ themes/algorithmica/layouts/partials/sidebar.html | 4 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass index fe3ebaeb..0a42a2d6 100644 --- a/themes/algorithmica/assets/style.sass +++ b/themes/algorithmica/assets/style.sass @@ -157,6 +157,11 @@ body &::before content: counter(chapter-counter) "." counter(section-counter) ". " font-weight: bold + + .draft, .draft a + color: $dimmed + + #wrapper width: 100% diff --git a/themes/algorithmica/layouts/partials/sidebar.html b/themes/algorithmica/layouts/partials/sidebar.html index 2276957a..816887f5 100644 --- a/themes/algorithmica/layouts/partials/sidebar.html +++ b/themes/algorithmica/layouts/partials/sidebar.html @@ -24,13 +24,13 @@ {{ if isset .Params "part" }}
  • {{.Params.Part}}
  • {{ end }} -
  • {{ .Title }}
  • {{ if .IsSection }}
      {{ range .Pages }} -
    1. {{ .Title }}
    2. {{ end }} From 16a9a52c12e777103d06cb52728aadc8fcb5c4ce Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 07:22:48 +0300 Subject: [PATCH 053/173] inversions edits --- content/russian/cs/sequences/inversions.md | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/content/russian/cs/sequences/inversions.md b/content/russian/cs/sequences/inversions.md index f18d1f4a..2fbec7d9 100644 --- a/content/russian/cs/sequences/inversions.md +++ b/content/russian/cs/sequences/inversions.md @@ -4,13 +4,18 @@ title: Число инверсий weight: 5 authors: - Сергей Слотин +draft: true --- -Пусть у нас есть некоторая перестановка $p$ (какая-то последовательность чисел от $1$ до $n$, где все числа встречаются ровно один раз). *Инверсией* называется пара индексов $i$ и $j$ такая, что $i < j$ и $p_i > p_j$. Требуется найти количество инверсий в данной перестановке. +**Определение.** *Инверсией* в перестановке $p$ называется пара индексов $i$ и $j$ такая, что $i < j$ и $p_i > p_j$. -## Наивный алгоритм +Например: -Эта задача легко решается за $O(n^2)$ обычным перебором всех пар индексов и проверкой каждого на инверсию: +- в перестановке $[1, 2, 3]$ инверсий нет, +- в $[1, 3, 2]$ одна инверсия ($3 \leftrightarrow 2$), +- в $[3, 2, 1]$ три инверсии ($3 \leftrightarrow 2$, $3 \leftrightarrow 1$ и $2 \leftrightarrow 1$). + +В этой статье мы рассмотрим, как находить количество инверсий в перестановке. Эта задача легко решается за $O(n^2)$ обычным перебором всех пар индексов и проверкой каждого на инверсию: ```cpp int count_inversions(int *p, int n) { @@ -23,6 +28,8 @@ int count_inversions(int *p, int n) { } ``` +Решить её быстрее сложнее. + ## Сортировкой слиянием Внезапно эту задачу можно решить сортировкой слиянием, слегка модифицировав её. From b402d342b998a1a13d17eea48781845321abcae4 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 07:23:07 +0300 Subject: [PATCH 054/173] quickselect edits --- content/russian/cs/sequences/quickselect.md | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/content/russian/cs/sequences/quickselect.md b/content/russian/cs/sequences/quickselect.md index b1606bbd..7e83a267 100644 --- a/content/russian/cs/sequences/quickselect.md +++ b/content/russian/cs/sequences/quickselect.md @@ -1,12 +1,12 @@ --- -# TODO: реализация title: Порядковые статистики weight: 4 +draft: true --- Если в [начале предыдущей главы](/cs/interactive/binary-search) мы искали число элементов массива, меньших $x$ — также известное как индекс этого элемента в отсортированном массиве — то теперь нас интересует обратная задача: узнать, какой элемент $k$-тый по возрастанию. -Если массив уже отсортирован, то задача тривиальная — просто берем $k$-тый элемент. Иначе мы его можем отсортировать, но на это потребуется $O(n \log n)$ операций — и мы знаем, что используя только сравнения быстрее не получится. +Если массив уже отсортирован, то задача тривиальная: просто берем $k$-тый элемент. Иначе мы его можем отсортировать, но на это потребуется $O(n \log n)$ операций — и мы знаем, что если мы используем только сравнения, быстрее не получится. Есть другой подход — мы можем модифицировать алгоритм быстрой сортировки. @@ -26,4 +26,17 @@ weight: 4 Подумав над тем, что размер отрезка каждый раз убывает приблизительно в 2 раза, над ограниченностью суммы $n + \frac{n}{2} + \frac{n}{4} + \ldots = 2 \cdot n$, и немного помахав руками, получаем, что алгоритм работает за $O(n)$. + + В C++ этот алгоритм уже реализован и доступен как `nth_element`. From c10ebb35240390d9bd7fd69769c8b65ed4f0cdfe Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 07:23:21 +0300 Subject: [PATCH 055/173] sequence compression --- content/russian/cs/sequences/_index.md | 3 +- content/russian/cs/sequences/compression.md | 50 ++++++++++++++------- 2 files changed, 35 insertions(+), 18 deletions(-) diff --git a/content/russian/cs/sequences/_index.md b/content/russian/cs/sequences/_index.md index d02ed49b..6888831d 100644 --- a/content/russian/cs/sequences/_index.md +++ b/content/russian/cs/sequences/_index.md @@ -1,7 +1,6 @@ --- title: Последовательности weight: 4 -draft: true --- -В этой главе рассматриваются некоторые алгоритмы на неотсортированных последовательностях. +В этой главе рассматриваются алгоритмы для неотсортированных последовательностей. diff --git a/content/russian/cs/sequences/compression.md b/content/russian/cs/sequences/compression.md index 332011b3..58686d5c 100644 --- a/content/russian/cs/sequences/compression.md +++ b/content/russian/cs/sequences/compression.md @@ -3,46 +3,64 @@ title: Сжатие координат authors: - Сергей Слотин weight: -1 -draft: true +date: 2022-04-20 --- +Часто бывает полезно преобразовать последовательность чисел либо каких-то других объектов в промежуток последовательных целых чисел — например, чтобы использовать её элементы как индексы в массиве либо какой-нибудь другой структуре. -## Сжатие координат -Это общая идея, которая может оказаться полезной. Пусть, есть $n$ чисел $a_1,\ldots,a_n$. Хотим, преобразовать $a_i$ так, чтобы равные остались равными, разные остались разными, но все они были от 0 до $n-1$. Для этого надо отсортировать числа, удалить повторяющиеся и заменить каждое $a_i$ на его индекс в отсортированном массиве. +Эта задача эквивалентна нумерации элементов множества, что можно сделать за $O(n)$ через хэш-таблицу: +```c++ +vector compress(vector a) { + unordered_map m; -``` -int a[n], all[n]; -for (int i = 0; i < n; ++i) { - cin >> a[i]; - all[i] = a[i]; + for (int &x : a) { + if (m.count(x)) + x = m[x]; + else + m[x] = m.size(); + } + + return a; } -sort(all, all + n); -m = unique(all, all + n) - all; // теперь m - число различных координат -for (int i = 0; i < n; ++i) - a[i] = lower_bound(all, all + m, x[i]) - all; ``` -```cpp +Элементам будут присвоены номера в порядке их первого вхождения в последовательность. Если нужно сохранить *порядок*, присвоив меньшим элементам меньшие номера, то задача становится чуть сложнее, и её можно решить разными способами. + +Как вариант, можно отсортировать массив, а затем два раза пройтись по нему с хэш-таблицей — в первый раз заполняя её, а во второй раз сжимая сам массив: + +```c++ vector compress(vector a) { + vector b = a; + sort(b.begin(), b.end()); + unordered_map m; - for (int x : a) - if (m.count(x)) + + for (int x : b) + if (!m.count(x)) m[x] = m.size(); + for (int &x : a) x = m[x]; + return a; } ``` +Также можно выкинуть из отсортированного массива дупликаты (за линейное время), а затем использовать его для нахождения индекса каждого элемента исходного массива бинарным поиском: -```cpp +```c++ vector compress(vector a) { vector b = a; + sort(b.begin(), b.end()); b.erase(unique(b.begin(), b.end()), b.end()); + for (int &x : a) x = int(lower_bound(b.begin(), b.end(), x) - b.begin()); + return a; } ``` + +Оба подхода работают за $O(n \log n)$. Используйте тот, который больше нравится. From ad0c2aa70cfb6e6d3622174e8cbd6fee8399bba7 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 07:28:06 +0300 Subject: [PATCH 056/173] quicksort edits --- content/russian/cs/sorting/quicksort.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/content/russian/cs/sorting/quicksort.md b/content/russian/cs/sorting/quicksort.md index f3a6a5d6..e6494cd3 100644 --- a/content/russian/cs/sorting/quicksort.md +++ b/content/russian/cs/sorting/quicksort.md @@ -7,13 +7,18 @@ draft: true Быстрая сортировка заключается в том, что на каждом шаге мы находим опорный элемент, все элементы, которые меньше его кидаем в левую часть, остальные в правую, а затем рекурсивно спускаемся в обе части. ```cpp +// partition - функция разбивающие элементы +// на меньшие и больше/равные a[index], +// при этом функция возвращает границу разбиения +void partition(int l, int r, int p) { + +} + void quicksort(int l, int r){ if (l < r){ int index = (l + r) / 2; /* index - индекс опорного элемента для начала сделаем его равным середине отрезка*/ - index = divide(l, r, index); /* divide - функция разбивающие элементы - на меньшие и больше/равные a[index], - при этом функция возвращает границу разбиения*/ + index = partition(l, r, index); quicksort(l, index); quicksort(index + 1, r); } @@ -25,8 +30,6 @@ void quicksort(int l, int r){ Существуют несколько выходов из этой ситуации : -2. Давайте если быстрая сортировка работает долго, то запустим любую другую сортировку за $NlogN$. - -3. Давайте делить массив не на две, а на три части(меньше, равны, больше). - -4. Чтобы избавиться от проблемы с максимумом/минимумом в середине, давайте **брать случайный элемент**. +1. Давайте если быстрая сортировка работает долго, то запустим любую другую сортировку за $NlogN$. +2. Давайте делить массив не на две, а на три части(меньше, равны, больше). +3. Чтобы избавиться от проблемы с максимумом/минимумом в середине, давайте **брать случайный элемент**. From 78d207d2d08787ecfecfafb25dfe6adaf347a03c Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 07:31:36 +0300 Subject: [PATCH 057/173] fix ru broken links --- content/russian/cs/algebra/matmul.md | 2 +- content/russian/cs/basic-structures/iterators.md | 4 ++-- content/russian/cs/matching/matching-problems.md | 2 +- content/russian/cs/spanning-trees/kruskal.md | 2 +- content/russian/cs/spanning-trees/safe-edge.md | 2 +- content/russian/cs/string-searching/manacher.md | 2 +- content/russian/cs/string-structures/palindromic-tree.md | 2 +- content/russian/cs/string-structures/suffix-array.md | 4 ++-- content/russian/cs/tree-structures/treap.md | 2 +- 9 files changed, 11 insertions(+), 11 deletions(-) diff --git a/content/russian/cs/algebra/matmul.md b/content/russian/cs/algebra/matmul.md index bc5ca593..8a633bea 100644 --- a/content/russian/cs/algebra/matmul.md +++ b/content/russian/cs/algebra/matmul.md @@ -188,7 +188,7 @@ matrix binpow(matrix a, int p) { Эту технику можно применить и к другим динамикам, где нужно посчитать количество способов что-то сделать — иногда очень неочевидными способами. -Например, можно решить такую задачу: найти количество строк длины $k \approx 10^{18}$, не содержащих данные маленькие запрещённые подстроки. Для этого нужно построить граф «легальных» переходов в [Ахо-Корасике](/cs/automata/aho-corasick), возвести его матрицу смежности в $k$-тую степень и просуммировать в нём первую строчку. +Например, можно решить такую задачу: найти количество строк длины $k \approx 10^{18}$, не содержащих данные маленькие запрещённые подстроки. Для этого нужно построить граф «легальных» переходов в [Ахо-Корасике](/cs/string-structures/aho-corasick), возвести его матрицу смежности в $k$-тую степень и просуммировать в нём первую строчку. В некоторых изощрённых случаях в матричном умножении вместо умножения и сложения нужно использовать другие операции, которые ведут себя как умножение и сложение. Пример задачи: «найти путь от $s$ до $t$ с минимальным весом ребра, использующий ровно $k$ переходов»; здесь нужно возводить в $(k-1)$-ую степень матрицу весов графа, и вместо и сложения, и умножения использовать минимум из двух весов. diff --git a/content/russian/cs/basic-structures/iterators.md b/content/russian/cs/basic-structures/iterators.md index b2d8269f..c048e0b6 100644 --- a/content/russian/cs/basic-structures/iterators.md +++ b/content/russian/cs/basic-structures/iterators.md @@ -71,7 +71,7 @@ for (int x : c) ### Алгоритмы из STL -Например, итераторы `std::vector` относятся к `random_access_iterator`, и если вызвать функцию `lower_bound` из стандартной библиотеки, то она произведет [бинарный поиск](../../ordered-search/binary-search) по элементам (предполагая, что они отсортированы в порядке неубывания): +Например, итераторы `std::vector` относятся к `random_access_iterator`, и если вызвать функцию `lower_bound` из стандартной библиотеки, то она произведет [бинарный поиск](/cs/interactive/binary-search/) по элементам (предполагая, что они отсортированы в порядке неубывания): ```cpp vector a = {1, 2, 3, 5, 8, 13}; @@ -93,4 +93,4 @@ array a = {4, 2, 1, 3}; cout << *min_element(a.begin(), a.end()) << endl; ``` -Подробнее про разные полезные алгоритмы STL можно прочитать в [ликбезе по C++](../../programming/cpp). + diff --git a/content/russian/cs/matching/matching-problems.md b/content/russian/cs/matching/matching-problems.md index cedfe69d..cd14e54e 100644 --- a/content/russian/cs/matching/matching-problems.md +++ b/content/russian/cs/matching/matching-problems.md @@ -81,6 +81,6 @@ $$ Пусть у вершин левой доли есть какие-то веса, и нам нужно набрать максимальное паросочетание минимального веса. -Выясняется, что можно просто отсортировать вершины левой доли по весу и пытаться в таком порядке добавлять их в паросочетание стандартным алгоритмом Куна. Для доказательства этого факта читатель может прочитать про [жадный алгоритм Радо-Эдмондса](/cs/greedy/matroid), частным случаем которого является такая модификация алгоритма Куна. +Выясняется, что можно просто отсортировать вершины левой доли по весу и пытаться в таком порядке добавлять их в паросочетание стандартным алгоритмом Куна. Для доказательства этого факта читатель может прочитать про [жадный алгоритм Радо-Эдмондса](/cs/combinatorial-optimization/matroid), частным случаем которого является такая модификация алгоритма Куна. Аналогичную задачу, но когда у *ребер* есть веса, проще всего решать сведением к нахождению [потока минимальной стоимости](/cs/flows/mincost-maxflow). diff --git a/content/russian/cs/spanning-trees/kruskal.md b/content/russian/cs/spanning-trees/kruskal.md index ddb9cabf..1f4c98a4 100644 --- a/content/russian/cs/spanning-trees/kruskal.md +++ b/content/russian/cs/spanning-trees/kruskal.md @@ -34,4 +34,4 @@ for (auto [a, b, w] : edges) { } ``` -Раз остовные деревья являются частным случаем [матроида](/cs/greedy/matroid), то алгоритм Краскала является частным случаем алгоритма Радо-Эдмондса. +Раз остовные деревья являются частным случаем [матроида](/cs/combinatorial-optimization/matroid), то алгоритм Краскала является частным случаем алгоритма Радо-Эдмондса. diff --git a/content/russian/cs/spanning-trees/safe-edge.md b/content/russian/cs/spanning-trees/safe-edge.md index cc7138c9..19f97006 100644 --- a/content/russian/cs/spanning-trees/safe-edge.md +++ b/content/russian/cs/spanning-trees/safe-edge.md @@ -24,4 +24,4 @@ weight: 1 - Если веса всех рёбер различны, то остов будет уникален. - Минимальный остов является также и остовом с минимальным произведением весов рёбер (замените веса всех рёбер на их логарифмы). - Минимальный остов является также и остовом с минимальным весом самого тяжелого ребра. -- Остовные деревья — частный случай [матроидов](/cs/greedy/matroid). +- Остовные деревья — частный случай [матроидов](/cs/combinatorial-optimization/matroid). diff --git a/content/russian/cs/string-searching/manacher.md b/content/russian/cs/string-searching/manacher.md index 8954b653..16d32ccb 100644 --- a/content/russian/cs/string-searching/manacher.md +++ b/content/russian/cs/string-searching/manacher.md @@ -32,7 +32,7 @@ vector pal_array(string s) { Тот же пример $s = aa\dots a$ показывает, что данная реализация работает за $O(n^2)$. -Для оптимизации применим идею, знакомую из алгоритма [z-функции](string-searching): при инициализации $t_i$ будем пользоваться уже посчитанными $t$. А именно, будем поддерживать $(l, r)$ — интервал, соответствующий самому правому из найденных подпалиндромов. Тогда мы можем сказать, что часть наибольшего палиндрома с центром в $s_i$, которая лежит внутри $s_{l:r}$, имеет радиус хотя бы $\min(r-i, \; t_{l+r-i})$. Первая величина равна длине, дальше которой произошел бы выход за пределы $s_{l:r}$, а вторая — значению радиуса в позиции, зеркальной относительно центра палиндрома $s_{l:r}$. +Для оптимизации применим идею, знакомую из алгоритма [z-функции](/cs/string-searching/z-function/): при инициализации $t_i$ будем пользоваться уже посчитанными $t$. А именно, будем поддерживать $(l, r)$ — интервал, соответствующий самому правому из найденных подпалиндромов. Тогда мы можем сказать, что часть наибольшего палиндрома с центром в $s_i$, которая лежит внутри $s_{l:r}$, имеет радиус хотя бы $\min(r-i, \; t_{l+r-i})$. Первая величина равна длине, дальше которой произошел бы выход за пределы $s_{l:r}$, а вторая — значению радиуса в позиции, зеркальной относительно центра палиндрома $s_{l:r}$. ```c++ diff --git a/content/russian/cs/string-structures/palindromic-tree.md b/content/russian/cs/string-structures/palindromic-tree.md index 3d70c76b..9b57534a 100644 --- a/content/russian/cs/string-structures/palindromic-tree.md +++ b/content/russian/cs/string-structures/palindromic-tree.md @@ -19,7 +19,7 @@ weight: 3 Будем поддерживать наибольший суффикс-палиндром. Когда мы будем дописывать очередной символ $c$, нужно найти наибольший суффикс этого палиндрома, который может быть дополнен символом $c$ — это и будет новый наидлиннейший суффикс-палиндром. -Для этого поступим аналогично [алгоритму Ахо-Корасик](aho-corasick): будем поддерживать для каждого палиндрома суффиксную ссылку $l(v)$, ведущую из $v$ в её наибольший суффикс-палиндром. При добавлении очередного символа, будем подниматься по суффиксным ссылкам, пока не найдём вершину, из которой можно совершить нужный переход. +Для этого поступим аналогично [алгоритму Ахо-Корасик](../aho-corasick): будем поддерживать для каждого палиндрома суффиксную ссылку $l(v)$, ведущую из $v$ в её наибольший суффикс-палиндром. При добавлении очередного символа, будем подниматься по суффиксным ссылкам, пока не найдём вершину, из которой можно совершить нужный переход. Если в подходящей вершине этого перехода не существовало, то нужно создать новую вершину, и для неё тоже понадобится своя суффиксная ссылка. Чтобы найти её, будем продолжать подниматься по суффиксным ссылкам предыдущего суффикс-палиндрома, пока не найдём второе такое место, которое мы можем дополнить символом $c$. diff --git a/content/russian/cs/string-structures/suffix-array.md b/content/russian/cs/string-structures/suffix-array.md index 25b90a3e..a7b90768 100644 --- a/content/russian/cs/string-structures/suffix-array.md +++ b/content/russian/cs/string-structures/suffix-array.md @@ -22,7 +22,7 @@ weight: 100 ![Сортировка всех суффиксов строки «mississippi$»](../img/sa-sort.png) -**Где это может быть полезно.** Пусть вы хотите основать ещё один поисковик, и чтобы получить финансирование, вам нужно сделать хоть что-то минимально работающее — хотя бы просто научиться искать по ключевому слову документы, включающие его, а также позиции их вхождения (в 90-е это был бы уже довольно сильный MVP). Простыми алгоритмами — [полиномиальными хешами](/cs/hashing), [z- и префикс-функцией](/cs/string-searching) и даже [Ахо-Корасиком](/cs/automata/aho-corasick) — это сделать быстро нельзя, потому что на каждый раз нужно проходиться по всем данным, а суффиксными структурами — можно. +**Где это может быть полезно.** Пусть вы хотите основать ещё один поисковик, и чтобы получить финансирование, вам нужно сделать хоть что-то минимально работающее — хотя бы просто научиться искать по ключевому слову документы, включающие его, а также позиции их вхождения (в 90-е это был бы уже довольно сильный MVP). Простыми алгоритмами — [полиномиальными хешами](/cs/hashing), [z- и префикс-функцией](/cs/string-searching) и даже [Ахо-Корасиком](../aho-corasick) — это сделать быстро нельзя, потому что на каждый раз нужно проходиться по всем данным, а суффиксными структурами — можно. В случае с суффиксным массивом можно сделать следующее: сконкатенировать все строки-документы с каким-нибудь внеалфавитным разделителем (`$`), построить по ним суффиксный массив, а дальше для каждого запроса искать бинарным поиском первый суффикс в суффиксном массиве, который меньше искомого слова, а также последний, который меньше. Все суффиксы между этими двумя будут включать искомую строку как префикс. @@ -132,7 +132,7 @@ vector suffix_array(vector &s) { Тогда есть мотивация посчитать массив `lcp$` в котором окажутся наибольшие общие префиксы соседних суффиксов, а после как-нибудь считать минимумы на отрезках в этом массиве (например, с помощью [разреженной таблицы](/cs/range-queries/sparse-table)). -Осталось придумать способ быстро посчитать массив `lcp`. Можно воспользоваться идеей из построения суффиксного массива за $O(n \log^2 n)$: с помощью [хешей](hashing) и бинпоиска находить `lcp` для каждой пары соседей. Такой метод работает за $O(n \log n)$, но является не самым удобным и популярным. +Осталось придумать способ быстро посчитать массив `lcp`. Можно воспользоваться идеей из построения суффиксного массива за $O(n \log^2 n)$: с помощью [хешей](/cs/hashing/polynomial/) и бинпоиска находить `lcp` для каждой пары соседей. Такой метод работает за $O(n \log n)$, но является не самым удобным и популярным. ### Алгоритм Касаи, Аримуры, Арикавы, Ли, Парка diff --git a/content/russian/cs/tree-structures/treap.md b/content/russian/cs/tree-structures/treap.md index 724ed15f..ad11c794 100644 --- a/content/russian/cs/tree-structures/treap.md +++ b/content/russian/cs/tree-structures/treap.md @@ -100,7 +100,7 @@ $$ Примечательно, что ожидаемая глубина вершин зависит от их позиции: вершина из середины должна быть примерно в два раза глубже, чем крайняя. -**Упражнение.** Выведите по аналогии с этим рассуждением асимптотику [quicksort](/cs/sorting/quicksort). +**Упражнение.** Выведите по аналогии с этим рассуждением асимптотику quicksort. ## Реализация From d184936628da9db13363466ce12f91f7c1af4660 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Apr 2022 07:39:32 +0300 Subject: [PATCH 058/173] fix hpc broken links --- content/english/hpc/algorithms/prefix.md | 2 +- content/english/hpc/architecture/assembly.md | 2 +- content/english/hpc/architecture/indirect.md | 2 +- content/english/hpc/cpu-cache/paging.md | 2 +- content/english/hpc/data-structures/b-tree.md | 4 ++-- content/english/hpc/data-structures/s-tree.md | 2 +- content/english/hpc/pipelining/branchless.md | 4 ++-- content/english/hpc/pipelining/throughput.md | 2 +- content/english/hpc/simd/shuffling.md | 2 +- 9 files changed, 11 insertions(+), 11 deletions(-) diff --git a/content/english/hpc/algorithms/prefix.md b/content/english/hpc/algorithms/prefix.md index 5e31570d..f07daaf3 100644 --- a/content/english/hpc/algorithms/prefix.md +++ b/content/english/hpc/algorithms/prefix.md @@ -61,7 +61,7 @@ for (int l = 0; l < logn; l++) We can prove that this algorithm works by induction: if on $k$-th iteration every element $a_i$ is equal to the sum of the $(i - 2^k, i]$ segment of the original array, then after adding $a_{i - 2^k}$ to it, it will be equal to the sum of $(i - 2^{k+1}, i]$. After $O(\log n)$ iterations, the array will turn into its prefix sum. -To implement it in SIMD, we could use [permutations](/hpc/simd/shuffles) to place $i$-th element against $(i-2^k)$-th, but they are too slow. Instead, we will use the `sll` ("shift lanes left") instruction that does exactly that and also replaces the unmatched elements with zeros: +To implement it in SIMD, we could use [permutations](/hpc/simd/shuffling) to place $i$-th element against $(i-2^k)$-th, but they are too slow. Instead, we will use the `sll` ("shift lanes left") instruction that does exactly that and also replaces the unmatched elements with zeros: ```c++ typedef __m128i v4i; diff --git a/content/english/hpc/architecture/assembly.md b/content/english/hpc/architecture/assembly.md index 013d2987..5c981547 100644 --- a/content/english/hpc/architecture/assembly.md +++ b/content/english/hpc/architecture/assembly.md @@ -57,7 +57,7 @@ Most instructions write their result into the first operand, which can also be i There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the first 32 bits of `rax` are `eax`, the first 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. -These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../jumps), but we'll get there in time. +These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../loops), but we'll get there in time. **Constants** are just integer or floating-point values: `42`, `0x2a`, `3.14`, `6.02e23`. They are more commonly called *immediate values* because they are embedded right into the machine code. Because it may considerably increase the complexity of the instruction encoding, some instructions don't support immediate values or allow just a fixed subset of them. In some cases, you have to load a constant value into a register and then use it instead of an immediate value. diff --git a/content/english/hpc/architecture/indirect.md b/content/english/hpc/architecture/indirect.md index ce6e86b8..487b81e3 100644 --- a/content/english/hpc/architecture/indirect.md +++ b/content/english/hpc/architecture/indirect.md @@ -106,7 +106,7 @@ During a virtual method call, that offset field is fetched from the instance of Of course, this adds some overhead: -- You may need to spend another 15 cycles or so for the same pipeline flushing reasons as for [branch misprediction](../pipelining). +- You may need to spend another 15 cycles or so for the same pipeline flushing reasons as for [branch misprediction](/hpc/pipelining). - The compiler most likely won't be able to inline the function call itself. - Class size increases by a couple of bytes or so (this is implementation-specific). - The binary size itself increases a little bit. diff --git a/content/english/hpc/cpu-cache/paging.md b/content/english/hpc/cpu-cache/paging.md index 684fcd65..3e6cfd8f 100644 --- a/content/english/hpc/cpu-cache/paging.md +++ b/content/english/hpc/cpu-cache/paging.md @@ -81,7 +81,7 @@ Enabling huge pages also improves [latency](../latency) by up to 10-15% for arra In general, enabling huge pages is a good idea when you have any sort of sparse reads, as they usually slightly improve and ([almost](../aos-soa)) never hurt performance. -That said, you shouldn't rely on huge pages if possible, as they aren't always available due to either hardware or computing environment restrictions. There are [many](../cache-lines) [other](../hw-prefetching) [reasons](../aos-soa) why grouping data accesses spatially may be beneficial, which automatically solves the paging problem. +That said, you shouldn't rely on huge pages if possible, as they aren't always available due to either hardware or computing environment restrictions. There are [many](../cache-lines) [other](../prefetching) [reasons](../aos-soa) why grouping data accesses spatially may be beneficial, which automatically solves the paging problem. -The usual disclaimer: the CPU is a [Zen 2](https://www.7-cpu.com/cpu/Zen2.html), the RAM is a [DDR4-2666](http://localhost:1313/hpc/cpu-cache/), and the compiler we will be using by default is Clang 10. The performance on your machine may be different, so I highly encourage to [go and test it](https://godbolt.org/z/14rd5Pnve) for yourself. +The usual disclaimer: the CPU is a [Zen 2](https://www.7-cpu.com/cpu/Zen2.html), the RAM is a [DDR4-2666](/hpc/cpu-cache/), and the compiler we will be using by default is Clang 10. The performance on your machine may be different, so I highly encourage to [go and test it](https://godbolt.org/z/14rd5Pnve) for yourself.
      {{.Title}}
      @@ -20,7 +25,9 @@ - + From 5bb09004d6024361734f78e4518b2fb829a7b103 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 21 Apr 2022 16:02:16 +0300 Subject: [PATCH 064/173] search string translations --- themes/algorithmica/i18n/en.toml | 9 +++++++++ themes/algorithmica/i18n/ru.toml | 9 +++++++++ 2 files changed, 18 insertions(+) diff --git a/themes/algorithmica/i18n/en.toml b/themes/algorithmica/i18n/en.toml index 9aae4777..6fa12340 100644 --- a/themes/algorithmica/i18n/en.toml +++ b/themes/algorithmica/i18n/en.toml @@ -15,6 +15,15 @@ other = "updated" [sections] other = "sections" +[search] +other = "Search this book…" + +[searchCountPrefix] +other = "Found" + +[searchCountSuffix] +other = "pages" + [prerequisites] other = "prerequisites" diff --git a/themes/algorithmica/i18n/ru.toml b/themes/algorithmica/i18n/ru.toml index 5e96226c..08d47b66 100644 --- a/themes/algorithmica/i18n/ru.toml +++ b/themes/algorithmica/i18n/ru.toml @@ -21,6 +21,15 @@ other = "обновлено" [sections] other = "статьи раздела" +[search] +other = "Поиск по сайту…" + +[searchCountPrefix] +other = "Найдено" + +[searchCountSuffix] +other = "страниц" + [prerequisites] other = "пререквизиты" From 641a7d6dd401360a778594035d1ddc62ee55d21a Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 21 Apr 2022 16:02:32 +0300 Subject: [PATCH 065/173] add lunr --- themes/algorithmica/static/scripts/lunr.multi.min.js | 1 + themes/algorithmica/static/scripts/lunr.ru.min.js | 1 + themes/algorithmica/static/scripts/lunr.stemmer.support.min.js | 1 + 3 files changed, 3 insertions(+) create mode 100644 themes/algorithmica/static/scripts/lunr.multi.min.js create mode 100644 themes/algorithmica/static/scripts/lunr.ru.min.js create mode 100644 themes/algorithmica/static/scripts/lunr.stemmer.support.min.js diff --git a/themes/algorithmica/static/scripts/lunr.multi.min.js b/themes/algorithmica/static/scripts/lunr.multi.min.js new file mode 100644 index 00000000..6f417304 --- /dev/null +++ b/themes/algorithmica/static/scripts/lunr.multi.min.js @@ -0,0 +1 @@ +!function(e,t){"function"==typeof define&&define.amd?define(t):"object"==typeof exports?module.exports=t():t()(e.lunr)}(this,function(){return function(e){e.multiLanguage=function(){for(var t=Array.prototype.slice.call(arguments),i=t.join("-"),r="",n=[],s=[],p=0;p=W.limit)return!1;W.cursor++}return!0}function t(){for(;!W.out_grouping(S,1072,1103);){if(W.cursor>=W.limit)return!1;W.cursor++}return!0}function w(){b=W.limit,_=b,e()&&(b=W.cursor,t()&&e()&&t()&&(_=W.cursor))}function i(){return _<=W.cursor}function u(e,n){var r,t;if(W.ket=W.cursor,r=W.find_among_b(e,n)){switch(W.bra=W.cursor,r){case 1:if(t=W.limit-W.cursor,!W.eq_s_b(1,"а")&&(W.cursor=W.limit-t,!W.eq_s_b(1,"я")))return!1;case 2:W.slice_del()}return!0}return!1}function o(){return u(h,9)}function s(e,n){var r;return W.ket=W.cursor,!!(r=W.find_among_b(e,n))&&(W.bra=W.cursor,1==r&&W.slice_del(),!0)}function c(){return s(g,26)}function m(){return!!c()&&(u(C,8),!0)}function f(){return s(k,2)}function l(){return u(P,46)}function a(){s(v,36)}function p(){var e;W.ket=W.cursor,(e=W.find_among_b(F,2))&&(W.bra=W.cursor,i()&&1==e&&W.slice_del())}function d(){var e;if(W.ket=W.cursor,e=W.find_among_b(q,4))switch(W.bra=W.cursor,e){case 1:if(W.slice_del(),W.ket=W.cursor,!W.eq_s_b(1,"н"))break;W.bra=W.cursor;case 2:if(!W.eq_s_b(1,"н"))break;case 3:W.slice_del()}}var _,b,h=[new n("в",-1,1),new n("ив",0,2),new n("ыв",0,2),new n("вши",-1,1),new n("ивши",3,2),new n("ывши",3,2),new n("вшись",-1,1),new n("ившись",6,2),new n("ывшись",6,2)],g=[new n("ее",-1,1),new n("ие",-1,1),new n("ое",-1,1),new n("ые",-1,1),new n("ими",-1,1),new n("ыми",-1,1),new n("ей",-1,1),new n("ий",-1,1),new n("ой",-1,1),new n("ый",-1,1),new n("ем",-1,1),new n("им",-1,1),new n("ом",-1,1),new n("ым",-1,1),new n("его",-1,1),new n("ого",-1,1),new n("ему",-1,1),new n("ому",-1,1),new n("их",-1,1),new n("ых",-1,1),new n("ею",-1,1),new n("ою",-1,1),new n("ую",-1,1),new n("юю",-1,1),new n("ая",-1,1),new n("яя",-1,1)],C=[new n("ем",-1,1),new n("нн",-1,1),new n("вш",-1,1),new n("ивш",2,2),new n("ывш",2,2),new n("щ",-1,1),new n("ющ",5,1),new n("ующ",6,2)],k=[new n("сь",-1,1),new n("ся",-1,1)],P=[new n("ла",-1,1),new n("ила",0,2),new n("ыла",0,2),new n("на",-1,1),new n("ена",3,2),new n("ете",-1,1),new n("ите",-1,2),new n("йте",-1,1),new n("ейте",7,2),new n("уйте",7,2),new n("ли",-1,1),new n("или",10,2),new n("ыли",10,2),new n("й",-1,1),new n("ей",13,2),new n("уй",13,2),new n("л",-1,1),new n("ил",16,2),new n("ыл",16,2),new n("ем",-1,1),new n("им",-1,2),new n("ым",-1,2),new n("н",-1,1),new n("ен",22,2),new n("ло",-1,1),new n("ило",24,2),new n("ыло",24,2),new n("но",-1,1),new n("ено",27,2),new n("нно",27,1),new n("ет",-1,1),new n("ует",30,2),new n("ит",-1,2),new n("ыт",-1,2),new n("ют",-1,1),new n("уют",34,2),new n("ят",-1,2),new n("ны",-1,1),new n("ены",37,2),new n("ть",-1,1),new n("ить",39,2),new n("ыть",39,2),new n("ешь",-1,1),new n("ишь",-1,2),new n("ю",-1,2),new n("ую",44,2)],v=[new n("а",-1,1),new n("ев",-1,1),new n("ов",-1,1),new n("е",-1,1),new n("ие",3,1),new n("ье",3,1),new n("и",-1,1),new n("еи",6,1),new n("ии",6,1),new n("ами",6,1),new n("ями",6,1),new n("иями",10,1),new n("й",-1,1),new n("ей",12,1),new n("ией",13,1),new n("ий",12,1),new n("ой",12,1),new n("ам",-1,1),new n("ем",-1,1),new n("ием",18,1),new n("ом",-1,1),new n("ям",-1,1),new n("иям",21,1),new n("о",-1,1),new n("у",-1,1),new n("ах",-1,1),new n("ях",-1,1),new n("иях",26,1),new n("ы",-1,1),new n("ь",-1,1),new n("ю",-1,1),new n("ию",30,1),new n("ью",30,1),new n("я",-1,1),new n("ия",33,1),new n("ья",33,1)],F=[new n("ост",-1,1),new n("ость",-1,1)],q=[new n("ейше",-1,1),new n("н",-1,2),new n("ейш",-1,1),new n("ь",-1,3)],S=[33,65,8,232],W=new r;this.setCurrent=function(e){W.setCurrent(e)},this.getCurrent=function(){return W.getCurrent()},this.stem=function(){return w(),W.cursor=W.limit,!(W.cursor=i&&(e-=i,t[e>>3]&1<<(7&e)))return this.cursor++,!0}return!1},in_grouping_b:function(t,i,s){if(this.cursor>this.limit_backward){var e=r.charCodeAt(this.cursor-1);if(e<=s&&e>=i&&(e-=i,t[e>>3]&1<<(7&e)))return this.cursor--,!0}return!1},out_grouping:function(t,i,s){if(this.cursors||e>3]&1<<(7&e)))return this.cursor++,!0}return!1},out_grouping_b:function(t,i,s){if(this.cursor>this.limit_backward){var e=r.charCodeAt(this.cursor-1);if(e>s||e>3]&1<<(7&e)))return this.cursor--,!0}return!1},eq_s:function(t,i){if(this.limit-this.cursor>1),f=0,l=o0||e==s||c)break;c=!0}}for(;;){var _=t[s];if(o>=_.s_size){if(this.cursor=n+_.s_size,!_.method)return _.result;var b=_.method();if(this.cursor=n+_.s_size,b)return _.result}if((s=_.substring_i)<0)return 0}},find_among_b:function(t,i){for(var s=0,e=i,n=this.cursor,u=this.limit_backward,o=0,h=0,c=!1;;){for(var a=s+(e-s>>1),f=0,l=o=0;m--){if(n-l==u){f=-1;break}if(f=r.charCodeAt(n-1-l)-_.s[m])break;l++}if(f<0?(e=a,h=l):(s=a,o=l),e-s<=1){if(s>0||e==s||c)break;c=!0}}for(;;){var _=t[s];if(o>=_.s_size){if(this.cursor=n-_.s_size,!_.method)return _.result;var b=_.method();if(this.cursor=n-_.s_size,b)return _.result}if((s=_.substring_i)<0)return 0}},replace_s:function(t,i,s){var e=s.length-(i-t),n=r.substring(0,t),u=r.substring(i);return r=n+s+u,this.limit+=e,this.cursor>=i?this.cursor+=e:this.cursor>t&&(this.cursor=t),e},slice_check:function(){if(this.bra<0||this.bra>this.ket||this.ket>this.limit||this.limit>r.length)throw"faulty slice operation"},slice_from:function(r){this.slice_check(),this.replace_s(this.bra,this.ket,r)},slice_del:function(){this.slice_from("")},insert:function(r,t,i){var s=this.replace_s(r,t,i);r<=this.bra&&(this.bra+=s),r<=this.ket&&(this.ket+=s)},slice_to:function(){return this.slice_check(),r.substring(this.bra,this.ket)},eq_v_b:function(r){return this.eq_s_b(r.length,r)}}}},r.trimmerSupport={generateTrimmer:function(r){var t=new RegExp("^[^"+r+"]+"),i=new RegExp("[^"+r+"]+$");return function(r){return"function"==typeof r.update?r.update(function(r){return r.replace(t,"").replace(i,"")}):r.replace(t,"").replace(i,"")}}}}}); \ No newline at end of file From 4ffb00832e101ca478e2f973cef4afdff72b82aa Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 21 Apr 2022 16:02:50 +0300 Subject: [PATCH 066/173] build search index --- themes/algorithmica/layouts/_default/list.searchindex.json | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 themes/algorithmica/layouts/_default/list.searchindex.json diff --git a/themes/algorithmica/layouts/_default/list.searchindex.json b/themes/algorithmica/layouts/_default/list.searchindex.json new file mode 100644 index 00000000..6310c263 --- /dev/null +++ b/themes/algorithmica/layouts/_default/list.searchindex.json @@ -0,0 +1,5 @@ +{{- $.Scratch.Add "searchindex" slice -}} +{{- range $index, $element := .Site.Pages -}} + {{- $.Scratch.Add "searchindex" (dict "id" $index "title" $element.Title "path" $element.RelPermalink "content" $element.Plain) -}} +{{- end -}} +{{- $.Scratch.Get "searchindex" | jsonify -}} From c387bd73ba6a942b8b922b07ea019eafbb4672d6 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 21 Apr 2022 16:03:16 +0300 Subject: [PATCH 067/173] implement search --- config.yaml | 9 ++ .../algorithmica/layouts/_default/baseof.html | 1 + .../algorithmica/layouts/partials/head.html | 88 ++++++++++++++++++- .../algorithmica/layouts/partials/search.html | 6 ++ 4 files changed, 103 insertions(+), 1 deletion(-) create mode 100644 themes/algorithmica/layouts/partials/search.html diff --git a/config.yaml b/config.yaml index 7e4ca1b7..8fb26a1c 100644 --- a/config.yaml +++ b/config.yaml @@ -8,6 +8,15 @@ outputFormats: baseName: index mediaType: text/html isHTML: true + SearchIndex: + mediaType: "application/json" + baseName: "searchindex" + isPlainText: true + notAlternative: true +outputs: + home: + - HTML + - SearchIndex markup: goldmark: footnote: false # katex conflict diff --git a/themes/algorithmica/layouts/_default/baseof.html b/themes/algorithmica/layouts/_default/baseof.html index f9056521..dbe71ede 100644 --- a/themes/algorithmica/layouts/_default/baseof.html +++ b/themes/algorithmica/layouts/_default/baseof.html @@ -6,6 +6,7 @@
      {{- partial "buttons.html" . -}}
      + {{ partial "search.html" . }} {{- partial "header.html" . -}}
      {{- block "main" . }}{{- end }} diff --git a/themes/algorithmica/layouts/partials/head.html b/themes/algorithmica/layouts/partials/head.html index f87a8873..2f4c3c46 100644 --- a/themes/algorithmica/layouts/partials/head.html +++ b/themes/algorithmica/layouts/partials/head.html @@ -10,6 +10,11 @@ + + + + + {{ $dark := resources.Get "dark.sass" | toCSS | minify | fingerprint }} @@ -18,22 +23,100 @@ console.log("Toggling sidebar visibility") var sidebar = document.getElementById('sidebar') var wrapper = document.getElementById('wrapper') - if (sidebar.classList.contains('sidebar-toggled') || window.getComputedStyle(sidebar).display == 'block') { + if (sidebar.classList.contains('sidebar-toggled') || window.getComputedStyle(sidebar).display == 'block') { sidebar.classList.toggle('sidebar-hidden') wrapper.classList.toggle('sidebar-hidden') } sidebar.classList.add('sidebar-toggled') wrapper.classList.add('sidebar-toggled') } + function switchTheme(theme) { console.log("Changing theme:", theme) document.getElementById('theme').href = (theme == 'dark' ? "{{ $dark.RelPermalink }}" : "") document.getElementById('syntax-theme').href = (theme == 'dark' ? '/syntax-dark.css' : '/syntax.css') localStorage.setItem('theme', theme) } + + async function toggleSearch() { + console.log("Toggling search") + + var searchDiv = document.getElementById('search') + if (window.getComputedStyle(searchDiv).display == 'none') { + searchDiv.style.display = 'block' + window.scrollTo({ top: 0 }); + } else { + searchDiv.style.display = 'none' + } + + if (!index) { + console.log("Fetching index") + const response = await fetch('/searchindex.json') + const pages = await response.json() + index = lunr(function() { + this.use(lunr.multiLanguage('en', 'ru')) + this.field('title', { + boost: 5 + }) + this.field('content', { + boost: 1 + }) + pages.forEach(function(doc) { + this.add(doc) + articles.push(doc) + }, this) + }) + console.log("Ready to search") + } + } + + var articles = [] + var index = undefined + + function search() { + var query = document.getElementById('search-bar').value + var resultsDiv = document.getElementById('search-results') + var countDiv = document.getElementById('search-count') + + if (query == '') { + resultsDiv.innerHTML = '' + countDiv.innerHTML = '' + return + } + + var results = index.search(query) + + countDiv.innerHTML = '{{ T "searchCountPrefix" }} ' + results.length + ' {{ T "searchCountSuffix" }}' + + let resultList = '' + + for (const n in results) { + const item = articles[results[n].ref] + resultList += '
    3. ' + item.title + '

      ' + const text = item.content + + const contextLimit = 80 + + if (text.includes(query)) { + const start = text.indexOf(query) + if (start > contextLimit) + resultList += '…' + resultList += text.substring(start - contextLimit, start) + + '' + query + '' + text.substring(start + query.length, start + query.length + contextLimit) + + } else { + resultList += text.substring(0, contextLimit * 2) + } + resultList += '…

    4. ' + } + + resultsDiv.innerHTML = resultList + } + if (localStorage.getItem('theme') == 'dark') { switchTheme('dark') } + window.addEventListener('load', function() { var el = document.getElementById("active-element") //console.log(el) @@ -46,6 +129,7 @@ toggleSidebar() }*/ }) + window.addEventListener('scroll', function() { var menu = document.getElementById('menu') if (window.scrollY < 120) { @@ -56,8 +140,10 @@ menu.classList.add('scrolled') } }) + window.addEventListener('keydown', function(e) { if (e.altKey) { return } + if (document.activeElement.tagName == 'INPUT') { return } if (e.key == 'ArrowLeft') { document.getElementById('prev-article').click() } else if (e.key == 'ArrowRight') { diff --git a/themes/algorithmica/layouts/partials/search.html b/themes/algorithmica/layouts/partials/search.html new file mode 100644 index 00000000..ee853dfa --- /dev/null +++ b/themes/algorithmica/layouts/partials/search.html @@ -0,0 +1,6 @@ + From 849e9d1e652b60e4c7bdc8e7d35ba37ed5741ffc Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 21 Apr 2022 16:03:21 +0300 Subject: [PATCH 068/173] search styling --- themes/algorithmica/assets/style.sass | 29 ++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass index 0a42a2d6..a6835c1e 100644 --- a/themes/algorithmica/assets/style.sass +++ b/themes/algorithmica/assets/style.sass @@ -222,7 +222,34 @@ menu .title opacity: 1 transition: opacity 0.1s - + +#search + display: none + font-family: $font-interface + + input + width: 100% + padding: 6px + + color: $font-color + + background: $code-background + border: $code-border + + #search-count + margin-top: 8px + color: $dimmed + + #search-results + margin-top: 6px + border-bottom: $borders + + li + list-style: none + margin: 12px 6px + + p + margin-top: 0 /* .github From 882df601131e76b86687c2c843b58252c5faec46 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 21 Apr 2022 16:22:12 +0300 Subject: [PATCH 069/173] update readme --- README.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 171f5406..959dc025 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,10 @@ # Algorithmica v3 -Algorithmica is a free and open web book about Computer Science. +Algorithmica is an open-access web book dedicated to the art and science of computing. -If you are concerned with editing, please read the [contributing guide](https://ru.algorithmica.org/contributing/) (in Russian). +You can contribute via [Prose](https://prose.io/) by clicking on the pencil icon on the top right on any page or by editing its source directly on GitHub. We use a slightly different Markdown dialect, so if you are not sure that the change is correct (e. g. editing an intricate LaTeX formula), you can install [Hugo](https://gohugo.io/) and build the site locally — or just create a pull request, and a preview link will be automatically generated for you. + +If you happen to speak Russian, please also read the [contributing guidelines](https://ru.algorithmica.org/contributing/). --- @@ -16,11 +18,11 @@ Key technical changes from the [previous version](https://github.com/algorithmic * Rich metadata support (language, sections, TOCs, authors...) * Automated global table of contents * Theming support +* Search support (Lunr) Short-term todo list: -* Search with lunr -* Themes (especially a better dark theme) -* Minor style adjustments for mobile and print versions +* Style adjustments for mobile and print versions * A pdf version of the whole website +* Meta-information support (for Google Scholar and social media) * [Sticky table of contents](https://css-tricks.com/table-of-contents-with-intersectionobserver/) From 75443b0d15f22a3d5f765621d5919ee1aaf2e9a6 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 25 Apr 2022 17:13:20 +0300 Subject: [PATCH 070/173] consistent spelling --- content/russian/cs/sequences/compression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/russian/cs/sequences/compression.md b/content/russian/cs/sequences/compression.md index 58686d5c..5b469fec 100644 --- a/content/russian/cs/sequences/compression.md +++ b/content/russian/cs/sequences/compression.md @@ -8,7 +8,7 @@ date: 2022-04-20 Часто бывает полезно преобразовать последовательность чисел либо каких-то других объектов в промежуток последовательных целых чисел — например, чтобы использовать её элементы как индексы в массиве либо какой-нибудь другой структуре. -Эта задача эквивалентна нумерации элементов множества, что можно сделать за $O(n)$ через хэш-таблицу: +Эта задача эквивалентна нумерации элементов множества, что можно сделать за $O(n)$ через хеш-таблицу: ```c++ vector compress(vector a) { From aeef2db22cf8692463b39fbdc8f321c080f4356a Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 26 Apr 2022 13:37:36 +0300 Subject: [PATCH 071/173] fix integer overflow issue --- content/russian/cs/modular/reciprocal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/russian/cs/modular/reciprocal.md b/content/russian/cs/modular/reciprocal.md index 5d0e34e9..7b966de3 100644 --- a/content/russian/cs/modular/reciprocal.md +++ b/content/russian/cs/modular/reciprocal.md @@ -99,7 +99,7 @@ $$ ax + my = 1 \iff ax \equiv 1 \iff x \equiv a^{-1} \pmod m $$ int inv(int a, int m) { if (a == 1) return 1; - return (1 - inv(m % a, a) * m) / a + m; + return (1 - 1ll * inv(m % a, a) * m) / a + m; } ``` From 238e3987c9f1c6d6e2716e8bb1b8ce4a9a2cdfee Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 27 Apr 2022 00:01:51 +0300 Subject: [PATCH 072/173] number theory intro --- content/english/hpc/number-theory/_index.md | 32 +++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md index f4936581..d532bcfd 100644 --- a/content/english/hpc/number-theory/_index.md +++ b/content/english/hpc/number-theory/_index.md @@ -4,10 +4,38 @@ weight: 7 draft: true --- -In 1940, British mathematician Godfrey Harold Hardy published a famous essay titled [A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology) where he discusses the notion that mathematics should be pursued for its own sake rather than for the sake of its applications. As a 62-year-old, he saw the devastation caused by first world war, and was amidst the second one. +In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._H._Hardy) published a famous essay titled "[A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology)" discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications. -A scientist faces a moral dilemma because some of its inventions may do more harm than good. One can find calm in pursuing useless math. Hardy himself specialized in number theory, and he was content about it not having any applications: "No one has yet discovered any warlike purpose to be served by the theory of numbers or relativity, and it seems unlikely that anyone will do so for many years." +I personally don't agree — and I wrote this book partially to show that there are way too few people working on practical algorithm design instead of theoretical computer science — but I understand where Hardy is coming from. Being 62 years old, he witnessed the devastation caused by the First and the ongoing Second World War that was greatly amplified by the weaponization of science. + +As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing: + +> No one has yet discovered any warlike purpose to be served by the theory of numbers or relativity, and it seems unlikely that anyone will do so for many years. + +Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory. + + From cf6e133d384586f41c59c859fcf1130afd57ef28 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 27 Apr 2022 17:30:50 +0300 Subject: [PATCH 073/173] ignoreIndexing conflicted with drafts --- themes/algorithmica/layouts/partials/sidebar.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/themes/algorithmica/layouts/partials/sidebar.html b/themes/algorithmica/layouts/partials/sidebar.html index 816887f5..652a1f1b 100644 --- a/themes/algorithmica/layouts/partials/sidebar.html +++ b/themes/algorithmica/layouts/partials/sidebar.html @@ -24,7 +24,7 @@ {{ if isset .Params "part" }}
    5. {{.Params.Part}}
    6. {{ end }} -
    7. {{ .Title }}
    8. {{ if .IsSection }} From 47df9a54170f32812ac3043aace1ef7f2df2027a Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 27 Apr 2022 17:42:27 +0300 Subject: [PATCH 074/173] modular arithmetic intro --- .../english/hpc/number-theory/img/clock.gif | Bin 0 -> 2331 bytes content/english/hpc/number-theory/inverse.md | 137 +++++++++++++++--- 2 files changed, 116 insertions(+), 21 deletions(-) create mode 100644 content/english/hpc/number-theory/img/clock.gif diff --git a/content/english/hpc/number-theory/img/clock.gif b/content/english/hpc/number-theory/img/clock.gif new file mode 100644 index 0000000000000000000000000000000000000000..0d0c65556eafa788280115c63496ef6024ecce91 GIT binary patch literal 2331 zcmV+$3FP)iNk%w1Vc7uL0K@GD4WmdGa(nH9|*Us54 z-qPRF-rs=0Bk0WH0OjT-*zn1e?CI+7BKPvpKmrIR0Sq8OKtNZkc=+_~BP1jM1BN3E zI5@K)fU$G)U@0_FuZKfu0vyoD`*aAIyFVqLK|81+<`Sk$ObyQdJS8L ztyr}<%AWYh)JIT`W83DPBDciH13dh>+B?Q=l)fVdJOsSsqbI^(3?F_eg1|(pjUQ45 z1-Xml$t4Vk+`1#?M7y1}e1;+T1Od!5`U-I1fV7UGMG>D$jfFQ1$0i0iNE-X40-_6& zD~CO!>ebD=?^5yo;^SuFCu{VD^Dl?)!Ek$1Z(;gnfT0tKbNLP)dY08La2vgOazJ)3 z?4yTQHJ&`?z69l?-WK`k=iYSv#lYVGeO(l2plS&!=pKXX9r&1Z(O5{(EB7*yEC`gI;oB&@f7~a5QjV-8%T8uK+;}09R7x?WHtuA}NsiR?;-z?2X7WduV@4s1LBgP!rVaxPWI#?# zikXrYzorebS z00Sft@IV42*vS)?T!yERXx$|?kr{xFDL?{%Qux)NDiOMwXb-dsKmZdF6w{@LiV%_l zrDlYyAd&(gz;VHn!0HI1l3Hy42{gs3Ym=97%WMJ#^h&~U7LCAw1J_DRVIs}Gg(?XN z#>MgtPnlNvv11^>BPz>bjFIWGnFmO5rKR~RP0no%izuh*x ztqu%IE`7Lr+3T4jjNJZO=2C`@8n3 zCw1_`XB*K2jP)4%yF)SODcR7+87kK9vJEE4>FA#KPH-Idna_WN2o3y5kiG#12LTOO zU_KH!DGH7Ne5Covkc5Y{2DEAr2=s#qr1q)ot;>NqdDhebH;D@7tpE`mNjn7Sz!I3G zd_@W&$M{!6rASOy%9FvY;s*elRL_A7u-DKYU_hp&Y6hO-7p5GbLyag1*RSPKXg>x zAA1(aLAs%Iwlky)AE!vpF|v`?upA^wm$^hv5{Qz#Bbc2%1@dHv;h_G z1x~6x1e-2|rzwF(Api=FWhW!&k@8kROf|p(IHX#*Ko!t;>NB3pGYJDsQX9q1rfUE= z**uzww|AERK!s*tKs}4~km_kPFC4|2h3>H!5fp$n(4<rGff{KT#$)fed6Ii9&$FoN-i0F~n3u5u>Lz7`KK_a81M51`6zR zu_Y2wabC4*Rg;<#FP2fF3GM1Qb4bi>-GwG*sE8Z$#5WnxG_JEVlqgx*)4|pCb!%l3 zVUgO$!+tKY{aY+w%lXjJLDr#_4Gv8=yV;S}=(2Vt!cZx)g#d7trZIh)GJEG(066uj zTa#cGNO%T!UMaJ;MS??Ucm7aa%_J%*raC39f=*7 zfZ62OXtky(9MS#Wmpe=gH!X=6SKhIw9}QNPB<)S3=AEboOb@}PWz~vH^^YESo<=o}onvC_O%sbGWUvR=!e~F(*n@<&M^(J+ zUq0ezKl}D#2@S$>PkY=dHMhD8o$Yle+m_Box4h>~?|R$&-uTY9zW2@Ve$Orh06Qq$ BG)w>h literal 0 HcmV?d00001 diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/inverse.md index aec428fe..beb56611 100644 --- a/content/english/hpc/number-theory/inverse.md +++ b/content/english/hpc/number-theory/inverse.md @@ -3,39 +3,79 @@ title: Modular Inverse weight: 1 --- -```c++ -mint inv() const { - uint t = x; - uint res = 1; - while (t != 1) { - uint z = mod / t; - res = (ull) res * (mod - z) % mod; - t = mod - t * z; - } - return res; -} -``` + + +Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time. + +We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD. + +But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones. + +### Modular Arithmetic + +Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference: + +$$ +m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m +$$ + +Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$). + + + +*Modular arithmetic* studies these sets of residues, which are fundamental for number theory. -Modular arithmetic studies the way these sets of remainders behave, and it has fundamental applications in number theory, cryptography and data compression. +**Problem.** Today is Thursday. What day of the week it will be exactly in a year? +If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday). -Consider the following problem: our "week" now consists of $m$ days, and we cycle through it with a steps of $a > 0$. How many distinct days there will be? +**Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now? -Let's assume that the first day is always Monday. At some point the sequence of day is going to cycle. The days will be representable as $k a \mod m$, so we need to find the first $k$ such as $k a$ is divisible by $m$. In the case of $m=7$, $m$ is prime, so the cycle length will be 7 exactly for any $a$. +For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero and after each year, it changes to + +$$ +d_{k + 1} = (d_k + a) \bmod m +$$ + +After $k$ years, it will be + +$$ +d_k = k \cdot a \bmod m +$$ -Now, if $m$ is not prime, but it is still coprime with $a$. For $ka$ to be divisible by $m$, $k$ needs to be divisible by $m$. In general, the answer is $\frac{m}{gcd(a, m)}$. For example, if the week is 10 days long, if the starting number is even, then it will cycle through all even numbers, and if the number is 5, then it will only cycle between 0 and 5. Otherwise it will go through all 10 remainders. +Since there are only $m$ days in a week, at some point it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that + +$$ +k \cdot a \equiv 0 \pmod m +$$ + +First of all, if $a \equiv 0$, it will be ethernal Monday. We now assume the non-trivial case of $a \not \equiv 0$. + +For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ week days. + +If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster. + +If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders. + +Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$ is the [greatest common divisor](/hpc/algorithms/gcd/) of $a$ and $m$. ### Fermat's Theorem @@ -65,6 +105,17 @@ $$ where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case. +Несколько причин: + +Это выражение довольно легко вбивать (1e9+7). +Простое число. +Достаточно большое. +int не переполняется при сложении. +long long не переполняется при умножении. +Кстати, 10^9 + 910 +9 + +9 обладает всеми теми же свойствами. Иногда используют и его. + ### Primality Testing These theorems have a lot of applications. One of them is checking whether a number $n$ is prime or not faster than factoring it. You can pick any base $a$ at random and try to raise it to power $a^{p-1}$ modulo $n$ and check if it is $1$. Such base is called *witness*. @@ -105,8 +156,27 @@ int binpow(int a, int n) { } ``` +179.64 + This helps if `n` or `mod` is a constant. +```c++ +int inverse(int _a) { + long long a = _a, r = 1; + + #pragma GCC unroll(30) + for (int l = 0; l < 30; l++) { + if ( (M - 2) >> l & 1 ) + r = r * a % M; + a = a * a % M; + } + + return r; +} +``` + +171.68 + ### Modular Division "Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, but $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$. @@ -180,8 +250,33 @@ int gcd(int a, int b, int &x, int &y) { y = x1; return d; } + +int inverse(int a) { + int x, y; + gcd(a, M, x, y); + if (x < 0) + x += M; + return x; +} ``` +159.28 + +```c++ +int inverse(int a) { + int b = M, x = 1, y = 0; + while (a != 1) { + y -= b / a * x; + b %= a; + swap(a, b); + swap(x, y); + } + return x < 0 ? x + M : x; +} +``` + +134.33 + Another application is the exact division modulo $2^k$. **Exercise**. Try to adapt the technique for binary GCD. From 0d4d13729d5662f715d258fb37bebba5ada2928c Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 27 Apr 2022 17:54:18 +0300 Subject: [PATCH 075/173] reorganize hpc number theory --- .../english/hpc/number-theory/cryptography.md | 2 +- .../hpc/number-theory/error-correction.md | 2 +- .../hpc/number-theory/exponentiation.md | 70 ++++++++ content/english/hpc/number-theory/finite.md | 2 +- content/english/hpc/number-theory/inverse.md | 170 +----------------- content/english/hpc/number-theory/modular.md | 105 +++++++++++ .../english/hpc/number-theory/montgomery.md | 14 +- 7 files changed, 192 insertions(+), 173 deletions(-) create mode 100644 content/english/hpc/number-theory/exponentiation.md create mode 100644 content/english/hpc/number-theory/modular.md diff --git a/content/english/hpc/number-theory/cryptography.md b/content/english/hpc/number-theory/cryptography.md index 0dd500dc..e552372a 100644 --- a/content/english/hpc/number-theory/cryptography.md +++ b/content/english/hpc/number-theory/cryptography.md @@ -1,6 +1,6 @@ --- title: Cryptography -weight: 6 +weight: 7 draft: true --- diff --git a/content/english/hpc/number-theory/error-correction.md b/content/english/hpc/number-theory/error-correction.md index 91f1f472..e8774ed8 100644 --- a/content/english/hpc/number-theory/error-correction.md +++ b/content/english/hpc/number-theory/error-correction.md @@ -1,6 +1,6 @@ --- title: Error Correction -weight: 4 +weight: 6 draft: true --- diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md new file mode 100644 index 00000000..f82af3e6 --- /dev/null +++ b/content/english/hpc/number-theory/exponentiation.md @@ -0,0 +1,70 @@ +--- +title: Binary Exponentiation +weight: 2 +--- + +### Binary Exponentiation + +To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative: + +$$ +\begin{aligned} + a^{2k} &= (a^k)^2 +\\ a^{2k + 1} &= (a^k)^2 \cdot a +\end{aligned} +$$ + +We essentially group it like this: + +$$ +a^8 = (aaaa) \cdot (aaaa) = ((aa)(aa))((aa)(aa)) +$$ + +This allows using only $O(\log n)$ operations (or, more specifically, at most $2 \cdot \log_2 n$ modular multiplications). + +```c++ +int binpow(int a, int n) { + int res = 1; + while (n) { + if (n & 1) + res = res * a % mod; + a = a * a % mod; + n >>= 1; + } + return res; +} +``` + +179.64 + +This helps if `n` or `mod` is a constant. + +```c++ +int inverse(int _a) { + long long a = _a, r = 1; + + #pragma GCC unroll(30) + for (int l = 0; l < 30; l++) { + if ( (M - 2) >> l & 1 ) + r = r * a % M; + a = a * a % M; + } + + return r; +} +``` + +171.68 + + +Несколько причин: + +Это выражение довольно легко вбивать (1e9+7). +Простое число. +Достаточно большое. +int не переполняется при сложении. +long long не переполняется при умножении. +Кстати, 10^9 + 910 +9 + +9 обладает всеми теми же свойствами. Иногда используют и его. + diff --git a/content/english/hpc/number-theory/finite.md b/content/english/hpc/number-theory/finite.md index fbef0015..cae2f2ef 100644 --- a/content/english/hpc/number-theory/finite.md +++ b/content/english/hpc/number-theory/finite.md @@ -1,6 +1,6 @@ --- title: Finite Fields -weight: 3 +weight: 5 draft: true --- diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/inverse.md index beb56611..c0d9df08 100644 --- a/content/english/hpc/number-theory/inverse.md +++ b/content/english/hpc/number-theory/inverse.md @@ -1,121 +1,8 @@ --- -title: Modular Inverse -weight: 1 +title: Extended Euclidean Algorithm +weight: 3 --- - - -Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time. - -We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD. - -But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones. - -### Modular Arithmetic - -Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference: - -$$ -m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m -$$ - -Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$). - - - -*Modular arithmetic* studies these sets of residues, which are fundamental for number theory. - -**Problem.** Today is Thursday. What day of the week it will be exactly in a year? - -If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday). - -**Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now? - -For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero and after each year, it changes to - -$$ -d_{k + 1} = (d_k + a) \bmod m -$$ - -After $k$ years, it will be - -$$ -d_k = k \cdot a \bmod m -$$ - -Since there are only $m$ days in a week, at some point it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that - -$$ -k \cdot a \equiv 0 \pmod m -$$ - -First of all, if $a \equiv 0$, it will be ethernal Monday. We now assume the non-trivial case of $a \not \equiv 0$. - -For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ week days. - -If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster. - -If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders. - -Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$ is the [greatest common divisor](/hpc/algorithms/gcd/) of $a$ and $m$. - -### Fermat's Theorem - -Now, consider what happens if instead of adding a number $a$, we repeatedly multiply by it, that is, write numbers in the form $a^n \mod m$. Since these are all finite numbers there is going to be a cycle, but what will its length be? If $p$ is prime, it turns out, all of them. - -**Theorem.** $a^p \equiv a \pmod p$ for all $a$ that are not multiple of $p$. - -**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient*, that is, the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ would appear after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then - -$$ -\begin{aligned} -a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p & -\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)} -\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)} -\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)} -\\\ &= a -\end{aligned} -$$ - -and then dividing by $a$ gives us the Fermat's theorem. - -Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that - -$$ -a^{\phi(m)} \equiv 1 \pmod m -$$ - -where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case. - -Несколько причин: - -Это выражение довольно легко вбивать (1e9+7). -Простое число. -Достаточно большое. -int не переполняется при сложении. -long long не переполняется при умножении. -Кстати, 10^9 + 910 -9 - +9 обладает всеми теми же свойствами. Иногда используют и его. - ### Primality Testing These theorems have a lot of applications. One of them is checking whether a number $n$ is prime or not faster than factoring it. You can pick any base $a$ at random and try to raise it to power $a^{p-1}$ modulo $n$ and check if it is $1$. Such base is called *witness*. @@ -124,59 +11,6 @@ Such probabilistic tests are therefore returning either "no" or "maybe." It may Unless the input is provided by an adversary, the mistake probability will be low. This test is adequate for finding large primes: there are roughly $\frac{n}{\ln n}$ primes among the first $n$ numbers, which is another fact that we are not going to prove. These primes are distributed more or less evenly, so one can just pick a random number and check numbers in sequence, and after checking $O(\ln n)$ numbers one will probably be found. -### Binary Exponentiation - -To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative: - -$$ -\begin{aligned} - a^{2k} &= (a^k)^2 -\\ a^{2k + 1} &= (a^k)^2 \cdot a -\end{aligned} -$$ - -We essentially group it like this: - -$$ -a^8 = (aaaa) \cdot (aaaa) = ((aa)(aa))((aa)(aa)) -$$ - -This allows using only $O(\log n)$ operations (or, more specifically, at most $2 \cdot \log_2 n$ modular multiplications). - -```c++ -int binpow(int a, int n) { - int res = 1; - while (n) { - if (n & 1) - res = res * a % mod; - a = a * a % mod; - n >>= 1; - } - return res; -} -``` - -179.64 - -This helps if `n` or `mod` is a constant. - -```c++ -int inverse(int _a) { - long long a = _a, r = 1; - - #pragma GCC unroll(30) - for (int l = 0; l < 30; l++) { - if ( (M - 2) >> l & 1 ) - r = r * a % M; - a = a * a % M; - } - - return r; -} -``` - -171.68 - ### Modular Division "Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, but $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$. diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md new file mode 100644 index 00000000..92e0c687 --- /dev/null +++ b/content/english/hpc/number-theory/modular.md @@ -0,0 +1,105 @@ +--- +title: Modular Arithmetic +weight: -1 +--- + + + + +Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time. + +We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD. + +But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones. + +**Problem.** Today is Thursday. What day of the week it will be exactly in a year? + +If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday). + +**Definition.** Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference: + +$$ +m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m +$$ + +For example, day 42 of the year is 161 119 = 17 \times 7. + +Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$). + + + +*Modular arithmetic* studies these sets of residues, which are fundamental for number theory. + +**Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now? + +For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero and after each year, it changes to + +$$ +d_{k + 1} = (d_k + a) \bmod m +$$ + +After $k$ years, it will be + +$$ +d_k = k \cdot a \bmod m +$$ + +Since there are only $m$ days in a week, at some point it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that + +$$ +k \cdot a \equiv 0 \pmod m +$$ + +First of all, if $a \equiv 0$, it will be ethernal Monday. Now, assuming the non-trivial case of $a \not \equiv 0$: + +- For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ week days. +- If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster. +- If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders. + +Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$ is the [greatest common divisor](/hpc/algorithms/gcd/) of $a$ and $m$. + +### Fermat's Theorem + +Now, consider what happens if instead of adding a number $a$, we repeatedly multiply by it, that is, write numbers in the form $a^n \mod m$. Since these are all finite numbers there is going to be a cycle, but what will its length be? If $p$ is prime, it turns out, all of them. + +**Theorem.** $a^p \equiv a \pmod p$ for all $a$ that are not multiple of $p$. + +**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient*, that is, the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ would appear after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then + +$$ +\begin{aligned} +a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p & +\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)} +\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)} +\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)} +\\\ &= a +\end{aligned} +$$ + +and then dividing by $a$ gives us the Fermat's theorem. + +Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that + +$$ +a^{\phi(m)} \equiv 1 \pmod m +$$ + +where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case. diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md index e784dfaf..233e355d 100644 --- a/content/english/hpc/number-theory/montgomery.md +++ b/content/english/hpc/number-theory/montgomery.md @@ -1,6 +1,6 @@ --- title: Montgomery Multiplication -weight: 2 +weight: 4 --- When we talked about [integers](../integer) in general, we discussed how to perform division and modulo by multiplication, and, unsurprisingly, in modular arithmetic 90% of its time is spent calculating modulo. Apart from using the general tricks described in the previous article, there is another method specifically for modular arithmetic, called *Montgomery multiplication*. @@ -79,6 +79,9 @@ Since $x < n \cdot n < r \cdot n$ (as $x$ is a product of multiplicatio) and $q Here is an equivalent C implementation for 64-bit integers: ```c++ +typedef unsigned long long u64; +typedef __uint128_t u128; + u64 reduce(u128 x) { u64 q = u64(x) * nr; u64 m = ((u128) q * n) >> 64; @@ -134,7 +137,6 @@ Transforming a number into the space is just a multiplication inside the space o ### Complete Implementation ```c++ -// TODO fix me and prettify me struct montgomery { u64 n, nr; @@ -148,6 +150,9 @@ struct montgomery { u64 q = u64(x) * nr; u64 m = ((u128) q * n) >> 64; u64 xhi = (x >> 64); + //cout << u64(x>>64) << " " << u64(x) << " " << q << endl; + //cout << u64(m>>64) << " " << u64(m) << endl; + //exit(0); if (xhi >= m) return (xhi - m); else @@ -163,3 +168,8 @@ struct montgomery { } }; ``` + +```c++ +montgomery m(n); +m.transform(x); +``` \ No newline at end of file From 013fd0109e05d69a4a30b5b86db74ffa6e784adf Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 27 Apr 2022 21:01:34 +0300 Subject: [PATCH 076/173] publish modular arithmetic --- content/english/hpc/number-theory/_index.md | 1 - .../{inverse.md => euclid-extended.md} | 32 ++------- .../hpc/number-theory/exponentiation.md | 9 ++- content/english/hpc/number-theory/modular.md | 66 ++++++++++++++----- .../english/hpc/number-theory/montgomery.md | 1 + 5 files changed, 65 insertions(+), 44 deletions(-) rename content/english/hpc/number-theory/{inverse.md => euclid-extended.md} (51%) diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md index d532bcfd..bb6a8b3c 100644 --- a/content/english/hpc/number-theory/_index.md +++ b/content/english/hpc/number-theory/_index.md @@ -1,7 +1,6 @@ --- title: Number Theory weight: 7 -draft: true --- In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._H._Hardy) published a famous essay titled "[A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology)" discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications. diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/euclid-extended.md similarity index 51% rename from content/english/hpc/number-theory/inverse.md rename to content/english/hpc/number-theory/euclid-extended.md index c0d9df08..ea01588c 100644 --- a/content/english/hpc/number-theory/inverse.md +++ b/content/english/hpc/number-theory/euclid-extended.md @@ -1,41 +1,21 @@ --- title: Extended Euclidean Algorithm weight: 3 +draft: true --- -### Primality Testing -These theorems have a lot of applications. One of them is checking whether a number $n$ is prime or not faster than factoring it. You can pick any base $a$ at random and try to raise it to power $a^{p-1}$ modulo $n$ and check if it is $1$. Such base is called *witness*. - -Such probabilistic tests are therefore returning either "no" or "maybe." It may be the case that it just happened to be equal to $1$ but in fact $n$ is composite, in which case you need to repeat the test until you are okay with the false positive probability. Moreover, there exist carmichael numbers, which are composite numbers $n$ that satisfy $a^n \equiv 1 \pmod n$ for all $a$. These numbers are rare, but still [exist](https://oeis.org/A002997). - -Unless the input is provided by an adversary, the mistake probability will be low. This test is adequate for finding large primes: there are roughly $\frac{n}{\ln n}$ primes among the first $n$ numbers, which is another fact that we are not going to prove. These primes are distributed more or less evenly, so one can just pick a random number and check numbers in sequence, and after checking $O(\ln n)$ numbers one will probably be found. - -### Modular Division - -"Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, but $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$. - -To perform division, we need to find an element that will behave itself like the reciprocal $\frac{1}{a} = a^{-1}$, and instead of "division" multiply by it. This element is called a *modular inverse*. +If the modulo is not prime, then we can still get by calculating $\phi(m)$ and invoking Euler's theorem. But calculating $\phi(m)$ is as difficult as factoring it, which is not fast. There is a more general method. -If the modulo is a prime number, then the solution is $a^{-1} \equiv a^{p-2}$, which follows directly from Fermat's theorem by dividing the equivalence by $a$: +Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that $$ -a^p \equiv a \implies a^{p-1} \equiv 1 \implies a^{p-2} \equiv a^{-1} +a^{\phi(m)} \equiv 1 \pmod m $$ -This means that $a^{p-2}$ "behaves" like $a^{-1}$ which is what we need. - -You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation: +where $\phi(m)$ is called [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case. -```c++ -int inv(int x) { - return binpow(x, mod - 2); -} -``` - -If the modulo is not prime, then we can still get by calculating $\phi(m)$ and invoking Euler's theorem. But calculating $\phi(m)$ is as difficult as factoring it, which is not fast. There is a more general method. - -### Extended Euclidean Algorithm +--- *Extended Euclidean algorithm* apart from finding $g = \gcd(a, b)$ also finds integers $x$ and $y$ such that diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md index f82af3e6..68142c30 100644 --- a/content/english/hpc/number-theory/exponentiation.md +++ b/content/english/hpc/number-theory/exponentiation.md @@ -1,9 +1,16 @@ --- title: Binary Exponentiation weight: 2 +draft: true --- -### Binary Exponentiation +You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation: + +```c++ +int inv(int x) { + return binpow(x, mod - 2); +} +``` To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative: diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md index 92e0c687..b6045a3a 100644 --- a/content/english/hpc/number-theory/modular.md +++ b/content/english/hpc/number-theory/modular.md @@ -20,11 +20,13 @@ Computers usually store time as the number of seconds that have passed since the We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD. -But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones. +But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday, and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainder* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones. -**Problem.** Today is Thursday. What day of the week it will be exactly in a year? +**Problem.** Today is Thursday. What day of the week will be exactly in a year? -If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday). +If we enumerate each day of the week, starting with Monday, from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday). + +### Residues **Definition.** Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference: @@ -32,9 +34,9 @@ $$ m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m $$ -For example, day 42 of the year is 161 119 = 17 \times 7. +For example, the 42nd day of the year is the same weekday as the 161st since $(161 - 42) = 119 = 17 \times 7$. -Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$). +Congruence modulo $m$ is an equivalence relation that splits all integers into equivalence classes called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$). + +![](../img/gcd-dependency1.png) Modern processors can execute many instructions in parallel, essentially meaning that the true "cost" of this computation is roughly the sum of latencies on its critical path. In this case, it is the total latency of `diff`, `abs`, `ctz`, and `shift`. We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a negative number divisible by $2^k$ still has $k$ zeros at the end. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this: -@@ + + +![](../img/gcd-dependency2.png) Hopefully you will be less confused when you think about how the final code will be executed: diff --git a/content/english/hpc/algorithms/img/gcd-dependency1.png b/content/english/hpc/algorithms/img/gcd-dependency1.png new file mode 100644 index 0000000000000000000000000000000000000000..4e58904c19b0720e58dd26977fc20d0b837a9eb9 GIT binary patch literal 14837 zcmaL8cQ}`Q{5JlvH?7iphAYwPRc=^<$5;(nH-ciu%%X0M=;?L|R3896yYSveI21r-GuK_gQ` zzl_-&5=oGBQ0IVIK-Toz<<+3`qn>dGdh$K~@ipAMJ zHsffP`9Ci!W*MY-PycMXQc`qXorac{)-f|PYDq~+F$oFn)2FvrR8(BOdR6oIaZdG+ z)tKVqJuia(vX-8HdAX!SikXGQ*29BUSy{R24lRwP_{XT|=*FfdDi;?QnSJ}NEiDlN!ye!U-8WwCqc<(c`b$y4H0zh{pI!tEi8C!$lm_`yR@&ciXOeA zKK^t6hrY=-8WiXK{PY|gb|h_Ai~IHY`TV8Y%zd7XU%qU=fAmgtcJ}sRa=vU zhqkh^qU1F~+O}=m)!VlvyPuwtcy@oZvcKPghnM#&US}mBPQHR8`r7-@e0E{MK~mfG zz+bcAs!RX;ruxexdbKFR^7BQ`xw?A)_^|tg&tuxhX2HJ8zt3*cI%99&J{Oy=6(X6_S`mS(fI zwyw$#Zut0->-6c?8J%2{(hdnzdmd0>e6s@q%nJV3$UtFAE+Z6qEW$~wsy!_a1fo6W3 zYsBlRk=01xHvBpFnfliSr@VgsA{Zf@V5b71)P*!?>pjd*WBr=Itf z^f

      {QTVf{HFeZu57h`KY~7e`eb7@+EZ$y;XBp8*~G-;>a}Z)T^bv!pXUM>e?<9B z4+u$0vc$#3C5+!y&zIg<65PF;fr^%S?r><<9)!^aUOG-Wy%=m%#_-jtHwe;DJ~dcf+HjGRgJaUgFj&j5nNCzx)XI7w`|4E^6&2O~(DiMH z4jm$eMbNQ@l)t^D=z7e=zXG{q^CVSrLtnf$wt6LMTOj}mKNc? zd%3U!*GI|L&;6pVU*~w{H~qTz00}2@x1eB7!XpA}W3snCSnID@VoZE|Q?B-vGw04R zIypH3WUyo@<()cpPMqL!dL9!)UsF>vGTD>QqZX)e?beDa(*c5o-f@0CbhjmMQBrMZcz4N%TS{v(YDOp(wH|3p#Wn`{CY&A~I z4=U2zAu5`S`vkyf!Kc?f|2$9bDtGP?X693%=jG+C@%pNjZ|~~tJc1?N|Ja3HK~Ygi zNQmP2@#Dn$GQEE!>iKh3N-8SvseV};KqHm_9qYERv^36h=gwibMd4aJc=)iXx0e%l zw&m`@R2ghF+M}l8Vq%&{jKQ#o#XrZkK3tX?cZNs{-gM%h+e`-u#AM8 z$8c?;Zafzdj^erh%tY{TM+bAIq0f+ z;^R*q*>kzy!tEH3lB#Ozfe6|Z_FYG2-IKmHT}j0Xop)co*Z(qT-}2hu;jN#g)YY@y zhH6qUn3x2Qzj^oL(BkY@?Y#T<^|P|F3^3Mg$SNVLN+%yXZ?++i=mrL=e1CtR?cmX) zFMeLZ9U#R|;{xxIT^u8A6+@$itky3AK$*Ub3f!GklsMaSLr zU3x6f7E*u5`^zdTwKX+KD+}M7ER!mox`>mk30YHd<+Wt86sJ%MUXsBj{W?AU_VsGA zfJSs;B8$gRP0t8t!;AikvrOv2OC+qwwl{C&@6;#xyn+JV604_;)yh&NRa>i-b&TTo^uE7;w7ou@s{M&; zUw_i?G=M+5yuAFitgMc{i^0Lc1~Xq?wlDtdTUKEcihT9z)m4Ge)s#|!qcPXG%*8b{ zG~|Mpyhdkcy8N!ssQ69s7@L}ka7&+d2%h@=`+SAV`<=wK7Zw)g5!eX&QWtvkM6rcX z*>dE`c=6+)9K0GK`;~nsHw_x|ace3+S%l5%8v zn#JASow#;QSJ=q*c15z1_sj|k3M`)cJ!+Gx72x5a4-E~CjEp3xi>0OIu@fislDDgM zm)q-uH2|s^1vEkkeE9M*=!$vi=>sMvTcV?*F)q(6Om6S>#ivQOW0=qWnLdT>BD`-O z_syF(aU&z1JyZGi?ORSx4hj31dHep1CyG)JWe0^WtR`u^sYD^s(vQkn} zX>4we>a9FTofxuYIG6b1#eU3dI(>cps(gA5NzP_-08zxPTYM88H!;n_H-9%tyK$qE zKYX+iTx1TTr_AqHeom}g;gKvQA+zfByL^BC&D7bqCh@m*bvdar4#zqa{+^#F7=V(p zayft0=)1NQ5n2CVR#_?+;-!}sH}{5@>&5@6?&~bya!iPQAOHAFDh* zs}$SRWl!@2@8T#ZDlXQswH3n0ZMBwCmX~Mm>FEK*i4A*KE`dE4x;jS(=2&Lm%F4V= z>FTtBnF#f|@^L#m;q3hvFN1iE0&hUP^xfSh9UL4mdXqr2a`W;i85tRI`J=0<_Ua|_ zH9wUfDY1Tj%-npJ$$eqPK5Rj+=>f%nfB^dDv&&{p)Fwq(URfHU+d)Ufq@-GK5eQyk zivbR5s)JLn>#wb<;yQWqq`~3CuYUHGZsy{O28H0+y0xjVj~l$iwWqo_AvsweYlRr% zH*Q>MY}C4Y|Gxc5eR!6^(|-^v#vczn2-w?t@jE!GrbC< zCh)7K54{Y!A<)FGMZbH`o*hX^Ne-{C@7@8t^qil4^X7(h`@;A4?Km$3XJ_ZiR*YuR z)?|TQ0|NsNKR-Ug>b|!uv^<5~=Fpj??ACf_TYf*L+jCtgEYgeq;U8M>z#YC;|uI1!bnRu<&JOeigGxbCq}R-;3DV z+Ts+LMG7om1}dJcbYx-IR(vx(SRH%s-p+%E4?C#7Ha}5fbr6i4THO;1gjdB^>}lX) zQ9XGg{mK&lMUUAAj}sZ6Zc^1%K4$x8ldGthO1One6~sK0I#wDvnd44KVX5=>WaZ4ne^% z+=*z==JH!(-~|q=OLM+yq7=?23fkrtSIVxrb>n$NK#qv7>XV)^68 zk)@^5gR-@|Wo0*m8lP3wt5hg7%gxP|k&!v0!bGYc9NhZ*_iuwEM?C&m0DmTjo>tZm zYwPIg@>ZWn!0(@rh2~w$FEP# znwy&q4jpO)J}#%wxeq)Jzk7EFrXCNkCk~#Tjm^nnkB8c&6%LKH1=GqFsJxa~_FaSm zShCxVy+R0Lkc#{ZJkGAJ^CCFfp`mS<6O(3Elu*bG)#*v?ChVC=8P~#g{LNF;1z4seWDtIP8}&;31zUR-3yIMp*UVj-OY@(V`R1C&8)TA}wf=@0ak zSj*bCFarXs8ukz`Ja~|y^k`-K>({R#&DwjVwY|M%S-0!WqA6AXGmE;8^M7ah2{nCkd#bJ z5z!xQPu-<8-=HiVAaxuf)O=%dSQx*-HPtdVzcJ7B@%~W(tgka5%URQbe}Antr>Mgk z!Ii!wn~n{>zDySAv<=2eTKLh+1nrlYeh9Y@$N1ns@w8vVIvW(jd+{$7sidT2c7A^B z1-TL%CN=fFxzlAGUEQxZ5iopW;*yn5Sz6lt`!kKhB^U3#FwwyaZK2uYBa%R4LxVKbLsoOi@<(1bAsh2-VsRS!13m7iGH{3^a9;=7u+ zW{19qR1`PScLZv5KRUH9{OzvYyJMkrbHUkq+&#a%h=pNuz}R>*;TAws)!Z7xiXy4c zJOL;}RX?9%REH}NNENd4C- z6I0W=!0&CI5Q~Y~o3-d0u!74S-|~UAQj*ByZdkz(26LZVcS1AT)P+nAR!d_B5e5P~ zJNwjdoq)Q!Iz1!fIR*Rk=jGt$)P%z72#_MS`{?K>-_TBV_3eNOT(J7rX%b%{4&n3I zVeGbbc1Fj>GJxpFU@K75(T%-|W%oLNp3uMv2?@t6Eg9+P=!WV-1u#ennLIn%%#@Lt zDWayv50Q2NGZ4EY>SfU1y%|^*M@D`oL$V60s61Y<>ZICkvVi2hDJNsBTProN1#B4>XVknswPNATO85kJrZuL~ENYrMw@=LQ#MYA zU~m7k3lbb095U_$3fq*tNML1~*x9vlp$;1vDbp!UO-<#%eZr-RiH*&D@PHm?8j+CD z;ZK$Xft>x?l#-Fr*8ly?Dr^x-PfyQ1mXBkm#a_6)UKIZG=TGp-$|RR3PoC(2QI6cG z2wh!DA!a+KZg}ULM{uZOIuws%>pQJ6dYo4Lel-IF&LHbxFo+-~IW1s^Yw`u$2n?)E z18*c&S68W@kv&-5-KX_z4l!l!Ond#|14nszc}#M$lL}k@i5AA|kr~5zFD@>&u(`Pi zKhy#9POEZHKW3AzN9_2*g0I~9Zw6KNvv=Erh<#Xm`lY4V%z>b3lV z@)!tZ7k-{N;X@kP#yM2uEiz~v@D3n{JCjpUacx>`$kjvQ(W6JPnVDPJX`FN*sf>>w z7sbgrs4@zDn=d*7dA|b!#8$EOG7VE3PRQbIw;5C@`>mI1`)*kmB_uGzk`v*m)-Eh9 z-35b0bo=)0Zu-kwZf<*c6`VK0m^^cK9Yi|?v#LVQ-}O}+3*)chx}(?A(oWV1T)Tan z|I>?$@(+1y$+#psz>ksU0qF4ZiVESy#YM{7)r$60Csd#;bhNc2p4dIIDA$^t*)})X zBa&BAa^%H}7u0gYrFH9vV7uvzmc&Ui zeIcu$umk?jPOP8cn`0OkK>SF;rjQ;pWwBq)u7jr~NLbDg84iJyrTZ`bj$B=x?{a`l z3BD755Ie{LF2FY!lfu{vPHsb3cNI=u+iO3sTpwk%@wmJ8wTaHo)zxs1q~t*z9qO?$ z^W(=i!&JzFguz>HuXuU;qFJnU#t)X6xZ31 zbyGV{$RC&I$>pINA*&1TkCr`o(mc`-k*Rq4;zebgzSruKH(~3+QlA`nBBAZ_iJgZh z27bMeurOua>Tf29XhL`czC)j!rX%6j6E;)8@^4XsWPk|4j5-6>u2rXCX^OjIyEOY% zE7WqRZpl)7Wd4FS?k9m5cpv0eg7;X9_e;MsE-t$X4GFVp6z(zX895OePCeub^eim8 za6d2-ufl!%Gd-A(OPQspj4Xm4qx3;hk=^V_1CP8TbI9^IcUAuU=P(MYS(qvYhK7~= z1;)YBy`eNfL{OoC)g@WbDH*R3&4uYfZk!+xf*Isy?)&>)9NgR@khoYccw0lA5XwOF~dow0BbYp5#f) zLE~e`G!1(<(a=QAoK}8;z2@xV^7T`t)8yyp`$T?TGtD!CS^7}%U}X52kJFPLCZ_{iZba^k~Dn;?GH)tcV=fF|&`XG6M#Jxo|!mFU+_HFC4xr<`_ZSEFi@O zQy%EiyBCsrtmYG4;3jIpv4b%%OTH7z7|h@r=da~&uvTYjPuLlA3*SvaZ-g)$u?PUh z2OvhuFim}b4e>&vgahe80BMOBLxd?3i2}e)$>|%te6Ca)1lr(c>IqOISd$G8%}ZzB zAJrf{M;=wbJy>^w5)$!XEF;7;6BpM4rMI)U|N7%coTy>?ncv^qGn9QCrjy*XbRw@th)LN_8XKX!-WeM4(M<{ zz#&3yRPaZrx>>jN4j__CC5$^5rPlxjGIA4e9AaQ19Fyd2qj@cPvlD(TWdLKn2nHKgLEE%G(3Hzh` zT(5XbD@dI*=sgn+4NYfCQ+iHW*(U2}zP75g=FWxG$Ry~G9617WnzHZjU-zO%r>aMB z6Iy`CIzL*-!(cL05LCWDKEzIoS#2I*KezF7ShWo9X7;q05f>*XjgOB{ma-2mbW&>l z#OEij+|VOYE3DwY22+J_>!NoZ%`W55Z>JZssP^E;$`|d{qNu-Tl-XTkt-88QzC++J zd#p(!zbY3jB1pc}>Hcym6%~~XMOQ9~`I&(yOkciyp=V;UKfC@>Z^t1;yI1sZ?Ff!B zEw(>seu4!XoPM>*f`?`WVL!X7@9fG5kIBF0GqMG}N3FwL`H74G5G^z_NlRCgDb z*nt5{YFgTgTltEcZkiZnDH(qY5t5UW+q7j1<*8Grs`5!BpuX4o>cuMi#{t5o;@aBU zZEbDDa=~?&|NHmK)2C0nvnm}O9it~FojVs#=Gl$4B*r8pIH=P0>D)6Tc_E|2GKrt& zv0`Fi2+z&6t!({NW=looqAo8Zqv`A{W`lLZ37*Rr!&F|JCgB*nz)u^Aw|=wO3y(kcUo4hRe!1*~9ag~PEX zodFv1V7yLr-j0VN&A55<0CqS0|CpEQ#;9G5B}?h_DPQkQ^j&qS$%Cd!wZ2+w^oI>^dZ6=|9~fH-S}pzf{Zg`1~vi z+q%lWAuv4?EdyPm3oxs*f*(V#JLR`*B*q_>AIugxgd55 z(aMR^(=5163saxi)q?)KsyO>bNn;q0AuJ*SYxpwAa~|j%p@yK^7{U0#I7g951T6mG zM9|8Y^%Fi5u73jzG~SCzANq8jnZa)3Q9i!|z`XOi|JmyD@&=rX?YVQW{Wy{c5FdRN zOZfPzhvG%CaKgIis7h?=q^?I%st=D@WONowOGq?f8S2`l^}$JRXl=bR&0{3|u7FbA z#!0jo3U?DY4=fG>5~OTu6M)%!S18J50)BssHQ>)(l}?dPllj@5OEuATCpmT3(F%T= zdUE9|!mu}f33}KYU*X~jKR>b``$G~5MRs;JVTe6c#tkK+>4a^{*Lt@N5#bR`RKbcX zYc7RLhoNcV#x{YfoP0y14umFzq&_Q2C5?gviWZ%I2|HcA3kDFK4Q{TjaA@B`*jdfZ zx+6Y)A3xfGMfTTTdfpbSj>3dwMn(n!!Iox)$e_YR^pohv7qz6Mq-rkwb_`R zY!OcQ>aZt($Az$Ka|#NgaG61z+I3cZiaeE3z#tM?Y}X3@aS0$$hN{05(l!~V&TX|_ zItl z1re-@yos25kAU23@yBsc3|p9T#JcS=JT(mb1OU{-mG6P5*&i@V($mumK6KbSznTBR zrsH#SbA;T&2VyVA3u#9qSz3I)?_E-SyfzRVZXPAZct?gT71oy1KNp3Q*4Cf7%{AiX zEX9Qu*4Kj2cNl|?AnZ4g4%|~z4W893 zG$h<6dS>QV2ybB)|5(HA&5(Cu#Y(OB<*P@--C-dH5J|b{Z}Rdb!-s>WYlha={7`|LpY|Tz{k2d0ui1u) zx%n0NgM+dUxbF4-D%5F45*ZF%N~#Aylp#h>XE#pFE!c7%jTShy`f&C9X1=_gTnnX< zR8ryrz__)Js9F|z&i?u1s7iUIpgt$ANkz6&qF4w z>N_%iBU%}Fj0`L+XBDsH^q6=P{x0_FzMw2N0rgE7GjcGce*Ydfq`h`& z7;Z-lZnJX0te_3qr2~`t% zd=JeD_vuNY3lud-54iSN5V?9dC*yw#^!xW%^ndI?qS{`+-qJtxh;Q)uA>$iaSwkaT za32o)`ueuPpr*b)Ik>+5&r=R6=vqoj!?~o7Vd+J@tODwX!C2bB>}!|jPgQ$tW?^CZ z4G~GP>)&2~e}6lKS%(Y_c?DF(#fy6RuP$x3EOHEl;sy8#0ULM1^30k+rke^bk&2kA zJmZ%uY**o`s;Yz}UOv76 z04F3NVf@=S9SG-ktb88cQqdg!4IU0&-dNONP>W#e#;hH9!VTqg#@RWhZex9`eM^GN zJ04>G)?QKtc_$2Ui&LlMvrHn(KmSPI2`cP-?i?t@ma2S(HVnJNH{}WrCL0|*D1<~I zVS3t<@7aC3e)yFTuhLugoy-KtV-url;+8G2b^CUyEq~b@sn)XM4uGtqSMkh%gK;T1 z)su}ad8b-B*N=bBg*?7JCL|cUvx=J9L=POYn{R?zTdl5K#q^{YY5VP3?R;LEhoi+kL7yO9Vvg zD?g>Ns4a4ME%{b{CksiRe&PH(^Do$ZvGsE<Oi7*l1QBI@!T*yW*toIt^RvM84qjPcfLquC zWzd6jA!>VAV5*SyN=zmo0-J6~S!>}KAd_S~hLm)4bim@NFc-_p%PB}G9Q=qMU6Y23 zdi}Ia!8E8}`o?^?#76qO33RID)q#gUbh8x6KTb;|I+AVFV8XA>+D7~GA^*@((N zJ@wf|{y8#oMp#H_uDJ&Gy90bio8YeY_SlXr<<$FRG&csm^=3Vg?zs`l3={a@kC>qi zaX199iO{4vKp!3%`O17>_QS%$@Te#n@=tWQ1Z8DY<`>yCR_onnHy)IhUV%+D+~R43 zbt{M~0h&o$ZH(yvDYO&WA)NCNSd9z$uQ=^15gtzNo9j>jC#E8jTeZ%H3 zEOY4UvIckjkT5ZXkzaye85kKwp*8yY&70+0lJExF+uK<@OfgIF`dLMNS9JGCuy^#1 zs$as5-@ku<4hjSv4cdyMTYrR zPdmSB#}0+%2-hF3TS6a456aLJ!W2Ob^0pbz_5CQi5XKvZEJ)4j+L|mPK!lrPC_Y&r zO+WABL%5~zr5Zkc;zd-agJ2j_O9+7yrhy1f8Sb`>oLr+#=voxqI{bnaKbah^(?E-g z3eZYmJ>ml*vkVVE5b+6Ofej3inO81SAXM=YtD50$A;9Nll>W%Yw8m%|(ZPQKhE<^8 znuBd4htvrbjASVM_{2nA^z|e;B5*V196NXjLlXSUP5)~wQU$#?PPm{rvn;67$AaR)J=sWHbV|89(X16S~e|Z*L#SbDIsn{8)P_$nDDuUVP~^U+r3GvQfVU z_62q+tT2v-&MU|vTRS_gVZ*u2pH1gdP#^?g(qT+h<>%bHN6W-5MI66N?h+c~$twQ8 z5+G0-hld4_AQq29-CYI=2A2yXx>rxOJ)(g|*)1kE0{AB)@bR-F061A-EJVtdg&~)Q zoPsT30wGHjDELHUC*X<_`B8XB(8_|#;oB;lSbELi)6{aiP|l_w!c2#;#R`OGLJwuK zzg%R@2Mu0I^cpom`_MRI#Jz4I=hPeZp zS!sn{F|cp2>j<>ZBcr2*_v~TDhTxGoLv;EgAR)EjZ(zp(q4i)xV|k}syLK5Tir%F? zD%c2h>d{>74-w~ZE{rts^V^F_diCbbSi;#IA1BONqUM8QJQO=Cgbh>(IN`17nhzVe zbQE{gFA?EIG%G*UX&lpkZT-N7dWY3~UNL^K$-((JgAn z4xsKI^#6GTK$H^f^t*V4aAf+SHdsVw{LCK2H=J5T)SM%Fe#w z%D8o_fW1bDqKOg)?B^=hN2EY!A+nUICG?%L{<(`iu#pawZ~^c za60foEd)wn3_R+=N`ond8+a2nNcPvUv55M5O`=o`m5EDC3k#nPbi{YEJKQJ54pg0Z zh5()coTV42HsCu6hab`8|Guae2%oQj*)+X>&yG}(_#J|#06q~74FTBE;UFc9OiVmlocw$$X=&+N4n`7u zOX6~&a6+`|rG1VK>l=y+0xn(^pR6qNo3;#}QVaYu#ekP1Jg|qHh6z6YFYANT!uTCh zZw_ze;Tgdf z5r*ENx8)v7arJ{ojLcN&Z6 zf6&bfR>TRmJ3jU(q!HG`pseK%OY!LD*4F&m+U-#DK+UV*YRK@q{1Uig)8lM6Z{EB# z*DCOIApQH|&i!g?n*{{AdaLsF5*sm~X3++exfPyQU}uFi1$}Dtf{|lGN+oozEzQ|y zSM`1Ppo0yU&x?)UQgH7cR_)UUcPzm(PELfz)SD;N>6d|oNe=gzcvuC75V0OXk0N)Q-m7nH%xl=D7Pvr^7(}`<&%uS* zv2R~>&os=HHXwZ6$~r1`F;AZ=lU?zBb}lXl>4hA|U#EQoAI`dSF)(loVr(sXA>t7M z3Qdgmn$AGT{0blPk&^2BHQF}ovc>2~0;`My-4N}qug)Ls9~d}a>jWblqAd>v*#Bv2 zAa9`{oDQ@^+|Qo?8wQPSVQ;6z5q40(;@f&tRZ#_bHm1 zObIDPmMpOv02d?kr;8zwvSb;!%@+uugCRGztZc%92Ss@96&A)~hw4=m;tCd0neTTK zlreK){g6mOK|!5%k^o?E(^gpZ6eMsyf@o*mzAc2zBQH+*$zh7F` z=)Xr`kTgYwuz9VmTnH?cP|qzGjo_7c96=cr;P0sVdY|=+fXzr|60`FS6rtt>5Feev z)K&^y5QEwKeQ9pYqU;O>_Vu2?oXbav!aGtqOW5v85Ou z!y1*xZlI>7E~~AssIG2>iS^x@izY1FxWNelS2)NJ4lM$L(thBv3uV{Hw{O(wV7d+0 zT95ct>tPb)XHXgVbmCaGKJ$E4a-_N#;h1nE->U4NnZ zhXBI-_R+*BblUwx<_*-}qXIrDE zTG8h?b6=f(^+~NG@)q5rJld?;BG=#Z@xhhVEQTY~8}fJluK6`vd8B!o^%fep2AHw$ z;@-}ns%uPRF0WQx_4}BtUTzi#QMgDuCehR8ylNF9J0(oe|{-8!otGH$jEo)?i1#hFYltKr}vs4PM`byVBnCEhEe(7-}CRgyZ7C@ zcW-s-u9>;HIhm?zy_)kSBcq!Q@thN_=`!V>(|Z{O46f}vdf}w8aax0+-W-)6pP;^_ zCC~o-`x8=AZ{9w7v9_--<@w^L&QA9~zwFE3zJ1@*!>D=pn9U)*o7v$HRhNGKs=Fh2 z$=u2+6mJ%l_R-SXv-0C}^j>yp^UZ%7Upt>r=S=9$(c`+!*SGB)otdFjS6AP|CJ|%j zzL!&ekB*Mcez|~KPu!obTpup6SNyj+L0wl@_pYH~r`(m5N&7tSnI4PLch^^!d(Fna zd?6PU6wDyy3`pP!ckSG{v#!2=tTvpIhHkvAtiHK$h+3J7!k3=luEeRdtUB zIPgv~UteEY|E290Uz$!`qFr+tcx8V`Kp?E7MBHVl{5pqRz=F7Zw?w-mpTL#n@B5mY zo86{c*$*8$RB)#u^Xk8-cus}!x+wa*sgot9uM!HJzBJzx7885_vijc#pCiH|A~m!9 z#TFJ8H9kl5k1CIpSaiJhoL

      Tz{g6V zjxtMm@R4FzLiL_#7pCv-{?M*eo8;J=uc)oPmsc&uT;?__qo7gP&*`?}baK4$eOz7)cR64KLSM@L7=d*kEcZp6n^OM8FcuCz<*%o&u?_3Uh(=2UUY)YR0H zQe8bgWo*dU{5-XsTqXaMdWluo_}if&V;){!yBORzt&GfD{%N%mvtN;6JF>E} zn8j@OY~Qg1mn$mrcwv(+(D?lM@aAT%J! z&YnH#<0IkW;lahvPcq2RA5;}8E#i`w=fIC-*)oL}E+?d=?SHaxR$uumEd?bi*m1vL z*YxsCubYQQNM4@6F7B^`Wv;PpoU(p2c5NBqSS?vOIjNa#?^4Abxda6Ekoj%?lRKxU zSLimT6vHmVSiP}wsnQ+iD*o56U$(RC-5ecCwIS>4>uQys+34xF1qB6#g@-GiKE0bv zQg!?-6`KUti4#n18FJee=JG$@6SH&sF(kFVHl3Mp`}Phpls!Z5OS6XlVmr3mR(Mxp zX6DFhdiH>LfW!NPU%q@vyL$KT-5e}2l}oWz!`{pG_U<=#kzICl>@}LN-rNx4mHPM8 zsv+*_)2HTv#?PFy!^3o5y?RyPKA~o9XQ!;8@l?0^;pX~JYI16h=7?E`=XrUPfA2Ty z&!|#TWcKy-om5rbfs60p#%}G19mu~_<|5lVJH31N?yq@2^>aEDr1s#?N){H}**Q71 z?Ci0d8}nDi<%NzNi_=b*R>V2~_U+qQUEN&YibAUf9x17NcMta-Ms4+ko7{XxDadCW zfaUu!QXP*K%X#yLb>q)+C{d9uEsM)D7hBDi+uP4d%E%B+%GA^}GCn@y`gO%htNFgd zZOQy+Y9pvQxda8N$jbeH@{@6ii6whFhtZsyxZ~>Y?;pE)b<@}0-ahB|QGWg#jC9XB zE_08a^YfF$-WS+>ypw6VYvULn-*(MA0@o`lWN+QRJvKevJKA^p^y!6_6%rcOgS@;L zqDM6*aMRGzUN0;Z(N2@Rj(V7wY)WoyZ6y`id@S(#cFt+Ig2lnX!Jslg+{xK_alD># zs49r<(W6Hx63(op&Vx5Pa@2cSP8>c=v2Wi#H*fFo&`{;#s$=?pl@3Z+DypfG|5{$A z-M6pa=ZL}-PGmuW@O~+ehomLc-}=g+YkBa7wC24Mv-AH*fCtn(d?I z;o%{g=NqqYwLSsAA|HxyCCnV6|=%euB zV`BD1Mn@|f8nW=J$A!hk?bXRl&8L*($yL|aXF#Kl!Lb!EtEwOsa}9a;`l=fkFax$o zeJNTRT3U56EMlxduWsGF8-ex^!*(K{_ihMUQ<%aZgoa?p=Du-!kx$luF`BH-!r;->t0@^?@;aB=eKtc9XXQJ`9$ZZ^*UfL z`{6^C;p(ex$&2?L|E|nUwkd2*2Be`oCJUKWbs6}~4U{mJRSVYFM>9p;xx-jkRK&Ao z%NFC<<dF4jhn;>Sg8Q=cm)u)UMK8W84 z6ELkvdHC?5u=GlGhtd-L3D@DtO9I%l@#*P?GQVZR3O^Ak_wns!RsmP8&|kT7rHbTq z*kz>Z%;eOR|37hGA0PfBM~+MmS6-2gR-~l($RQu7p%%leh*dLendr*ZQ_<9nzTF34 z6PMRMb&8vxpFeN1R>-b3jr;u#DqiBOn~m?-zMTfW)#Kl~e@2WB~%V)fF&&kP& zDlL_8_w=+1bOMq_0*<>=eQ$GOS=6j;Y?|IjP*ZS+^$iXtQ_%A#jExUq+y|Yx{r{~?^_Ms@ z4cd1V1g?2lIy<+N`RQcHa-#)JVuQr*jFh{F64ijqEeYQJf*L%Cz9oLgJyc>1v7Q^E5^2WF6GuGJeCW+YrR%dY%(8K0jwvodsY5;!a< zXmdEnwe0Uft(21>4>d-66+nUb$jQlLVtaq=Zt8x1G3w15>DS&fjDxvd-xkK+tDHIW zq9<`HJqDyC>4kWS_jL?s~;(}+Sy-ZL+QA9 z)fP?DdVaVvCZ;=o_bK$^j*bp)ZEbC*C!Zc1)X7Ld;n5#GdX%(R@$=A7!r=vw&Unm$`EC`WHL`<0Lh>ySo!GKJCuS!$VNfwLi;8+pNS@ZEg7z@7}eW ziT|d9HDC3;qn1#Q4cd~Vfq}91<;^lR` zHA5~MWs#Mgy%mMW6-IPzoa1Yeks6^^1bKb@`0>KhQtcznyAN`6`=3hLU1HsGJ@uzw zR8*A3>cUvsu);_`@JL7JWzV0M5-c|}8=EppJ#jJR(~Cu&l;Ljj!Lo_ZS3EHqyrImnv$iAN~7Qla`jY#UYa9sH&=Z_TokLvu9~gt-Z7E z182DHkAC?0adCakvm)p(=^yYcJ0|y zSK`<^Iy>8ZelM6*A4-4}%NBv|syFJs>vduefI3x zoZ#T^%9N+r+WVB2a(>T`hDGB-C0O31dsu`@hO!tdFom!_2k_MQC1 zaWam71gOVjD*>|f9&3$xZ8*l`9v0Sq+Lf1=n}gXiOM6|eoSB}UwnB-cAh+=@QgHjN zE!iZ$Y(mT6l8`9vcORV9kA9+)spjV9_PkQ_JvH7M3c~#F&y0}Mz^ka4MVyNroSd8j z1_fj@JulubO?_l(Eb{Izu-G;|lAD`L#gPd0Gy!aCY-?+PX4yh^U&0v}w3Td>&|(40 zN6?Hw6?WhR9y)wD3yf*!-u>4As02i(r0h~uRLnCf+1=XOikGjyeJcZoob}*A`{?hB zN=jRCm|8{_Z8I54%pv(y6cw*m`2V~yr@xvl)vHlOfEVh;xb4GoPFlW~xHZ%XrD)y$K=I+LSGQJIS7~r6 zTHm06t(U)dr^Snli{JN(o&3^r-@5+hzF5=Ek%7tFmYe&I3IlR*pM~kCqfcCJ=b9a< z=43nJ0(MW%Yjs*vbK=t@&4fF5ZX_nsZr!?-nWe1c@UdfGp6O**R?6A7rbTV8Pb-jJ z)6G%avg)(Jd-N!!yu5s5Ow4vNLQMDdmG>9E@t(OcK0a>iI&wtY8P9gs*f{3Y{3PAV#@zJUSD)-)-i z;BqF)Y{F>5!op61_<)0c5$OH&DdfQe2m%2&fa=rSTt#E!{n-BZz(9kKW%c`#CMT`8 z@7y^77>SIEs)4#jH{A(E==HUI3tA$W+DUvaH=Y6{F@{zAX6HiF=g;hvi~^%xS&lQH zk)(Sy8>pCax3R4t1fM6}Npb8Sy&TBMwr$(?^!anEk9PzMUcCwfxvUh+E;|6wdvX3N z+XU2UO^S#W7Z=wx91AP#UP%wuQO(?(3&ezmfg!AV^B)_k59I6g3vcvL5zD&BfXqC> z!-vP7DQHI@;=>&+EG%4ubcE6o310U@8zt5vAt~uux6q^Va%MDBB|x8Pb#N>sInmwK z)YQuT7J0x(n_J@w%s)LmRqi{_@#)j2(Xp{DP>rQ}&s4ESI4&g6P;+Zyb#~_OeLs_Y z>sDBq>&Q4*1JMl$x=T1B-@7PqWtHmS4KAz^_zjkEkuPFiMO8I1Ik|H*0LSDhY^2)#3mZz8^F~a^X`Lk>SW4~fdQw;`inyUJb3gdd?fhq3sUZSycJ+sNl&kMSP32M zeRpr@vXaMapGcalKOGR{Jyg{2Q#F08S}2qvD7CQncYU&V*!0o4IVyDa{gQ5E&o93G zV!Y<^F(xL4hLMpRgmQi3_lR+N^3MUNT1oYTcwda11P3GI4JfFZkr7Tv+O*#ZP+fcQ z5&=OK75iVmeywrp)b;9W1@L+1uiLv-ax359I(Tp= zAgl4?$M>T(pfXN$%Fq-``J~5uVM}VlTXTD8MO-Yq`e*E~+ zQFdTc3}=>E(#^bhh2DJ8Xsmem-nE=k6|2>eL^C zvX_*UJieNzY7!U}H1_@bO>{FL=DT+muuQ{k#SaP9s`dtU$VE9C`OH-kk{ z`O*?V{>8U+OfI!;cVo6*rIvfH;+ zYu~?bg00i|z597ndzNz4qenK$)jf%ci7GHFjH?3gHJRkrw6w%x7ZanIge2_WVgZZH zYqpA_rqqVn(ZhKUYNv5Qfm%#3Q z$wx2dA3y41rJ|yea9HngLM+>fB&&Y))=q_ElB#OORg0fX3 zb%9;JG^@6?7BHotqB?+cZX8=vrgZN7c~z`tRC+plwUWmWFakFr@~3^@OB~x6W@cs@ zM@L6e+|!E6%BH8fIjLOi@@q08l0(2}a zO`XPYU7}Ebu(rkm1kfoiJn?giQTVqSI%BwP?Scj@FcqRjzH}>*^;@_`eDpA5BGWO` z3JsJjvn$KV#gtjt>E`9>EPiy4+U?Blf($to{TA-w44gSBwid9fmS3cU>U^vVqT_OWcd zuP;`DZ&1sxaJfzg4vqEK_N+Y4+pB;7CgWHZrEcR610iw~-?A{>me~Y8)d=d;1Sop# z@CMJO!oa{_WV~s5;=~C{8=LxW=3~c>)ipLo;?#-WI^P-iZ~flaZ{KL~vRcE`DekD~ zXyOJcm{WD%`p(;Ka8!qf@96d2kz#3SS-YH^4a?@h)s1DQj;<~~&*@g@9%cwYK2Xhx zBh;oW@_~}5-Y*U@8-IV(qjZw)-D^5~A@=9D&V!=>J3 zW(q+P`4kkYd!}LHn9W4>JchpHfzgRVzPqS43wPKW03Q?MJyyGeM^Nz0F_SlSjeg#1 z!=iPaoym40%$W{_1qC!%h;VokoyyzDP|>V`2i_b8?i<`+UPp1aTpe86sh};{EGQ=@ zr=4kGiXsU^U3PVM1MRkeTN5@D;DH2OA?Qv%X&!I_M>Cw7Q=YIHv+}BSFI-R~oUr1m zwsaXLqz2&;5j(NJq;7#93O{j33=9oxpul{83|+?|&zabaTvS?uwFXy9R8AlQrw?R` zTTrm2(GTXKA5J?jWH!Vzy3>Qlk4fECu6pWo~N2S=}ck3J|@K_EH{9AVtR4CDsP}Dj2ME1jZ}!77^($A8sGDu(u~}IM1MvIQ%Kc z?djCwIjGK`?+!Sahi$FJ9njLzZM}N+Dgz@UjK{55{lNQQZ!Rh&Wn>JGPQITXsg6KW zgI{)oX}Nv-_G>gDuoY$p%c$4Z*8EmxSpf0C*XvlJ9Mn4$eE6eB2a=MK+@Q=rqIa*X ztVjm^Iqv1{ZDDB{(${AIzVZMLDC~U);SqFVUO~aw+|RDBZr!@2*7gBieFwk}o+u^g zv+DWt`y`wP^j&psp&4`Y@N9*f27*Vz`Jv_Dh=UyjsOVzRJa?{0w>V(&G$NUcV&03P zVPPuCGX3Ie=oc2w&Qt(Cg8DB`G_3w?Q&>a|z(&mhCH)_J75wrfii+5D$aBe`Oh4%- z!in>6!FS~<(u{X7F4~LjbbwOhphaK%3VESX3jG!>^GqvQAz;ZD=DH>z5)6PT1~mwha$+;OWYJ=2(G%;b-nk5aJPSLs(qA9`fLOiz>7XcCUFhkRG?ul^VTb z2XBI~VzDC4sBv^R8YZR)xWIWYUWoWDm_t00BNTa2XmaE*e*&+Uhq~((swnX zJ7)xW-j(Tvz|Kdb6Alo3CbEmWv8BbiYbXm;7dxN?Xp?e2zA`^z=D2kNt4YvM(0IR% zKjKIfoi(Cv=FlvwzfBrlJu}$!tLTREkhqkEMm4=v6-K7f!6)6 zMygNz*A~}_K2#CIlmt(hkb5=DgcigRwm|zx+2tT8 zISQC;Wbb|P-ljvuC@3U0h-2E3^H5FgtDAGN@pMlbwv`kUMK)^4WZL-i$LiDIwuvGHLj|QPQ@ObnEy@`l#f2mLEk=Cj{$BTLPKEgI{-MT(c-_y4d@aror3Mm(_7ftMHm*_5@Hqj687rV@jokb(eN23K>n#nwlg!GNg}2iSWsQ~$G-|> z_3*#7k1aZ%VL^43L_&fm@He+@z??Uq?JN8`^jfT_sEEiriKBh)+&QRt6{ul2R`b$j zFA)RaP#i7p@crHu)%o^m7B68WpF}}GMmIpgqbz%$K3>GR*U;Xnrh(}LhHzpQKwAwl zHTc19xY@sEda0L|mSTTgG=_)$p-oGZw3;~%S4k{*KI3jx;9BizF+q3tMYWQ~{fH;W zCng-(O)C8sx4vt;k{wtUus0V4&Mfk6@aV=MB%@Zdf{Up#@Sy!wJ8_w7Wsdsx>VMieaKdkEpo1 zSvBkpgr>%&7?_v{Q$h0TKUr921>Q3k9zJ1b#;*!A@kzb4qn_Bd1+~BWNd6W zY!&$R#9qq=uQSW|{-{gjRbPOzfG8Y=W*3%_P=gDL;ez0R;q@Hf`QgH)p^&5`I>>W` zP-gQ1_^|hQ<05KZz_4h$jL+;z6_xEsycLWRx6`OMZeke@{(i$Sclf1Gb z(DTv^!M2=i(4SW|o_~Sai@&bd-aumDaqK}4?C-6xfCyZ*>&Nv0@Hf!A;3p&&%fdYI z6nqAs7$KFU-vXDk*Vk)pnF{YeevB^uQGH*ld#dIB31fwAT|EmA74M{`vi=yZj1J!T zBO-Tv0@`G|n3$M$gM6?NOf)Qsl<$3`5Ogt1Jc_Zx<6&bwCFp~CD0Mw>t)`=c7Fzlm z&hulfdm(U)l=bxJiN^%pvT$+Pi`Xnp;VK6_F;vbq#F9@;z9Nc*We`dPgD{j}4fMVC zw5#k@fQ=?%-ADt{;L&}wtYxg8QkJOUwP+137_MAcu)G#3%mFCg5i)B%J)3z7wFo+r z_XSp&7A*Z=olJj>Lf~4^BFrl7yzK7hG1A%IPE%D?MX*qU-GFBWtjso@4#ac`cyI9^ zzHjmQ#;MMZ@<_^Z)+ zzR6q5nvfj?VmnF&-i4tqX4m@e)2F04eF6+F3}1C@`I_QIx?~nb1_B2Yn+>sw4m&mKHhI7^N=@SpiBX4TK6|S=N zWMg7U>ciqG&`Fg`m+tFdyjTnuv#yvo38NyT@F5zig8usL%kBsB)jfN*ooDaF7W04W zYZxaG^7%2;^t+E!Avg?T*=K(EbM_ARq;vWXa8J2mSaZWR7`@aCYlW~95END# zYs(%2v*-_MeT*U+aaYnuQIV2#7w{G#_s#A_bpbIn*BlzLF-trf;X1kFU8o%+1 zg2z~NK?^Oj5w&d)^;thWJlqhz_~ z5<+$A>FE)M-d(>J>?643$jvD95HlT59Ehf*1DZzqe;Y_1-rfhmsQBTY@f|%H{k#x+ zr-IpoD5xqzek-#qU9`65yC><+vo;a0;K#^|z|+9MAmRM;3yB{-oX)7XcU(2nfSgPf z!-`fJUkh4WIv^q}e0g8Xo9`g8+VCXQKsg(ckPsm!YY?UAfdk>|^H(={NhA^xd-(bL zFG^k-K6>mJKg6kuj!yidUON3jZ5lvFm2r-PquH+?W`~0w6MZofIC`BNWDq)#|yCNC@72%PQVMFEB0a+2z zkeQg6Xx*2f!sl6EVPf_M%ib>^$PB^_C%Nv{VZBswZAHZ`sA2b&S;Ip|kG?07$YI-A zNDh|9-&R`soU8X_baIktZvw{U35gPn@U7nChj7AUaCTosM!Da+fSEQV6j%nr)WkqY zWLzAHaHKJo0UD;HsF-E4u7nmrICSV}?;f5?e(>T2Arz$l5~P(|LV^+LeO>ZF9XC8$ zbD9)AVXa5f@~)-@4c0htZ@Gp64uTVdn=nK0z=svKs0~wV`*hyK1fx&p@Px>SIT{#G zM3=;#TjBt>cXoy%fXq24?-Dy)6%-A}Y#hm))05g%Lg#UDk%7+w&5s$25 z-zjm;*1!zH>_yZ72OP6iS&tu+1DA-M0Qn)>9~kR4Y?#(P(H&?y25=$<-!_Q7O6a;^ zY8r>N7=?Bs#v`6+r?=;*Q^Qk;nClvN$ar2O7}i86gbq-jFv>w?pXTSQVOu(&8we7F zmQPINi0dwbznfcFTnBPZKnS*Xby?dBt%hQh(K)7#r;QK;uue}nULx#b78ck%OlZCh z7~x}Xjmv(@xmcY=$%$Xw3Cb=Rye zt}RW$Bo0NO<>lqoryb!CF`&wAuPGn6N`zl^4Gp}Qb?oo&M-+F`#pM|2=tp}2!2<^n z|L=f#lczRFesg>Wh7*W!meoIhJOW-Jq_F@mNJvOH2}hY5vs}0_jHmD_C~&f{un-v^ zG1LT|hlHsHcD0yo^V2eugR0_87*y=TGK~KC5r=33y4nicf2Jz=ysqwb=+jkT_~0NL zprZiZVv{3<{$EuLS^QcUQwEpL2JgVnL-O;F@lG%@FpMG-mGocYvu#e<0;L5-DJ&|{ zT~+c?sR-sZiUiY?chKsw-L>G6wEOq(!8o05b!1OMZLJb6I1bMQXn*kWnfs(Otl5DQN+qR=s~3b19ohw8uZNHHU2hy1U5(`yNyOZya}Yf0T&m(`Y*7G)CwN}Qi_PqF_`sJT%z0ORnx&sY|5uj zQ9x}rgi|sRBWyiO3%`Fi;L+T_wsTi8(T-jkq=6Se#5fp{V;h;!b5?#U$Fp{fpPw9r zgy=N|!5aZpq`edrHK2>5U%rs~cAR%{8Y-tn4NLmWG7sjSeWCSMi0@rlRas@g^0;F& z6dz%07nGGX-~_C(zLm$+X+0o;7;Gxl%QB)h(l9`dc^%T*3XZwclXJ@U_J=W-7ggSbBM7)(i=bq`nAIyb_~p!+$RzC1bhc#B$h<-^6?>|wOsO|-Es|c z5Zp(O?8fAZh0MNZ`yPZpwa3oneq@yv?q>N_#wW8Y%{eoAQn_bf}TIGVJaobi}RAeKThAogJt z!IO}O@+#43cVb#s^h6yClg7Fc%=WrZCFq^w)&mK1jdj1AZ%&{C*jd74Oax%N$==J9V*kye3#5Yc2!zEmYV-Us=^OlIp$gW{7 za};S$?9G?i*-Ap?=aw-!8S&@$eA}9ktZdrzLd-+58Bg1V6Kw;zGNE9nA7Q*p1@c;4 zx)dJC+_-vobjHAfBCLK#>IVoc6;F~_IC z@lF^UUje?AGuF1Y1c(yD<=0%*>1=In#e4iQqKT2=u>bLlh>3FaOkjh1k6D=uJDgTG zOf6{_8oatpE8*04t*fhxoU+ju1J{bs%M-pS4!&P=&rlMxgqZ6@8_tF~0rV(&%nZi^ zk`Z4Z*s|r<{0JwekkZ;Ve^+l#83q$ZM1P~ z@){CK1SS#&*UE2Wsj-Ivy-kEmCC>3|1G`R5%<7EJ^}q!O!%%|7hjyrb`SKyc@<1mL z$%qGZO>6K&!Ph7)!Al3o#xKe~Ek%XPGt2r3gS&kA4B z_Rv4yG+inFEkwax;<604jTcPkU0Ku}qypbt1Tm)!uwKqrE^!TkFj#Rq*wN#SZw3%% z^PM{v($7>VT>S^Gvj;}Y7#5=(17(oG2xvnv4@8VL5KRfU3sj|od<(mAgGW@94y8!U zk%C#H_Tqqt%&e@n{rxQJarS!VAZt`~f?W@)Z%FkZ7;Bx+FV&Dnb<;c#HBsiXYSJ81^CER+ds-vBBs_N<(@J(f; zeTMn?g(%nn7{yzh`p8tOXP*S%==E!i%?`$#uKc&ITzY_)`_zYhc#RHf4%R>chTCh= ztkq%jhQEFLS7$!|mCA@v5lBZ7rsDte68_xY)8n$Q(sCC!K}hZ#Gr{n^{?U{30}F{U^oEbU}3s;I&4Q`qP2Cfp5Z(vaKa6#e+B#=j3qfeS<*u+ZScnHFur2Jee77P zk2Ef$<7`mi(0v42Zvycyz(g}RSRMv(7?R3%9wl*n7es&vHK1pSbAS=Tnt=gU$g>(u zQ}qEDRKM1#2_O$^ASND=G+0=SF)rqj2;b%TMWMd_{{Ig6^U^WeU%otw*~5a;(l2jb z3YVeH?@-!RS$~pY%VffNcnS zOllav^`SzE)?b7p%hLP(C%u;cz0YXPQ0!o)3RfZAJlJuPB~Jg3nCu>K5*%D0_D=TQ@t1gEIKTz)ZFZB-JMX*9A-{PJBWsx6Ni{IhrX>3hlH4f1c$hU{9Re_=Fm6Qy{=gy z0)gD{(tdE?%sXRq+RyA?J2vud(ZkQe0*DauyW0AXX!Mk(Azm_Fx42O+kh~v+${Gce z($b3rYF!jR#sNd zrS^;Eo4n~wUG~*crGfuaPXeF#L0Vdxyp)53L(A!IpNp&Oe0+~LSj&A4jc|i$91h2WRaR26 zE?ik!3S*MAr?_xlCg9sOVXzT@e8fuXRn7-uL;r`PlBLk%eT1{~h|hUze*P6{Y3U65 zGtZ)7gc%H`ns0Ky9Hy5069OqKFRz0gaE0xOY{kgQ%QMT#$!+iNUx8DD3s4E^H{GzU zBmj=^aI%);mG>>VjCNlluUiV?Jt$~tu|IP)<#;66;MBdny^Wrm%x|Nje*XGJoTf*n zqM|bMGumx)LW+l+`>Smt z%g2v98ACO4?^{`A7*X-KqWbR<1YNQNPVQWyX}Hv!TR|ZQaBl;g1T31)s%G%nnCLcl>%*2FV8a2}|muY+F&KU!4sglxs`dL5_|V*{R#lA$c7 zOsgB?HDe7Z_Uf7%B8aoI^U>t&dARjw&-sJR$&5jdfwD(6+v5(|^8ZAlp!0M6tQ`AwCx zy+oJw_f+CZ2^ksD6BCwGO=xleFpKtp*1Zu+_x=6-p5EThKf|Rynrq2^tAq3m4B~<3 zuM(LK2MY~fARFrIKbDnIcXoDW&@+jOiP1?(NmO_QIC6 zkzUaR6I0XBM1I2skfUYReB6?ooG=tx<-PN4#<)2~JUo>A8keFX3rK9sqT(KiZ)ddq z{N}~!s*1SVlGYJ{K;?z)Zxs(bnvY;q@GPrUf@7%xN)WBz7={~W1ELbRH7PHQ)xtrQ z$QQ9Oi>$dgW@hF}`)|xIry8$!h0`pu7EUT3FEOOPf6vXy8SM4%Y#3O3S7#>?zd=Sv zJf|8R8(VB+qwGvotqZFE(9Od`-QQn{l8OqX?43?N20TD-p$keg-X5qDevSRRf<_$F zs*%pv*f=sedWkiEvcjq<1P~S%H#bQ;VH)p^^Z7m9=t{CaQmzKTU!a#3!X9u$G(SHd z`Q}ZqYR(+1A*N2;!QLJ-J#D=nq7R32Lppw$M6Tft>>M0&D=X=8a&msPfBeYRv+NcY z9{#bpIo<0)?$Gcsr;w2Dx?tU%@mePaK7))%UfpDf;{~0Csqv#nZ&Fhk^9l5$&v+7F3ONg=PZq#-9K2jzn* zDY3#}u*k^Bcm$w$H9b9QoEg@**vz8Ijkg{-$#-U~7;q{}OiVntu>mNczN6!wRokma zrI&1N;Iw7vaJxY>nvruzQF0V}ilGvno4q)1*t=x_SWzT}TUq(Zk)*xMFOi^es+K&iLNvkcnC3o4q}lImrI}+*}aoOYJ&IhwpRyqc*FlIw~N} zQ1l^l43_5{-+uC^l!B5{RZA;!FTKL1J+`DoQ1NVUFk|qHeRON}_wUG_9<9?ed|*Ut zt5Qj6X_fCjBGHe9usz?UrK5u}R5=U#^5wzlVf#gQVRnGvR9c)e0B?9iM8vjX-}DQ> zC;|ck9bU8kg%1t&^_Tzc@pfynD&_QIlyl4LovIj{U+PV~q?q%+*%u!G4=>{4A{rWG z4z{L+mR{^^Z+8Tfkkc_Ra9h_lR8@7xvMR{(4`R=1_jYjXx1EieUjRPf^Sl0IVBlJQ zetrww{c%8mGM++78oiYQwvLIqExLeX4}A5*hph;t5&(2wuy>x3{Of2&Pq6SNFxWefAwn?ZRMOmU^z2$A!@wo0>8J!Kg*VDW3@3Wu^g~jJuPx zGchsYeyj2t6;;=5n^#1BetrhU(WW*wali(Zr^m%zTXJjCj&6PZ`ii2WVl4vnTYTR# zMkXRXojJ2ATg;Z&AVU^Jh(e*RnVXxJTRJ)l(6Y!vsHmte?_+IUK8W_kL@Nmkc`~T> zsD=D@7GX+-moPdqHAVgS@nfg=;$%rWyr)SXsy$sm@^h=J=>?N05p~SWF=UXZNMy|Y zP_nf%rrfQsM}Pod17tlLtF-Ou>+1pzYJPR~@af=%4iH}wFE1}(hc1OYMbrk{8A|Tq ztj3U_ksGYXr{?1$2U2NdY#a=bnu9VetEkX;`BFHXmi2Am=_0$lji`tS7G(Eldtl($ z_;@GqkcoWy3-M+tfatcSKf7*C$+b2lu>$5`Vq&VSskyx4jQ1i9@M?PjKm5e7@jz!@ zj8^0{HZnq?P|$&vxFXUmuNo z`0ydqi7_Xz7O_oDP2!rGng_phMXN9v)tQ9odiOPx1KsN4dAO=3Zz6Wk5XCoEXeJkJI zP1HJBPAW1n6X3X5aYI8xn2pT$5rMw{tqXdU%q@^Utj<7=STHuajcH6F7V7urUEy9 z9lknVPI)=k5!{+pCzKh}nw`!5UcsvZ_N}iks&{#J=IzKGHF)6KFJ8};ex~wlInd6N zEk5WJg4B`n3Q%Qv6q7~_vmJe$XL|Le5SEvhyP(mD2;kHqLH+%DPpW+Na*XBxGXipc zhX7VV<3Dwzrn(yAhi{E|`uMl87MzVDm+L1_to^Ez>!$(uBmsTn`pV5MXLLrJf&SgrlR5*;$(tUo={> z9yJ^Ln(DS^)2xW-d0k!IJiviGI^Dp)0JPrvLDU=|2coK~s>=F$4P)c$e1=)k<24R{ zW9^J^-P)d`yg~K#(h$1yr@#1t{NE#-@B;5!R9s91rv^h=a`J*rXqJMP2++cC<6I3} z;0YX&$nHwpE-}}I2QxoiPWGJ2H%CTb|0X4I%E~fAj7&^E9x{>zm70H|6gQt_LZi_Z zmX-)s@_U3T4vgzl#|?bj4g~UFHZBxRux7jN;o0Y&1i>Z#7J_$)lNx&A4tmMEPHu~-hl2J2uh^GY{#-2?h5D1suCG9ecD$HC!^UFzA zs_-eos7}6#wgc)TVIfL1ujUG9ZpDk!zmaiqs~KO#0CRr9;TC~3{@RoRh^(xtx*ouJ z+h6`k*L3Of6*97CqZQUiV3PLfmax1mCrI@4XD@hX$Q9tiv5AJtcdQP;!~{$_8L+iq zK;|K}?6{LyTL_;k3u>&AXmuc4^Wnpr5DqD+n`~@s7*_HV?76ew!A57J>tghTc}mRq z_&C$Z$cQ5X@#`w^F;g);3oafW-TAsH-Z(Lx2MyUeRT zP*>-akT59$M7O3CPL%q^bz^-!id!>wuIrjgm>yqfO3Lk%gH53+*AJ)-e0|_W05!Yv zRFOYZ#?;^4T~#NiVpw@eiOxENVK|)TAD7=hIG;N>boKXB-y<|v6o?g$1Dsi;e-acG zjnK%w9>A=T4g3O$V|(gg&Z2O0U0sqcAKd9KGmD%X>Ej0HBm~e0%xZREEWkr-Z!L17 z_9_Fv!2_`Bpo9IC2p}u~U2`MlkHNCR5KjzQ+ZenN3;%G&1#~v9)*1}aP9#e?HC1>c zbMuI@GGSrs=Aoi9rWT<^P&Li__k&?j8JSU@bhSs1m^6SFl(PG9Mnry{jZzFdRBXC4 zLtboidpd9<^*^M%3CMQ>PgK?9ek=BW&v$F`Oj2PItcrq|y&F;T{sr_6YXLd0@Zh7Pt6u24KrE26Rd=FO` zo0#C@;v&xWKS;Yfz?aeo2D4KGcHQOsHlCn|T6Vt*UE1j95@}|$!0^xbtY*%{7ICYn zumd`xxOz2VS4vu1$BVSFsfklSfa0?kUPD!tIOyQ`I~ZTrs4YoPuxyShpYgJ^jBw!4 zgY@_q7EE7J$M{Epsgoq=r|x~meXWC;R=*Wo4e10(4t@@Ya#QgmH|E8qCLCbretQGp;{tSVU)7Ziy60E+XyZg%d`MLLI z-Kd!Vi92M|^pqFuTirws3gp*RMLHfEH$2d@@WMB<(kX58!e + +![](../img/prefix-sum.png) ### Другие операции From 6b3df0447d0d22ba55d4ad7496f7a3965b7f8ff2 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Sat, 7 May 2022 00:46:05 +0300 Subject: [PATCH 084/173] typo --- content/english/hpc/architecture/functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md index f7a74cc6..02614f94 100644 --- a/content/english/hpc/architecture/functions.md +++ b/content/english/hpc/architecture/functions.md @@ -16,7 +16,7 @@ Both of these concerns can be solved by having a dedicated location in memory wh The hardware stack works the same way software stacks do and is similarly implemented as just two pointers: - The *base pointer* marks the start of the stack and is conventionally stored in `rbp`. -- The *stack pointer* marks the last element on the stack and is conventionally stored in `rsp`. +- The *stack pointer* marks the last element of the stack and is conventionally stored in `rsp`. When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e. g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers. From 2808896da9ff9168952ef3e4b5725dd172dd400f Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Sat, 7 May 2022 00:46:13 +0300 Subject: [PATCH 085/173] extra space --- content/english/hpc/algorithms/argmin.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/algorithms/argmin.md b/content/english/hpc/algorithms/argmin.md index 0a9531c1..2089d083 100644 --- a/content/english/hpc/algorithms/argmin.md +++ b/content/english/hpc/algorithms/argmin.md @@ -164,7 +164,7 @@ int argmin(int *a, int n) { The compiler [optimized the machine code layout](/hpc/architecture/layout), and the CPU is now able to execute the loop at around 2 GFLOPS — a slight but sizeable improvement from 1.5 GFLOPS of the non-hinted loop. -Here is the idea: if we are only updating the minimum a dozen or so times during the entire computation, we can ditch all the vector-blending and index updating and just maintain the minimum and regularly check if it has changed. Inside this check, we can use however slow method of updating the argmin we want because it will only be called a few times. +Here is the idea: if we are only updating the minimum a dozen or so times during the entire computation, we can ditch all the vector-blending and index updating and just maintain the minimum and regularly check if it has changed. Inside this check, we can use however slow method of updating the argmin we want because it will only be called a few times. To implement it with SIMD, all we need to do on each iteration is a vector load, a comparison, and a test-if-zero: From ece7674101f421484943c6df14c142e30059abde Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Sat, 7 May 2022 03:29:13 +0300 Subject: [PATCH 086/173] typo --- content/english/hpc/pipelining/branchless.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md index f84627b5..280498b1 100644 --- a/content/english/hpc/pipelining/branchless.md +++ b/content/english/hpc/pipelining/branchless.md @@ -41,7 +41,7 @@ sar ebx, 31 ; t >>= 31 imul eax, ebx ; x *= t ``` -Another, more complicated way to implement this whole sequence is to convert this sign bit into a mask and then use bitwise `and` instead of multiplication: `((a[i] - 50) >> 31 - 1) & a`. This makes the whole sequence one cycle faster, considering that, unlike other instructions, `imul` takes 3 cycles: +Another, more complicated way to implement this whole sequence is to convert this sign bit into a mask and then use bitwise `and` instead of multiplication: `((a[i] - 50) >> 31 - 1) & a[i]`. This makes the whole sequence one cycle faster, considering that, unlike other instructions, `imul` takes 3 cycles: ```nasm mov ebx, eax ; t = x From e6d9601a8dcb8a41d6776f57cde82e0abfe22732 Mon Sep 17 00:00:00 2001 From: yatancuyu <45235844+yatancuyu@users.noreply.github.com> Date: Wed, 11 May 2022 15:40:33 +0300 Subject: [PATCH 087/173] Add missing return value --- content/russian/cs/graph-traversals/cycle.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/russian/cs/graph-traversals/cycle.md b/content/russian/cs/graph-traversals/cycle.md index 5347e9cd..7a274da1 100644 --- a/content/russian/cs/graph-traversals/cycle.md +++ b/content/russian/cs/graph-traversals/cycle.md @@ -60,6 +60,7 @@ int dfs(int v, int p = -1) { } } } + return -1; } ``` From 912a24172441950b850629814c0893ebea8c6915 Mon Sep 17 00:00:00 2001 From: yatancuyu <45235844+yatancuyu@users.noreply.github.com> Date: Wed, 11 May 2022 15:45:14 +0300 Subject: [PATCH 088/173] Prevent infinite loop --- content/russian/cs/graph-traversals/connectivity.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/russian/cs/graph-traversals/connectivity.md b/content/russian/cs/graph-traversals/connectivity.md index 45ceec28..17628308 100644 --- a/content/russian/cs/graph-traversals/connectivity.md +++ b/content/russian/cs/graph-traversals/connectivity.md @@ -31,7 +31,7 @@ void dfs(int v, int num) { int num = 0; for (int v = 0; v < n; v++) if (!component[v]) - dfs(v, num++); + dfs(v, ++num); ``` После этого переменная `num` будет хранить число компонент связности, а массив `component` — номер компоненты для каждой вершины, который, например, можно использовать, чтобы быстро проверять, существует ли путь между заданной парой вершин. From 63526ca0348b0abd5359d53cd91f24c92ddbf654 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Fri, 13 May 2022 16:59:54 +0300 Subject: [PATCH 089/173] slides theme --- assets/slides.sass | 50 +++ config.yaml | 6 +- content/english/hpc/slides/01-intro/_index.md | 297 ++++++++++++++++++ content/english/hpc/slides/_index.md | 10 + 4 files changed, 360 insertions(+), 3 deletions(-) create mode 100644 content/english/hpc/slides/01-intro/_index.md create mode 100644 content/english/hpc/slides/_index.md diff --git a/assets/slides.sass b/assets/slides.sass index e69de29b..671ababe 100644 --- a/assets/slides.sass +++ b/assets/slides.sass @@ -0,0 +1,50 @@ +$font-text: 'Source Sans', serif !default +$font-code: 'Inconsolata', monospace !default +$font-headings: 'Garamond', serif !default + +$borders: 1px solid #eaecef !default + +/* fonts */ +@font-face + font-family: 'CMU' + src: url(fonts/cmu.woff2) + +@font-face + font-family: 'Merriweather' + src: url(fonts/merriweather.woff2) + +@font-face + font-family: 'Inconsolata' + src: url(fonts/inconsolata.woff2) + +@font-face + font-family: 'Garamond' + src: url(fonts/garamond.woff2) + +@font-face + font-family: "Open Sans" + src: url(fonts/opensans.woff2) + +@font-face + font-family: "Source Sans" + src: url(fonts/sourcesans.ttf) + +@font-face + font-family: "Crimson" + src: url(fonts/crimson.ttf) + +body + font-family: $font-text + font-size: 24px + +h1 + font-size: 2em + text-align: center + margin-top: 0 + margin-bottom: 20px + +h2 + font-size: 1.5em + +h3 + font-size: 1.25em diff --git a/config.yaml b/config.yaml index 8fb26a1c..1f196de4 100644 --- a/config.yaml +++ b/config.yaml @@ -42,8 +42,8 @@ languages: params: repo: "https://github.com/algorithmica-org/algorithmica" reveal_hugo: - theme: white + #theme: white slide_number: true transition: none - #custom_theme: "slides.sass" - #custom_theme_compile: true + custom_theme: "slides.sass" + custom_theme_compile: true diff --git a/content/english/hpc/slides/01-intro/_index.md b/content/english/hpc/slides/01-intro/_index.md new file mode 100644 index 00000000..492ceb6a --- /dev/null +++ b/content/english/hpc/slides/01-intro/_index.md @@ -0,0 +1,297 @@ +--- +title: Why Go Beyond Big O? +outputs: [Reveal] +--- + +# Performance Engineering + +Sergey Slotin + +$x + y$ + +May 7, 2022 + +--- + +### About me + +- Former [competitive programmer](https://codeforces.com/profile/sslotin) +- Created [Algorithmica.org](https://ru.algorithmica.org/cs) and "co-founded" [Tinkoff Generation](https://algocode.ru/) +- Wrote [Algorithms for Modern Hardware](https://en.algorithmica.org/hpc/), on which these lectures are based +- Twitter: [@sergey_slotin](https://twitter.com/sergey_slotin); Telegram: [@bydlokoder](https://t.me/bydlokoder); anywhere else: @sslotin + +---- + +### About this mini-course + +- Low-level algorithm optimization +- Two days, six lectures +- **Day 1:** CPU architecture & assembly, pipelining, SIMD programming +- **Day 2:** CPU caches & memory, binary search, tree data structures +- Prerequisites: CS 102, C/C++ +- No assignments, but you are encouraged to reproduce case studies: https://github.com/sslotin/amh-code + +--- + +## Lecture 0: Why Go Beyond Big O + +*(AMH chapter 1)* + +--- + +## The RAM Model of Computation + +- There is a set of *elementary operations* (read, write, add, multiply, divide) +- Each operation is executed sequentially and has some constant *cost* +- Running time ≈ sum of all elementary operations weghted by their costs + +---- + +![](https://en.algorithmica.org/hpc/complexity/img/cpu.png =400x) + +- The “elementary operations” of a CPU are called *instructions* +- Their “costs” are called *latencies* (measured in cycles) +- Instructions modify the state of the CPU stored in a number of *registers* +- To convert to real time, sum up all latencies of executed instructions and divide by the *clock frequency* (the number of cycles a particular CPU does per second) +- Clock speed is volatile, so counting cycles is more useful for analytical purposes + +---- + +![](https://external-preview.redd.it/6PIp0RLbdWFGFUOT6tFuufpMlplgWdnXWOmjuqkpMMU.jpg?auto=webp&s=9bed495f3dbb994d7cdda33cc114aba1cebd30e2 =400x) + +http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ + +---- + +### Asymptotic complexity + +![](https://en.algorithmica.org/hpc/complexity/img/complexity.jpg =400x) + +For sufficiently large $n$, we only care about asymptotic complexity: $O(n) = O(1000 \cdot n)$ + +$\implies$ The costs of basic ops don't matter since they don't affect complexity + +But can we handle "sufficiently large" $n$? + +--- + +When complexity theory was developed, computers were different + +![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Eniac.jpg/640px-Eniac.jpg =500x) + +Bulky, costly, and fundamentally slow (due to speed of light) + +---- + +![](https://researchresearch-news-wordpress-media-live.s3.eu-west-1.amazonaws.com/2022/02/microchip_fingertip-738x443.jpg =500x) + +Micro-scale circuits allow signals to propagate faster + +---- + + + +

      + +---- + +The development of microchips and photolithography enabled: + +- higher clock rates +- the ability to scale the production +- **much** lower material and power usage (= lower cost) + +---- + +![](https://upload.wikimedia.org/wikipedia/commons/4/49/MOS_6502AD_4585_top.jpg =500x) + +MOS Technology 6502 (1975), Atari 2600 (1977), Apple II (1977), Commodore 64 (1982) + +---- + +Also a clear path to improvement: just make lenses stronger and chips smaller + +**Moore’s law:** transistor count doubles every two years. + +---- + +**Dennard scaling:** reducing die dimensions by 30% + +- doubles the transistor density ($0.7^2 \approx 0.5$) +- increases the clock speed by 40% ($\frac{1}{0.7} \approx 1.4$) +- leaves the overall *power density* the same + (we have a mechanical limit on how much heat can be dissipated) + +$\implies$ Each new "generation" should have roughly the same total cost, but 40% higher clock and twice as many transistors + +(which can be used e. g. to add new instructions or increase the word size) + +---- + +Around 2005, Dennard scaling stopped — due to *leakage* issues: + +- transistors became very smal +- $\implies$ their magnetic fields started to interfere with the neighboring circuitry +- $\implies$ unnecessary heating and occasional bit flipping +- $\implies$ have to increase voltage to fix it +- $\implies$ have to reduce clock frequency to balance off power consumption + +---- + +![](https://en.algorithmica.org/hpc/complexity/img/dennard.ppm =600x) + +A limit on the clock speed + +--- + +Clock rates have plateaued, but we still have more transistors to use: + +- **Pipelining:** overlapping the execution of sequential instructions to keep different parts of the CPU busy +- **Out-of-order execution:** no waiting for the previous instructions to complete +- **Superscalar processing:** adding duplicates of execution units +- **Caching:** adding layers of faster memory on the chip to speed up RAM access +- **SIMD:** adding instructions that handle a block of 128, 256, or 512 bits of data +- **Parallel computing:** adding multiple identinal cores on a chip +- **Distributed computing:** multiple chips in a motherboard or multiple computers +- **FPGAs** and **ASICs:** using custom hardware to solve a specific problem + +---- + +![](https://en.algorithmica.org/hpc/complexity/img/die-shot.jpg =500x) + +For modern computers, the “let’s count all operations” approach for predicting algorithm performance is off by several orders of magnitude + +--- + +### Matrix multiplication + +```python +n = 1024 + +a = [[random.random() + for row in range(n)] + for col in range(n)] + +b = [[random.random() + for row in range(n)] + for col in range(n)] + +c = [[0 + for row in range(n)] + for col in range(n)] + +for i in range(n): + for j in range(n): + for k in range(n): + c[i][j] += a[i][k] * b[k][j] +``` + +630 seconds or 10.5 minutes to multiply two $1024 \times 1024$ matrices in plain Python + +~880 cycles per multiplication + +---- + +```java +public class Matmul { + static int n = 1024; + static double[][] a = new double[n][n]; + static double[][] b = new double[n][n]; + static double[][] c = new double[n][n]; + + public static void main(String[] args) { + Random rand = new Random(); + + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + a[i][j] = rand.nextDouble(); + b[i][j] = rand.nextDouble(); + c[i][j] = 0; + } + } + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i][j] += a[i][k] * b[k][j]; + } +} +``` + +Java needs 10 seconds, 63 times faster + +~13 cycles per multiplication + +---- + +```c +#define n 1024 +double a[n][n], b[n][n], c[n][n]; + +int main() { + for (int i = 0; i < n; i++) { + for (int j = 0; j < n; j++) { + a[i][j] = (double) rand() / RAND_MAX; + b[i][j] = (double) rand() / RAND_MAX; + } + } + + for (int i = 0; i < n; i++) + for (int j = 0; j < n; j++) + for (int k = 0; k < n; k++) + c[i][j] += a[i][k] * b[k][j]; + + return 0; +} +``` + +`GCC -O3` needs 9 seconds, but if we include `-march=native` and `-ffast-math`, the compiler vectorizes the code, and it drops down to 0.6s. + +---- + +```python +import time +import numpy as np + +n = 1024 + +a = np.random.rand(n, n) +b = np.random.rand(n, n) + +start = time.time() + +c = np.dot(a, b) + +duration = time.time() - start +print(duration) +``` + +BLAS needs ~0.12 seconds +(~5x over auto-vectorized C and ~5250x over plain Python) diff --git a/content/english/hpc/slides/_index.md b/content/english/hpc/slides/_index.md new file mode 100644 index 00000000..794e67a6 --- /dev/null +++ b/content/english/hpc/slides/_index.md @@ -0,0 +1,10 @@ +--- +title: Slides +ignoreIndexing: true +weight: 1000 +draft: true +--- + +This is an attempt to make a university course out of the book. + +Work in progress. From 498100cf79e8d6d512edcedbb09e274f40030d38 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Fri, 13 May 2022 19:00:33 +0300 Subject: [PATCH 090/173] typos --- content/english/hpc/external-memory/locality.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/external-memory/locality.md b/content/english/hpc/external-memory/locality.md index 569d9437..eca83766 100644 --- a/content/english/hpc/external-memory/locality.md +++ b/content/english/hpc/external-memory/locality.md @@ -23,7 +23,7 @@ In this article, we continue designing algorithms for the external memory model In this context, we can talk about the degree of cache reuse primarily in two ways: -- *Temporal locality* refers to the repeated access of the same data within a relatively small time duration, such that the data likely remains cached between the requests. +- *Temporal locality* refers to the repeated access of the same data within a relatively small time period, such that the data likely remains cached between the requests. - *Spatial locality* refers to the use of elements relatively close to each other in terms of their memory locations, such that they are likely fetched in the same memory block. In other words, temporal locality is when it is likely that this same memory location will soon be requested again, while spatial locality is when it is likely that a nearby location will be requested right after. @@ -136,7 +136,7 @@ $$ t[k][i] = \min(t[k-1][i], t[k-1][i+2^{k-1}]) $$ -Now, there are two design choices to make: whether the log-size $k$ should be the first or the second dimension, and whether to iterate over $k$ and then $i$ or the other way around. This means that there are of $2×2=4$ ways to build it, and here is the optimal one: +Now, there are two design choices to make: whether the log-size $k$ should be the first or the second dimension, and whether to iterate over $k$ and then $i$ or the other way around. This means that there are $2×2=4$ ways to build it, and here is the optimal one: ```cpp int mn[logn][maxn]; From 457960740ed133df92f23d0002176a66c8abd923 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Fri, 13 May 2022 19:48:07 +0300 Subject: [PATCH 091/173] "great-grandfather" --- content/english/hpc/data-structures/binary-search.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 56f1609a..d9a3dcf6 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -175,7 +175,7 @@ With prefetching, the performance on large arrays becomes roughly the same: ![](../img/search-branchless-prefetch.svg) -The graph still grows faster as the branchy version also prefetches "grandchildren," "grand-grandchildren," and so on — although the usefulness of each new speculative read diminishes exponentially as the prediction is less and less likely to be correct. +The graph still grows faster as the branchy version also prefetches "grandchildren," "great-grandchildren," and so on — although the usefulness of each new speculative read diminishes exponentially as the prediction is less and less likely to be correct. In the branchless version, we could also fetch ahead by more than one layer, but the number of fetches we'd need also grows exponentially. Instead, we will try a different approach to optimize memory operations. @@ -359,9 +359,9 @@ This observation extends to the grand-children of node $k$ — they are also sto \end{aligned} --> -Their cache line can also be fetched with one instruction. Interesting… what if we continue this, and instead of fetching direct children, we fetch ahead as many descendants as we can cramp into one cache line? That would be $\frac{64}{4} = 16$ elements, our grand-grand-grandchildren with indices from $16k$ to $(16k + 15)$. +Their cache line can also be fetched with one instruction. Interesting… what if we continue this, and instead of fetching direct children, we fetch ahead as many descendants as we can cramp into one cache line? That would be $\frac{64}{4} = 16$ elements, our great-great-grandchildren with indices from $16k$ to $(16k + 15)$. -Now, if we prefetch just one of these 16 elements, we will probably only get some but not all of them, as they may cross a cache line boundary. We can prefetch the first *and* the last element, but to get away with just one memory request, we need to notice that the index of the first element, $16k$, is divisible by $16$, so its memory address will be the base address of the array plus something divisible by $16 \cdot 4 = 64$, the cache line size. If the array were to begin on a cache line, then these $16$ grand-gran-grandchildren elements will be guaranteed to be on a single cache line, which is just what we needed. +Now, if we prefetch just one of these 16 elements, we will probably only get some but not all of them, as they may cross a cache line boundary. We can prefetch the first *and* the last element, but to get away with just one memory request, we need to notice that the index of the first element, $16k$, is divisible by $16$, so its memory address will be the base address of the array plus something divisible by $16 \cdot 4 = 64$, the cache line size. If the array were to begin on a cache line, then these $16$ great-great-grandchildren elements will be guaranteed to be on a single cache line, which is just what we needed. Therefore, we only need to [align](/hpc/cpu-cache/alignment) the array: From 01f16643633967f4b2ac68f889316263e5239b8e Mon Sep 17 00:00:00 2001 From: hectonit <48787141+hectonit@users.noreply.github.com> Date: Sun, 15 May 2022 19:57:02 +0300 Subject: [PATCH 092/173] Update fenwick.md Wrong variable naming --- content/russian/cs/range-queries/fenwick.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/russian/cs/range-queries/fenwick.md b/content/russian/cs/range-queries/fenwick.md index f07a1ed4..9e37fc8d 100644 --- a/content/russian/cs/range-queries/fenwick.md +++ b/content/russian/cs/range-queries/fenwick.md @@ -84,7 +84,7 @@ int sum (int r1, int r2) { int res = 0; for (int i = r1; i > 0; i -= i & -i) for (int j = r2; j > 0; j -= j & -j) - ans += t[i][j]; + res += t[i][j]; return res; } ``` From 0339dbbd098c1cfd2943d443cbdc0b78ee6849f6 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Sun, 15 May 2022 22:39:35 +0300 Subject: [PATCH 093/173] elaborate on b-tree insert performance --- content/english/hpc/data-structures/b-tree.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/data-structures/b-tree.md b/content/english/hpc/data-structures/b-tree.md index 122e1c8e..d69a814e 100644 --- a/content/english/hpc/data-structures/b-tree.md +++ b/content/english/hpc/data-structures/b-tree.md @@ -305,7 +305,7 @@ The relative speedup varies with the structure size — 7-18x/3-8x over STL and ![](../img/btree-relative.svg) -Insertions are only 1.5-2 faster than for `absl::btree`, which uses scalar code to do everything. I don't know (yet) why insertions are *that* slow, but I guess it has something to do with data dependencies between queries. +Insertions are only 1.5-2 faster than for `absl::btree`, which uses scalar code to do everything. My best guess why insertions are *that* slow is due to data dependency: since the tree nodes may change, the CPU can't start processing the next query before the previous one finishes (the [true latency](../s-tree/#comparison-with-stdlower_bound) of both queries is roughly equal and ~3x of the reciprocal throughput of `lower_bound`). ![](../img/btree-absl.svg) From eefefe42b7db3cdb8dc5ab74cbfc864e08ad0dff Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 16 May 2022 08:55:12 +0300 Subject: [PATCH 094/173] elaborate on why ctz of negative diff works --- content/english/hpc/algorithms/gcd.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md index b9e9007a..63efdec9 100644 --- a/content/english/hpc/algorithms/gcd.md +++ b/content/english/hpc/algorithms/gcd.md @@ -207,7 +207,7 @@ Let's draw the dependency graph of this loop: Modern processors can execute many instructions in parallel, essentially meaning that the true "cost" of this computation is roughly the sum of latencies on its critical path. In this case, it is the total latency of `diff`, `abs`, `ctz`, and `shift`. -We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a negative number divisible by $2^k$ still has $k$ zeros at the end. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this: +We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a [negative number](../hpc/arithmetic/integer/#signed-integers) divisible by $2^k$ still has $k$ zeros at the end of its binary representation. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this: + +**Definition.** The *representative* $\bar x$ of a number $x$ in the Montgomery space is defined as $$ \bar{x} = x \cdot r \bmod n $$ -Note that the transformation is actually such a multiplication that we want to optimize, so it is still an expensive operation. However, we will only need to transform a number into the space once, perform as many operations as we want efficiently in that space and at the end transform the final result back, which should be profitable if we are doing lots of operations modulo $n$. +Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we don't use this method for general modular multiplication and only long sequences of operations where transforming numbers to and from the Montgomery space is worth it. + + -Inside the Montgomery space addition, substraction and checking for equality is performed as usual ($x \cdot r + y \cdot r \equiv (x + y) \cdot r \bmod n$). However, this is not the case for multiplication. Denoting multiplication in Montgomery space as $*$ and normal multiplication as $\cdot$, we expect the result to be: +Inside the Montgomery space, addition, substraction, and checking for equality is performed as usual: + +$$ +x \cdot r + y \cdot r \equiv (x + y) \cdot r \bmod n +$$ + +However, this is not the case for multiplication. Denoting multiplication in the Montgomery space as $*$ and the "normal" multiplication as $\cdot$, we expect the result to be: $$ \bar{x} * \bar{y} = \overline{x \cdot y} = (x \cdot y) \cdot r \bmod n $$ -But the normal multiplication will give us: +But the normal multiplication in the Montgomery space yields: $$ \bar{x} \cdot \bar{y} = (x \cdot y) \cdot r \cdot r \bmod n $$ -Therefore the multiplication in the Montgomery space is defined as +Therefore, the multiplication in the Montgomery space is defined as $$ \bar{x} * \bar{y} = \bar{x} \cdot \bar{y} \cdot r^{-1} \bmod n $$ -This means that whenever we multiply two numbers, after the multiplication we need to *reduce* them. Therefore, we need to have an efficient way of calculating $x \cdot r^{-1} \bmod n$. +This means that, after we normally multiply two numbers in the Montgomery space, we need to *reduce* the result by multiplying it by $r^{-1}$ and taking the modulo — and there is an efficent way to do this particular operation. ### Montgomery reduction -Assume that $r=2^{64}$, the modulo $n$ is 64-bit and the number $x$ we need to reduce (multiply by $r^{-1}$) is 128-bit (the product of two 64-bit numbers). +Assume that $r=2^{32}$, the modulo $n$ is 32-bit, and the number $x$ we need to reduce (multiply by $r^{-1}$ and take it modulo $n$) is the 64-bit the product of two 32-bit numbers. -Because $\gcd(n, r) = 1$, we know that there are two numbers $r^{-1}$ and $n'$ in the $[0, n)$ range such that +By definition, $\gcd(n, r) = 1$, so we know that there are two numbers $r^{-1}$ and $n'$ in the $[0, n)$ range such that $$ r \cdot r^{-1} + n \cdot n' = 1 $$ -and both $r^{-1}$ and $n'$ can be computed using the extended Euclidean algorithm. +and both $r^{-1}$ and $n'$ can be computed using the [extended Euclidean algorithm](../euclid-extended). -Using this identity we can express $r \cdot r^{-1}$ as $(-n \cdot n' + 1)$ and write $x \cdot r^{-1}$ as +Using this identity, we can express $r \cdot r^{-1}$ as $(-n \cdot n' + 1)$ and write $x \cdot r^{-1}$ as $$ \begin{aligned} @@ -75,7 +122,13 @@ def reduce(x): return a ``` -Since $x < n \cdot n < r \cdot n$ (as $x$ is a product of multiplicatio) and $q \cdot n < r \cdot n$, we know that $-n < (x - q \cdot n) / r < n$. Therefore the final modulo operation can be implemented using a single bound check and addition. +Since $x < n \cdot n < r \cdot n$ and $q \cdot n < r \cdot n$, we know that + +$$ +-n < (x - q \cdot n) / r < n +$$ + +Therefore, the final modulo operation can be implemented using a single bound check and addition. Here is an equivalent C implementation for 64-bit integers: @@ -138,39 +191,86 @@ Transforming a number into the space is just a multiplication inside the space o ### Complete Implementation ```c++ +typedef __uint32_t u32; +typedef __uint64_t u64; + struct montgomery { - u64 n, nr; + u32 n, nr; - montgomery(u64 n) : n(n) { - nr = 1; + constexpr montgomery(u32 n) : n(n), nr(1) { for (int i = 0; i < 6; i++) nr *= 2 - n * nr; } - u64 reduce(u128 x) { - u64 q = u64(x) * nr; - u64 m = ((u128) q * n) >> 64; - u64 xhi = (x >> 64); - //cout << u64(x>>64) << " " << u64(x) << " " << q << endl; - //cout << u64(m>>64) << " " << u64(m) << endl; - //exit(0); - if (xhi >= m) - return (xhi - m); - else - return (xhi - m) + n; + u32 reduce(u64 x) const { + u32 q = u32(x) * nr; + u32 m = ((u64) q * n) >> 32; + u32 xhi = (x >> 32); + return xhi + n - m; + + // if you need + // u32 t = xhi - m; + // return xhi >= m ? t : t + n; } - u64 mult(u64 x, u64 y) { - return reduce((u128) x * y); + u32 multiply(u32 x, u32 y) const { + return reduce((u64) x * y); } - u64 transform(u64 x) { - return (u128(x) << 64) % n; + u32 transform(u32 x) const { + return (u64(x) << 32) % n; } }; ``` ```c++ montgomery m(n); -m.transform(x); -``` \ No newline at end of file + +a = m.transform(a); +b = m.transform(b); +c = m.multiply(a, b); +c = m.reduce(c); +``` + +```c++ +int inverse(int _a) { + u32 a = space.transform(_a); + u32 r = space.transform(1); + + int n = M - 2; + while (n) { + if (n & 1) + r = space.multiply(r, a); + a = space.multiply(a, a); + n >>= 1; + } + + return space.reduce(r); +} +``` + +SIMD + +166.79 ns + +207.04 ns + +```c++ +constexpr montgomery space(M); + +int inverse(int _a) { + u64 a = space.transform(_a); + u64 r = space.transform(1); + + #pragma GCC unroll(30) + for (int l = 0; l < 30; l++) { + if ( (M - 2) >> l & 1 ) + r = space.multiply(r, a); + a = space.multiply(a, a); + } + + return space.reduce(r); +} +``` + +**Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul). From acfb5c857b2bf915adf6f28c17bc2d2ba5adef91 Mon Sep 17 00:00:00 2001 From: Project Nayuki Date: Wed, 18 May 2022 05:40:26 +0000 Subject: [PATCH 097/173] Improved spelling and word choice. --- content/english/hpc/_index.md | 4 ++-- content/english/hpc/architecture/assembly.md | 6 +++--- content/english/hpc/architecture/functions.md | 8 ++++---- content/english/hpc/architecture/isa.md | 2 +- content/english/hpc/architecture/layout.md | 6 +++--- content/english/hpc/architecture/loops.md | 4 ++-- content/english/hpc/arithmetic/division.md | 2 +- content/english/hpc/arithmetic/float.md | 2 +- content/english/hpc/compilation/_index.md | 2 +- content/english/hpc/complexity/_index.md | 2 +- content/english/hpc/complexity/hardware.md | 10 +++++----- content/english/hpc/complexity/languages.md | 4 ++-- content/english/hpc/external-memory/sorting.md | 2 +- content/english/hpc/pipelining/_index.md | 8 ++++---- content/english/hpc/pipelining/branchless.md | 4 ++-- content/english/hpc/pipelining/hazards.md | 4 ++-- content/english/hpc/pipelining/tables.md | 4 ++-- content/english/hpc/pipelining/throughput.md | 4 ++-- 18 files changed, 39 insertions(+), 39 deletions(-) diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md index 92d0cd91..942c9f6a 100644 --- a/content/english/hpc/_index.md +++ b/content/english/hpc/_index.md @@ -33,7 +33,7 @@ A "release" for an open-source book like this essentially means: - mostly freezing the table of contents (except for the case studies), - doing one final round of heavy copyediting (hopefully, with the help of a professional editor — I still haven’t figured out how commas work in English), - drawing illustrations (I stole a lot of those that are currently displayed), -- making a print-optimized pdf and figuring out the best way to distribute it. +- making a print-optimized PDF and figuring out the best way to distribute it. After that, I will mostly be fixing errors and only doing some minor edits reflecting the changes in technology or new algorithm advancements. The e-book/printed editions will most likely be sold on a "pay what you want" basis, and in any case, the web version will always be fully available online. @@ -51,7 +51,7 @@ However, as the book is still evolving, it is probably not the best idea to star There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, but [one of them](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming) is 50 years old, and [the other](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) is 30 years old, and [computers have changed a lot](/hpc/complexity/hardware) since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs. -And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat the hardware like something from the 90s. +And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat the hardware like something from the 1990s. What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools for *teaching* performance engineering — and I'm looking forward to collaborating with other people who want to make it a reality as well. diff --git a/content/english/hpc/architecture/assembly.md b/content/english/hpc/architecture/assembly.md index 5c981547..00c7caac 100644 --- a/content/english/hpc/architecture/assembly.md +++ b/content/english/hpc/architecture/assembly.md @@ -19,7 +19,7 @@ Jumping right into it, here is how you add two numbers (`*c = *a + *b`) in Arm a ldr w0, [x0] ; load 4 bytes from wherever x0 points into w0 ldr w1, [x1] ; load 4 bytes from wherever x1 points into w1 add w0, w0, w1 ; add w0 with w1 and save the result to w0 -str w0, [x2] ; write contents of w0 to wherever x2 points/ +str w0, [x2] ; write contents of w0 to wherever x2 points ``` Here is the same operation in x86 assembly: @@ -33,7 +33,7 @@ mov DWORD PTR [rdx], eax ; write contents of eax to wherever rdx points Assembly is very simple in the sense that it doesn't have many syntactical constructions compared to high-level programming languages. From what you can observe from the examples above: -- A program is a sequence of instructions, each written as its name followed by a variable amount of operands. +- A program is a sequence of instructions, each written as its name followed by a variable number of operands. - The `[reg]` syntax is used for "dereferencing" a pointer stored in a register, and on x86 you need to prefix it with size information (`DWORD` here means 32 bit). - The `;` sign is used for line comments, similar to `#` and `//` in other languages. @@ -55,7 +55,7 @@ Most instructions write their result into the first operand, which can also be i **Registers** are named `rax`, `rbx`, `rcx`, `rdx`, `rdi`, `rsi`, `rbp`, `rsp`, and `r8`-`r15` for a total of 16 of them. The "letter" ones are named like that for historical reasons: `rax` is "accumulator," `rcx` is "counter," `rdx` is "data" and so on — but, of course, they don't have to be used only for that. -There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the first 32 bits of `rax` are `eax`, the first 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. +There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the lowest 32 bits of `rax` are `eax`, the lowest 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../loops), but we'll get there in time. diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md index 02614f94..412fc027 100644 --- a/content/english/hpc/architecture/functions.md +++ b/content/english/hpc/architecture/functions.md @@ -18,7 +18,7 @@ The hardware stack works the same way software stacks do and is similarly implem - The *base pointer* marks the start of the stack and is conventionally stored in `rbp`. - The *stack pointer* marks the last element of the stack and is conventionally stored in `rsp`. -When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e. g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers. +When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e.g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers. -By convention, a function should take its arguments in `rdi`, `rsi`, `rdx`, `rcx`, `r8`, `r9` (and the rest in the stack if that wasn't enough), put the return value into `rax`, and then return. Thus, `square`, being a simple one-argument function, can be implemented like this: +By convention, a function should take its arguments in `rdi`, `rsi`, `rdx`, `rcx`, `r8`, `r9` (and the rest in the stack if those weren't enough), put the return value into `rax`, and then return. Thus, `square`, being a simple one-argument function, can be implemented like this: ```nasm square: ; x = edi, ret = eax @@ -190,7 +190,7 @@ distance: ret ``` -This is better, but we are still implicitly accessing stack memory: you need to push and pop the instruction pointer on each function call. In simple cases like this, we can *inline* function calls by stitching callee's code into the caller and resolving conflicts over registers. In our example: +This is better, but we are still implicitly accessing stack memory: you need to push and pop the instruction pointer on each function call. In simple cases like this, we can *inline* function calls by stitching the callee's code into the caller and resolving conflicts over registers. In our example: ```nasm distance: diff --git a/content/english/hpc/architecture/isa.md b/content/english/hpc/architecture/isa.md index a1a4e66c..4862efb3 100644 --- a/content/english/hpc/architecture/isa.md +++ b/content/english/hpc/architecture/isa.md @@ -14,7 +14,7 @@ Abstractions help us in reducing all this complexity down to a single *interface Hardware engineers love abstractions too. An abstraction of a CPU is called an *instruction set architecture* (ISA), and it defines how a computer should work from a programmer's perspective. Similar to software interfaces, it gives computer engineers the ability to improve on existing CPU designs while also giving its users — us, programmers — the confidence that things that worked before won't break on newer chips. -An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, ISA importantly defines counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance. +An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, an ISA importantly defines the counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance. ### RISC vs CISC diff --git a/content/english/hpc/architecture/layout.md b/content/english/hpc/architecture/layout.md index 11735951..9ddebfd5 100644 --- a/content/english/hpc/architecture/layout.md +++ b/content/english/hpc/architecture/layout.md @@ -16,7 +16,7 @@ During the **fetch** stage, the CPU simply loads a fixed-size chunk of bytes fro -Next comes the **decode** stage: the CPU looks at this chunk of bytes, discards everything that comes before the instruction pointer, and splits the rest of them into instructions. Machine instructions are encoded using a variable amount of bytes: something simple and very common like `inc rax` takes one byte, while some obscure instruction with encoded constants and behavior-modifying prefixes may take up to 15. So, from a 32-byte block, a variable number of instructions may be decoded, but no more than a certain machine-dependant limit called the *decode width*. On my CPU (a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2)), the decode width is 4, which means that on each cycle, up to 4 instructions can be decoded and passed to the next stage. +Next comes the **decode** stage: the CPU looks at this chunk of bytes, discards everything that comes before the instruction pointer, and splits the rest of them into instructions. Machine instructions are encoded using a variable number of bytes: something simple and very common like `inc rax` takes one byte, while some obscure instruction with encoded constants and behavior-modifying prefixes may take up to 15. So, from a 32-byte block, a variable number of instructions may be decoded, but no more than a certain machine-dependent limit called the *decode width*. On my CPU (a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2)), the decode width is 4, which means that on each cycle, up to 4 instructions can be decoded and passed to the next stage. The stages work in a pipelined fashion: if the CPU can tell (or [predict](/hpc/pipelining/branching/)) which instruction block it needs next, then the fetch stage doesn't wait for the last instruction in the current block to be decoded and loads the next one right away. @@ -49,12 +49,12 @@ The instructions are stored and fetched using largely the same [memory system](/ The instruction cache is crucial in situations when you either - don't know what instructions you are going to execute next, and need to fetch the next block with [low latency](/hpc/cpu-cache/latency), -- or executing a long sequence of verbose-but-quick-to-process instructions, and need [high bandwidth](/hpc/cpu-cache/bandwidth). +- or are executing a long sequence of verbose-but-quick-to-process instructions, and need [high bandwidth](/hpc/cpu-cache/bandwidth). The memory system can therefore become the bottleneck for programs with large machine code. This consideration limits the applicability of the optimization techniques we've previously discussed: - [Inlining functions](../functions) is not always optimal, because it reduces code sharing and increases the binary size, requiring more instruction cache. -- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of loops is known during compile-time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth. +- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of iterations is known during compile-time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth. - Huge [code alignments](#code-alignment) increase the binary size, again requiring more instruction cache. Spending one more cycle on fetch is a minor penalty compared to missing the cache and waiting for the instructions to be fetched from the main memory. Another aspect is that placing frequently used instruction sequences on the same [cache lines](/hpc/cpu-cache/cache-lines) and [memory pages](/hpc/cpu-cache/paging) improves [cache locality](/hpc/external-memory/locality). To improve instruction cache utilization, you should group hot code with hot code and cold code with cold code, and remove dead (unused) code if possible. If you want to explore this idea further, check out Facebook's [Binary Optimization and Layout Tool](https://engineering.fb.com/2018/06/19/data-infrastructure/accelerate-large-scale-applications-with-bolt/), which was recently [merged](https://github.com/llvm/llvm-project/commit/4c106cfdf7cf7eec861ad3983a3dd9a9e8f3a8ae) into LLVM. diff --git a/content/english/hpc/architecture/loops.md b/content/english/hpc/architecture/loops.md index b441ae67..9dc1faba 100644 --- a/content/english/hpc/architecture/loops.md +++ b/content/english/hpc/architecture/loops.md @@ -23,11 +23,11 @@ Assembly doesn't have if-s, for-s, functions, or other control flow structures t **Jump** moves the instruction pointer to a location specified by its operand. This location may be either an absolute address in memory, relative to the current address or even [computed during runtime](../indirect). To avoid the headache of managing these addresses directly, you can mark any instruction with a string followed by `:`, and then use this string as a label which gets replaced by the relative address of this instruction when converted to machine code. -Labels can be any strings, but compilers don't get creative and [typically](https://godbolt.org/z/T45x8GKa5) just use the line numbers in the source code and function names with their signatures when picking names for labels. +Labels can be any string, but compilers don't get creative and [typically](https://godbolt.org/z/T45x8GKa5) just use the line numbers in the source code and function names with their signatures when picking names for labels. **Unconditional** jump `jmp` can only be used to implement `while (true)` kind of loops or stitch parts of a program together. A family of **conditional** jumps is used to implement actual control flow. -It is reasonable to think that these conditions are computed as `bool`-s somewhere and passed to conditional jumps as operands: after all, this is how it works in programming languages. But that is not how it is implemented in hardware. Conditional operations use a special `FLAGS` register, which first needs to be populated by executing instructions that perform some kind of checks. +It is reasonable to think that these conditions are computed as `bool`-s somewhere and passed to conditional jumps as operands: after all, this is how it works in programming languages. But that is not how it is implemented in hardware. Conditional operations use a special `FLAGS` register, which first needs to be populated by executing instructions that perform some kind of check. In our example, `cmp rax, rcx` compares the iterator `rax` with the end-of-array pointer `rcx`. This updates the FLAGS register, and now it can be used by `jne loop`, which looks up a certain bit there that tells whether the two values are equal or not, and then either jumps back to the beginning or continues to the next instruction, thus breaking the loop. diff --git a/content/english/hpc/arithmetic/division.md b/content/english/hpc/arithmetic/division.md index e3f699db..ad1cf525 100644 --- a/content/english/hpc/arithmetic/division.md +++ b/content/english/hpc/arithmetic/division.md @@ -45,7 +45,7 @@ You can also divide 128-bit integer (stored in `rdx:rax`) by a 64-bit integer: ```nasm div(u128, u64): ; a = rdi + rsi, b = rdx - mov rcx, rdx ; + mov rcx, rdx mov rax, rdi mov rdx, rsi div edx diff --git a/content/english/hpc/arithmetic/float.md b/content/english/hpc/arithmetic/float.md index cda42944..70217a91 100644 --- a/content/english/hpc/arithmetic/float.md +++ b/content/english/hpc/arithmetic/float.md @@ -139,7 +139,7 @@ $$ \{ \pm \; (1 + m) \cdot 2^e \; | \; m = \frac{x}{2^{32}}, \; x \in [0, 2^{32}) \} $$ -Since $m$ is now a nonnegative value, we will now make it unsigned integer, and instead add a separate boolean field for the sign of the number: +Since $m$ is now a nonnegative value, we will now make it unsigned integer, and instead add a separate Boolean field for the sign of the number: ```cpp struct fp { diff --git a/content/english/hpc/compilation/_index.md b/content/english/hpc/compilation/_index.md index cbc0f691..07b0e07f 100644 --- a/content/english/hpc/compilation/_index.md +++ b/content/english/hpc/compilation/_index.md @@ -8,4 +8,4 @@ The main benefit of [learning assembly language](../architecture/assembly) is no There are rare cases where we *really* need to switch to handwritten assembly for maximal performance, but most of the time compilers are capable of producing near-optimal code all by themselves. When they do not, it is usually because the programmer knows more about the problem than what can be inferred from the source code, but failed to communicate this extra information to the compiler. -In this chapter, we will discuss the intricacies of getting compiler to do exactly what we want and gathering useful information that can guide further optimizations. +In this chapter, we will discuss the intricacies of getting the compiler to do exactly what we want and gathering useful information that can guide further optimizations. diff --git a/content/english/hpc/complexity/_index.md b/content/english/hpc/complexity/_index.md index 69cebf4c..c537c4ce 100644 --- a/content/english/hpc/complexity/_index.md +++ b/content/english/hpc/complexity/_index.md @@ -11,7 +11,7 @@ Complexity is an old concept. It was [systematically formulated](http://www.cs.a ### Classical Complexity Theory -The "elementary operations" of a CPU are called *instructions*, and their "costs" are called *latencies*. Instructions are stored in *memory* and executed one by one by the processor, which has some internal *state* stored in a number of *registers*. One of these registers is the *instruction pointer* that indicates the address of the next instruction to read and execute. Each instruction changes the state of the processor in a certain way (including moving the instruction pointer), possibly modifies the main memory, and takes a different amount of *CPU cycles* to complete before the next one can be started. +The "elementary operations" of a CPU are called *instructions*, and their "costs" are called *latencies*. Instructions are stored in *memory* and executed one by one by the processor, which has some internal *state* stored in a number of *registers*. One of these registers is the *instruction pointer* that indicates the address of the next instruction to read and execute. Each instruction changes the state of the processor in a certain way (including moving the instruction pointer), possibly modifies the main memory, and takes a different number of *CPU cycles* to complete before the next one can be started. To estimate the real running time of a program, you need to sum all latencies for its executed instructions and divide it by the *clock frequency*, that is, the number of cycles a particular CPU does per second. diff --git a/content/english/hpc/complexity/hardware.md b/content/english/hpc/complexity/hardware.md index 1d59d101..d1c950b6 100644 --- a/content/english/hpc/complexity/hardware.md +++ b/content/english/hpc/complexity/hardware.md @@ -4,9 +4,9 @@ weight: 1 ignoreIndexing: true --- -The main disadvantage of the supercomputers of the 1960s wasn't that they were slow — relatively speaking, they weren't — but that they were giant, complex to use, and so expensive that only the governments of the world superpowers could afford them. Their size was the reason they were so expensive: they required a lot of custom components that had to be very carefully assembled in the macro-world, by people holding advanced degrees in electrical engineering, in a process that couldn't be up-scaled for mass production. +The main disadvantage of the supercomputers of the 1960s wasn't that they were slow — relatively speaking, they weren't — but that they were giant, complex to use, and so expensive that only the governments of the world superpowers could afford them. Their size was the reason they were so expensive: they required a lot of custom components that had to be very carefully assembled in the macro-world, by people holding advanced degrees in electrical engineering, in a process that couldn't be scaled up for mass production. -The turning point was the development of *microchips* — single, tiny, complete circuits — which revolutionized the industry and turned out to be probably the most important invention of the 20th century. What was a multimillion-dollar cupboard of computing machinery in 1965 could in 1975 fit on a [4×4 mm slice of silicon](https://en.wikipedia.org/wiki/MOS_Technology_6502)[^size] that you can buy for $25. This dramatic improvement in affordability started the home computer revolution during the following decade, with computers like Apple II, Atari 2600, Commodore 64, and IBM PC becoming available to the masses. +The turning point was the development of *microchips* — single, tiny, complete circuits — which revolutionized the industry and turned out to be probably the most important invention of the 20th century. What was a multimillion-dollar cupboard of computing machinery in 1965 could in 1975 fit on a [4mm × 4mm slice of silicon](https://en.wikipedia.org/wiki/MOS_Technology_6502)[^size] that you can buy for $25. This dramatic improvement in affordability started the home computer revolution during the following decade, with computers like Apple II, Atari 2600, Commodore 64, and IBM PC becoming available to the masses. [^size]: Actual sizes of CPUs are about centimeter-scale because of power management, heat dissipation, and the need to plug it into the motherboard without excessive swearing. @@ -17,7 +17,7 @@ Microchips are "printed" on a slice of crystalline silicon using a process calle 1. growing and slicing a [very pure silicon crystal](https://en.wikipedia.org/wiki/Wafer_(electronics)), 2. covering it with a layer of [a substance that dissolves when photons hit it](https://en.wikipedia.org/wiki/Photoresist), 3. hitting it with photons in a set pattern, -4. chemically [etching](https://en.wikipedia.org/wiki/Etching_(microfabrication)) the now exposed parts, +4. chemically [etching](https://en.wikipedia.org/wiki/Etching_(microfabrication)) the now-exposed parts, 5. removing the remaining photoresist, …and then performing another 40-50 steps over several months to complete the rest of the CPU. @@ -56,11 +56,11 @@ Throughout most of the computing history, optical shrinking was the main driving Both Dennard scaling and Moore's law are not actual laws of physics, but just observations made by savvy engineers. They are both destined to stop at some point due to fundamental physical limitations, the ultimate one being the size of silicon atoms. In fact, Dennard scaling already did — due to power issues. -Thermodynamically, a computer is just a very efficient device for converting electrical power into heat. This heat eventually needs to be removed, and there are physical limits to how much power you can dissipate from a millimeter-scale crystal. Computer engineers, aiming to maximize performance, essentially just choose the maximum possible clock rate so that the overall power consumption stays the same. If transistors become smaller, they have less capacity, meaning less required voltage to flip them, which in turn allows increasing the clock rate. +Thermodynamically, a computer is just a very efficient device for converting electrical power into heat. This heat eventually needs to be removed, and there are physical limits to how much power you can dissipate from a millimeter-scale crystal. Computer engineers, aiming to maximize performance, essentially just choose the maximum possible clock rate so that the overall power consumption stays the same. If transistors become smaller, they have less capacitance, meaning less required voltage to flip them, which in turn allows increasing the clock rate. Around 2005–2007, this strategy stopped working because of *leakage* effects: the circuit features became so small that their magnetic fields started to make the electrons in the neighboring circuitry move in directions they are not supposed to, causing unnecessary heating and occasional bit flipping. -The only way to mitigate this is to increase voltage; and to balance off power consumption you need to reduce clock frequency, which in turn makes the whole process progressively less profitable as transistor density increases. At some point, clock rates could no longer be increased by scaling, and the miniaturization trend started to slow down. +The only way to mitigate this is to increase the voltage; and to balance off power consumption you need to reduce clock frequency, which in turn makes the whole process progressively less profitable as transistor density increases. At some point, clock rates could no longer be increased by scaling, and the miniaturization trend started to slow down. , but for now, you can assume that the CPU maintains a buffer of pending instructions up to some distance in the future, and executes them as soon as the values of its operands are computed and there is an execution unit available. +You can only take advantage of superscalar processing if the stream of instructions contains groups of logically independent operations that can be processed separately. The instructions don't always arrive in the most convenient order, so, when possible, modern CPUs can execute them *out of order* to improve overall utilization and minimize pipeline stalls. How this magic works is a topic for a more advanced discussion, but for now, you can assume that the CPU maintains a buffer of pending instructions up to some distance in the future, and executes them as soon as the values of its operands are computed and there is an execution unit available. ### An Education Analogy Consider how our education system works: 1. Topics are taught to groups of students instead of individuals as broadcasting the same things to everyone at once is more efficient. -2. An intake of students is split into groups lead by different teachers; assignments and other course materials are shared between groups. +2. An intake of students is split into groups led by different teachers; assignments and other course materials are shared between groups. 3. Each year the same course is taught to a new intake so that the teachers are kept busy. These innovations greatly increase the *throughput* of the whole system, although the *latency* (time to graduation for a particular student) remains unchanged (and maybe increases a little bit because personalized tutoring is more effective). diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md index 280498b1..0f87da83 100644 --- a/content/english/hpc/pipelining/branchless.md +++ b/content/english/hpc/pipelining/branchless.md @@ -32,7 +32,7 @@ Suddenly, the loop now takes ~7 cycles per element instead of the original ~14. But wait… shouldn't there still be a branch? How does `(a[i] < 50)` map to assembly? -There are no boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift. +There are no Boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift. ```nasm mov ebx, eax ; t = x @@ -101,7 +101,7 @@ In our example, the branchy code wins when the branch can be predicted with a pr ![](../img/branchy-vs-branchless.svg) -This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile-time, so it needs to be provided in one of several ways: +This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile time, so it needs to be provided in one of several ways: - We can use [profile-guided optimization](/hpc/compilation/situational/#profile-guided-optimization) which will decide for itself whether to use predication or not. - We can use [likeliness attributes](../branching#hinting-likeliness-of-branches) and [compiler-specific intrinsics](/hpc/compilation/situational) to hint at the likeliness of branches: `__builtin_expect_with_probability` in GCC and `__builtin_unpredictable` in Clang. diff --git a/content/english/hpc/pipelining/hazards.md b/content/english/hpc/pipelining/hazards.md index 02a0869d..d4a2d7df 100644 --- a/content/english/hpc/pipelining/hazards.md +++ b/content/english/hpc/pipelining/hazards.md @@ -20,6 +20,6 @@ Different hazards have different penalties: - In structural hazards, you have to wait (usually one more cycle) until the execution unit is ready. They are fundamental bottlenecks on performance and can't be avoided — you have to engineer around them. - In data hazards, you have to wait for the required data to be computed (the latency of the *critical path*). Data hazards are solved by restructuring computations so that the critical path is shorter. -- In control hazards, you generally have to flush the entire pipeline and start over, wasting whole 15-20 cycles. They are solved by either removing branches completely, or making them predictable so that the CPU can effectively *speculate* on what is going to be executed next. +- In control hazards, you generally have to flush the entire pipeline and start over, wasting a whole 15-20 cycles. They are solved by either removing branches completely, or making them predictable so that the CPU can effectively *speculate* on what is going to be executed next. -As they have very different impact on performance, we are going to go in the reversed order and start with the more grave ones. +As they have very different impacts on performance, we are going to go in the reversed order and start with the more grave ones. diff --git a/content/english/hpc/pipelining/tables.md b/content/english/hpc/pipelining/tables.md index 24678270..5f69c579 100644 --- a/content/english/hpc/pipelining/tables.md +++ b/content/english/hpc/pipelining/tables.md @@ -14,7 +14,7 @@ In this context, it makes sense to use two different "[costs](/hpc/complexity)" -You can get latency and throughput numbers for a specific architecture from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf). Here are some samples values for my Zen 2 (all specified for 32-bit operands, if there is any difference): +You can get latency and throughput numbers for a specific architecture from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf). Here are some sample values for my Zen 2 (all specified for 32-bit operands, if there is any difference): | Instruction | Latency | RThroughput | |-------------|---------|:------------| @@ -34,7 +34,7 @@ Some comments: - If a certain instruction is especially frequent, its execution unit could be duplicated to increase its throughput — possibly to even more than one, but not higher than the [decode width](/hpc/architecture/layout). - Some instructions have a latency of 0. This means that these instruction are used to control the scheduler and don't reach the execution stage. They still have non-zero reciprocal throughput because the [CPU front-end](/hpc/architecture/layout) still needs to process them. - Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is the [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all. -- Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), latency is usually specified for the best case (an L1 cache hit). +- Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), the latency is usually specified for the best case (an L1 cache hit). There are many more important little details, but this mental model will suffice for now. diff --git a/content/english/hpc/pipelining/throughput.md b/content/english/hpc/pipelining/throughput.md index 27789b28..03562291 100644 --- a/content/english/hpc/pipelining/throughput.md +++ b/content/english/hpc/pipelining/throughput.md @@ -6,7 +6,7 @@ weight: 4 Optimizing for *latency* is usually quite different from optimizing for *throughput*: - When optimizing data structure queries or small one-time or branchy algorithms, you need to [look up the latencies](../tables) of its instructions, mentally construct the execution graph of the computation, and then try to reorganize it so that the critical path is shorter. -- When optimizing hot loops and large-dataset algorithms, you need to look up the throughputs of its instructions, count how many times each one is used per iteration, determine which of them is the bottleneck, and then try to restructure the loop so that it is used less often. +- When optimizing hot loops and large-dataset algorithms, you need to look up the throughputs of their instructions, count how many times each one is used per iteration, determine which of them is the bottleneck, and then try to restructure the loop so that it is used less often. The last advice only works for *data-parallel* loops, where each iteration is fully independent of the previous one. When there is some interdependency between consecutive iterations, there may potentially be a pipeline stall caused by a [data hazard](../hazards) as the next iteration is waiting for the previous one to complete. @@ -64,7 +64,7 @@ If an instruction has a latency of $x$ and a throughput of $y$, then you would n This technique is mostly used with [SIMD](/hpc/simd) and not in scalar code. You can [generalize](/hpc/simd/reduction) the code above and compute sums and other reductions faster than the compiler. -In general, when optimizing loops, you usually have just one or a few *execution ports* that you want to utilize to their fullest, and you engineer the rest of the loop around them. As different instructions may use different sets of ports, it is not always clear which one is going to be the overused. In situations like this, [machine code analyzers](/hpc/profiling/mca) can be very helpful for finding bottlenecks of small assembly loops. +In general, when optimizing loops, you usually have just one or a few *execution ports* that you want to utilize to their fullest, and you engineer the rest of the loop around them. As different instructions may use different sets of ports, it is not always clear which one is going to be overused. In situations like this, [machine code analyzers](/hpc/profiling/mca) can be very helpful for finding the bottlenecks of small assembly loops. diff --git a/content/english/hpc/architecture/assembly.md b/content/english/hpc/architecture/assembly.md index 00c7caac..de94e4cf 100644 --- a/content/english/hpc/architecture/assembly.md +++ b/content/english/hpc/architecture/assembly.md @@ -128,7 +128,7 @@ movl %eax, (%rdx) The key differences can be summarized as follows: 1. The *last* operand is used to specify the destination. -2. Registers and constants need to be prefixed by `%` and `$` respectively (e. g. `addl $1, %rdx` increments `rdx`). +2. Registers and constants need to be prefixed by `%` and `$` respectively (e.g., `addl $1, %rdx` increments `rdx`). 3. Memory addressing looks like this: `displacement(%base, %index, scale)`. 4. Both `;` and `#` can be used for line comments, and also `/* */` can be used for block comments. diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md index 412fc027..3f98a381 100644 --- a/content/english/hpc/architecture/functions.md +++ b/content/english/hpc/architecture/functions.md @@ -18,7 +18,7 @@ The hardware stack works the same way software stacks do and is similarly implem - The *base pointer* marks the start of the stack and is conventionally stored in `rbp`. - The *stack pointer* marks the last element of the stack and is conventionally stored in `rsp`. -When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e.g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers. +When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances; e.g., when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers. @@ -249,7 +249,7 @@ Apart from requiring much less memory, which is good for fitting into the CPU ca To improve the performance further, we can: -- manually optimize the index arithmetic (e. g. noticing that we need to multiply `v` by `2` either way), +- manually optimize the index arithmetic (e.g., noticing that we need to multiply `v` by `2` either way), - replace division by two with an explicit binary shift (because [compilers aren't always able to do it themselves](/hpc/compilation/contracts/#arithmetic)), - and, most importantly, get rid of [recursion](/hpc/architecture/functions) and make the implementation fully iterative. @@ -724,7 +724,7 @@ This makes both queries much slower — especially the reduction — but this sh **Minimum** is a nice exception where the update query can be made slightly faster if the new value of the element is less than the current one: we can skip the horizontal reduction part and just update $\log_B n$ nodes using a scalar procedure. -This works very fast when we mostly have such updates, which is the case e. g. for the sparse-graph Dijkstra algorithm when we have more edges than vertices. For this problem, the wide segment tree can serve as an efficient fixed-universe min-heap. +This works very fast when we mostly have such updates, which is the case, e.g., for the sparse-graph Dijkstra algorithm when we have more edges than vertices. For this problem, the wide segment tree can serve as an efficient fixed-universe min-heap. **Lazy propagation** can be done by storing a separate array for the delayed operations in a node. To propagate the updates, we need to go top to bottom (which can be done by simply reversing the direction of the `for` loop and using `k >> (h * b)` to calculate the `h`-th ancestor), [broadcast](/hpc/simd/moving/#broadcast) and reset the delayed operation value stored in the parent of the current node, and apply it to all values stored in the current node with SIMD. diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md index 35670da9..f0ca9c65 100644 --- a/content/english/hpc/external-memory/hierarchy.md +++ b/content/english/hpc/external-memory/hierarchy.md @@ -40,8 +40,8 @@ Everything up to the RAM level is called *volatile memory* because it does not p From fastest to slowest: -- **CPU registers**, which are the zero-time access data cells CPU uses to store all its intermediate values, can also be thought of as a memory type. There is only a limited number of them (e. g. 16 "general purpose" ones), and in some cases, you may want to use all of them for performance reasons. -- **CPU caches.** Modern CPUs have multiple layers of cache (L1, L2, often L3, and rarely even L4). The lowest layer is shared between cores and is usually scaled with their number (e. g. a 10-core CPU should have around 10M of L3 cache). +- **CPU registers**, which are the zero-time access data cells CPU uses to store all its intermediate values, can also be thought of as a memory type. There is only a limited number of them (e.g., just 16 "general purpose" ones), and in some cases, you may want to use all of them for performance reasons. +- **CPU caches.** Modern CPUs have multiple layers of cache (L1, L2, often L3, and rarely even L4). The lowest layer is shared between cores and is usually scaled with their number (e.g., a 10-core CPU should have around 10M of L3 cache). - **Random access memory,** which is the first scalable type of memory: nowadays you can rent machines with half a terabyte of RAM on the public clouds. This is the one where most of your working data is supposed to be stored. The CPU cache system has an important concept of a *cache line*, which is the basic unit of data transfer between the CPU and the RAM. The size of a cache line is 64 bytes on most architectures, meaning that all main memory is divided into blocks of 64 bytes, and whenever you request (read or write) a single byte, you are also fetching all its 63 cache line neighbors whether your want them or not. diff --git a/content/english/hpc/external-memory/oblivious.md b/content/english/hpc/external-memory/oblivious.md index a0327855..93c4f2fc 100644 --- a/content/english/hpc/external-memory/oblivious.md +++ b/content/english/hpc/external-memory/oblivious.md @@ -118,7 +118,7 @@ It seems like we can't do better, but it turns out we can. ### Algorithm -Cache-oblivious matrix multiplication relies on essentially the same trick as the transposition. We need to divide the data until it fits into lowest cache (i. e. $N^2 \leq M$). For matrix multiplication, this equates to using this formula: +Cache-oblivious matrix multiplication relies on essentially the same trick as the transposition. We need to divide the data until it fits into lowest cache (i.e., $N^2 \leq M$). For matrix multiplication, this equates to using this formula: $$ \begin{pmatrix} diff --git a/content/english/hpc/external-memory/virtual.md b/content/english/hpc/external-memory/virtual.md index 6535283d..92bb454c 100644 --- a/content/english/hpc/external-memory/virtual.md +++ b/content/english/hpc/external-memory/virtual.md @@ -19,7 +19,7 @@ Virtual memory gives each process the impression that it fully controls a contig To achieve this, the memory address space is divided into *pages* (typically 4KB in size), which are the base units of memory that the programs can request from the operating system. The memory system maintains a special hardware data structure called the *page table*, which contains the mappings of virtual page addresses to the physical ones. When a process accesses data using its virtual memory address, the memory system calculates its page number (by right-shifting it by $12$ if $4096=2^{12}$ is the page size), looks up in the page table that its physical address is, and forwards the read or write request to where that data is actually stored. -Since the address translation needs to be done for each memory request, and the number of memory pages itself may be large (e. g. 16G RAM / 4K page size = 4M pages), address translation poses a difficult problem in itself. One way to speed it up is to use a special cache for the page table itself called *translation lookaside buffer* (TLB), and the other is to [increase the page size](/hpc/cpu-cache/paging) so that the total number of memory pages is made smaller at the cost of reduced granularity. +Since the address translation needs to be done for each memory request, and the number of memory pages itself may be large (e.g., 16G RAM / 4K page size = 4M pages), address translation poses a difficult problem in itself. One way to speed it up is to use a special cache for the page table itself called *translation lookaside buffer* (TLB), and the other is to [increase the page size](/hpc/cpu-cache/paging) so that the total number of memory pages is made smaller at the cost of reduced granularity. -Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp). Most execution units have their own little pipelines, and can take another instruction just one or two cycles after the previous one. If a certain instruction is frequently used, it makes sense to duplicate its execution unit also, and also place frequently jointly used instructions on the same execution unit: e. g. not using the same for arithmetic and memory operation. +Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp). Most execution units have their own little pipelines, and can take another instruction just one or two cycles after the previous one. If a certain instruction is frequently used, it makes sense to duplicate its execution unit also, and also place frequently jointly used instructions on the same execution unit: e.g., not using the same for arithmetic and memory operation. ### Microcode diff --git a/content/english/hpc/pipelining/throughput.md b/content/english/hpc/pipelining/throughput.md index 03562291..0b596404 100644 --- a/content/english/hpc/pipelining/throughput.md +++ b/content/english/hpc/pipelining/throughput.md @@ -84,7 +84,7 @@ Bandwidth is the rate at which data can be read or stored. For the purpose of de In the previous version, we have an inherently sequential chain of operations in the innermost loop. We accumulate the minimum in variable v by a sequence of min operations. There is no way to start the second operation before we know the result of the first operation; there is no room for parallelism here: -The result will be clearly the same, but we are calculating the operations in a different order. In essence, we split the work in two independent parts, calculating the minimum of odd elements and the minimum of even elements, and finally combining the results. If we calculate the odd minimum v0 and even minimum v1 in an interleaved manner, as shown above, we will have more opportunities for parallelism. For example, the 1st and 2nd operation could be calculated simultaneously in parallel (or they could be executed in a pipelined fashion in the same execution unit). Once these results are available, the 3rd and 4th operation could be calculated simultaneously in parallel, etc. We could potentially obtain a speedup of a factor of 2 here, and naturally the same idea could be extended to calculating e.g. 4 minimums in an interleaved fashion. +The result will be clearly the same, but we are calculating the operations in a different order. In essence, we split the work in two independent parts, calculating the minimum of odd elements and the minimum of even elements, and finally combining the results. If we calculate the odd minimum v0 and even minimum v1 in an interleaved manner, as shown above, we will have more opportunities for parallelism. For example, the 1st and 2nd operation could be calculated simultaneously in parallel (or they could be executed in a pipelined fashion in the same execution unit). Once these results are available, the 3rd and 4th operation could be calculated simultaneously in parallel, etc. We could potentially obtain a speedup of a factor of 2 here, and naturally the same idea could be extended to calculating, e.g., 4 minimums in an interleaved fashion. Instruction-level parallelism is automatic Now that we know how to reorganize calculations so that there is potential for parallelism, we will need to know how to realize the potential. For example, if we have these two operations in the C++ code, how do we tell the computer that the operations can be safely executed in parallel? diff --git a/content/english/hpc/profiling/_index.md b/content/english/hpc/profiling/_index.md index 0b7ca30f..ceca0f2f 100644 --- a/content/english/hpc/profiling/_index.md +++ b/content/english/hpc/profiling/_index.md @@ -10,7 +10,7 @@ There are many different types of profilers. I like to think about them by analo - When objects are on a micrometer scale, they use optical microscopes. - When objects are on a nanometer scale, and light no longer interacts with them, they use electron microscopes. -- When objects are smaller than that (e. g. the insides of an atom), they resort to theories and assumptions about how things work (and test these assumptions using intricate and indirect experiments). +- When objects are smaller than that (e.g., the insides of an atom), they resort to theories and assumptions about how things work (and test these assumptions using intricate and indirect experiments). Similarly, there are three main profiling techniques, each operating by its own principles, having distinct areas of applicability, and allowing for different levels of precision: diff --git a/content/english/hpc/profiling/benchmarking.md b/content/english/hpc/profiling/benchmarking.md index d873ca62..dd543bcc 100644 --- a/content/english/hpc/profiling/benchmarking.md +++ b/content/english/hpc/profiling/benchmarking.md @@ -59,7 +59,7 @@ Although *efficient* in terms of execution speed, C and C++ are not the most *pr One way to improve modularity and reusability is to separate all testing and analytics code from the actual implementation of the algorithm, and also make it so that different versions are implemented in separate files, but have the same interface. -In C/C++, you can do this by creating a single header file (e. g. `gcd.hh`) with a function interface and all its benchmarking code in `main`: +In C/C++, you can do this by creating a single header file (e.g., `gcd.hh`) with a function interface and all its benchmarking code in `main`: ```c++ int gcd(int a, int b); // to be implemented @@ -93,7 +93,7 @@ int main() { } ``` -Then you create many implementation files for each algorithm version (e. g. `v1.cc`, `v2.cc` and so on, or some meaningful names if applicable) that all include that single header file: +Then you create many implementation files for each algorithm version (e.g., `v1.cc`, `v2.cc`, and so on, or some meaningful names if applicable) that all include that single header file: ```c++ #include "gcd.hh" diff --git a/content/english/hpc/profiling/events.md b/content/english/hpc/profiling/events.md index 71ae9cd3..eb2ba613 100644 --- a/content/english/hpc/profiling/events.md +++ b/content/english/hpc/profiling/events.md @@ -93,7 +93,7 @@ Overhead Command Shared Object Symbol 0.80% run libc-2.33.so [.] rand ``` -Note that, for each function, just its *overhead* is listed and not the total running time (e. g. `setup` includes `std::__introsort_loop` but only its own overhead is accounted as 3.43%). There are tools for constructing [flame graphs](https://www.brendangregg.com/flamegraphs.html) out of perf reports to make them more clear. You also need to account for possible inlining, which is apparently what happened with `std::lower_bound` here. Perf also tracks shared libraries (like `libc`) and, in general, any other spawned processes: if you want, you can launch a web browser with perf and see what's happening inside. +Note that, for each function, just its *overhead* is listed and not the total running time (e.g., `setup` includes `std::__introsort_loop` but only its own overhead is accounted as 3.43%). There are tools for constructing [flame graphs](https://www.brendangregg.com/flamegraphs.html) out of perf reports to make them more clear. You also need to account for possible inlining, which is apparently what happened with `std::lower_bound` here. Perf also tracks shared libraries (like `libc`) and, in general, any other spawned processes: if you want, you can launch a web browser with perf and see what's happening inside. Next, you can "zoom in" on any of these functions, and, among others things, it will offer to show you its disassembly with an associated heatmap. For example, here is the assembly for `query`: diff --git a/content/english/hpc/profiling/mca.md b/content/english/hpc/profiling/mca.md index 4634ba25..99cfe2ed 100644 --- a/content/english/hpc/profiling/mca.md +++ b/content/english/hpc/profiling/mca.md @@ -40,7 +40,7 @@ First, it outputs general information about the loop and the hardware: - It "ran" the loop 100 times, executing 400 instructions in total in 108 cycles, which is the same as executing $\frac{400}{108} \approx 3.7$ [instructions per cycle](/hpc/complexity/hardware) on average (IPC). - The CPU is theoretically capable of executing up to 6 instructions per cycle ([dispatch width](/hpc/architecture/layout)). - Each cycle in theory can be executed in 0.8 cycles on average ([block reciprocal throughput](/hpc/pipelining/tables)). -- The "uOps" here are the micro-operations that CPU splits each instruction into (e. g. fused load-add is composed of two uOps). +- The "uOps" here are the micro-operations that the CPU splits each instruction into (e.g., fused load-add is composed of two uOps). Then it proceeds to give information about each individual instruction: diff --git a/content/english/hpc/profiling/noise.md b/content/english/hpc/profiling/noise.md index 8dcdb032..74ff0272 100644 --- a/content/english/hpc/profiling/noise.md +++ b/content/english/hpc/profiling/noise.md @@ -87,7 +87,7 @@ for (int i = 0; i < N; i++) checksum ^= lower_bound(checksum ^ q[i]); ``` -It usually makes the most difference in algorithms with possible pipeline stall issues, e. g. when comparing branchy and branch-free algorithms. +It usually makes the most difference in algorithms with possible pipeline stall issues, e.g., when comparing branchy and branch-free algorithms. **Cold cache.** Another source of bias is the *cold cache effect*, when memory reads initially take longer time because the required data is not in cache yet. @@ -130,7 +130,7 @@ The issues we've described produce *bias* in measurements: they consistently giv These type of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling: - If you benchmark a compute-bound algorithm, measure its performance in cycles using `perf stat`: this way it will be independent of clock frequency, fluctuations of which is usually the main source of noise. -- Otherwise, set core frequency to the what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e. g. `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it. +- Otherwise, set core frequency to the what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e.g., `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it. - If applicable, turn hyper-threading off and attach jobs to specific cores. Make sure no other jobs are running on the system, turn off networking and try not to fiddle with the mouse. You can't remove noises and biases completely. Even a program's name can affect its speed: the executable's name ends up in an environment variable, environment variables end up on the call stack, and so the length of the name affects stack alignment, which can result in data accesses slowing down due to crossing cache line or memory page boundaries. diff --git a/content/english/hpc/profiling/simulation.md b/content/english/hpc/profiling/simulation.md index 2f6c6dc6..75401b8a 100644 --- a/content/english/hpc/profiling/simulation.md +++ b/content/english/hpc/profiling/simulation.md @@ -50,7 +50,7 @@ Mispred rate: 22.0% ( 22.5% + 0.0% ) We've fed Cachegrind exactly the same example code as in [the previous section](../events): we create an array of a million random integers, sort it, and then perform a million binary searches on it. Cachegrind shows roughly the same numbers as perf does, except that that perf's measured numbers of memory reads and branches are slightly inflated due to [speculative execution](/hpc/pipelining): they really happen in hardware and thus increment hardware counters, but are discarded and don't affect actual performance, and thus ignored in the simulation. -Cachegrind only models the first (`D1` for data, `I1` for instructions) and the last (`LL`, unified) levels of cache, the characteristics of which are inferred from the system. It doesn't limit you in any way as you can also set them from the command line, e. g. to model the L2 cache: `--LL=,,`. +Cachegrind only models the first (`D1` for data, `I1` for instructions) and the last (`LL`, unified) levels of cache, the characteristics of which are inferred from the system. It doesn't limit you in any way as you can also set them from the command line, e g., to model the L2 cache: `--LL=,,`. It seems like it only slowed down our program so far and hasn't provided us any information that `perf stat` couldn't. To get more out of it than just the summary info, we can inspect a special file with profiling info, which it dumps by default in the same directory named as `cachegrind.out.`. It is human-readable, but is expected to be read via the `cg_annotate` command: diff --git a/content/english/hpc/simd/intrinsics.md b/content/english/hpc/simd/intrinsics.md index e091ddb6..4e9c6804 100644 --- a/content/english/hpc/simd/intrinsics.md +++ b/content/english/hpc/simd/intrinsics.md @@ -95,7 +95,7 @@ for (int i = 0; i < 100; i += 4) { The main challenge of using SIMD is getting the data into contiguous fixed-sized blocks suitable for loading into registers. In the code above, we may in general have a problem if the length of the array is not divisible by the block size. There are two common solutions to this: -1. We can "overshoot" by iterating over the last incomplete segment either way. To make sure we don't segfault by trying to read from or write to a memory region we don't own, we need to pad the arrays to the nearest block size (typically with some "neutral" element, e. g. zero). +1. We can "overshoot" by iterating over the last incomplete segment either way. To make sure we don't segfault by trying to read from or write to a memory region we don't own, we need to pad the arrays to the nearest block size (typically with some "neutral" element, e.g., zero). 2. Make one iteration less and write a little loop in the end that calculates the remainder normally (with scalar operations). Humans prefer #1 because it is simpler and results in less code, and compilers prefer #2 because they don't really have another legal option. @@ -135,7 +135,7 @@ Also, some of the intrinsics don't map to a single instruction but a short seque diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md index 078983d2..c67c1942 100644 --- a/content/english/hpc/simd/reduction.md +++ b/content/english/hpc/simd/reduction.md @@ -3,7 +3,7 @@ title: Sums and Other Reductions weight: 3 --- -*Reduction* (also known as *folding* in functional programming) is the action of computing the value of some associative and commutative operation (i.e. $(a \circ b) \circ c = a \circ (b \circ c)$ and $a \circ b = b \circ a$) over a range of arbitrary elements. +*Reduction* (also known as *folding* in functional programming) is the action of computing the value of some associative and commutative operation (i.e., $(a \circ b) \circ c = a \circ (b \circ c)$ and $a \circ b = b \circ a$) over a range of arbitrary elements. The simplest example of reduction is calculating the sum an array: @@ -68,7 +68,7 @@ int hsum(__m256i x) { } ``` -There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e. g. for integer multiplication or calculating absolute differences between adjacent elements (used in image processing). +There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing). There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps. diff --git a/content/english/hpc/slides/01-intro/_index.md b/content/english/hpc/slides/01-intro/_index.md index 492ceb6a..615a89aa 100644 --- a/content/english/hpc/slides/01-intro/_index.md +++ b/content/english/hpc/slides/01-intro/_index.md @@ -151,7 +151,7 @@ Also a clear path to improvement: just make lenses stronger and chips smaller $\implies$ Each new "generation" should have roughly the same total cost, but 40% higher clock and twice as many transistors -(which can be used e. g. to add new instructions or increase the word size) +(which can be used, e.g., to add new instructions or increase the word size) ---- diff --git a/content/english/hpc/stats.md b/content/english/hpc/stats.md index 6e436d15..15d81e39 100644 --- a/content/english/hpc/stats.md +++ b/content/english/hpc/stats.md @@ -18,7 +18,7 @@ A **random variable** is any variable whose value depends on an outcome of a ran 2. $\forall x \in X, 0 \leq P \leq 1$. 3. $\sum_{x \in X} P(x) = 1$. -For example, consider a random variable $X$ with $k$ discrete states (e. g. the result of a die toss). We can place a *uniform distribution* on $X$ — that is, make each of its states equally likely — by setting its probability distribution to: +For example, consider a random variable $X$ with $k$ discrete states (e.g., the result of a die toss). We can place a *uniform distribution* on $X$ — that is, make each of its states equally likely — by setting its probability distribution to: $$ P(x=x_i) = \frac{1}{k} @@ -121,7 +121,7 @@ The last transition is true because it is a sum of harmonic series. ### Order Statistics -There is a slight modification of quicksort called quickselect that allows finding the $k$-th smallest element in $O(n)$ time, which is useful when we need to quickly compute order statistics, e. g. medians or 75-th quantiles. +There is a slight modification of quicksort called quickselect that allows finding the $k$-th smallest element in $O(n)$ time, which is useful when we need to quickly compute order statistics; e.g., medians or 75-th quantiles. 1. Select a random element $p$ from the array. 2. Partition the array into two arrays $L$ and $R$ using the predicate $a_i > p$. From 3bb8fad0b2f4c9bfeca09d3dfd8e2c9d24763184 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 18 May 2022 10:57:34 +0300 Subject: [PATCH 101/173] amount/number and much/many --- content/english/hpc/algorithms/gcd.md | 2 +- content/english/hpc/arithmetic/float.md | 2 +- content/english/hpc/arithmetic/ieee-754.md | 4 ++-- content/english/hpc/compilation/precalc.md | 2 +- content/english/hpc/cpu-cache/alignment.md | 2 +- content/english/hpc/data-structures/binary-search.md | 2 +- content/english/hpc/external-memory/locality.md | 2 +- content/english/hpc/external-memory/sorting.md | 4 ++-- content/english/hpc/profiling/benchmarking.md | 2 +- content/english/hpc/simd/shuffling.md | 2 +- 10 files changed, 12 insertions(+), 12 deletions(-) diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md index 63efdec9..d56be8f7 100644 --- a/content/english/hpc/algorithms/gcd.md +++ b/content/english/hpc/algorithms/gcd.md @@ -135,7 +135,7 @@ int gcd(int a, int b) { Let's run it, and… it sucks. The difference in speed compared to `std::gcd` is indeed 2x, but on the other side of the equation. This is mainly because of all the branching needed to differentiate between the cases. Let's start optimizing. -First, let's replace all divisions by 2 with divisions by whichever highest power of 2 we can. We can do it efficiently with `__builtin_ctz`, the "count trailing zeros" instruction available on modern CPUs. Whenever we are supposed to divide by 2 in the original algorithm, we will call this function instead, which will give us the exact amount to right-shift the number by. Assuming that the we are dealing with large random numbers, this is expected to decrease the number of iterations by almost a factor 2, because $1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \ldots \to 2$. +First, let's replace all divisions by 2 with divisions by whichever highest power of 2 we can. We can do it efficiently with `__builtin_ctz`, the "count trailing zeros" instruction available on modern CPUs. Whenever we are supposed to divide by 2 in the original algorithm, we will call this function instead, which will give us the exact number of bits to right-shift the number by. Assuming that the we are dealing with large random numbers, this is expected to decrease the number of iterations by almost a factor 2, because $1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \ldots \to 2$. Second, we can notice that condition 2 can now only be true once — in the very beginning — because every other identity leaves at least one of the numbers odd. Therefore we can handle this case just once in the beginning and not consider it in the main loop. diff --git a/content/english/hpc/arithmetic/float.md b/content/english/hpc/arithmetic/float.md index 70217a91..dcc33039 100644 --- a/content/english/hpc/arithmetic/float.md +++ b/content/english/hpc/arithmetic/float.md @@ -9,7 +9,7 @@ The users of floating-point arithmetic deserve one of these IQ bell curve memes - Then they discover that `0.1 + 0.2 != 0.3` or some other quirk like that, freak out, start thinking that some random error term is added to every computation, and for many years avoid any real data types completely. - Then they finally man up, read the specification of how IEEE-754 floats work and start using them appropriately. -Too many people are unfortunately still at stage 2, breeding various misconceptions about floating-point arithmetic — thinking that it is fundamentally imprecise and unstable, and slower than integer arithmetic. +Unfortunately, too many people are still at stage 2, breeding various misconceptions about floating-point arithmetic — thinking that it is fundamentally imprecise and unstable, and slower than integer arithmetic. ![](../img/iq.svg) diff --git a/content/english/hpc/arithmetic/ieee-754.md b/content/english/hpc/arithmetic/ieee-754.md index 06d58e4d..65cc5f48 100644 --- a/content/english/hpc/arithmetic/ieee-754.md +++ b/content/english/hpc/arithmetic/ieee-754.md @@ -52,7 +52,7 @@ Their availability ranges from chip to chip: - Most CPUs support single- and double-precision — which is what `float` and `double` types refer to in C. - Extended formats are exclusive to x86, and are available in C as the `long double` type, which falls back to double precision on arm. The choice of 64 bits for mantissa is so that every `long long` integer can be represented exactly. There is also a 40-bit format that similarly allocates 32 mantissa bits. - Quadruple as well as the 256-bit "octuple" formats are only used for specific scientific computations and are not supported by general-purpose hardware. -- Half-precision arithmetic only supports a small subset of operations and is generally used for machine learning applications, especially neural networks, because they tend to do a large amount of calculation, but don't require a high level of precision. +- Half-precision arithmetic only supports a small subset of operations and is generally used for applications such as machine learning, especially neural networks, because they tend to perform large amounts of calculations but don't require high levels of precision. - Half-precision is being gradually replaced by bfloat, which trades off 3 mantissa bits to have the same range as single-precision, enabling interoperability with it. It is mostly being adopted by specialized hardware: TPUs, FGPAs, and GPUs. The name stands for "[Brain](https://en.wikipedia.org/wiki/Google_Brain) float." Lower-precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e.g., the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it. @@ -77,7 +77,7 @@ This is a complex mechanism that deserves an article of its own, but since this ### NaNs, Zeros and Infinities -Floating-point arithmetic often deals with noisy, real-world data, and exceptions there are much more common than in the integer case. For this reason, the default behavior is different. Instead of crashing, the result is substituted with a special value without interrupting the executing, unless the programmer explicitly wants to. +Floating-point arithmetic often deals with noisy, real-world data. Exceptions there are much more common than in the integer case, and for this reason, the default behavior when handling them is different. Instead of crashing, the result is substituted with a special value without interrupting the program execution (unless the programmer explicitly wants it to). The first type of such value is the two infinities: a positive and a negative one. They are generated if the result of an operation can't fit within the representable range, and they are treated as such in arithmetic. diff --git a/content/english/hpc/compilation/precalc.md b/content/english/hpc/compilation/precalc.md index 29b31cd6..4a7cb7b7 100644 --- a/content/english/hpc/compilation/precalc.md +++ b/content/english/hpc/compilation/precalc.md @@ -37,7 +37,7 @@ constexpr int fibonacci(int n) { } ``` -There used to be much more limitations in earlier C++ standards, like you could not use any sort of state inside them and had to rely on recursion, so the whole process felt more like Haskell programming rather than C++. Since C++17, you can even compute static arrays using the imperative style, which is useful for precomputing lookup tables: +There used to be many more limitations in earlier C++ standards, like you could not use any sort of state inside them and had to rely on recursion, so the whole process felt more like Haskell programming rather than C++. Since C++17, you can even compute static arrays using the imperative style, which is useful for precomputing lookup tables: ```c++ struct Precalc { diff --git a/content/english/hpc/cpu-cache/alignment.md b/content/english/hpc/cpu-cache/alignment.md index 83d62310..59579467 100644 --- a/content/english/hpc/cpu-cache/alignment.md +++ b/content/english/hpc/cpu-cache/alignment.md @@ -77,7 +77,7 @@ This potentially wastes space but saves a lot of CPU cycles. This trade-off is m ### Optimizing Member Order -Padding is only inserted before a not-yet-aligned member or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the required amount of padding bytes and the total size of the structure. +Padding is only inserted before a not-yet-aligned member or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the required number of padding bytes and the total size of the structure. In the previous example, we could reorder the structure members like this: diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 7c408228..48bf07b4 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -313,7 +313,7 @@ Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compar The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (that is, leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns. -This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that amount. To do this, we can invert the number (`~k`) and call the "find first set" instruction: +This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that. To do this, we can invert the number (`~k`) and call the "find first set" instruction: ```c++ int lower_bound(int x) { diff --git a/content/english/hpc/external-memory/locality.md b/content/english/hpc/external-memory/locality.md index eca83766..e61cb5a3 100644 --- a/content/english/hpc/external-memory/locality.md +++ b/content/english/hpc/external-memory/locality.md @@ -174,7 +174,7 @@ The AoS layout is usually preferred for data structures, but SoA still has good This difference in design is important in data processing applications. For example, databases can be either *row-* or *column-oriented* (also called *columnar*): -- *Row-oriented* storage formats are used when you need to search for a limited amount of objects in a large dataset and fetch all or most of their fields. Examples: PostgreSQL, MongoDB. +- *Row-oriented* storage formats are used when you need to search for a limited number of objects in a large dataset and/or fetch all or most of their fields. Examples: PostgreSQL, MongoDB. - *Columnar* storage formats are used for big data processing and analytics, where you need to scan through everything anyway to calculate certain statistics. Examples: ClickHouse, Hbase. Columnar formats have the additional advantage that you can only read the fields that you need, as different fields are stored in separate external memory regions. diff --git a/content/english/hpc/external-memory/sorting.md b/content/english/hpc/external-memory/sorting.md index c7effc46..299da78f 100644 --- a/content/english/hpc/external-memory/sorting.md +++ b/content/english/hpc/external-memory/sorting.md @@ -34,7 +34,7 @@ So far the examples have been simple, and their analysis doesn't differ too much In the standard RAM model, the asymptotic complexity would be multiplied $k$, since we would need to perform $O(k)$ comparisons to fill each next element. But in the external memory model, since everything we do in-memory doesn't cost us anything, its asymptotic complexity would not change as long as we can fit $(k+1)$ full blocks in memory, that is, if $k = O(\frac{M}{B})$. -Remember [the $M \gg B$ assumption](../model) when we introduced the computational model? If we have $M \geq B^{1+ε}$ for $\epsilon > 0$, then we can fit any sub-polynomial amount of blocks in memory, certainly including $O(\frac{M}{B})$. This condition is called *tall cache assumption*, and it is usually required in many other external memory algorithms. +Remember [the $M \gg B$ assumption](../model) when we introduced the computational model? If we have $M \geq B^{1+ε}$ for $\epsilon > 0$, then we can fit any sub-polynomial number of blocks in memory, certainly including $O(\frac{M}{B})$. This condition is called *tall cache assumption*, and it is usually required in many other external memory algorithms. ### Merge Sorting @@ -58,7 +58,7 @@ Half of a page ago we have learned that in the external memory model, we can mer Let's sort each block of size $M$ in-memory just as we did before, but during each merge stage, we will split sorted blocks not just in pairs to be merged, but take as many blocks we can fit into our memory during a $k$-way merge. This way the height of the merge tree would be greatly reduced, while each layer would still be done in $O(\frac{N}{B})$ IOPS. -How many sorted arrays can we merge at once? Exactly $k = \frac{M}{B}$, since we need memory for one block for each array. Since the total amount of layers will be reduced to $\log_{\frac{M}{B}} \frac{N}{M}$, the total complexity will be reduced to +How many sorted arrays can we merge at once? Exactly $k = \frac{M}{B}$, since we need memory for one block for each array. Since the total number of layers will be reduced to $\log_{\frac{M}{B}} \frac{N}{M}$, the total complexity will be reduced to $$ SORT(N) \stackrel{\text{def}}{=} O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{M} \right) diff --git a/content/english/hpc/profiling/benchmarking.md b/content/english/hpc/profiling/benchmarking.md index dd543bcc..2be61235 100644 --- a/content/english/hpc/profiling/benchmarking.md +++ b/content/english/hpc/profiling/benchmarking.md @@ -186,4 +186,4 @@ plt.plot(ns, [x / y for x, y in zip(baseline, results)]) plt.show() ``` -Once established, this workflow makes you iterate much faster and just focus on optimizing the algorithm itself. +Once established, this workflow makes you iterate much faster and focus on optimizing the algorithm itself. diff --git a/content/english/hpc/simd/shuffling.md b/content/english/hpc/simd/shuffling.md index 111c34d5..6ff3b749 100644 --- a/content/english/hpc/simd/shuffling.md +++ b/content/english/hpc/simd/shuffling.md @@ -175,7 +175,7 @@ The general idea of our algorithm is as follows: - use this mask to index a lookup table that returns a permutation moving the elements that satisfy the predicate to the beginning of the vector (in their original order); - use the `_mm256_permutevar8x32_epi32` intrinsic to permute the values; - write the whole permuted vector to the buffer — it may have some trailing garbage, but its prefix is correct; -- calculate the population count of the scalar mask and move the buffer pointer by that amount. +- calculate the population count of the scalar mask and move the buffer pointer by that number. First, we need to precompute the permutations: From 893772a2538f1592fb1fdc55611267a7effd5868 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 18 May 2022 11:49:06 +0300 Subject: [PATCH 102/173] fix eytzinger example (tnx @tmp-coder) --- content/english/hpc/data-structures/binary-search.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 48bf07b4..babe0092 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -286,7 +286,9 @@ This function takes the current node number `k`, recursively writes out all elem Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time. -Note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`). +Note that this traversal and the resulting permutation are not exactly equivalent to the "tree" of vanilla binary search: for example, the left child subtree may be larger than the right child subtree — and even more than just by one node — but it doesn't matter since both approaches result in the same logarithmic tree depth. + +Also note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`). ### Search Implementation @@ -302,18 +304,18 @@ The only problem arises when we need to restore the index of the resulting eleme ``` array: 1 2 3 4 5 6 7 8 -eytzinger: 4 2 5 1 6 3 7 8 +eytzinger: 5 3 7 2 4 6 8 1 1st range: --------------- k := 1 2nd range: ------- k := 2*k (=2) 3rd range: --- k := 2*k + 1 (=5) -4th range: - k := 2*k + 1 (=11) +4th range: - k := 2*k (=10) ``` -Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $4$, $2$, and $5$, go left-right-right, and end up with $k = 11$, which isn't even a valid array index. +Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $5$, $3$, and $4$, go left-right-left, and end up with $k = 10$, which isn't even a valid array index. The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (that is, leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns. -This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that. To do this, we can invert the number (`~k`) and call the "find first set" instruction: +This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing 1s in the binary representation and right-shift $k$ by exactly that number of bits. To do this, we can invert the number (`~k`) and call the "find first set" instruction: ```c++ int lower_bound(int x) { From b82fb8fa10e5eaac97e6111016f9886464b3135c Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 19 May 2022 13:44:01 +0300 Subject: [PATCH 103/173] on optimizing latency and efficiency --- content/english/hpc/complexity/levels.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md index 281bdea2..b1e29e2e 100644 --- a/content/english/hpc/complexity/levels.md +++ b/content/english/hpc/complexity/levels.md @@ -40,3 +40,25 @@ Programmers can be put in several "levels" in terms of their software optimizati In this book, we expect that the average reader is somewhere around stage 1, and hopefully by the end of it will get to 4. You should also go through these levels when designing algorithms. First get it working in the first place, then select a bunch of reasonably asymptotically optimal algorithm. Then think about how they are going to work in terms of their memory operations or ability to execute in parallel (even if you consider single-threaded programs, there is still going to be plenty of parallelism inside a core, so this model is extremely ), and then proceed toward actual implementation. Avoid premature optimization, as Knuth once said. + +--- + +For most web services, efficiency doesn't matter, but *latency* does. + +Increasing efficiency is not how it is done nowadays. + +A pageview usually generates somewhere on the order of 0.1 to 1 cent per pageview. This is a typical rate at which you monetize user attention. Say, if I simply installed AdSense, i'd be getting something like that — depending on where most of my readers are from and how many of them are using an ad blocker. + +At the same time, a server with a dedicated core and 1GB of ram (which is an absurdly large amount of resources for a simple web service) costs around one millionth per second when amortized. You could fetch 100 photos with that. + +Amazon had an experiment where they A/B tested their service with artificial delays and found out that a 100ms delay decreased revenue. This follows for most other services, say, you lose your "flow" at twitter, the user is likely to start thinking on something else and leave. If the delay at Google is more than a few seconds, people will just think that Google isn't working and quit. + +Minimization of latency can be usually done with parallel computing, which is why distributed systems are scaled more on scalability. This part of the book is concerned with improving *efficiency* of algorithms, which makes latency lower as the by-product. + +However, there are still use cases when there is a trade-off between quality and cost of servers. + +- Search is hierarchical. There are usually many layers of more accurate but slower models. The more documents you rank on each layer, the better the final quality. +- Games. They are more enjoyable on large scale, but computational power also increases. This includes AI. +- AI workloads — those that have large quantities of data such as language models. Heavier models require more compute. The bottleneck in them is not the number of data, but efficiencty. + +Inherently sequential algorithms, or cases when the resources are constrained. Ctrl+f'ing a large PDF is painful. Factorization. From 25333d550985213cdde5a743f5d6e4862207e4ce Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Fri, 20 May 2022 06:37:07 +0300 Subject: [PATCH 104/173] estimating performance engineering impact --- content/english/hpc/complexity/levels.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md index b1e29e2e..e2e8b58f 100644 --- a/content/english/hpc/complexity/levels.md +++ b/content/english/hpc/complexity/levels.md @@ -62,3 +62,17 @@ However, there are still use cases when there is a trade-off between quality and - AI workloads — those that have large quantities of data such as language models. Heavier models require more compute. The bottleneck in them is not the number of data, but efficiencty. Inherently sequential algorithms, or cases when the resources are constrained. Ctrl+f'ing a large PDF is painful. Factorization. + +## Estimating the impact + +Sometime the optimization needs to happen in the calling layer. + +SIMDJSON speeds up JSON parsing, but it may be better to not use JSON in the first place. + +Protobuf or flat binary formats. + +There is also a chicken and egg problem: people don't use an approach that much because it is slow and not feasible. + +Cost to implement, bugs, maintainability. It is perfectly fine that most software in the world is inefficient. + +What does it mean to be a better programmer? Faster programs? Faster speed of work? Fewer bugs? It is a combination of those. From bf8a1e817963151180f6f7352e3d979b1f6f7f33 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Fri, 20 May 2022 06:50:32 +0300 Subject: [PATCH 105/173] how to read this book --- content/english/hpc/complexity/levels.md | 2 ++ content/english/hpc/preface.md | 16 ++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md index e2e8b58f..981d467c 100644 --- a/content/english/hpc/complexity/levels.md +++ b/content/english/hpc/complexity/levels.md @@ -76,3 +76,5 @@ There is also a chicken and egg problem: people don't use an approach that much Cost to implement, bugs, maintainability. It is perfectly fine that most software in the world is inefficient. What does it mean to be a better programmer? Faster programs? Faster speed of work? Fewer bugs? It is a combination of those. + +Implementing compiler optimizations or databases are examples of high-leverage activities because they act as a tax on everything else — which is why you see most people writing books on these particular topics rather than software optimization in general. diff --git a/content/english/hpc/preface.md b/content/english/hpc/preface.md index 28adae07..2e18e715 100644 --- a/content/english/hpc/preface.md +++ b/content/english/hpc/preface.md @@ -19,6 +19,22 @@ There are a lot of forward references I couldn't get rid of. Read some of the SIMD and memory chapter first. +Chapter 1 is a "why you should care" sort of read. + +Chapter 2 is an introduction to computer architectures from the perspective of performance. There is a high chance that you already know it from a college course, but I still advise to read it to get into context, as we will cover assembly-level optimization techniques there. + +Chapter 3 is where experienced programmers should start from. + +Chapter 4 discusses compilation with the example of C++ and GCC/Clang. Chapter 5 discusses language-agnostic profiling methods. You are free to skip both. + +Chapter 6 discusses arithmetic and chapter 7 discusses modular arithmetic and its applications. They also acts as a sort of reference for algorithms in the case studies. + +Chapter 8 introduces the external memory model and how the memory system works. Chapter 9 follows up with experimental studies of how it can affect performance. + +Chapters 10 discusses SIMD programming, which is a major part. It is not *that* intertwined with the preivous ones, and if you are feeling comfortable, I'd suggest that you start reading with it because it will teach you powerful techniques right away. + +Chapters 11-12 contain case studies of complex algorithms. Performance engineering is a practical field, so you should learn from major examples. + The first 5 chapters build up general understanding of performance. Chapters 6-10 go deeper into modern features. Arithmetic, number theory (the techniques that are also relevant outside of it). Some are theoretic, and then applied in practice. From 22ad3b1ff984da97081d26b4ef60f9b7c7137a24 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 23 May 2022 10:25:34 +0300 Subject: [PATCH 106/173] fix formatting --- .../russian/cs/factorization/eratosthenes.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/content/russian/cs/factorization/eratosthenes.md b/content/russian/cs/factorization/eratosthenes.md index 02e72c0e..acf47749 100644 --- a/content/russian/cs/factorization/eratosthenes.md +++ b/content/russian/cs/factorization/eratosthenes.md @@ -12,10 +12,10 @@ published: true Основная идея соответствует названию алгоритма: запишем ряд чисел $1, 2,\ldots, n$, а затем будем вычеркивать -* сначала числа, делящиеся на $2$, кроме самого числа $2$, -* потом числа, делящиеся на $3$, кроме самого числа $3$, -* с числами, делящимися на $4$, ничего делать не будем — мы их уже вычёркивали, -* потом продолжим вычеркивать числа, делящиеся на $5$, кроме самого числа $5$, +- сначала числа, делящиеся на $2$, кроме самого числа $2$, +- потом числа, делящиеся на $3$, кроме самого числа $3$, +- с числами, делящимися на $4$, ничего делать не будем — мы их уже вычёркивали, +- потом продолжим вычеркивать числа, делящиеся на $5$, кроме самого числа $5$, …и так далее. @@ -23,10 +23,10 @@ published: true ```c++ vector sieve(int n) { - vector is_prime(n+1, true); + vector is_prime(n + 1, true); for (int i = 2; i <= n; i++) if (is_prime[i]) - for (int j = 2*i; j <= n; j += i) + for (int j = 2 * i; j <= n; j += i) is_prime[j] = false; return is_prime; } @@ -49,7 +49,6 @@ $$ У исходного алгоритма асимптотика должна быть ещё лучше. Чтобы найти её точнее, нам понадобятся два факта про простые числа: 1. Простых чисел от $1$ до $n$ примерно $\frac{n}{\ln n}$ . - 2. Простые числа распределены без больших «разрывов» и «скоплений», то есть $k$-тое простое число примерно равно $k \ln k$. Мы можем упрощённо считать, что число $k$ является простым с «вероятностью» $\frac{1}{\ln n}$. Тогда, время работы алгоритма можно более точнее оценить как @@ -65,11 +64,11 @@ $$ ## Линейное решето -Основная проблема решета Эратосфена состоит в том, что некоторые числа мы будем помечать как составные несколько раз — а именно столько раз, сколько у них различных простых делителей. Чтобы достичь линейного времени работы, нам нужно придумать способ, как рассматривать все составные числа ровно один раз. +Основная проблема решета Эратосфена состоит в том, что некоторые числа мы будем помечать как составные несколько раз — столько, сколько у них различных простых делителей. Чтобы достичь линейного времени работы, нам нужно придумать способ, как рассматривать все составные числа ровно один раз. Обозначим за $d(k)$ минимальный простой делитель числа $k$ и заметим следующий факт: у составного числа $k$ есть единственное представление $k = d(k) \cdot r$, и при этом у числа $r$ нет простых делителей меньше $d(k)$. -Идея оптимизации состоит в том, чтобы перебирать этот $r$, и для каждого перебирать только нужные множители — а именно все от $2$ до $d(r)$ включительно. +Идея оптимизации состоит в том, чтобы перебирать этот $r$, и для каждого перебирать только нужные множители — а именно, все от $2$ до $d(r)$ включительно. ### Алгоритм From a80dc4f9e0aeb28a873c5c3347626477cb45ac42 Mon Sep 17 00:00:00 2001 From: Timofey Date: Tue, 24 May 2022 13:11:02 +0300 Subject: [PATCH 107/173] Update products.md --- content/russian/cs/geometry-basic/products.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/russian/cs/geometry-basic/products.md b/content/russian/cs/geometry-basic/products.md index a4e1a3d5..ca0a5dd3 100644 --- a/content/russian/cs/geometry-basic/products.md +++ b/content/russian/cs/geometry-basic/products.md @@ -1,6 +1,7 @@ --- title: Скалярное и векторное произведение weight: 2 +published: true --- Помимо очевидных сложения, вычитания и умножения на константу, у векторов можно ввести и свои особенные операции, которые нам упростят жизнь. @@ -42,7 +43,7 @@ $$ Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$. -Геометрически, это ориентированный объем параллелограмма, натянутого на вектора $a$ и $b$: +Геометрически, это ориентированная площадь параллелограмма, натянутого на вектора $a$ и $b$: ![](../img/cross.jpg) From 9ef98a1c9f5b4a103e68d9db4d44634753e9378e Mon Sep 17 00:00:00 2001 From: Timofey Date: Tue, 24 May 2022 13:27:57 +0300 Subject: [PATCH 108/173] http://www.gramota.ru/slovari/dic/?word=%D0%B2%D0%B5%D0%BA%D1%82%D0%BE%D1%80&all=x --- content/russian/cs/geometry-basic/vectors.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/russian/cs/geometry-basic/vectors.md b/content/russian/cs/geometry-basic/vectors.md index 05051396..ee1a052a 100644 --- a/content/russian/cs/geometry-basic/vectors.md +++ b/content/russian/cs/geometry-basic/vectors.md @@ -1,6 +1,7 @@ --- -title: Точки и векторы +title: Точки и вектора weight: 1 +published: true --- Отрезок, для которого указано, какой из его концов считается началом, а какой концом, называется *вектором*. Вектор на плоскости можно задать двумя числами — его координатами по горизонтали и вертикали. From 689bc2ee0615285809b0086df388dc5b3dafcfbc Mon Sep 17 00:00:00 2001 From: Timofey Date: Tue, 24 May 2022 14:47:45 +0300 Subject: [PATCH 109/173] =?UTF-8?q?=D0=9E=D0=BF=D0=B8=D1=81=D0=BA=D0=B0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- content/russian/cs/geometry-basic/products.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/russian/cs/geometry-basic/products.md b/content/russian/cs/geometry-basic/products.md index ca0a5dd3..488dbca6 100644 --- a/content/russian/cs/geometry-basic/products.md +++ b/content/russian/cs/geometry-basic/products.md @@ -41,7 +41,7 @@ $$ a \times b = |a| \cdot |b| \cdot \sin \theta = x_a y_b - y_a x_b $$ -Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$. +Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$. Геометрически, это ориентированная площадь параллелограмма, натянутого на вектора $a$ и $b$: @@ -66,7 +66,7 @@ int operator^(r a, r b) { return a.x*b.y - b.x*a.y; } Скалярное и векторное произведения тесно связаны с углами между векторами и могут использоваться для подсчета величин вроде ориентированных углов и площадей, которые обычно используются для разных проверок. -Когда они уже реализованы, использовать произведения гораздо проще, чем опираться на алгебру. Например, можно легко угол между двумя векторами, подставив в знакомый нам `atan2` векторное и скалярное произведение: +Когда они уже реализованы, использовать произведения гораздо проще, чем опираться на алгебру. Например, можно легко вычислить угол между двумя векторами, подставив в знакомый нам `atan2` векторное и скалярное произведение: ```c++ double angle(r a, r b) { From 2638bf74c962ab81b0515318e832d0a9451e4b7c Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 24 May 2022 21:04:56 +0300 Subject: [PATCH 110/173] fix formatting --- content/russian/cs/interactive/answer-search.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/russian/cs/interactive/answer-search.md b/content/russian/cs/interactive/answer-search.md index 28e4b4bc..0b38ce24 100644 --- a/content/russian/cs/interactive/answer-search.md +++ b/content/russian/cs/interactive/answer-search.md @@ -66,7 +66,7 @@ int solve() { Здесь, в отличие от предыдущей задачи, кажется, существует прямое решение с формулой. Но вместо того, чтобы о нем думать, можно просто свести задачу к обратной. Давайте подумаем, как по числу минут $t$ (ответу) понять, сколько листов напечатается за это время? Очень легко: $$ -\lfloor\frac{t}{x}\rfloor + \lfloor\frac{t}{y}\rfloor +\left \lfloor \frac{t}{x} \right \rfloor + \left \lfloor \frac{t}{y} \right \rfloor $$ -Ясно, что за $0$ минут $n$ листов распечатать нельзя, а за $xn$ минут один только первый принтер успеет напечатать $n$ листов. Поэтому $0$ и $xn$ — это подходящие изначальные границы для бинарного поиска. +Ясно, что за $0$ минут $n$ листов распечатать нельзя, а за $x \cdot n$ минут один только первый принтер успеет напечатать $n$ листов. Поэтому $0$ и $xn$ — это подходящие изначальные границы для бинарного поиска. From f81a9ea614b579811ba6b8e7edf9a112c601b441 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 24 May 2022 21:05:09 +0300 Subject: [PATCH 111/173] factorization code --- .../english/hpc/algorithms/factorization.md | 243 +++++++++++++++++- 1 file changed, 242 insertions(+), 1 deletion(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 4ff8061d..7c2d8aa7 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -14,7 +14,6 @@ Integer factorization is interesting because of RSA problem. - Less than 10^100: Quadratic Sieve - More than 10^100: General Number Field Sieve - and do other computations such as computing the greatest common multiple (given that it is not even so that ) (since $\gcd(n, r) = 1$) For all methods, we will implement `find_factor` function which returns one divisor ot 1. You can apply it recurively to get the factorization, so whatever asymptotic you had won't affect it: @@ -32,6 +31,23 @@ vector factorize(u64 n) { } ``` +0.056024 +2043.968140 + +```c++ +typedef __uint16_t u16; +typedef __uint32_t u32; +typedef __uint64_t u64; +typedef __uint128_t u128; + +u64 find_factor(u64 n) { + for (u64 d = 2; d * d <= n; d++) + if (n % d == 0) + return d; + return 1; +} +``` + ## Trial division This is the most basic algorithm to find a prime factorization. @@ -193,3 +209,228 @@ This is exactly the type of problem when we need specific knowledge, because we ## Further optimizations Существуют также [субэкспоненциальные](https://ru.wikipedia.org/wiki/%D0%A4%D0%B0%D0%BA%D1%82%D0%BE%D1%80%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F_%D1%86%D0%B5%D0%BB%D1%8B%D1%85_%D1%87%D0%B8%D1%81%D0%B5%D0%BB#%D0%A1%D1%83%D0%B1%D1%8D%D0%BA%D1%81%D0%BF%D0%BE%D0%BD%D0%B5%D0%BD%D1%86%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5_%D0%B0%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D1%8B), но не полиномиальные алгоритмы факторизации. Человечество [умеет](https://en.wikipedia.org/wiki/Integer_factorization_records) факторизовывать числа порядка $2^{200}$. + + +--- + +If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other. + +How to optimize for the *average* case is unclear. + +0.087907 +3964.321045 + +```c++ +u64 find_factor(u64 n) { + if (n % 2 == 0) + return 2; + for (u64 d = 3; d * d <= n; d += 2) + if (n % d == 0) + return d; + return 1; +} +``` + +0.199740 +7615.217773 + +```c++ +u64 find_factor(u64 n) { + for (u64 d : {2, 3, 5}) + if (n % d == 0) + return d; + u64 increments[] = {0, 4, 6, 10, 12, 16, 22, 24}; + for (u64 d = 7; d * d <= n; d += 30) { + for (u64 k = 0; k < 8; k++) { + u64 x = d + increments[k]; + if (n % x == 0) + return x; + } + } + return 1; +} +``` + +19430.058594 + +```c++ +const int N = (1 << 16); + +struct Precalc { + u16 primes[6542]; // # of primes under N=2^16 + + constexpr Precalc() : primes{} { + bool marked[N] = {}; + int n_primes = 0; + + for (int i = 2; i < N; i++) { + if (!marked[i]) { + primes[n_primes++] = i; + for (int j = 2 * i; j < N; j += i) + marked[j] = true; + } + } + } +}; + +constexpr Precalc P{}; + +u64 find_factor(u64 n) { + for (u16 p : P.primes) + if (n % p == 0) + return p; + return 1; +} +``` + +352997.656250 + +```c++ +u64 magic[6542]; +magic[n_primes++] = u64(-1) / i + 1; + +u64 find_factor(u64 n) { + for (u64 m : P.magic) + if (m * n < m) + return u64(-1) / m + 1; + return 1; +} +``` + +Except that it is contant, so the speedup should be twice as much. + +--- + +```c++ +u64 find_factor(u64 n) { + while (true) { + if (u64 g = gcd(randint(2, n - 1), n); g != 1) + return g; + } +} +``` + +99.292641 +25720.164062 almost 15x slower + +```c++ +u64 f(u64 x, u64 a, u64 mod) { + return ((u128) x * x + a) % mod; +} + +u64 diff(u64 a, u64 b) { + // a and b are unsigned and so is their difference, so we can't just call abs(a - b) + return a > b ? a - b : b - a; +} + +u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { + u64 x = x0, y = x0, g = 1; + while (g == 1) { + x = f(x, a, n); + y = f(y, a, n); + y = f(y, a, n); + g = gcd(diff(x, y)); + } + return g; +} + +u64 find_factor(u64 n) { + return rho(n); +} +``` + +56.745281 + +```c++ +u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { + u64 x = x0, y = x0; + + for (int l = 256; l < (1 << 20); l *= 2) { + x = y; + for (int i = 0; i < l; i++) { + y = f(y, a, n); + if (u64 g = gcd(diff(x, y), n); g != 1) + return g; + } + } + + return 1; +} +``` + +426.389160 + +```c++ +const int M = 1024; + +u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { + u64 x = x0, y = x0, p = 1; + + for (int l = M; l < (1 << 20); l *= 2) { + x = y; + for (int i = 0; i < l; i += M) { + for (int j = 0; j < M; j++) { + y = f(y, a, n); + p = (u128) p * diff(x, y) % n; + } + if (u64 g = gcd(p, n); g != 1) + return g; + } + } + + return 1; +} +``` + +2948.260986 + +```c++ +struct Montgomery { + u64 n, nr; + + Montgomery(u64 n) : n(n) { + nr = 1; + for (int i = 0; i < 6; i++) + nr *= 2 - n * nr; + } + + u64 reduce(u128 x) const { + u64 q = u64(x) * nr; + u64 m = ((u128) q * n) >> 64; + return (x >> 64) + n - m; + } + + u64 multiply(u64 x, u64 y) { + return reduce((u128) x * y); + } +}; + +u64 f(u64 x, u64 a, Montgomery m) { + return m.multiply(x, x) + a; +} + +const int M = 1024; + +u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { + Montgomery m(n); + u64 y = x0; + + for (int l = M; l < (1 << 20); l *= 2) { + u64 x = y, p = 1; + for (int i = 0; i < l; i += M) { + for (int j = 0; j < M; j++) { + y = f(y, a, m); + p = m.multiply(p, diff(x, y)); + } + if (u64 g = gcd(p, n); g != 1) + return g; + } + } + + return 1; +} +``` + +There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows). + +788.4861246275735 From 968abd50c4b267ab6a7b27e991b02267c1d08518 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 24 May 2022 23:02:47 +0300 Subject: [PATCH 112/173] factorization intro --- .../english/hpc/algorithms/factorization.md | 70 +++++++++++++------ 1 file changed, 48 insertions(+), 22 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 7c2d8aa7..8baf4aaf 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -4,42 +4,59 @@ weight: 3 draft: true --- -Integer factorization is interesting because of RSA problem. +The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs. -"How big are your numbers?" determines the method to use: +In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches, and then gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms, which is almost 4x faster than the previous state-of-the-art. -- Less than 2^16 or so: Lookup table. -- Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm. -- Less than 10^50: Lenstra elliptic curve factorization -- Less than 10^100: Quadratic Sieve -- More than 10^100: General Number Field Sieve + -and do other computations such as computing the greatest common multiple (given that it is not even so that ) (since $\gcd(n, r) = 1$) +### Benchmark -For all methods, we will implement `find_factor` function which returns one divisor ot 1. You can apply it recurively to get the factorization, so whatever asymptotic you had won't affect it: +For all methods, we will implement `find_factor` function that takes a positive integer $n$ and returns either its smallest divisor (or `1` if the number is prime): ```c++ -typedef uint32_t u32; -typedef uint64_t u64; +// I don't feel like typing "unsigned long long" each time +typedef __uint16_t u16; +typedef __uint32_t u32; +typedef __uint64_t u64; typedef __uint128_t u128; +u64 find_factor(u64 n); +``` + +To find full factorization, you can apply it to $n$, reduce it, and continue until a new factor can no longer be found: + +```c++ vector factorize(u64 n) { - vector res; - while (int d = find_factor(n); d > 1) // does it work? - res.push_back(d); - return res; + vector factorization; + do { + u64 d = find_factor(n); + factorization.push_back(d); + n /= d; + } while (d != 1); + return factorization; } ``` +Since after each removed factor the problem becomes considerably smaller and simpler, the worst-case running time of full factorization is equal to the worst-case running time of a `find_factor` call. + +For many factorization algorithms, including those presented in this article, the running time scales with the least prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. To generate a $k$-bit semiprime, we generate two random $\lfloor k / 2 \rfloor$-bit primes. + +Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of errors, although they can be reduced to almost zero without significant performance penalties. + +### Trial division + +Trial division was first described by Fibonacci in 1202. Although it was probably known to animals. Perhaps some animals can factor? The scientific priority probably belongs to dinosaurs or ancient fish trying to divvy stuff up. + +I tried finding references to who invented trial division, but probably it was known to animals long before to split into equal parts. + 0.056024 2043.968140 ```c++ -typedef __uint16_t u16; -typedef __uint32_t u32; -typedef __uint64_t u64; -typedef __uint128_t u128; - u64 find_factor(u64 n) { for (u64 d = 2; d * d <= n; d++) if (n % d == 0) @@ -48,8 +65,6 @@ u64 find_factor(u64 n) { } ``` -## Trial division - This is the most basic algorithm to find a prime factorization. We divide by each possible divisor $d$. @@ -434,3 +449,14 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows). 788.4861246275735 + +### Larger Numbers + +"How big are your numbers?" determines the method to use: + + +- Less than 2^16 or so: Lookup table. +- Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm. +- Less than 10^50: Lenstra elliptic curve factorization +- Less than 10^100: Quadratic Sieve +- More than 10^100: General Number Field Sieve From 6f0850d4fc30ba584623340bc8ad3a774b8ff8e9 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 25 May 2022 14:10:08 +0300 Subject: [PATCH 113/173] centered codeblock --- content/english/hpc/arithmetic/newton.md | 6 +++--- content/english/hpc/data-structures/binary-search.md | 12 ++++++------ content/russian/cs/numerical/newton.md | 6 +++--- themes/algorithmica/assets/style.sass | 8 +++++++- .../_default/_markup/render-codeblock-center.html | 3 +++ 5 files changed, 22 insertions(+), 13 deletions(-) create mode 100644 themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html diff --git a/content/english/hpc/arithmetic/newton.md b/content/english/hpc/arithmetic/newton.md index 38bcddda..de42104c 100644 --- a/content/english/hpc/arithmetic/newton.md +++ b/content/english/hpc/arithmetic/newton.md @@ -68,9 +68,9 @@ The algorithm converges for many functions, although it does so reliably and pro Let's run a few iterations of Newton's method to find the square root of $2$, starting with $x_0 = 1$, and check how many digits it got correct after each iteration: -
      -1
      -1.5
      +
      +1.0000000000000000000000000000000000000000000000000000000000000
      +1.5000000000000000000000000000000000000000000000000000000000000
       1.4166666666666666666666666666666666666666666666666666666666675
       1.4142156862745098039215686274509803921568627450980392156862745
       1.4142135623746899106262955788901349101165596221157440445849057
      diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
      index babe0092..8a4924ea 100644
      --- a/content/english/hpc/data-structures/binary-search.md
      +++ b/content/english/hpc/data-structures/binary-search.md
      @@ -302,12 +302,12 @@ while (k <= n)
       
       The only problem arises when we need to restore the index of the resulting element, as $k$ may end up not pointing to a leaf node. Here is an example of how that can happen:
       
      -```
      -    array:  1 2 3 4 5 6 7 8
      -eytzinger:  5 3 7 2 4 6 8 1
      -1st range:  ---------------  k := 1
      -2nd range:  -------          k := 2*k      (=2)
      -3rd range:      ---          k := 2*k + 1  (=5)
      +```center
      +    array:  1 2 3 4 5 6 7 8                     
      +eytzinger:  5 3 7 2 4 6 8 1                     
      +1st range:  ---------------  k := 1             
      +2nd range:  -------          k := 2*k      (=2) 
      +3rd range:      ---          k := 2*k + 1  (=5) 
       4th range:        -          k := 2*k      (=10)
       ```
       
      diff --git a/content/russian/cs/numerical/newton.md b/content/russian/cs/numerical/newton.md
      index 248e1b4e..5426cff5 100644
      --- a/content/russian/cs/numerical/newton.md
      +++ b/content/russian/cs/numerical/newton.md
      @@ -66,9 +66,9 @@ double sqrt(double n) {
       
       Запустим метод Ньютона для поиска квадратного корня $2$, начиная с $x_0 = 1$, и посмотрим, сколько первых цифр оказались правильными после каждой итерации:
       
      -
      -1
      -1.5
      +
      +1.0000000000000000000000000000000000000000000000000000000000000
      +1.5000000000000000000000000000000000000000000000000000000000000
       1.4166666666666666666666666666666666666666666666666666666666675
       1.4142156862745098039215686274509803921568627450980392156862745
       1.4142135623746899106262955788901349101165596221157440445849057
      diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass
      index a6835c1e..eb5e2410 100644
      --- a/themes/algorithmica/assets/style.sass
      +++ b/themes/algorithmica/assets/style.sass
      @@ -492,7 +492,13 @@ pre
         padding-left: 8px
         font-size: 0.85em
         text-align: left
      -  
      +
      +pre.center-pre
      +  text-align: center
      +  font-size: 1em
      +  background: none
      +  border: none
      +
       .highlight
         margin: 0px
       
      diff --git a/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html b/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html
      new file mode 100644
      index 00000000..d263bb5a
      --- /dev/null
      +++ b/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html
      @@ -0,0 +1,3 @@
      +
      +{{.Inner}}
      +
      From 7297d591846a63f1615ec5415db99d0e5d447e26 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 25 May 2022 14:11:59 +0300 Subject: [PATCH 114/173] bump hugo version --- netlify.toml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/netlify.toml b/netlify.toml index 1b5ed16e..fb612037 100644 --- a/netlify.toml +++ b/netlify.toml @@ -2,7 +2,7 @@ command = "hugo --gc --minify" [context.production.environment] -HUGO_VERSION = "0.87.0" +HUGO_VERSION = "0.96.0" HUGO_ENV = "production" HUGO_ENABLEGITINFO = "true" @@ -10,20 +10,20 @@ HUGO_ENABLEGITINFO = "true" command = "hugo --gc --minify --enableGitInfo" [context.split1.environment] -HUGO_VERSION = "0.87.0" +HUGO_VERSION = "0.96.0" HUGO_ENV = "production" [context.deploy-preview] command = "hugo --gc --minify --buildFuture -b $DEPLOY_PRIME_URL" [context.deploy-preview.environment] -HUGO_VERSION = "0.87.0" +HUGO_VERSION = "0.96.0" [context.branch-deploy] command = "hugo --gc --minify -b $DEPLOY_PRIME_URL" [context.branch-deploy.environment] -HUGO_VERSION = "0.87.0" +HUGO_VERSION = "0.96.0" [context.next.environment] HUGO_ENABLEGITINFO = "true" From 251dd08c54db23dac6a977a84ea4a60ced3c9532 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 25 May 2022 16:38:32 +0300 Subject: [PATCH 115/173] wheel and lookup factorization --- .../english/hpc/algorithms/factorization.md | 259 ++++++++---------- 1 file changed, 118 insertions(+), 141 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 8baf4aaf..9f7958ed 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -15,7 +15,7 @@ Unlike other case studies of this book, in this one you will actually learn an a ### Benchmark -For all methods, we will implement `find_factor` function that takes a positive integer $n$ and returns either its smallest divisor (or `1` if the number is prime): +For all methods, we will implement `find_factor` function that takes a positive integer $n$ and returns any of its non-trivial divisors (or `1` if the number is prime): ```c++ // I don't feel like typing "unsigned long long" each time @@ -45,35 +45,30 @@ Since after each removed factor the problem becomes considerably smaller and sim For many factorization algorithms, including those presented in this article, the running time scales with the least prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. To generate a $k$-bit semiprime, we generate two random $\lfloor k / 2 \rfloor$-bit primes. -Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of errors, although they can be reduced to almost zero without significant performance penalties. +Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of false negative errors (when `find_factor` returns `1` despite number $n$ being composite), although this rate can be reduced to almost zero without significant performance penalties. ### Trial division -Trial division was first described by Fibonacci in 1202. Although it was probably known to animals. Perhaps some animals can factor? The scientific priority probably belongs to dinosaurs or ancient fish trying to divvy stuff up. + + +The most basic approach is to try every number less than $n$ as a divosor: ```c++ u64 find_factor(u64 n) { - for (u64 d = 2; d * d <= n; d++) + for (u64 d = 2; d < n; d++) if (n % d == 0) return d; return 1; } ``` -This is the most basic algorithm to find a prime factorization. - -We divide by each possible divisor $d$. -We can notice, that it is impossible that all prime factors of a composite number $n$ are bigger than $\sqrt{n}$. -Therefore, we only need to test the divisors $2 \le d \le \sqrt{n}$, which gives us the prime factorization in $O(\sqrt{n})$. - -The smallest divisor has to be a prime number. -We remove the factor from the number, and repeat the process. -If we cannot find any divisor in the range $[2; \sqrt{n}]$, then the number itself has to be prime. +One simple optimization is to notice that it is enough to only check divisors that do not exceed $\sqrt n$. This works because if $n$ is divided by $d > \sqrt n$, then it is also divided by $\frac{n}{d} < \sqrt n$, so we can don't have to check it separately. ```c++ u64 find_factor(u64 n) { @@ -84,13 +79,43 @@ u64 find_factor(u64 n) { } ``` +In our benchmark, $n$ is a semiprime, and we always find the lesser divisor, so both $O(n)$ and $O(\sqrt n)$ implementations perform the same and are able to factorize ~2k 30-bit numbers per second, while taking whole ~20 seconds to factorize a single 60-bit number. + +### Lookup Table + +Nowadays, you can type `factor 57` in your Linux terminal or Google search bar to get the factorization of any number. But before computers were invented, it was more practical to use *factorization tables:* special books containing factorizations of the first $N$ numbers. + +We can also use this approach to compute these lookup tables [during compile time](/hpc/compilation/precalc/). To save space, it is convenient to only store the smallest divisor of a number, requiring just one byte for a 16-bit integer: + +```c++ +template +struct Precalc { + unsigned char divisor[N]; + + constexpr Precalc() : divisor{} { + for (int i = 0; i < N; i++) + divisor[i] = 1; + for (int i = 2; i * i < N; i++) + if (divisor[i] == 1) + for (int k = i * i; k < N; k += i) + divisor[k] = i; + } +}; + +constexpr Precalc P{}; + +u64 find_factor(u64 n) { + return P.divisor[n]; +} +``` + +This approach can process 3M 16-bit integers per second, although it [probably gets slower](../hpc/cpu-cache/bandwidth/) for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first $2^{16}$ numbers, it does not scale well for larger inputs. + ### Wheel factorization -This is an optimization of the trial division. -The idea is the following. -Once we know that the number is not divisible by 2, we don't need to check every other even number. -This leaves us with only $50\%$ of the numbers to check. -After checking 2, we can simply start with 3 and skip every other number. +To save paper space, pre-computer era factorization tables typically excluded numbers divisible by 2 and 5: in decimal numeral system, you can quickly determine whether a number is divisible by 2 or 5 (by looking at its last digit) and keep dividing the number $n$ by 2 or 5 while it is possible, eventually arriving to some entry in the factorization table. This makes the factorization table just ½ × ⅘ = 0.4 its original size. + +We can apply a similar trick to trial division, first checking if the number is divisible by $2$, and then only check for odd divisors: ```c++ u64 find_factor(u64 n) { @@ -103,24 +128,27 @@ u64 find_factor(u64 n) { } ``` -This method can be extended. -If the number is not divisible by 3, we can also ignore all other multiples of 3 in the future computations. -So we only need to check the numbers $5, 7, 11, 13, 17, 19, 23, \dots$. -We can observe a pattern of these remaining numbers. -We need to check all numbers with $d \bmod 6 = 1$ and $d \bmod 6 = 5$. -So this leaves us with only $33.3\%$ percent of the numbers to check. -We can implement this by checking the primes 2 and 3 first, and then start checking with 5 and alternatively skip 1 or 3 numbers. +With 50% fewer divisions to do, this algorithm works twice as fast, but it can be extended. If the number is not divisible by $3$, we can also ignore all multiples of $3$, and the same goes for all other divisors. + +The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small. For example, if we consider $2$, $3$, and $5$, then, among the first $90$ numbers, we only need to check: + +```center +(1,) 7, 11, 13, 17, 19, 23, 29, +31, 37, 41, 43, 47, 49, 53, 59, +61, 67, 71, 73, 77, 79, 83, 89… +``` + +You can notice a pattern: the sequence repeats itself every $30$ numbers because remainder modulo $2 \times 3 \times 5 = 30$ is all we need to determine whether a number is divisible by $2$, $3$, or $5$. This means that we only need to check $8$ specific numbers in every $30$, proportionally improving the performance: ```c++ u64 find_factor(u64 n) { for (u64 d : {2, 3, 5}) if (n % d == 0) return d; - u64 increments[] = {0, 4, 6, 10, 12, 16, 22, 24}; - u64 sum = 30; - for (u64 d = 7; d * d <= n; d += sum) { - for (u64 k = 0; k < 8; k++) { - u64 x = d + increments[k]; + u64 offsets[] = {0, 4, 6, 10, 12, 16, 22, 24}; + for (u64 d = 7; d * d <= n; d += 30) { + for (u64 offset : offsets) { + u64 x = d + offset; if (n % x == 0) return x; } @@ -129,38 +157,80 @@ u64 find_factor(u64 n) { } ``` -We can extend this even further. -Here is an implementation for the prime number 2, 3 and 5. -It's convenient to use an array to store how much we have to skip. +As expected, it works $\frac{30}{8} = 3.75$ times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved by considering more primes, but the returns are diminishing: adding a new prime $p$ reduces the number of iterations by $\frac{1}{p}$, but increases the size of the skip-list by a factor of $p$, requiring proportionally more memory. -### Lookup table +### Precomputed Primes -We will choose to store smallest factors of first $2^16$ — because this way they all fit in just one byte, so we are sort of saving on memory here. +If we keep increasing the number of primes we exclude in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don't need this array of offsets, but we need to precompute primes, which we can do during compile time like this: ```c++ -template +const int N = (1 << 16); + struct Precalc { - char divisor[N]; + u16 primes[6542]; // # of primes under N=2^16 - constexpr Precalc() : divisor{} { - for (int i = 0; i < N; i++) - divisor[i] = 1; - for (int i = 2; i * i < N; i++) - if (divisor[i] == 1) - for (int k = i * i; k < N; k += i) - divisor[k] = i; + constexpr Precalc() : primes{} { + bool marked[N] = {}; + int n_primes = 0; + + for (int i = 2; i < N; i++) { + if (!marked[i]) { + primes[n_primes++] = i; + for (int j = 2 * i; j < N; j += i) + marked[j] = true; + } + } } }; -constexpr Precalc precalc{}; +constexpr Precalc P{}; + +u64 find_factor(u64 n) { + for (u16 p : P.primes) + if (n % p == 0) + return p; + return 1; +} +``` + +This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but fixed fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$. + +All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](../hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation: + +```c++ +// ...precomputation is the same as before, +// but we store the reciprocal instead of the prime number itself +u64 magic[6542]; +// for each prime i: +magic[n_primes++] = u64(-1) / i + 1; u64 find_factor(u64 n) { - return precalc.divisor[n]; + for (u64 m : P.magic) + if (m * n < m) + return u64(-1) / m + 1; + return 1; } ``` +This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have + + +$\tilde{O}(\sqrt n)$ territory + ### Pollard's Rho Algorithm +--- + +```c++ +u64 find_factor(u64 n) { + while (true) { + if (u64 g = gcd(randint(2, n - 1), n); g != 1) + return g; + } +} +``` + + The algorithm is probabilistic. This means that it may or may not work. You would also need to Ро-алгоритм Полларда — рандомизированный алгоритм факторизации целых чисел, работающий за время $O(n^\frac{1}{4})$ и основывающийся не следствии из парадокса дней рождений: @@ -232,99 +302,6 @@ If you have limited time, you should probably compute as much forward as possibl How to optimize for the *average* case is unclear. -0.087907 -3964.321045 - -```c++ -u64 find_factor(u64 n) { - if (n % 2 == 0) - return 2; - for (u64 d = 3; d * d <= n; d += 2) - if (n % d == 0) - return d; - return 1; -} -``` - -0.199740 -7615.217773 - -```c++ -u64 find_factor(u64 n) { - for (u64 d : {2, 3, 5}) - if (n % d == 0) - return d; - u64 increments[] = {0, 4, 6, 10, 12, 16, 22, 24}; - for (u64 d = 7; d * d <= n; d += 30) { - for (u64 k = 0; k < 8; k++) { - u64 x = d + increments[k]; - if (n % x == 0) - return x; - } - } - return 1; -} -``` - -19430.058594 - -```c++ -const int N = (1 << 16); - -struct Precalc { - u16 primes[6542]; // # of primes under N=2^16 - - constexpr Precalc() : primes{} { - bool marked[N] = {}; - int n_primes = 0; - - for (int i = 2; i < N; i++) { - if (!marked[i]) { - primes[n_primes++] = i; - for (int j = 2 * i; j < N; j += i) - marked[j] = true; - } - } - } -}; - -constexpr Precalc P{}; - -u64 find_factor(u64 n) { - for (u16 p : P.primes) - if (n % p == 0) - return p; - return 1; -} -``` - -352997.656250 - -```c++ -u64 magic[6542]; -magic[n_primes++] = u64(-1) / i + 1; - -u64 find_factor(u64 n) { - for (u64 m : P.magic) - if (m * n < m) - return u64(-1) / m + 1; - return 1; -} -``` - -Except that it is contant, so the speedup should be twice as much. - ---- - -```c++ -u64 find_factor(u64 n) { - while (true) { - if (u64 g = gcd(randint(2, n - 1), n); g != 1) - return g; - } -} -``` - 99.292641 25720.164062 almost 15x slower From dd88f5e0bdc1fbf03ac4f62c35ac458b085eeafb Mon Sep 17 00:00:00 2001 From: arnu152 <36503815+arnu152@users.noreply.github.com> Date: Wed, 25 May 2022 17:36:10 +0200 Subject: [PATCH 116/173] Fix a typo in the prefix sum code sample --- content/english/hpc/algorithms/prefix.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/algorithms/prefix.md b/content/english/hpc/algorithms/prefix.md index f07daaf3..81d31900 100644 --- a/content/english/hpc/algorithms/prefix.md +++ b/content/english/hpc/algorithms/prefix.md @@ -76,7 +76,7 @@ v4i prefix(v4i x) { // x = 1, 3, 5, 7 // + 0, 0, 1, 3 // = 1, 3, 6, 10 - return s; + return x; } ``` @@ -91,7 +91,7 @@ v8i prefix(v8i x) { x = _mm256_add_epi32(x, _mm256_slli_si256(x, 8)); x = _mm256_add_epi32(x, _mm256_slli_si256(x, 16)); // <- this does nothing // x = 1, 3, 6, 10, 5, 11, 18, 26 - return s; + return x; } ``` From 88b757a7ceb792b7ca1435a06f4e17617e443381 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 25 May 2022 19:45:20 +0300 Subject: [PATCH 117/173] pollard rho --- .../english/hpc/algorithms/factorization.md | 147 ++++++++---------- content/english/hpc/algorithms/img/rho.jpg | Bin 0 -> 14570 bytes 2 files changed, 67 insertions(+), 80 deletions(-) create mode 100644 content/english/hpc/algorithms/img/rho.jpg diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 9f7958ed..90a1bf43 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -195,7 +195,7 @@ u64 find_factor(u64 n) { This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but fixed fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$. -All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](../hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation: +All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation. In particular, we can use [Lemire division check](/hpc/arithmetic/division/#lemire-reduction): ```c++ // ...precomputation is the same as before, @@ -212,14 +212,13 @@ u64 find_factor(u64 n) { } ``` -This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have - - -$\tilde{O}(\sqrt n)$ territory +This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have for this number range. While it can probably be even further optimized by performing these checks in parallel with [SIMD](/hpc/simd), we will stop there and consider a different, asymptotically better approach. ### Pollard's Rho Algorithm ---- + -### Brent's Method +To construct this sequence, we need a "seemingly random" function that maps the remainders of $n$. Typical choice is $f(x) = (x + 1)^2 \mod n$. -Another idea is to accumulate the product and instead of calculating GCD on each step to calculate it every log n steps. +Now, consider a graph where each vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. The "trajectory" of any element — the path we walk starting from that element and following edges — eventually loop around. This trajectory resembles the greek letter $\rho$ (rho), which is why the algorithm is named so. -### Optimizing division +![](../img/rho.jpg) -The next step is to actually apply Montgomery Multiplication. +Apart from this trick, Pollard's rho algorithm relies on a consequence from the Birthday paradox: we need to add $O(\sqrt{n})$ random numbers from $1$ to $n$ to a set until we get a collision. -This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it. +Now, consider a trajectory of some element $x_0$: {$x_0$, $f(x_0)$, $f(f(x_0))$, $\ldots$}. -... +Make another sequence out of it, virtually taking each element modulo $p$, the lesser of prime divisors of $n$. -## Further optimizations +**Lemma.** The expected length in that sequence is $O(\sqrt[4]{n})$. -Существуют также [субэкспоненциальные](https://ru.wikipedia.org/wiki/%D0%A4%D0%B0%D0%BA%D1%82%D0%BE%D1%80%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F_%D1%86%D0%B5%D0%BB%D1%8B%D1%85_%D1%87%D0%B8%D1%81%D0%B5%D0%BB#%D0%A1%D1%83%D0%B1%D1%8D%D0%BA%D1%81%D0%BF%D0%BE%D0%BD%D0%B5%D0%BD%D1%86%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5_%D0%B0%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D1%8B), но не полиномиальные алгоритмы факторизации. Человечество [умеет](https://en.wikipedia.org/wiki/Integer_factorization_records) факторизовывать числа порядка $2^{200}$. +**Proof.** Each time we walk a new edge, we generate a random number. It has some chance if looping around. +As $p$ is the lesser divisor, $p \leq \sqrt n$. Now we need to plug it into the [Birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): we need to add $O(\sqrt{p}) = O(\sqrt[4]{n})$ elements to the set to get a collision, which means that the. ---- +Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently. -If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other. +Now, if we find a cycle in this sequence — $i$ and $j$ such that $f^i(x_0) \equiv f^j(x_0) \pmod p$ — we can find some divisor of $n$ using the $\gcd$ trick: $\gcd(|f^i(x_0) - f^j(x_0)|, n)$ would be less than $n$ and divisible by $p$. -How to optimize for the *average* case is unclear. +Floyd's cycle-finding algorithm -99.292641 -25720.164062 almost 15x slower +The algorithm itself just finds a loop in this sequence using the Ford algorithms, also known as the "hare and turtle" technique: we maintain two pointers $i$ and $j$ ($i = 2j$) and check that $f^i(x_0) \equiv f^j(x_0) \pmod p$, which is equivalent to checking $\gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1$. ```c++ -u64 f(u64 x, u64 a, u64 mod) { - return ((u128) x * x + a) % mod; +u64 f(u64 x, u64 mod) { + return ((u128) x * x + 1) % mod; } u64 diff(u64 a, u64 b) { @@ -315,7 +271,7 @@ u64 diff(u64 a, u64 b) { return a > b ? a - b : b - a; } -u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { +u64 find_factor(u64 n) { u64 x = x0, y = x0, g = 1; while (g == 1) { x = f(x, a, n); @@ -325,16 +281,16 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { } return g; } - -u64 find_factor(u64 n) { - return rho(n); -} ``` -56.745281 +While it processes 25k 30-bit numbers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, processing around 90 of them per second. + +### Pollard-Brent Algorithm + +Floyd's cycle-finding algorithm has a problem in that it does more iterator increments than necessary. One way to solve it is to memorize the values that the faster iterator visits and compute the gcd using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using this trick: ```c++ -u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { +u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { u64 x = x0, y = x0; for (int l = 256; l < (1 << 20); l *= 2) { @@ -350,12 +306,14 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { } ``` -426.389160 +It actually does *not* improve performance and even makes it ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it. + +We can remove the logarithm from the asymptotic using the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic: ```c++ const int M = 1024; -u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { +u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { u64 x = x0, y = x0, p = 1; for (int l = M; l < (1 << 20); l *= 2) { @@ -374,7 +332,13 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { } ``` -2948.260986 +It now works at 425 factorizations per second, bottlenecked by the speed of modulo. + +### Optimizing Modulo + +The next step is to actually apply [Montgomery Multiplication](/hpc/number-theory/montgomery/). + +This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it. ```c++ struct Montgomery { @@ -403,7 +367,7 @@ u64 f(u64 x, u64 a, Montgomery m) { const int M = 1024; -u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { +u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { Montgomery m(n); u64 y = x0; @@ -423,15 +387,38 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) { } ``` +It processes around 3000 per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) library can do (invocated via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)). + +### Further Optimization + +There might be a way to . + +It may be beneficial to start multiplying only after a certain threshold since there is little probability that we enter a cycle in the beginning. + +It may be worth it to run a few versions in parallel and stop whichever finishes first. If we run $p$ runs, it is expected to finish $\sqrt p$ times faster. Either scalar code and taking advantage of there being multiple execution ports for multiplication, or using [SIMD](/hpc/simd) instructions to do 4 or 8 multiplications in parallel. + +Would not be surprised to see another 3x improvement and throughputs of 10k/sec. + +If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other. + +How to optimize for the *average* case is unclear. + +### Reducing Errors + There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows). -788.4861246275735 +Our implementation has less than 0.7% error rate, but it grows higher if the numbers are lower than $10^{18}$. + +Since Pollard's rho algorithm is randomized, you need to account for errors. There may be several sources: + +- Factors not being found (need to perform a primality test and start again if it's negative). +- The `p` variable can get zeroed out (need to either restart or roll back and do it iteration-by-iteration). +- Overflows in Montgomery multiplication (our implementation is pretty loose). ### Larger Numbers "How big are your numbers?" determines the method to use: - - Less than 2^16 or so: Lookup table. - Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm. - Less than 10^50: Lenstra elliptic curve factorization diff --git a/content/english/hpc/algorithms/img/rho.jpg b/content/english/hpc/algorithms/img/rho.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d7f01ad81ee48c90ae02e9b248cc880cc3665e9e GIT binary patch literal 14570 zcmc(F1y~$=vTvh?hssp2TOum2oNl|ySqaI!QDN`;1FCwfPvsna2p(gJu&KJw5;G>Z)H={pw-nVFkcakdc=GU|?W?H;*6SVIGhIkPs0; zhzLj^5C|C=2?Y%o9Ss!~jR@xnCN3E<1vwcpDJdl_8v`XZ3k@kLBmYwt4o)5(9twtM zA_82(Y}`Ctzug1|85tQ36^#%bosf%)l#1(L-X1yuY$TW!SR^siZ@660CEFGPkU0mJVJpzM*LqfyCBN7rnd`wFIl#-g8mtRm=R9sT} zxwfvp0o>Ts+|}LF+t)uZI0TuP{5~~3Gds7sw*F&db8CBN_xR-W?EK>L>iXwzdcgp2 ze^Ki{HT!RRVL$2x3l9$m5Bg0n7+BXwfy0JJpyEWtkx&B}**~G?@<+mzjL)h0f=t8x z5{mcMVH^dYmS>gj_&3%5w`RYlSirxf*?%hbfApFI(BNPmjR%Jfhyj5PeIO9_0IXkSW0XdO%$>}gh;dD= z5sf_nxXOe+Z<}n1z2u>n{x3K!6O>pdyFGN~vfcT%NMhsw6Y`>TQm1Uxi+j{@R-gV2 zI_M4A{L+_}mrd5SV#~%U82cNaF=W5pdL4Y)sGiO{W&@XK_Z9{_02;F1*(4v+;wngxNi^DG1< zzUX`l50m3-d_JY3kS2$6tWw7KhDtgVn(OR0G|4nBd=WLP4*-f%?p4gmwRFCn!lm*N z;`8Ka;VyMk=qr;xxMfFaVWS4plLt*dmUUZ%`aG*-_l3R%+{;8ZX% zK~&*S`ar?)qx$Vgl5I*1>8=WW;gWD|L*(nk{kKbs;RnVcToHauGL>>{WQ!Re0~x7d zTaXtfVIXRHh{5G<$vP1BgsDTsjwA@=aIsRdFXx@v(lMS)51s0@fn}Y)|0!-1-1N5Q zV9Ax|i;f8SA2GX-rymdTo<|YEAar@<#JJD*dDx2+*G-ohEwjAj!G>W}gNXRwYTt3DecpcFu z4cLPXQEo}>1sA0ADx$m0-Z$X1wP4FrsQ)p6-tEn?)!gyY}BO8V*QWip+kaLwWM3lzO zN~0NH=DeI21LpIjG2h9Li|#j1&K{RL`u&WR!Jz(zucAoTKzjCF6rX5lv6uod2BM)U z!0UVfx=_omoE5K_dvNBT=WR^za@nvdgHZA_j$<;AA1y)6mts$r9Cemb#~f;1=AtEU zMs*M_yZDVeq>mAVf+JtZS5Y!fJ z&UT`89GW))Tt$HaFt>~i)wQ+r7rJtk^El$d#a(MW-+I=sG#%eK5ee*RFsC2)M^<<} zSq(o=`;_{&oF)Xpx=@=WcdL3O!5d}|;>SL^T7kH`r0W^|L@5>727NMq&@!rulaLEb z1~tC#u)gqXaY0HzX;gwTfVV*P5ZbXOk75uLlfQt0%B6v`?PL5F^gA@%xNPS0ptV-RWF8PmuLZFROTHFy9_EE^KGn@^(ZgU2bvT2&y@KYi4vG-Shs zx%01bdvtai${1fYKYe|WfWJS`{qgaon$B%K;LY${Ju9PW{fxC=-5H z6xx3rt1K(*dI8qBY^c%K86mwmDR$e44rDIbmLQV`6o2!fKS!pWBKQ(yUP}|bGxh_O z@Bq8eswf}knqw2YD{jdKYZcWSoh&}mkMggvYlJvqV%EoHU!{WARl zvgLEoeKllB{wfN6#7Z=&I0(5A%Kkn1?XzmAdcX*EuV@vubeBo5! zBo^d_Fr9l$Ro{)0*pFL??oW>o+n$~tytpt~xmF&o zVRfH)Mc6iMSm}v04YDucU1%D4bMz3KR`(tNR7;1etS$}$&9^@@<3f?0ga!!@7$1QC z_BM|LRK4nB{8-ykiwR`bO}*Hor-$W29KzV*#616FOm({pid2JeW*oF=tNxd_+TSHKlGZgU?Dyebw;UK|X~!I*Aeu zQ!*m=2dA zf8B=-oGW*dMd79jkKYBO%I7ce9at%aIa8b-6Yh6k`R$4cp6N)!J=FjLI3$k5v9t_z zc3rRxwO=I8D+kEsOT`U&5Z4&kzZYB8am!(;?(XvF*%}b*E?TI$$sF&n>HO(v5G#&= ztM+e-hGtl+bdj5(T;j{g0%qcj65^=VN)Kv`eRJTLAgt~PbqB{%=R<+wC>Cq zJyXOL87(i0%3K~+={KNi9IJaCs$|HmJz0S7;pH}ZqzCV z2s#x>-Ja`Fu{f|{Boa}D(00YU?FRsGfChW8un;J?^_f+Hr{=`wgCv2iZ^g30X2_ zCYI{#=_nNz88f6_hQCf+QDSa#O;w|gVx2)6a3 zu%B!d^Y7)=or3Bt4O|?yc~_c1hUMm4625U#bq*efylynxZYmT{-^np@=dh3Dd_`*^nyX=Ly1|fWOjU~Os4I3*Tq*P&u0ghX1if7G=YwieLR72N9{@LF!o5sJI7Pi_ zo+Av4C%dTyU07M}TF zYgzeyJo&qQn!70>zcc!lpC=DMwBYxuP@V%+Z5bM^x(*y|g#K@hJ*CV0cme`jv>LYE z%Q2u{1FA^r?{(W1^i2ccR1lpaOh!4T@XBdcm!D7-#e&6DJAu11x{fzDPSNZ}z)rRW z{pc_e)r13>%HI^mVLs93ylPdbvFHH^3a+TSpRMS3Sv&00m;MnlC6N)ILw6zL+e$!t zh=5h~A;S&5H9=+p4b1`b+15#+3W<^}i#$)9P>y;MM?qZtWyp$*tbqi1_85qrBGR_< zeJii4b)L-lp1e)}*Oocep%9tZU1S?H@?>E5H-4~5#&zcQOiD!*Wps`M^jfOa%n&b;D6&MQK1lnL{$%tSXyFS4WLCPwJ%neh64Zr z0E6@8FNFz^$a+JoA&gaF=Kx!KFMIOFmWxYPON`@A%=xi`G|NOeW6VjdH4>s$)(em6 zc^`sfX>i1TW#xf2JL@{Q$5&)Oc_4Op7qNEz<7HLViq^vz+MCHMub~Z8hVdsz)(}w?4_DNsr9boE~laFG-`#J)KyvbZdtr-xdeT(gJb`vj@GdFT_4kIjNJ&v zY}j9w8f&jt1qN=Fm))ug2PLzTwaS^=D5a9|r%ntQZ6cOa1(g=5!c+E^1l99{o*V&- zq^HaY0N_J&T2_W@MI%>J9DjcbGx#$V@cs?!?K^NQO(Q#9lQepzr;2EGoP=6Z(y1O( zTSZM%J^skaT~RTpM%-I%>XN}B1!&v^A#C9ogH8Z1lV^EKYi=dD50Ox+ryW1qNg`P%F$PlDNo2$OeW9Z$2{tVTDmjW z(PKzrsAH+BkDLh+z@tO|v>pjxJgW57F^2m*8&f8?`c}tg9eP_kYk6Xk*tNSyEfc$W z8q17MlHNom^Zng|Cd#sN$u@W{uJ(I9(U11YLViMw)uXe{*0 z*D)zUYSNB7Yd6RpH8_Mt?&r~cOQd}jHJqEG+5yd#5_hxSL#~w0n|Xol0;w^wsG!)&as9S`(&p65*a~;v^M2{g#|jHkthI}L7jieLr<;D1 zUL6#Kd=m0f9)7({dJ}N|p#F^4nvd)y8JoW3I4^Rdmqs*KxW4G2=(R-WnnrLQtwkAf+qU7h|_tzvUr+rfoZ|$n-D#z9kDVJnhTzp(l0J&^4=W?g!9p!BK)Ym2X#zw_( z?TCSbPl9kHR8d7B7IEGgHF2V!^s%~$2pNH z^0Kvd-7xh1tyIyrOSZri8~fDM;0Gw&DV5QpWTj8zWj2P{Dn)ObtKso$55zWVebjnK z_%4p9s8X%KC|Hh)>Gbu~%hXfMP(u*QEoNy$g<~<9SW>`HOFJz+v(41iD`dhJSxet(R0uW#np` z%Io)QkSY0U!GuydxRY!QtJQ4JG!?Af%GHiK-j)&?O5b5#YF>(~4VUxb+h>rD8XvN4 z4H+@k-rPaz%$((|_xCJT#s4I2c}nyVw-yT*Ex{87SQXY=hQ_%qcQ4_-;B9^3Y%LB4 z5%@;^s$VAVkzA`^(gZb8v^i%c3dWnel#A~3p5?@m{bI?;} z?4Emq2Yk|A$M&^EtYYv!08xrjyx@?1dC_TpidEwKI+#W3QtI3&rYJxNMdzO+gN@7K z9?`uW!F?a!^jHcm>=o(8Av15<9rRPeFF4M%Fs-bBda?-2(*34Hi5pIg2UZqeaP^R` zZ-|4Gk+U&|UrZdFWw1>s*XuW2)Ua^IN~`&66#qSgEx?YQxH=k6Ob`YdHZ6S$u~l}% zP~Ikt^7VB1aY5ZO*K_RR*UkW~%*juhr5)cS65C2IJEieKgDJg7JciGtg;vxrLfmzN z=qjpmk4TR*Dv<{J`?}L9Ru zem-1^3fHhmVkYWfbwa=i)(`)`y6o>9qun`b1WJxo)$3-R^Ti2&6|v*?KlJe74Tdhu zW+_0Oo){G0k5A@%OG1Oca1FV>xr2d&mWqlQ9vtY3Lmi^W4fVV$Z{J{Fn9IGzycw<5 zba|XjQTW}|hk101=8GXtf^SovTSkuqj$QRK+J=;#jJ?*2f;79@Tq14+m#9=4UQ?oCztVUXGMgJ~5zLX_TaCO6_rOy-))p-IgVf~?@*6g$4cv$PXkK-+(V6ljUH4}hA8S^< zb+$A3%gjyH~nDBhwnqg;Btu@1b>aCk(JjA?NTT50!oeu_aI&FSkut6AEeq$C#!Hr1Hjb1vb*)ye8d( zPY@&?I9_UE2yG=QUe*nmHWf}&hodWWtOv-hDerM%qE~taFQbHI*Jhtw8rWW_`@iA- z(dFX;8qXiRWo|7uL^kKVGVAhq7frq}*IBxs2a{FSATpD9HTC@;X4-ZJJfEAP_cfrH zx=*RH4H6g???dlJS-CShU7dDCYDWnxr@DU5zp4)S`OP$SZWTU$por>2RN$*PB86&r z>;O!U)5w=TL~haVX$he*WzTMyFD5DpfAo3N7>`|uWdI)1xXzDJj(M2Zn_nuihd63< z)vC#UTod;=HB_?l0N%g4KpuhS}m?(N>a#Z5oB;ryE8>!v&h9Q-|K z{}FLnYU%MYGq!NSlHjx)zhT=zo+KOY_ADDMi^wfQm_3&yY?WUyxI5C5MEYjp>jKlA zP?pgPsAh;^x5x9W_h|=$^=E~9s!a@{CKZ#I(r|tvZF>Y=I?h|+^)>6!4vRTMxZ~yd zdsZ-eG#C(p+fWMSpMRNl>Yun}&)pupg`8$YMva6$CSYcDWm>)zF8ol;$ErIU~k*p+$Q?a;Jj$u)2!4V z_QEyQx!|WhszB~*m=N(Vr0Hffv?mDO&mM#8>{U^Bax?F>DY)qcfmA>cmK0_VFztUU za>Yy#M$YPR`U*2QG6kM9-_&3OdnklKKy~IIQ%n%}DX+y90-}tH&;dF-cm8o_1W!dB zvR_}pwyrcuseNFBl^%Y1y*m_62S* z4)fYPJBl2J{A10^Q6xFhab@Oei zF4X%&L3u^GV~Jc5xmE1-rdY7&x#)+mi=|YOPD76z9Q+Zv)w&vtJmNgZpq5U8YpvMt!p%bk}@p{V^K^Km2rt^N>qr zRo4UmBSvucjD@OiS)D%>XxhBbm%M(+&F=xQ@lt~xh81yG*o#&<@R`^=0Gddn;tpE( zzmmzS8qDhKYP_6Xn(w!_<~m4tO5jj^MqQNa$6ZAah8~4(rY;T6&GjvOew=4ptd~WJ z?A+!+U<)rT1n16KTG)zwUZ8&f=vVm$GQWygNv1Hyb*)k7Div86#bLf5gWE2*MUBNm z#{-2&r%YZptea)A+xQr?9qXd(_6AhbDISCb>S6v=&yG7U5)(**%*rDpCOq>@v+3T)@crpgL5>@>;eo)d#9B-Jcjxbh zi+-H8W|?Iu6n97`_H!+Qz;o$nJZ#Cw{M_FURKWTxV;6X#>#P=DLMw}g8Z4z#Q{To9 z1rQ}U0*~K<*2M`8;J#Hlxxbv;Qp_yqo)gN)Z2aSuZl0ju+3%Hnu z5k6n%c({jd`rf*wZvXn&Ww}16tJyH*(D~Wf%!bUQ4f@wz2)_Q0#FLo7k~NHfsog1Ksf%YYp8Lq{IE)PK79Q_UnEe0riRP?ce-8Puf zT|_DfQz9OqNkI`bF+(egleII4^>|vmLpj7%JxKH_Q(EwEr)MSYkk8Hva{T}YZfR+I zM}&NIk+0+3IB0bV9O@J=tzw(5GI9s)ta*^wg{9mu&^%0}$GzuX1;H)T`w8i6wMwsB zqjn>_ZExl>7`!hXntf3_$?Q6-Z6LUSX%s){-DvYHw$9%);c#nv{L>f*#6sd%t2RC%X9ph1W?H5 ztykZ1Ua_-84Ke!8Q|51P z#^)jjLxzDwd4eJ9aWWdVOp?`=dYeigZG`Z>w|%)zNr)_>$0QpyM!f&o-h9OQ6=ld2 ze83XpS}-Dxz$8=g8Q@ffdX)jy)tDAmv9EM6GL%m0fdFOG5|&Q%j6gYQZtID0$l;Jl=nOXz+=DBBaqU1b&zJ zUF{lvoE&eohWpwZv5g3dr=WINif;|Aj%5$Pn=2BX$-q)!zSxagr&B^}m0P!ntS%kR z3yL6~XpVK~`h5E)I3nvj`~GPjnA+bNfB09MD2QYaKxCaa*ScE3p<*85S(p$iXJp4$qo7%2>2)0ZSud6@*~iC3z`JO9xn`a2sfZPuBciB*tm z6yx(KL53lIvpLJu*5;6Kv`1Puap*YBe6T3A2e|O%)u|C?YV;U&DyfrU2NI+7vCxns$YQP zw7=7Ffv5tn-%B3IyugxJjgjqHCgu0LW#dJnls{tlAOS0XDTO8f?ob>q`y6k4CzUlR zeltxyBR?5tqp2xjO$JUt|Ad=wJc-KDPdki=L4X3MmBD!J2(+MU@NMb=FddnE730EH zvMUtz{h)Za`7y|MFxmyL>uR4ZZ_N;4+cfL)XG)v1-bvE7VU~5T#~8@CXtuw|BB_C+ zv|M8>pAq*BkPQ<%RQ@M1y#Fo*fY%-UGr6x&>}jCfxZln|l7-mShsw1^6#%~D7!=Jr zPK3CnenZ{ji=&$@F5BR$3Ud7uBGkVc(f;f82-cD(B0?{k+fzQiZ3y0U=m!ws)9=-X zZ&~V+rdRvoPMZ=u8|O|mX-HBvxXezBWP625tnW#8Yji7bB6}4X69pC^&HcBZ?$sw} z=46eE=Cyl(7^EaxxNH=FV#tr>m;CtmeEZM&4yNl0`j~`=dlBR$*)H#foPT!=QFI$V zjd{f8%H2lCO-nK24_<}K^5fx`I3o-Xz!Ngpe;v90R3#Oq0lRkidEYf#;=MP|PU|_} zM$K+mucP)iY}$NW_|D2KS2J!+JgpKq7+m0<;#!My*!yWCTsXn<9H7IkQ-xJ~yuttf J1JJ|l{{rESw}}7% literal 0 HcmV?d00001 From e796f1669013f46eb3e7d31db9f227534b78a2ed Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 25 May 2022 19:50:36 +0300 Subject: [PATCH 118/173] pollard refactor --- .../english/hpc/algorithms/factorization.md | 33 ++++++++++--------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 90a1bf43..18d46824 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -271,12 +271,13 @@ u64 diff(u64 a, u64 b) { return a > b ? a - b : b - a; } +const u64 SEED = 42; + u64 find_factor(u64 n) { - u64 x = x0, y = x0, g = 1; + u64 x = SEED, y = SEED, g = 1; while (g == 1) { - x = f(x, a, n); - y = f(y, a, n); - y = f(y, a, n); + x = f(f(x, n), n); // advance x twice + y = f(y, n); // advance y once g = gcd(diff(x, y)); } return g; @@ -290,13 +291,13 @@ While it processes 25k 30-bit numbers — almost 15 times slower than the fastes Floyd's cycle-finding algorithm has a problem in that it does more iterator increments than necessary. One way to solve it is to memorize the values that the faster iterator visits and compute the gcd using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using this trick: ```c++ -u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { - u64 x = x0, y = x0; +u64 find_factor(u64 n) { + u64 x = SEED; for (int l = 256; l < (1 << 20); l *= 2) { - x = y; + u64 y = x; for (int i = 0; i < l; i++) { - y = f(y, a, n); + x = f(x, n); if (u64 g = gcd(diff(x, y), n); g != 1) return g; } @@ -313,14 +314,14 @@ We can remove the logarithm from the asymptotic using the fact that if one of $a ```c++ const int M = 1024; -u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { - u64 x = x0, y = x0, p = 1; +u64 find_factor(u64 n) { + u64 x = SEED; for (int l = M; l < (1 << 20); l *= 2) { - x = y; + u64 y = x, p = 1; for (int i = 0; i < l; i += M) { for (int j = 0; j < M; j++) { - y = f(y, a, n); + y = f(y, n); p = (u128) p * diff(x, y) % n; } if (u64 g = gcd(p, n); g != 1) @@ -340,6 +341,8 @@ The next step is to actually apply [Montgomery Multiplication](/hpc/number-theor This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it. +We do not need to convert numbers out of Montgomery representation before computing the GCD. + ```c++ struct Montgomery { u64 n, nr; @@ -369,13 +372,13 @@ const int M = 1024; u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { Montgomery m(n); - u64 y = x0; + u64 x = SEED; for (int l = M; l < (1 << 20); l *= 2) { - u64 x = y, p = 1; + u64 y = x, p = 1; for (int i = 0; i < l; i += M) { for (int j = 0; j < M; j++) { - y = f(y, a, m); + x = f(x, m); p = m.multiply(p, diff(x, y)); } if (u64 g = gcd(p, n); g != 1) From 54fe1ba3afb88fdd5b2ec9a041c74be42469afb5 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 25 May 2022 20:23:30 +0300 Subject: [PATCH 119/173] trial division edits --- .../english/hpc/algorithms/factorization.md | 50 +++++++++++-------- 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 18d46824..9e886375 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -6,7 +6,7 @@ draft: true The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs. -In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches, and then gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms, which is almost 4x faster than the previous state-of-the-art. +In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and almost 4 times faster than the previous state-of-the-art. -The most basic approach is to try every number less than $n$ as a divosor: +The most basic approach is to try every integer smaller than $n$ as a divisor: ```c++ u64 find_factor(u64 n) { @@ -68,7 +68,7 @@ u64 find_factor(u64 n) { } ``` -One simple optimization is to notice that it is enough to only check divisors that do not exceed $\sqrt n$. This works because if $n$ is divided by $d > \sqrt n$, then it is also divided by $\frac{n}{d} < \sqrt n$, so we can don't have to check it separately. +We can notice that if $n$ is divided by $d < \sqrt n$, then it is also divided by $\frac{n}{d} > \sqrt n$, and there is no need to check for it separately. This lets us stop trial division early and only check for potential divisors that do not exceed $\sqrt n$: ```c++ u64 find_factor(u64 n) { @@ -79,13 +79,13 @@ u64 find_factor(u64 n) { } ``` -In our benchmark, $n$ is a semiprime, and we always find the lesser divisor, so both $O(n)$ and $O(\sqrt n)$ implementations perform the same and are able to factorize ~2k 30-bit numbers per second, while taking whole ~20 seconds to factorize a single 60-bit number. +In our benchmark, $n$ is a semiprime, and we always find the lesser divisor, so both $O(n)$ and $O(\sqrt n)$ implementations perform the same and are able to factorize ~2k 30-bit numbers per second — while taking whole 20 seconds to factorize a single 60-bit number. ### Lookup Table Nowadays, you can type `factor 57` in your Linux terminal or Google search bar to get the factorization of any number. But before computers were invented, it was more practical to use *factorization tables:* special books containing factorizations of the first $N$ numbers. -We can also use this approach to compute these lookup tables [during compile time](/hpc/compilation/precalc/). To save space, it is convenient to only store the smallest divisor of a number, requiring just one byte for a 16-bit integer: +We can also use this approach to compute these lookup tables [during compile time](/hpc/compilation/precalc/). To save space, we can store only the smallest divisor of a number. Since the smallest divisor does not exceed the $\sqrt n$, we need just one byte per a 16-bit integer: ```c++ template @@ -109,13 +109,13 @@ u64 find_factor(u64 n) { } ``` -This approach can process 3M 16-bit integers per second, although it [probably gets slower](../hpc/cpu-cache/bandwidth/) for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first $2^{16}$ numbers, it does not scale well for larger inputs. +With this approach, we can process 3M 16-bit integers per second, although it would probably [get slower](../hpc/cpu-cache/bandwidth/) for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first $2^{16}$ numbers, it does not scale well for larger inputs. ### Wheel factorization -To save paper space, pre-computer era factorization tables typically excluded numbers divisible by 2 and 5: in decimal numeral system, you can quickly determine whether a number is divisible by 2 or 5 (by looking at its last digit) and keep dividing the number $n$ by 2 or 5 while it is possible, eventually arriving to some entry in the factorization table. This makes the factorization table just ½ × ⅘ = 0.4 its original size. +To save paper space, pre-computer era factorization tables typically excluded numbers divisible by $2$ and $5$, making the factorization table ½ × ⅘ = 0.4 of its original size. In the decimal numeral system, you can quickly determine whether a number is divisible by $2$ or $5$ (by looking at its last digit) and keep dividing the number $n$ by $2$ or $5$ while it is possible, eventually arriving at some entry in the factorization table. -We can apply a similar trick to trial division, first checking if the number is divisible by $2$, and then only check for odd divisors: +We can apply a similar trick to trial division by first checking if the number is divisible by $2$ and then only considering odd divisors: ```c++ u64 find_factor(u64 n) { @@ -128,9 +128,11 @@ u64 find_factor(u64 n) { } ``` -With 50% fewer divisions to do, this algorithm works twice as fast, but it can be extended. If the number is not divisible by $3$, we can also ignore all multiples of $3$, and the same goes for all other divisors. +With 50% fewer divisions to perform, this algorithm works twice as fast. -The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small. For example, if we consider $2$, $3$, and $5$, then, among the first $90$ numbers, we only need to check: +This method can be extended: if the number is not divisible by $3$, we can also ignore all multiples of $3$, and the same goes for all other divisors. The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small. + +For example, if we consider $2$, $3$, and $5$, then, among the first $90$ numbers, we only need to check: ```center (1,) 7, 11, 13, 17, 19, 23, 29, @@ -138,7 +140,7 @@ The problem is, as we increase the number of primes to exclude, it becomes less 61, 67, 71, 73, 77, 79, 83, 89… ``` -You can notice a pattern: the sequence repeats itself every $30$ numbers because remainder modulo $2 \times 3 \times 5 = 30$ is all we need to determine whether a number is divisible by $2$, $3$, or $5$. This means that we only need to check $8$ specific numbers in every $30$, proportionally improving the performance: +You can notice a pattern: the sequence repeats itself every $30$ numbers. This is not surprising since the remainder modulo $2 \times 3 \times 5 = 30$ is all we need to determine whether a number is divisible by $2$, $3$, or $5$. This means that we only need to check $8$ numbers with specific remainders out of every $30$, proportionally improving the performance: ```c++ u64 find_factor(u64 n) { @@ -157,11 +159,11 @@ u64 find_factor(u64 n) { } ``` -As expected, it works $\frac{30}{8} = 3.75$ times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved by considering more primes, but the returns are diminishing: adding a new prime $p$ reduces the number of iterations by $\frac{1}{p}$, but increases the size of the skip-list by a factor of $p$, requiring proportionally more memory. +As expected, it works $\frac{30}{8} = 3.75$ times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved further by considering more primes, but the returns are diminishing: adding a new prime $p$ reduces the number of iterations by $\frac{1}{p}$ but increases the size of the skip-list by a factor of $p$, requiring proportionally more memory. ### Precomputed Primes -If we keep increasing the number of primes we exclude in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don't need this array of offsets, but we need to precompute primes, which we can do during compile time like this: +If we keep increasing the number of primes in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don't need this array of offsets but just the array of primes: ```c++ const int N = (1 << 16); @@ -193,9 +195,11 @@ u64 find_factor(u64 n) { } ``` -This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but fixed fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$. +This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. + +Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but constant fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$. -All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation. In particular, we can use [Lemire division check](/hpc/arithmetic/division/#lemire-reduction): +All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advice and allow for some additional precomputation. In our case, it is suitable to use [the Lemire division check](/hpc/arithmetic/division/#lemire-reduction): ```c++ // ...precomputation is the same as before, @@ -212,7 +216,7 @@ u64 find_factor(u64 n) { } ``` -This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have for this number range. While it can probably be even further optimized by performing these checks in parallel with [SIMD](/hpc/simd), we will stop there and consider a different, asymptotically better approach. +This makes the algorithm ~18x faster: we can now factorize **~350k** 30-bit numbers per second, which is actually the most efficient algorithm we have for this number range. While it can probably be optimized even further by performing these checks in parallel with [SIMD](/hpc/simd), we will stop there and try a different, asymptotically better approach. ### Pollard's Rho Algorithm @@ -235,6 +239,8 @@ By itself, this algorithm is just an esoteric way of computing factorization, bu --> +Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm. + To construct this sequence, we need a "seemingly random" function that maps the remainders of $n$. Typical choice is $f(x) = (x + 1)^2 \mod n$. Now, consider a graph where each vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. The "trajectory" of any element — the path we walk starting from that element and following edges — eventually loop around. This trajectory resembles the greek letter $\rho$ (rho), which is why the algorithm is named so. @@ -427,3 +433,7 @@ Since Pollard's rho algorithm is randomized, you need to account for errors. The - Less than 10^50: Lenstra elliptic curve factorization - Less than 10^100: Quadratic Sieve - More than 10^100: General Number Field Sieve + +Requiring about 100KB of memory. + +6542 * 8 From 002b4aece30f3a63c2dc06a8a1b016afb55c4904 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 25 May 2022 21:46:07 +0300 Subject: [PATCH 120/173] pollard rho description --- .../english/hpc/algorithms/factorization.md | 47 +++++++++++++------ 1 file changed, 32 insertions(+), 15 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 9e886375..d44ca6af 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -237,35 +237,49 @@ It also searches for a factor, but it does so by repeatedly trying to compute th By itself, this algorithm is just an esoteric way of computing factorization, but can be made useful. If, instead of random numbers, we apply this $\gcd$ trick to a particular number sequence, we get a $O(n^\frac{1}{4})$ approach known as Pollard's rho algorithm. +Apart from this trick, Pollard's rho algorithm relies on a consequence from the Birthday paradox: we need to add $O(\sqrt{n})$ random numbers from $1$ to $n$ to a set until we get a collision. + --> -Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm. +Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm that makes use of the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): one only needs to draw $\Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability. -To construct this sequence, we need a "seemingly random" function that maps the remainders of $n$. Typical choice is $f(x) = (x + 1)^2 \mod n$. +Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes. -Now, consider a graph where each vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. The "trajectory" of any element — the path we walk starting from that element and following edges — eventually loop around. This trajectory resembles the greek letter $\rho$ (rho), which is why the algorithm is named so. +Now, consider a graph where each number-vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. In functional graphs, the "trajectory" of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point we have to go to a vertex we have already visited). -![](../img/rho.jpg) +![The trajectory of an element resembles the greek letter ρ (rho), which is what the algorithm is named after](../img/rho.jpg) -Apart from this trick, Pollard's rho algorithm relies on a consequence from the Birthday paradox: we need to add $O(\sqrt{n})$ random numbers from $1$ to $n$ to a set until we get a collision. +Consider a trajectory of some particular element $x_0$: -Now, consider a trajectory of some element $x_0$: {$x_0$, $f(x_0)$, $f(f(x_0))$, $\ldots$}. +$$ +x_0, \; f(x_0), \; f(f(x_0)), \; \ldots +$$ -Make another sequence out of it, virtually taking each element modulo $p$, the lesser of prime divisors of $n$. +Now, let's make another sequence out of this one by reducing each element modulo $p$, the smallest prime divisor of $n$. -**Lemma.** The expected length in that sequence is $O(\sqrt[4]{n})$. +**Lemma.** The expected length of that sequence before it turns into a cycle is $O(\sqrt[4]{n})$. -**Proof.** Each time we walk a new edge, we generate a random number. It has some chance if looping around. +**Proof:** Since $p$ is the smallest divisor, $p \leq \sqrt n$. Each time we follow a new edge, we essentially generate a random number between $0$ and $p$ (we treat $f$ as a "deterministically-random" function). The birthday paradox states that we only need to generate $O(\sqrt p) = O(\sqrt[4]{n})$ numers until we get a collision and thus enter a loop. -As $p$ is the lesser divisor, $p \leq \sqrt n$. Now we need to plug it into the [Birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): we need to add $O(\sqrt{p}) = O(\sqrt[4]{n})$ elements to the set to get a collision, which means that the. +Since we don't know $p$, this mod-$p$ sequence is only imaginary, but if find a cycle in it — that is, $i$ and $j$ such that -Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently. +$$ +f^i(x_0) \equiv f^j(x_0) \pmod p +$$ + +then we can also find $p$ itself as + +$$ +p = \gcd(|f^i(x_0) - f^j(x_0)|, n) +$$ -Now, if we find a cycle in this sequence — $i$ and $j$ such that $f^i(x_0) \equiv f^j(x_0) \pmod p$ — we can find some divisor of $n$ using the $\gcd$ trick: $\gcd(|f^i(x_0) - f^j(x_0)|, n)$ would be less than $n$ and divisible by $p$. +The algorithm itself just finds this cycle and $p$ using this GCD trick and Floyd's "[tortoise and hare](https://en.wikipedia.org/wiki/Cycle_detection#Floyd's_tortoise_and_hare)" algorithm: we maintain two pointers $i$ and $j = 2i$ and check that -Floyd's cycle-finding algorithm +$$ +\gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1 +$$ -The algorithm itself just finds a loop in this sequence using the Ford algorithms, also known as the "hare and turtle" technique: we maintain two pointers $i$ and $j$ ($i = 2j$) and check that $f^i(x_0) \equiv f^j(x_0) \pmod p$, which is equivalent to checking $\gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1$. +which is equivalent to comparing $f^i(x_0)$ and $f^j(x_0)$ modulo $p$. Since $j$ (hare) is increasing at twice the rate of $i$ (tortoise), their difference is increasing by $1$ each iteration and eventually will become equal to (or a multiple of) the cycle length, with $i$ and $j$ pointing to the same elements. And as we proved half a page ago, reaching a cycle would only require $O(\sqrt[4]{n})$ iterations: ```c++ u64 f(u64 x, u64 mod) { @@ -290,7 +304,7 @@ u64 find_factor(u64 n) { } ``` -While it processes 25k 30-bit numbers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, processing around 90 of them per second. +While it processes only ~25k 30-bit integers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, factorizing around 90 of them per second. ### Pollard-Brent Algorithm @@ -412,6 +426,9 @@ If you have limited time, you should probably compute as much forward as possibl How to optimize for the *average* case is unclear. +Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently. + + ### Reducing Errors There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows). From 428407e09d0461d13b55a5bae555d98bea75d320 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 26 May 2022 15:32:28 +0300 Subject: [PATCH 121/173] factorization edits --- .../english/hpc/algorithms/factorization.md | 76 ++++++++----------- 1 file changed, 33 insertions(+), 43 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index d44ca6af..fd61d441 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -1,7 +1,6 @@ --- title: Integer Factorization weight: 3 -draft: true --- The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs. @@ -241,7 +240,11 @@ Apart from this trick, Pollard's rho algorithm relies on a consequence from the --> -Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm that makes use of the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): one only needs to draw $\Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability. +Pollard's rho is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm that makes use of the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): + +> One only needs to draw $d = \Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability. + +You can look up formal proof on Wikipedia, but the informal reasoning behind it is that that each of $d$ added numbers has a chance of approximately $\frac{d}{n}$ of colliding with anythin else, meaning that the expected number of collisions is $\frac{d^2}{n}$. If $d$ is asymptotically smaller than $\sqrt n$, then this ratio grows to zero as $n$ rises and to infinity otherwise. Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes. @@ -308,7 +311,9 @@ While it processes only ~25k 30-bit integers — almost 15 times slower than the ### Pollard-Brent Algorithm -Floyd's cycle-finding algorithm has a problem in that it does more iterator increments than necessary. One way to solve it is to memorize the values that the faster iterator visits and compute the gcd using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using this trick: +Floyd's cycle-finding algorithm has a problem in that it moves iterators more than necessary: at least half of the vertices are visited one additional time by the slower iterator. + +One way to solve it is to memorize the values $x_i$ that the faster iterator visits and every two iterations compute the GCD using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using a different principle: the tortoise doesn't move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare $x_i$ and $x_{2^{\lfloor \log_2 i \rfloor}}$ on each iteration: ```c++ u64 find_factor(u64 n) { @@ -327,9 +332,11 @@ u64 find_factor(u64 n) { } ``` -It actually does *not* improve performance and even makes it ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it. +Note that we also set an upper limit on the number of iterations so that the algorithm finishes in reasonable time and returns `1` if $n$ turns out to be a prime. + +It actually does *not* improve performance and even makes the algorithm ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it. -We can remove the logarithm from the asymptotic using the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic: +Instead of [optimizing the GCD itself](../gcd), we can optimize the number of its invocations. We can use the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic: ```c++ const int M = 1024; @@ -357,11 +364,7 @@ It now works at 425 factorizations per second, bottlenecked by the speed of modu ### Optimizing Modulo -The next step is to actually apply [Montgomery Multiplication](/hpc/number-theory/montgomery/). - -This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it. - -We do not need to convert numbers out of Montgomery representation before computing the GCD. +The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/): the modulo is constant, so we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap. ```c++ struct Montgomery { @@ -410,47 +413,34 @@ u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { } ``` -It processes around 3000 per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) library can do (invocated via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)). - -### Further Optimization - -There might be a way to . - -It may be beneficial to start multiplying only after a certain threshold since there is little probability that we enter a cycle in the beginning. - -It may be worth it to run a few versions in parallel and stop whichever finishes first. If we run $p$ runs, it is expected to finish $\sqrt p$ times faster. Either scalar code and taking advantage of there being multiple execution ports for multiplication, or using [SIMD](/hpc/simd) instructions to do 4 or 8 multiplications in parallel. - -Would not be surprised to see another 3x improvement and throughputs of 10k/sec. - -If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other. - -How to optimize for the *average* case is unclear. - -Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently. +This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than the [PARI](https://pari.math.u-bordeaux.fr/) library (invoked via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)). +### Further Improvements -### Reducing Errors +I belive there is still a lot of potential for optimization in our implementation of the Pollard's algorithm: -There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows). +- There is probably be a better cycle-finding algorithm that exploits the fact that the graph is random. It is currently bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we do that, we could calculate more than one multiplication of the values we've seen to detect a loop sooner. On the other hand, there is little chance that we enter the loop in within the first few iterations, so we may just advance the iterator for some time before starting the trials with the GCD trick. +- If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (try to prove it). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could run two or three pairs of operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel. -Our implementation has less than 0.7% error rate, but it grows higher if the numbers are lower than $10^{18}$. +I would not be surprised to see another 3x improvement and a throughput of ~10k/sec. -Since Pollard's rho algorithm is randomized, you need to account for errors. There may be several sources: + -- Factors not being found (need to perform a primality test and start again if it's negative). -- The `p` variable can get zeroed out (need to either restart or roll back and do it iteration-by-iteration). -- Overflows in Montgomery multiplication (our implementation is pretty loose). +Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate which grows higher if the numbers are lower than $10^{18}$. They come from three main sources: -### Larger Numbers +- Factors simply not being found (the algorithm is inherently randomized, and there is no guarantee that they will be found). In this case, we need to perform a primality test and optionally start again. +- The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one-by-one. +- Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if $n$ is large, we need to add more `x > mod ? x - mod : x` kind of statements to deal with overflows. -"How big are your numbers?" determines the method to use: +These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general the optimal approach should depend on the size of the numbers: -- Less than 2^16 or so: Lookup table. -- Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm. -- Less than 10^50: Lenstra elliptic curve factorization -- Less than 10^100: Quadratic Sieve -- More than 10^100: General Number Field Sieve +- Smaller than $2^{16}$: use a lookup table +- Smaller than $2^{32}$: use a list of precomputed primes with a fast divsibility check +- Smaller than $2^{64}$ or so: use Pollard's rho algorithm with Montgomery multiplication +- Smaller than $10^{50}$: switch to [Lenstra elliptic curve factorization](https://en.wikipedia.org/wiki/Lenstra_elliptic-curve_factorization) +- Smaller than $10^{100}$: switch to [Quadratic Sieve](https://en.wikipedia.org/wiki/Quadratic_sieve) +- Larger than $10^{100}$: switch to [General Number Field Sieve](https://en.wikipedia.org/wiki/General_number_field_sieve) -Requiring about 100KB of memory. + -6542 * 8 +If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/). From 19143a513bdc88a564391fa4b71f5d01e3ef6a0b Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 26 May 2022 18:44:58 +0300 Subject: [PATCH 122/173] factorization improvements --- .../english/hpc/algorithms/factorization.md | 21 ++++++++++--------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index fd61d441..7fc51f93 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -362,7 +362,7 @@ u64 find_factor(u64 n) { It now works at 425 factorizations per second, bottlenecked by the speed of modulo. -### Optimizing Modulo +### Optimizing the Modulo The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/): the modulo is constant, so we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap. @@ -413,26 +413,27 @@ u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { } ``` -This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than the [PARI](https://pari.math.u-bordeaux.fr/) library (invoked via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)). +This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)'s `factor` function measures. ### Further Improvements -I belive there is still a lot of potential for optimization in our implementation of the Pollard's algorithm: +**Optimizations.** There is still a lot of potential for optimization in our implementation of the Pollard's algorithm: -- There is probably be a better cycle-finding algorithm that exploits the fact that the graph is random. It is currently bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we do that, we could calculate more than one multiplication of the values we've seen to detect a loop sooner. On the other hand, there is little chance that we enter the loop in within the first few iterations, so we may just advance the iterator for some time before starting the trials with the GCD trick. -- If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (try to prove it). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could run two or three pairs of operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel. +- We could probably use a better cycle-finding algorithm, exploiting the fact that the graph is random. For example, there is little chance that we enter the loop in within the first few iterations (the length of the cycle and the path we walk before entering it should be equal in expectation since before we loop around, we choose the vertex of the path we've walked independently), so we may just advance the iterator for some time before starting the trials with the GCD trick. +- Our current approach is bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we are waiting for it to complete, we could perform more than just one trial using the previous values. +- If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (the reasoning is similar to the Birthday paradox; try to prove it yourself). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could concurrently run two or three of the same operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel. -I would not be surprised to see another 3x improvement and a throughput of ~10k/sec. +I would not be surprised to see another 3x improvement and a throughput of ~10k/sec. If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/). -Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate which grows higher if the numbers are lower than $10^{18}$. They come from three main sources: +**Errors.** Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate for 60-bit integers, and it grows higher if the numbers are lower. These errors come from three main sources: -- Factors simply not being found (the algorithm is inherently randomized, and there is no guarantee that they will be found). In this case, we need to perform a primality test and optionally start again. +- A cycle simply not being found (the algorithm is inherently random, and there is no guarantee that it will be found). In this case, we need to perform a primality test and optionally start again. - The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one-by-one. - Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if $n$ is large, we need to add more `x > mod ? x - mod : x` kind of statements to deal with overflows. -These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general the optimal approach should depend on the size of the numbers: +**Larger numbers.** These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general, the optimal approach should depend on the size of the numbers: - Smaller than $2^{16}$: use a lookup table - Smaller than $2^{32}$: use a list of precomputed primes with a fast divsibility check @@ -443,4 +444,4 @@ These issues become less important if we exclude small numbers and numbers with -If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/). +The last three approaches are very different from what we've been doing and require much more advanced number theory, and they deserve an article (or a full-length university course) of their own. From 709340d509d45719c5f9d76432d273a1d84d44c5 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 26 May 2022 19:22:45 +0300 Subject: [PATCH 123/173] pollard edits --- .../english/hpc/algorithms/factorization.md | 44 +++++++++---------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 7fc51f93..07bf7408 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -244,11 +244,11 @@ Pollard's rho is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm t > One only needs to draw $d = \Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability. -You can look up formal proof on Wikipedia, but the informal reasoning behind it is that that each of $d$ added numbers has a chance of approximately $\frac{d}{n}$ of colliding with anythin else, meaning that the expected number of collisions is $\frac{d^2}{n}$. If $d$ is asymptotically smaller than $\sqrt n$, then this ratio grows to zero as $n$ rises and to infinity otherwise. +The reasoning behind it is that each of the $d$ added element has a $\frac{d}{n}$ chance of colliding with some other element, implying that the expected number of collisions is $\frac{d^2}{n}$. If $d$ is asymptotically smaller than $\sqrt n$, then this ratio grows to zero as $n \to \infty$, and to infinity otherwise. -Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes. +Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes. -Now, consider a graph where each number-vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. In functional graphs, the "trajectory" of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point we have to go to a vertex we have already visited). +Now, consider a graph where each number-vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. In functional graphs, the "trajectory" of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point, we have to go to a vertex we have already visited). ![The trajectory of an element resembles the greek letter ρ (rho), which is what the algorithm is named after](../img/rho.jpg) @@ -258,11 +258,11 @@ $$ x_0, \; f(x_0), \; f(f(x_0)), \; \ldots $$ -Now, let's make another sequence out of this one by reducing each element modulo $p$, the smallest prime divisor of $n$. +Let's make another sequence out of this one by reducing each element modulo $p$, the smallest prime divisor of $n$. -**Lemma.** The expected length of that sequence before it turns into a cycle is $O(\sqrt[4]{n})$. +**Lemma.** The expected length of the reduced sequence before it turns into a cycle is $O(\sqrt[4]{n})$. -**Proof:** Since $p$ is the smallest divisor, $p \leq \sqrt n$. Each time we follow a new edge, we essentially generate a random number between $0$ and $p$ (we treat $f$ as a "deterministically-random" function). The birthday paradox states that we only need to generate $O(\sqrt p) = O(\sqrt[4]{n})$ numers until we get a collision and thus enter a loop. +**Proof:** Since $p$ is the smallest divisor, $p \leq \sqrt n$. Each time we follow a new edge, we essentially generate a random number between $0$ and $p$ (we treat $f$ as a "deterministically-random" function). The birthday paradox states that we only need to generate $O(\sqrt p) = O(\sqrt[4]{n})$ numbers until we get a collision and thus enter a loop. Since we don't know $p$, this mod-$p$ sequence is only imaginary, but if find a cycle in it — that is, $i$ and $j$ such that @@ -307,13 +307,13 @@ u64 find_factor(u64 n) { } ``` -While it processes only ~25k 30-bit integers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, factorizing around 90 of them per second. +While it processes only ~25k 30-bit integers — which is almost 15 times slower than by checking each prime using a fast division trick — it dramatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, factorizing around 90 of them per second. ### Pollard-Brent Algorithm Floyd's cycle-finding algorithm has a problem in that it moves iterators more than necessary: at least half of the vertices are visited one additional time by the slower iterator. -One way to solve it is to memorize the values $x_i$ that the faster iterator visits and every two iterations compute the GCD using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using a different principle: the tortoise doesn't move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare $x_i$ and $x_{2^{\lfloor \log_2 i \rfloor}}$ on each iteration: +One way to solve it is to memorize the values $x_i$ that the faster iterator visits and, every two iterations, compute the GCD using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$. But it can also be done without extra memory using a different principle: the tortoise doesn't move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare $x_i$ and $x_{2^{\lfloor \log_2 i \rfloor}}$ on each iteration: ```c++ u64 find_factor(u64 n) { @@ -332,11 +332,11 @@ u64 find_factor(u64 n) { } ``` -Note that we also set an upper limit on the number of iterations so that the algorithm finishes in reasonable time and returns `1` if $n$ turns out to be a prime. +Note that we also set an upper limit on the number of iterations so that the algorithm finishes in a reasonable amount of time and returns `1` if $n$ turns out to be a prime. -It actually does *not* improve performance and even makes the algorithm ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it. +It actually does *not* improve performance and even makes the algorithm ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the time requirement of this algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it. -Instead of [optimizing the GCD itself](../gcd), we can optimize the number of its invocations. We can use the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic: +Instead of [optimizing the GCD itself](../gcd), we will optimize the number of its invocations. We can use the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$ we remove $\log n$ out of the asymptotic: ```c++ const int M = 1024; @@ -360,11 +360,11 @@ u64 find_factor(u64 n) { } ``` -It now works at 425 factorizations per second, bottlenecked by the speed of modulo. +Now it performs 425 factorizations per second, bottlenecked by the speed of modulo. ### Optimizing the Modulo -The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/): the modulo is constant, so we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap. +The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/). Since the modulo is constant, we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap: ```c++ struct Montgomery { @@ -413,7 +413,7 @@ u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) { } ``` -This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)'s `factor` function measures. +This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath's `factor`](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html) function measures. ### Further Improvements @@ -423,24 +423,24 @@ This implementation can processes around 3k 60-bit integers per second, which is - Our current approach is bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we are waiting for it to complete, we could perform more than just one trial using the previous values. - If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (the reasoning is similar to the Birthday paradox; try to prove it yourself). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could concurrently run two or three of the same operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel. -I would not be surprised to see another 3x improvement and a throughput of ~10k/sec. If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/). +I would not be surprised to see another 3x improvement and throughput of ~10k/sec. If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/). **Errors.** Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate for 60-bit integers, and it grows higher if the numbers are lower. These errors come from three main sources: - A cycle simply not being found (the algorithm is inherently random, and there is no guarantee that it will be found). In this case, we need to perform a primality test and optionally start again. -- The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one-by-one. +- The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one by one. - Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if $n$ is large, we need to add more `x > mod ? x - mod : x` kind of statements to deal with overflows. **Larger numbers.** These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general, the optimal approach should depend on the size of the numbers: -- Smaller than $2^{16}$: use a lookup table -- Smaller than $2^{32}$: use a list of precomputed primes with a fast divsibility check -- Smaller than $2^{64}$ or so: use Pollard's rho algorithm with Montgomery multiplication -- Smaller than $10^{50}$: switch to [Lenstra elliptic curve factorization](https://en.wikipedia.org/wiki/Lenstra_elliptic-curve_factorization) -- Smaller than $10^{100}$: switch to [Quadratic Sieve](https://en.wikipedia.org/wiki/Quadratic_sieve) -- Larger than $10^{100}$: switch to [General Number Field Sieve](https://en.wikipedia.org/wiki/General_number_field_sieve) +- Smaller than $2^{16}$: use a lookup table; +- Smaller than $2^{32}$: use a list of precomputed primes with a fast divisibility check; +- Smaller than $2^{64}$ or so: use Pollard's rho algorithm with Montgomery multiplication; +- Smaller than $10^{50}$: switch to [Lenstra elliptic curve factorization](https://en.wikipedia.org/wiki/Lenstra_elliptic-curve_factorization); +- Smaller than $10^{100}$: switch to [Quadratic Sieve](https://en.wikipedia.org/wiki/Quadratic_sieve); +- Larger than $10^{100}$: switch to [General Number Field Sieve](https://en.wikipedia.org/wiki/General_number_field_sieve). From ab5ffcb7135a3848720535b47694d95acb27d504 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 26 May 2022 19:44:35 +0300 Subject: [PATCH 124/173] elaborate on benchmarking --- content/english/hpc/algorithms/factorization.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md index 07bf7408..acfd0b0c 100644 --- a/content/english/hpc/algorithms/factorization.md +++ b/content/english/hpc/algorithms/factorization.md @@ -5,7 +5,7 @@ weight: 3 The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs. -In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and almost 4 times faster than the previous state-of-the-art. +In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and ~3 times faster than the previous state-of-the-art. + *Instrumentation* is an overcomplicated term that means inserting timers and other tracking code into programs. The simplest example is using the `time` utility in Unix-like systems to measure the duration of execution for the whole program. More generally, we want to know *which parts* of the program need optimization. There are tools shipped with compilers and IDEs that can time designated functions automatically, but it is more robust to do it by hand using any methods of interacting with time that the language provides: From 1cd629fa9dde73de0d810890effbc4c7cdac4db8 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Fri, 10 Jun 2022 15:32:41 +0300 Subject: [PATCH 128/173] add anagrams problem --- content/russian/cs/programming/bayans.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/content/russian/cs/programming/bayans.md b/content/russian/cs/programming/bayans.md index d35880cc..9faf6139 100644 --- a/content/russian/cs/programming/bayans.md +++ b/content/russian/cs/programming/bayans.md @@ -307,6 +307,10 @@ def query(y): Даны $3 \cdot 10^5$ точек на плоскости. Выберите среди них любое подмножество из 500 точек и решите для него задачу коммивояжера: найдите минимальный по длине цикл, проходящий через все эти точки. +## Анаграммы + +Найдите в строке $s$ первую подстроку, являющуюся анаграммой (пререстановкой символов) строки $t$ за $O(n)$. + -Due to difficulties in [refraining the compiler from cheating](/hpc/profiling/noise/), the code snippets in this article are slightly simplified for exposition purposes. Check the [code repository](https://github.com/sslotin/amh-code/tree/main/cpu-cache) if you want to reproduce them yourself. +Due to difficulties in [preventing the compiler from optimizing away unused values](/hpc/profiling/noise/), the code snippets in this article are slightly simplified for exposition purposes. Check the [code repository](https://github.com/sslotin/amh-code/tree/main/cpu-cache) if you want to reproduce them yourself. ### Acknowledgements diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 8a4924ea..6e73d32d 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -9,7 +9,7 @@ Instead, the most fascinating showcases of performance engineering are multifold -In this article, we focus on one such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code. +In this section, we focus on one such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code. The first algorithm achieves that by removing [branches](/hpc/pipelining/branching), and the second also optimizes the memory layout to achieve better [cache system](/hpc/cpu-cache) performance. This technically disqualifies it from being a drop-in replacement for `std::lower_bound` as it needs to permute the elements of the array before it can start answering queries — but I can't recall a lot of scenarios where you obtain a sorted array but can't afford to spend linear time on preprocessing. @@ -401,7 +401,7 @@ Also, note that the last few prefetch requests are actually not needed, and in f This prefetching technique allows us to read up to four elements ahead, but it doesn't really come for free — we are effectively trading off excess memory [bandwidth](/hpc/cpu-cache/bandwidth) for reduced [latency](/hpc/cpu-cache/latency). If you run more than one instance at a time on separate hardware threads or just any other memory-intensive computation in the background, it will significantly [affect](/hpc/cpu-cache/sharing) the benchmark performance. -But we can do better. Instead of fetching four cache lines at a time, we could fetch four times *fewer* cache lines. And in the [next article](../s-tree), we will explore the approach. +But we can do better. Instead of fetching four cache lines at a time, we could fetch four times *fewer* cache lines. And in the [next section](../s-tree), we will explore the approach. -When you fetch anything from memory, there is always some non-zero latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through an incredibly complex system of address translation units and caching layers designed to both help in memory management and reduce the latency. +When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce the latency. Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored: diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md index f0ca9c65..da1f5bb6 100644 --- a/content/english/hpc/external-memory/hierarchy.md +++ b/content/english/hpc/external-memory/hierarchy.md @@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data. ### Non-Volatile Memory -While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to be persisted for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms. +While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms. diff --git a/content/english/hpc/pipelining/_index.md b/content/english/hpc/pipelining/_index.md index e18a31cc..aab72d79 100644 --- a/content/english/hpc/pipelining/_index.md +++ b/content/english/hpc/pipelining/_index.md @@ -5,7 +5,7 @@ weight: 3 When programmers hear the word *parallelism*, they mostly think about *multi-core parallelism*, the practice of explicitly splitting a computation into semi-independent *threads* that work together to solve a common problem. -This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as much computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware), and parallel algorithm design is becoming an increasingly important area, for now, we will consider the use of more than one CPU core cheating. +This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as many computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware) and parallel algorithm design is becoming an increasingly important area, for now, we will limit ourselves to considering only a single CPU core. But there are other types of parallelism, already existing inside a CPU core, that you can use *for free*. diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md index 0f87da83..d7416f35 100644 --- a/content/english/hpc/pipelining/branchless.md +++ b/content/english/hpc/pipelining/branchless.md @@ -28,7 +28,7 @@ for (int i = 0; i < N; i++) s += (a[i] < 50) * a[i]; ``` -Suddenly, the loop now takes ~7 cycles per element instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability. +The loop now takes ~7 cycles per element instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability. But wait… shouldn't there still be a branch? How does `(a[i] < 50)` map to assembly? @@ -182,7 +182,7 @@ int abs(int a) { **Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated char array (also known as "C-string") allocated somewhere on the heap and one integer containing the string size. -A very common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings. +A common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings. However, this requires a separate branch, which is costly unless most strings are empty. What we can do to get rid of it is to allocate a "zero C-string," which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction. @@ -216,7 +216,7 @@ That there are no substantial reasons why compilers can't do this on their own, --> -**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications, including GPU programming, because they don't have branching in the first place. +**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications because they don't have branching in the first place. In our array sum example, if you remove the `volatile` type qualifier from the accumulator, the compiler becomes able to [vectorize](/hpc/simd/auto-vectorization) the loop: diff --git a/content/english/hpc/pipelining/tables.md b/content/english/hpc/pipelining/tables.md index 5f69c579..ad90c400 100644 --- a/content/english/hpc/pipelining/tables.md +++ b/content/english/hpc/pipelining/tables.md @@ -33,7 +33,7 @@ Some comments: - Because our minds are so used to the cost model where "more" means "worse," people mostly use *reciprocals* of throughput instead of throughput. - If a certain instruction is especially frequent, its execution unit could be duplicated to increase its throughput — possibly to even more than one, but not higher than the [decode width](/hpc/architecture/layout). - Some instructions have a latency of 0. This means that these instruction are used to control the scheduler and don't reach the execution stage. They still have non-zero reciprocal throughput because the [CPU front-end](/hpc/architecture/layout) still needs to process them. -- Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is the [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all. +- Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all. - Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), the latency is usually specified for the best case (an L1 cache hit). There are many more important little details, but this mental model will suffice for now. From 59ca0451a59c0b3c81e1e542d2f8aff3588207c6 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Mon, 18 Jul 2022 01:17:15 +0300 Subject: [PATCH 132/173] four new theoretical problems --- content/russian/cs/programming/bayans.md | 37 ++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/content/russian/cs/programming/bayans.md b/content/russian/cs/programming/bayans.md index 9faf6139..aee5deda 100644 --- a/content/russian/cs/programming/bayans.md +++ b/content/russian/cs/programming/bayans.md @@ -311,6 +311,43 @@ def query(y): Найдите в строке $s$ первую подстроку, являющуюся анаграммой (пререстановкой символов) строки $t$ за $O(n)$. +## Функциональный граф + +Дан ориентированный граф из $n < 10^5$ вершин, в котором из каждой вершины ведет ровно одно ребро. Требуется ответить на $q < 10^5$ запросов «в какую вершину мы попадем, если начнем в вершине $v_i$ и сделаем $k_i < 10^{18}$ переходов» за время $O(q + n)$. + +## Асинхронная шляпа + +Серёжа и его $(n - 1)$ друзей решили поиграть в «шляпу», в которой один игрок должен за ограниченное время объяснить как можно больше слов, чтобы его партнер их отгадал. + +Каждый игрок должен пообщаться с любым другим по разу; обычно игра проводится так: + +- 1-й игрок объясняет в течение минуты слова 2-му, +- 2-й игрок объясняет слова 3-му, +- ..., +- $n$-й игрок объясняет слова 1-му, +- 1-й игрок объясняет слова 3-му, +- 2-й игрок объясняет слова 4-му… + +…и так далее, пока $(n-1)$-й игрок не закончит объяснять слова $(n-2)$-ому. + +Если друзей собралось много, то игра может занять приличное время. Серёжу интересует, какое минимальное время она может длиться, если разрешить парам участников общаться между собой одновременно и в любом порядке. + +Для данного $n \le 500$, найдите минимальное количество времени $k$ и соответствующее ему расписание. + +## Random coffee + +В компании, в которой вы работаете, устроено неизвестное число людей — от одного до бесконечности с равной вероятностью. Для борьбы с одиночеством, каждый сотрудник участвует в «random coffee»: каждую неделю вы встречаетесь со случайным человеком из компании, чтобы попить кофе и обсудить что угодно. + +Вы участвовали в random coffee $n$ раз и пообщались с $k$ разными людьми (с некоторыми — более одного раза). Какое наиболее вероятное число человек работает в компании? + +## Мафия + +В «мафию» играют 13 человек, из которых 10 мирных и 3 мафии. Все роли розданы с помощью стандартной колоды игральных карт: заранее выбрали и перемешали 10 красных и 3 чёрные карты, кто вытянул черную — мафия. Все карты различны и известны всем. Игра начинается с дневного голосования. + +Как мирным гарантированно победить? + + + + + + + + + + + + + + + + + 0 + 7 + + 2 + + + 1 + 3 + + 4 + + 8 + 5 + + 9 + 6 + + + + + 1 + 3 + + + 2 + + 4 + + 8 + 5 + + 9 + 6 + + 0 + 7 + + + + + + + + + + + + From f3fb1ae8eceaaf73d231763b3bcf0fb3f4b964eb Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 19 Jul 2022 01:10:19 +0300 Subject: [PATCH 136/173] typos --- content/english/hpc/algorithms/gcd.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md index 7941edd0..6a4f8ca7 100644 --- a/content/english/hpc/algorithms/gcd.md +++ b/content/english/hpc/algorithms/gcd.md @@ -252,9 +252,9 @@ int gcd(int a, int b) { } ``` -It runs in 91ns — which is good enough to leave it there. +It runs in 91ns, which is good enough to leave it there. -If somebody wants to try to shove off a few more nanoseconds by re-writing assembly by hand or trying a lookup table to save a few last iterations, please [let me know](http://sereja.me/). +If somebody wants to try to shave off a few more nanoseconds by rewriting the assembly by hand or trying a lookup table to save a few last iterations, please [let me know](http://sereja.me/). ### Acknowledgements From 9d626692f78d3e173644d1bbbf8dbbca7d9c2d79 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 19 Jul 2022 01:28:13 +0300 Subject: [PATCH 137/173] improve wording --- content/english/hpc/algorithms/matmul.md | 2 +- content/english/hpc/cpu-cache/alignment.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 02c68f36..5f2847d2 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -438,7 +438,7 @@ There is also an approach that performs asymptotically fewer arithmetic operatio FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which results in decreased performance. If you can guarantee that all intermediate results can be represented exactly as 32- or 64-bit floating-point numbers (which is [often the case](/hpc/arithmetic/errors/)), it may be faster to just convert them to and from floats. -You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication," which is defined as: +This approach can be also applied to some similar-looking computations. One example is the "min-plus matrix multiplication" defined as: $$ (A \circ B)_{ij} = \min_{1 \le k \le n} (A_{ik} + B_{kj}) diff --git a/content/english/hpc/cpu-cache/alignment.md b/content/english/hpc/cpu-cache/alignment.md index 59579467..e9c5f4d3 100644 --- a/content/english/hpc/cpu-cache/alignment.md +++ b/content/english/hpc/cpu-cache/alignment.md @@ -185,4 +185,4 @@ int load(int *p) { } ``` -Compilers usually don't do that because this is not technically always legal: that 4th byte may be on a memory page that you don't own, so the operating system won't let you load it even if you are going to discard it right away. +Compilers usually don't do that because it's technically not legal: that 4th byte may be on a memory page that you don't own, so the operating system won't let you load it even if you are going to discard it right away. From 05f05c5b4eb587ff533769f3fca83486b0307890 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 19 Jul 2022 03:27:19 +0300 Subject: [PATCH 138/173] elaborating on eytzinger layout --- .../hpc/data-structures/binary-search.md | 33 ++++++++++--------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 6e73d32d..d2f237cb 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -248,7 +248,7 @@ Apart from being compact, it has some nice properties, like that all even-number Here is how this layout looks when applied to binary search: -![](../img/eytzinger.png) +![Note that the tree is slightly imbalanced (because of the last layer is continuous)](../img/eytzinger.png) When searching in this layout, we just need to start from the first element of the array, and then on each iteration jump to either $2 k$ or $(2k + 1)$, depending on how the comparison went: @@ -278,15 +278,15 @@ void eytzinger(int k = 1) { } ``` -This function takes the current node number `k`, recursively writes out all elements to the left of the middle of the search interval, writes out the current element we'd compare against, and then recursively writes out all the elements on the right. It seems a bit complicated, but to convince ourselves that it works, we only need three observations: +This function takes the current node number `k`, recursively writes out all elements to the left of the middle of the search interval, writes out the current element we'd compare against, and then recursively writes out all the elements on the right. It seems a bit complicated, but to convince yourself that it works, you only need three observations: - It writes exactly `n` elements as we enter the body of `if` for each `k` from `1` to `n` just once. - It writes out sequential elements from the original array as it increments the `i` pointer each time. -- By the time we write the element at node `k`, we have already written all the elements to its left (exactly `i`). +- By the time we write the element at node `k`, we will have already written all the elements to its left (exactly `i`). -Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time. +Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time. Maintaining the permutation is both logically and computationally harder to maintain though: adding an element to a sorted array only requires shifting a suffix of its elements one position to the right, while Eytzinger array practically needs to be rebuilt from scratch. -Note that this traversal and the resulting permutation are not exactly equivalent to the "tree" of vanilla binary search: for example, the left child subtree may be larger than the right child subtree — and even more than just by one node — but it doesn't matter since both approaches result in the same logarithmic tree depth. +Note that this traversal and the resulting permutation are not exactly equivalent to the "tree" of vanilla binary search: for example, the left child subtree may be larger than the right child subtree — up to twice as large — but it doesn't matter much since both approaches result in the same $\lceil \log_2 n \rceil$ tree depth. Also note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`). @@ -300,22 +300,25 @@ while (k <= n) k = 2 * k + (t[k] < x); ``` -The only problem arises when we need to restore the index of the resulting element, as $k$ may end up not pointing to a leaf node. Here is an example of how that can happen: +The only problem arises when we need to restore the index of the resulting element, as $k$ does not directly point to it. Consider this example (its corresponding tree is listed above): ```center - array: 1 2 3 4 5 6 7 8 -eytzinger: 5 3 7 2 4 6 8 1 -1st range: --------------- k := 1 -2nd range: ------- k := 2*k (=2) -3rd range: --- k := 2*k + 1 (=5) -4th range: - k := 2*k (=10) + array: 0 1 2 3 4 5 6 7 8 9 +eytzinger: 6 3 7 1 5 8 9 0 2 4 +1st range: ------------------- k := 1 +2nd range: ------------- k := 2*k = 2 (6 ≥ 3) +3rd range: ------- k := 2*k = 4 (3 ≥ 3) +4th range: --- k := 2*k + 1 = 9 (1 < 3) +5th range: - k := 2*k + 1 = 19 (2 < 3) ``` -Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $5$, $3$, and $4$, go left-right-left, and end up with $k = 10$, which isn't even a valid array index. + -The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (that is, leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns. +Here we query the array of $[0, …, 9]$ for the lower bound of $x=3$. We compare it against $6$, $3$, $1$, and $2$, go left-left-right-right, and end up with $k = 19$, which isn't even a valid array index. -This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing 1s in the binary representation and right-shift $k$ by exactly that number of bits. To do this, we can invert the number (`~k`) and call the "find first set" instruction: +The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we go left exactly once and then keep going right until we reach a leaf (because we will only be comparing $x$ against lesser elements). Therefore, to restore the answer, we just need to "cancel" some number of right turns and then one more. + +This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing 1s in the binary representation and right-shift $k$ by exactly that number of bits plus one. To do this, we can invert the number (`~k`) and call the "find first set" instruction: ```c++ int lower_bound(int x) { From c98fcddab8225ab707a71820da7ff45e744da04d Mon Sep 17 00:00:00 2001 From: song-jx <79297685+song-jx@users.noreply.github.com> Date: Tue, 19 Jul 2022 22:05:52 +0800 Subject: [PATCH 139/173] Fixed a problem that could cause out of bounds. Calling add(32, 0) when N = 33 will be out of bounds. --- content/english/hpc/data-structures/segment-trees.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md index f4c6fb7f..e98c16cb 100644 --- a/content/english/hpc/data-structures/segment-trees.md +++ b/content/english/hpc/data-structures/segment-trees.md @@ -594,7 +594,7 @@ constexpr int offset(int h) { int s = 0, n = N; while (h--) { s += (n + B - 1) / B * B; - n /= B; + n = (n + B - 1) / B; } return s; } From dad89c8d3155433d45875a057039bcc944ca98f8 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 19 Jul 2022 18:23:56 +0300 Subject: [PATCH 140/173] new branchless binary search --- .../hpc/data-structures/binary-search.md | 61 ++++++++++++++++--- 1 file changed, 51 insertions(+), 10 deletions(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index d2f237cb..f2e61ffb 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -3,6 +3,8 @@ title: Binary Search weight: 1 --- + + While improving the speed of user-facing applications is the end goal of performance engineering, people don't really get excited over 5-10% improvements in some databases. Yes, this is what software engineers are paid for, but these types of optimizations tend to be too intricate and system-specific to be readily generalized to other software. Instead, the most fascinating showcases of performance engineering are multifold optimizations of textbook algorithms: the kinds that everybody knows and deemed so simple that it would never even occur to try to optimize them in the first place. These optimizations are simple and instructive and can very much be adopted elsewhere. And they are surprisingly not as rare as you'd think. @@ -71,7 +73,7 @@ int lower_bound(int x) { Find the middle element of the search range, compare it to `x`, shrink the range in half. Beautiful in its simplicity. -A similar approach is employed by `std::lower_bound`, except that it needs to be more generic to support containers with non-random-access iterators and thus uses the first element and the size of the search interval instead of the two of its ends. Implementations from both [Clang](https://github.com/llvm-mirror/libcxx/blob/78d6a7767ed57b50122a161b91f59f19c9bd0d19/include/algorithm#L4169) and [GCC](https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L1023) use this metaprogramming monstrosity: +A similar approach is employed by `std::lower_bound`, except that it needs to be more generic to support containers with non-random-access iterators and thus uses the first element and the size of the search interval instead of the two of its ends. To this end, implementations from both [Clang](https://github.com/llvm-mirror/libcxx/blob/78d6a7767ed57b50122a161b91f59f19c9bd0d19/include/algorithm#L4169) and [GCC](https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L1023) use this metaprogramming monstrosity: ```c++ template @@ -131,23 +133,60 @@ Now, let's try to get rid of these obstacles one by one. ## Removing Branches -We can replace branching with [predication](/hpc/pipelining/branchless). To do this, we need to adopt the STL approach and rewrite the loop using the first element and the size of the search interval — instead of its first and last element. This way we only need to update the first element of the search interval with a `cmov` instruction and halve its size on each iteration: +We can replace branching with [predication](/hpc/pipelining/branchless). To make the task easier, we can adopt the STL approach and rewrite the loop using the first element and the size of the search interval (instead of its first and last element): ```c++ int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; - base = (base[half] < x ? &base[half] : base); + if (base[half - 1] < x) { + base += half; + len = len - half; + } else { + len = half; + } + } + return *base; +} +``` + +Note that, on each iteration, `len` is essentially just halved and then either floored or ceiled, depending on how the comparison went. This conditional update seems unnecessary; to avoid it, we can simply say that it's always ceiled: + +```c++ +int lower_bound(int x) { + int *base = t, len = n; + while (len > 1) { + int half = len / 2; + if (base[half - 1] < x) + base += half; + len -= half; // = ceil(len / 2) + } + return *base; +} +``` + +This way, we only need to update the first element of the search interval with a [conditional move](/hpc/pipelining/branchless/) and halve its size on each iteration: + +```c++ +int lower_bound(int x) { + int *base = t, len = n; + while (len > 1) { + int half = len / 2; + base += (base[half - 1] < x) * half; // will be replaced with a "cmov" len -= half; } - return *(base + (*base < x)); + return *base; } ``` -Note that this loop is not always equivalent to the standard binary search — it always rounds *up* the size of the search interval, so it accesses slightly different elements and may perform one comparison more than what is needed. We do this to make the number of iterations constant and remove the need for branching completely, although it does require an awkward `(*base < x)` check at the end. + -As typical for predication, this trick is very fragile to compiler optimizations. It doesn't make a difference on Clang — for some reason, it replaces the ternary operator with a branch anyway — but it works fine on GCC (9.3), yielding a 2.5-3x improvement on small arrays: +Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely. + +As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the funciton is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays: + + ![](../img/search-branchless.svg) @@ -162,15 +201,17 @@ int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; - __builtin_prefetch(&base[(len - half) / 2]); - __builtin_prefetch(&base[half + (len - half) / 2]); - base = (base[half] < x ? &base[half] : base); len -= half; + __builtin_prefetch(&base[len / 2 - 1]); + __builtin_prefetch(&base[half + len / 2 - 1]); + base += (base[half - 1] < x) * half; } - return *(base + (*base < x)); + return *base; } ``` + + With prefetching, the performance on large arrays becomes roughly the same: ![](../img/search-branchless-prefetch.svg) From 3b2037f968fec31bf7f0ffef74a80df432338a51 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 05:31:43 +0300 Subject: [PATCH 141/173] simplify code --- content/english/hpc/data-structures/segment-trees.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md index e98c16cb..90435a38 100644 --- a/content/english/hpc/data-structures/segment-trees.md +++ b/content/english/hpc/data-structures/segment-trees.md @@ -593,8 +593,8 @@ constexpr int height(int n) { constexpr int offset(int h) { int s = 0, n = N; while (h--) { - s += (n + B - 1) / B * B; n = (n + B - 1) / B; + s += n * B; } return s; } From b8e8ede0ad7a040478df5d985c5bdf417758385b Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 05:37:35 +0300 Subject: [PATCH 142/173] change wording --- content/english/hpc/data-structures/segment-trees.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md index 90435a38..9ad14608 100644 --- a/content/english/hpc/data-structures/segment-trees.md +++ b/content/english/hpc/data-structures/segment-trees.md @@ -603,14 +603,14 @@ constexpr int H = height(N); alignas(64) int t[offset(H)]; // an array for storing nodes ``` -This way we effectively reduce the height of the tree by approximately $\frac{\log_B n}{\log_2 n} = \log_2 B$ times ($\sim4$ times if $B = 16$), but it becomes non-trivial to implement in-node operations efficiently. For our problem, we have two main options: +This way, we effectively reduce the height of the tree by approximately $\frac{\log_B n}{\log_2 n} = \log_2 B$ times ($\sim4$ times if $B = 16$), but it becomes non-trivial to implement in-node operations efficiently. For our problem, we have two main options: 1. We could store $B$ *sums* in each node (for each of its $B$ children). 2. We could store $B$ *prefix sums* in each node (the $i$-th being the sum of the first $(i + 1)$ children). If we go with the first option, the `add` query would be largely the same as in the bottom-up segment tree, but the `sum` query would need to add up to $B$ scalars in each node it visits. And if we go with the second option, the `sum` query would be trivial, but the `add` query would need to add `x` to some suffix on each node it visits. -In either case, one operation will perform $O(\log_B n)$ operations, touching just one scalar in each node, while the other will perform $O(B \cdot \log_B n)$ operations, touching up to $B$ scalars in each node. However, it is 21st century, and we can use [SIMD](/hpc/simd) to accelerate the slower operation. Since there are no fast [horizontal reductions](/hpc/simd/reduction) in SIMD instruction sets, but it is easy to add a vector to a vector, we will choose the second approach and store prefix sums in each node. +In either case, one operation would perform $O(\log_B n)$ operations, touching just one scalar in each node, while the other would perform $O(B \cdot \log_B n)$ operations, touching up to $B$ scalars in each node. We can, however, use [SIMD](/hpc/simd) to accelerate the slower operation, and since there are no fast [horizontal reductions](/hpc/simd/reduction) in SIMD instruction sets, but it is easy to add a vector to a vector, we will choose the second approach and store prefix sums in each node. This makes the `sum` query extremely fast and easy to implement: @@ -623,7 +623,7 @@ int sum(int k) { } ``` -The `add` query is more complicated and slower. We need to add a number to only a suffix of a node, and we can do this by [masking out](/hpc/simd/masking) the positions that need not be modified. +The `add` query is more complicated and slower. We need to add a number only to a suffix of a node, and we can do this by [masking out](/hpc/simd/masking) the positions that should not be modified. We can pre-calculate a $B \times B$ array corresponding to $B$ such masks that tell, for each of $B$ positions within a node, whether a certain prefix sum value needs to be updated or not: From 6a06a065e37eb052a4774a2415cfe9fd356acfce Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 05:48:13 +0300 Subject: [PATCH 143/173] adjust header padding --- themes/algorithmica/assets/style.sass | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass index eb5e2410..b91f9a5f 100644 --- a/themes/algorithmica/assets/style.sass +++ b/themes/algorithmica/assets/style.sass @@ -187,10 +187,10 @@ menu display: flex font-family: $font-headings - height: 30px + height: 26px background-color: $background justify-content: space-between - padding: 12px + padding: 14px margin: 0 text-align: center From 1c8c455097f458dcb86f67e40c77fd1ca15830c6 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 05:57:43 +0300 Subject: [PATCH 144/173] change simd titles --- content/english/hpc/_index.md | 4 ++-- content/english/hpc/simd/moving.md | 2 +- content/english/hpc/simd/reduction.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md index 942c9f6a..7a0068ff 100644 --- a/content/english/hpc/_index.md +++ b/content/english/hpc/_index.md @@ -163,8 +163,8 @@ Planned table of contents: 9.11. AoS and SoA 10. SIMD Parallelism 10.1. Intrinsics and Vector Types - 10.2. Loading and Writing Data - 10.3. Sums and Other Reductions + 10.2. Moving Data + 10.3. Reductions 10.4. Masking and Blending 10.5. In-Register Shuffles 10.6. Auto-Vectorization diff --git a/content/english/hpc/simd/moving.md b/content/english/hpc/simd/moving.md index 948c31c4..72cbbd33 100644 --- a/content/english/hpc/simd/moving.md +++ b/content/english/hpc/simd/moving.md @@ -1,5 +1,5 @@ --- -title: Loading and Writing Data +title: Moving Data aliases: [/hpc/simd/vectorization] weight: 2 --- diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md index c67c1942..5a0ace1e 100644 --- a/content/english/hpc/simd/reduction.md +++ b/content/english/hpc/simd/reduction.md @@ -1,5 +1,5 @@ --- -title: Sums and Other Reductions +title: Reductions weight: 3 --- From af2c2b90dedcd2ab701cff977383529244219dbc Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 06:11:31 +0300 Subject: [PATCH 145/173] improving search --- themes/algorithmica/assets/style.sass | 3 +++ themes/algorithmica/layouts/partials/head.html | 1 + 2 files changed, 4 insertions(+) diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass index b91f9a5f..00a420cf 100644 --- a/themes/algorithmica/assets/style.sass +++ b/themes/algorithmica/assets/style.sass @@ -236,6 +236,9 @@ menu background: $code-background border: $code-border + &:focus + outline: 1px solid $dimmed + #search-count margin-top: 8px color: $dimmed diff --git a/themes/algorithmica/layouts/partials/head.html b/themes/algorithmica/layouts/partials/head.html index 2f4c3c46..c5013dba 100644 --- a/themes/algorithmica/layouts/partials/head.html +++ b/themes/algorithmica/layouts/partials/head.html @@ -45,6 +45,7 @@ if (window.getComputedStyle(searchDiv).display == 'none') { searchDiv.style.display = 'block' window.scrollTo({ top: 0 }); + document.getElementById('search-bar').focus() } else { searchDiv.style.display = 'none' } From 72e00452f0bbd5d30d436ff207fc3f91ee3c678d Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 07:47:49 +0300 Subject: [PATCH 146/173] spmd --- content/english/hpc/_index.md | 2 +- content/english/hpc/simd/_index.md | 2 +- .../english/hpc/simd/auto-vectorization.md | 26 ++++++++++++++----- 3 files changed, 21 insertions(+), 9 deletions(-) diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md index 7a0068ff..8d73bcb0 100644 --- a/content/english/hpc/_index.md +++ b/content/english/hpc/_index.md @@ -167,7 +167,7 @@ Planned table of contents: 10.3. Reductions 10.4. Masking and Blending 10.5. In-Register Shuffles - 10.6. Auto-Vectorization + 10.6. Auto-Vectorization and SPMD 11. Algorithm Case Studies 11.1. Binary GCD (11.2. Prime Number Sieves) diff --git a/content/english/hpc/simd/_index.md b/content/english/hpc/simd/_index.md index 5e05da8e..50f6e3ed 100644 --- a/content/english/hpc/simd/_index.md +++ b/content/english/hpc/simd/_index.md @@ -43,6 +43,6 @@ In particular, AVX2 has instructions for working with 256-bit registers, while b ![](img/intel-extensions.webp) -Compilers often do a good job rewriting simple loops with SIMD instructions, like in the case above. This optimization is called [auto-vectorization](auto-vectorization), and it is the preferred way to use SIMD. +Compilers often do a good job rewriting simple loops with SIMD instructions, like in the case above. This optimization is called [auto-vectorization](auto-vectorization), and it is the most popular way of using SIMD. The problem is that it only works with certain types of loops, and even then it often yields suboptimal results. To understand its limitations, we need to get our hands dirty and explore this technology on a lower level, which is what we are going to do in this chapter. diff --git a/content/english/hpc/simd/auto-vectorization.md b/content/english/hpc/simd/auto-vectorization.md index 5fc568c3..b7b8a45f 100644 --- a/content/english/hpc/simd/auto-vectorization.md +++ b/content/english/hpc/simd/auto-vectorization.md @@ -1,15 +1,17 @@ --- -title: Auto-Vectorization +title: Auto-Vectorization and SPMD weight: 10 --- -SIMD-parallelism is most often used for *embarrassingly parallel* computations: the kinds where all you do is apply some elementwise function to all elements of an array and write it back somewhere else. In this setting, you don't even need to know how SIMD works: the compiler is perfectly capable of optimizing such loops by itself — you just need to be aware that such optimization exists and that it usually yields a 5-10x speedup. +SIMD parallelism is most often used for *embarrassingly parallel* computations: the kinds where all you do is apply some elementwise function to all elements of an array and write it back somewhere else. In this setting, you don't even need to know how SIMD works: the compiler is perfectly capable of optimizing such loops by itself — you just need to be aware that such optimization exists and that it usually yields a 5-10x speedup. -Doing nothing and relying on auto-vectorization is actually the preferred way of using SIMD. Whenever you can, you should always stick with the scalar code for its simplicity and maintainability. But often even the loops that seem straightforward to vectorize are not optimized because of some technical nuances. [As in many other cases](/hpc/compilation/contracts), the compiler may need some additional input from the programmer as he may know a bit more about the problem than what can be inferred from static analysis. +Doing nothing and relying on auto-vectorization is actually the most popular way of using SIMD. In fact, in many cases, it even advised to stick with the plain scalar code for its simplicity and maintainability. + +But often even the loops that seem straightforward to vectorize are not optimized because of some technical nuances. [As in many other cases](/hpc/compilation/contracts), the compiler may need some additional input from the programmer as he may know a bit more about the problem than what can be inferred from static analysis. ### Potential Problems -Consider the "a + b" example: +Consider the "a + b" example we [started with](../intrinsics/#simd-intrinsics): ```c++ void sum(int *a, int *b, int *c, int n) { @@ -47,8 +49,18 @@ for (int i = 0; i < n; i++) To help the compiler eliminate this corner case, we can use the `alignas` specifier on static arrays and the `std::assume_aligned` function to mark pointers aligned. -**Checking if vectorization happened.** In either case, it is useful to check if the compiler vectorized the loop the way you intended. You can either [compiling it to assembly](/hpc/compilation/stages) and look for blocks for instructions that start with a "v" or add the `-fopt-info-vec-optimized` compiler flag so that the compiler indicates where auto-vectorization is happening and what SIMD width is being used. If you swap `optimized` for `missed` or `all`, you may also get some reasoning behind why it is not happening in other places. +**Checking if vectorization happened.** In either case, it is useful to check if the compiler vectorized the loop the way you intended. You can either [compiling it to assembly](/hpc/compilation/stages) and look for blocks for instructions that start with a "v" or add the `-fopt-info-vec-optimized` compiler flag so that the compiler indicates where auto-vectorization is happening and what SIMD width is being used. If you swap `optimized` for `missed` or `all`, you may also get some reasoning behind why it is not happening in other places. ---- +There are [many other ways](https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf) of telling the compiler exactly what we mean, but in especially complex cases — e.g., when there are a lot of branches or function calls inside the loop — it is easier to go one level of abstraction down and vectorize manually. + +### SPMD + +There is a neat compromise between auto-vectorization and the manual use of SIMD intrinsics: "single program, multiple data" (SPMD). This is a model of computation in which the programmer writes what appears to be a regular serial program, but that is actually executed in parallel on the hardware. + +The programming experience is largely the same, and there is still the fundamental limitation in that the computation must be data-parallel, but SPMD ensures that the vectorization will happen regardless of the compiler and the target CPU architecture. It also allows for the computation to be automatically parallelized across multiple cores and, in some cases, even offloaded to other types of parallel hardware. + +There is support for SPMD is some modern languages ([Julia](https://docs.julialang.org/en/v1/base/base/#Base.SimdLoop.@simd)), multiprocessing APIs ([OpenMP](https://www.openmp.org/spec-html/5.0/openmpsu42.html)), and specialized compilers (Intel [ISPC](https://ispc.github.io/)), but it has seen the most success in the context of GPU programming where both problems and hardware are massively parallel. + +We will cover this model of computation in much more depth in Part 2 -There are [many other ways](https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf) of telling the compiler what we meant exactly, but in especially complex cases — when inside the loop there are a lot of branches or some functions are called — it is easier to go down to the intrinsics level and write it yourself. + From af8d237fcc253fd5a1d32d281e26fa94f2cae948 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 08:39:36 +0300 Subject: [PATCH 147/173] update index --- content/english/hpc/_index.md | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md index 8d73bcb0..a1ff7f42 100644 --- a/content/english/hpc/_index.md +++ b/content/english/hpc/_index.md @@ -39,11 +39,11 @@ After that, I will mostly be fixing errors and only doing some minor edits refle **Pre-ordering / financially supporting the book.** Due to my unfortunate citizenship and place of birth, you can't — that is, until I find a way that at the same time complies with international sanctions, doesn't sponsor [the war](https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine), and won't put me in prison for tax evasion. -So, don't bother. If you want to support this book, just share the articles you like on link aggregators and social media and help fix typos — that would be enough. +So, don't bother. If you want to support this book, just share it and help fix typos — that would be enough. **Translations.** The website has a separate functionality for creating and managing translations — and I've already been contacted by some nice people willing to translate the book into Italian and Chinese (and I will personally translate at least some of it into my native Russian). -However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when a centralized translation process starts. +However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when centralized translation starts. **"Translating" the Russian version.** The articles hosted at [ru.algorithmica.org/cs/](https://ru.algorithmica.org/cs/) are not about advanced performance engineering but mostly about classical computer science algorithms — without discussing how to speed them up beyond asymptotic complexity. Most of the information there is not unique and already exists in English on some other places on the internet: for example, the similar-spirited [cp-algorithms.com](https://cp-algorithms.com/). @@ -51,7 +51,7 @@ However, as the book is still evolving, it is probably not the best idea to star There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, but [one of them](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming) is 50 years old, and [the other](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) is 30 years old, and [computers have changed a lot](/hpc/complexity/hardware) since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs. -And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat the hardware like something from the 1990s. +And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat modern hardware like something from the 1990s. What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools for *teaching* performance engineering — and I'm looking forward to collaborating with other people who want to make it a reality as well. @@ -76,7 +76,7 @@ Competitive programming is, in my opinion, misguided. They are doing useless thi The first part covers the basics of computer architecture and optimization of single-threaded algorithms. -It walks through the main CPU optimization topics such as caching, SIMD and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure. +It walks through the main CPU optimization topics such as caching, SIMD, and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure. Planned table of contents: @@ -94,7 +94,7 @@ Planned table of contents: 1.4. Functions and Recursion 1.5. Indirect Branching 1.6. Machine Code Layout - 1.7. Interrupts and System Calls + 1.7. System Calls 1.8. Virtualization 3. Instruction-Level Parallelism 3.1. Pipeline Hazards @@ -215,7 +215,7 @@ Among the cool things that we will speed up: - optimal Karatsuba Algorithm - optimal FFT -This work is largely based on blog posts, research papers, conference talks and other work authored by a lot of people: +This work is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people: - [Agner Fog](https://agner.org/optimize/) - [Daniel Lemire](https://lemire.me/en/#publications) @@ -248,29 +248,33 @@ This work is largely based on blog posts, research papers, conference talks and - [Creel](https://www.youtube.com/c/WhatsACreel) Volume: 450-600 pages -Release date: Q2 2022 +Release date: Q3 2022 ### Part II: Parallel Algorithms -Concurrency, models of parallelism, green threads and concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking and graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication and sorting. +Concurrency, models of parallelism, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting. Volume: 150-200 pages -Release date: 2023? +Release date: 2023-2024? ### Part III: Distributed Computing -Communication-constrained algorithms, message passing, actor model, partitioning, MapReduce, consistency and reliability at scale, storage, compression, scheduling and cloud computing, distributed deep learning. +(I might need some help from here on.) + +Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, consistency, reliability, scheduling, cloud computing. Release date: ??? (more likely to be completed than not) ### Part IV: Compilers and Domain-Specific Architectures -LLVM IR, compiler optimizations, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++ and oneAPI, XLA, Verilog, FPGAs, ASICs, TPUs and other AI accelerators. +(TODO: come up with a better title — one that emphasizes that this part is mainly about the software-hardware boundary and not PL/IC design.) + +LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators. Release date: ??? (less likely to be completed than not) ### Disclaimer: Technology Choices -The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles we aim to convey are not specific to them. +The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them. To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed. From 20b8479c5ac2ed627cd86baa12e4e6656074c8ae Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 22:52:44 +0300 Subject: [PATCH 148/173] update hpc index --- content/english/hpc/_index.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md index a1ff7f42..8c5e9ef2 100644 --- a/content/english/hpc/_index.md +++ b/content/english/hpc/_index.md @@ -239,6 +239,10 @@ This work is largely based on blog posts, research papers, conference talks, and - [Geoff Langdale](https://branchfree.org/) - [Matt Kulukundis](https://twitter.com/JuvHarlequinKFM) - [Georg Sauthoff](https://gms.tf/) +- [Danila Kutenin](https://danlark.org/author/kutdanila/) +- [Ivica Bogosavljević](https://johnysswlab.com/author/ibogi/) +- [Matt Pharr](https://pharr.org/matt/) +- [Jan Wassenberg](https://research.google/people/JanWassenberg/) - [Marshall Lochbaum](https://mlochbaum.github.io/publications.html) - [Pavel Zemtsov](https://pzemtsov.github.io/) - [Nayuki](https://www.nayuki.io/category/programming) @@ -252,22 +256,22 @@ Release date: Q3 2022 ### Part II: Parallel Algorithms -Concurrency, models of parallelism, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting. +Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting. Volume: 150-200 pages Release date: 2023-2024? ### Part III: Distributed Computing -(I might need some help from here on.) + -Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, consistency, reliability, scheduling, cloud computing. +Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing. Release date: ??? (more likely to be completed than not) -### Part IV: Compilers and Domain-Specific Architectures +### Part IV: Software & Hardware -(TODO: come up with a better title — one that emphasizes that this part is mainly about the software-hardware boundary and not PL/IC design.) + LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators. @@ -277,4 +281,4 @@ Release date: ??? (less likely to be completed than not) The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them. -To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed. +To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust / [Carbon?](https://github.com/carbon-language/carbon-lang), LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed. From 6b522385797429bf1a1b5c0295f33ac73350e1a1 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 20 Jul 2022 23:55:19 +0300 Subject: [PATCH 149/173] edit number theory intro --- content/english/hpc/number-theory/_index.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md index 6812e14c..d66f85fd 100644 --- a/content/english/hpc/number-theory/_index.md +++ b/content/english/hpc/number-theory/_index.md @@ -3,17 +3,15 @@ title: Number Theory weight: 7 --- -*Disclaimer: this chapter is a very early draft that is probably not worth reading yet.* - In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._H._Hardy) published a famous essay titled "[A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology)" discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications. -I personally don't agree — and I wrote this book partially to show that there are way too few people working on practical algorithm design instead of theoretical computer science — but I understand where Hardy is coming from. Being 62 years old, he witnessed the devastation caused by the First and the ongoing Second World War that was greatly amplified by the weaponization of science. +Similar to mathematics, the various fields of computer science also form a spectrum, with mathematical logic and computability theory on one end and web programming and application development on the other. I assume that you, the reader, is more on the applied side: this book was written to show that there are way too few people working on practical algorithm design instead of theoretical computer science — and since you got to Chapter 7, you probably also believe in that statement. -As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing: +But, regardless of the personal views on the matter, one can see where Hardy is coming from. Being 62 years old at the moment of writing, he witnessed the devastation caused by the First and the ongoing Second World War — which was greatly amplified by the weaponization of science. As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing: > No one has yet discovered any warlike purpose to be served by the theory of numbers or relativity, and it seems unlikely that anyone will do so for many years. -Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory. +Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory — the computational aspect of which is the topic of this chapter. @@ -54,7 +54,7 @@ $$ \bar{x} = x \cdot r \bmod n $$ -Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we don't use this method for general modular multiplication and only long sequences of operations where transforming numbers to and from the Montgomery space is worth it. +Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we only use this method when the overhead of transforming numbers to and from the Montgomery space is worth it and not for general modular multiplication. @@ -287,6 +287,6 @@ int inverse(int _a) { } ``` -While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case in modular arithmetic is for `inverse` to be used as a subprocedure in a bigger computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types. +While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types. **Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul). From a05f571a1762a1f6f8d8b6b329cdbde03b2f56a6 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 21 Jul 2022 02:49:51 +0300 Subject: [PATCH 152/173] move acknowledgements section --- content/english/hpc/_index.md | 58 +++++++++++++++++++---------------- 1 file changed, 31 insertions(+), 27 deletions(-) diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md index 8c5e9ef2..ed71792a 100644 --- a/content/english/hpc/_index.md +++ b/content/english/hpc/_index.md @@ -215,7 +215,35 @@ Among the cool things that we will speed up: - optimal Karatsuba Algorithm - optimal FFT -This work is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people: +Volume: 450-600 pages +Release date: Q3 2022 + +### Part II: Parallel Algorithms + +Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting. + +Volume: 150-200 pages +Release date: 2023-2024? + +### Part III: Distributed Computing + + + +Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing. + +Release date: ??? (more likely to be completed than not) + +### Part IV: Software & Hardware + + + +LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators. + +Release date: ??? (less likely to be completed than not) + +### Acknowledgements + +The book is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people: - [Agner Fog](https://agner.org/optimize/) - [Daniel Lemire](https://lemire.me/en/#publications) @@ -245,38 +273,14 @@ This work is largely based on blog posts, research papers, conference talks, and - [Jan Wassenberg](https://research.google/people/JanWassenberg/) - [Marshall Lochbaum](https://mlochbaum.github.io/publications.html) - [Pavel Zemtsov](https://pzemtsov.github.io/) +- [Gustavo Duarte](https://manybutfinite.com/) +- [Nyaan](https://nyaannyaan.github.io/library/) - [Nayuki](https://www.nayuki.io/category/programming) - [InstLatX64](https://twitter.com/InstLatX64) - [ridiculous_fish](https://ridiculousfish.com/blog/) - [Z boson](https://stackoverflow.com/users/2542702/z-boson) - [Creel](https://www.youtube.com/c/WhatsACreel) -Volume: 450-600 pages -Release date: Q3 2022 - -### Part II: Parallel Algorithms - -Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting. - -Volume: 150-200 pages -Release date: 2023-2024? - -### Part III: Distributed Computing - - - -Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing. - -Release date: ??? (more likely to be completed than not) - -### Part IV: Software & Hardware - - - -LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators. - -Release date: ??? (less likely to be completed than not) - ### Disclaimer: Technology Choices The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them. From 19bb6305fb564080bc8f0e8995bfeb51038116bd Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Fri, 22 Jul 2022 01:49:24 +0300 Subject: [PATCH 153/173] links to floyd-warshall --- content/english/hpc/algorithms/matmul.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md index 5f2847d2..cf976045 100644 --- a/content/english/hpc/algorithms/matmul.md +++ b/content/english/hpc/algorithms/matmul.md @@ -474,9 +474,9 @@ for (int k = 0; k < n; k++) d[i][j] = min(d[i][j], d[i][k] + d[k][j]); ``` -Interestingly, vectorizing the distance product and executing it $O(\log n)$ times in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot. +Interestingly, similarly vectorizing the distance product and executing it $O(\log n)$ times ([or possibly fewer](https://arxiv.org/pdf/1904.01210.pdf)) in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot. -As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because now there is a logical dependency between the iterations, and you need to perform updates in a particular order, but it is still possible to design a similar kernel and a block iteration order that achieves a 30-50x total speedup. +As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because now there is a logical dependency between the iterations, and you need to perform updates in a particular order, but it is still possible to design [a similar kernel and a block iteration order](https://github.com/sslotin/amh-code/blob/main/floyd/blocked.cc) that achieves a 30-50x total speedup. ## Acknowledgements From fd9bdbea9477ed7e4e0c749f2967bf5997bb73a8 Mon Sep 17 00:00:00 2001 From: Rinat Valiullov <9755333+RinatValiullov@users.noreply.github.com> Date: Tue, 26 Jul 2022 01:54:42 +0500 Subject: [PATCH 154/173] fix typo (duplicate text) --- content/russian/cs/sorting/bubble.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/russian/cs/sorting/bubble.md b/content/russian/cs/sorting/bubble.md index 2d9af9b5..38fa5c8a 100644 --- a/content/russian/cs/sorting/bubble.md +++ b/content/russian/cs/sorting/bubble.md @@ -1,9 +1,10 @@ --- title: Сортировка пузырьком weight: 1 +published: true --- -Наш первый подход будет заключаться в следующем: обозначим за $n$ длину массива и $n$ раз пройдёмся раз пройдемся по нему слева направо, меняя два соседних элемента, если первый больше второго. +Наш первый подход будет заключаться в следующем: обозначим за $n$ длину массива и $n$ раз пройдёмся по нему слева направо, меняя два соседних элемента, если первый больше второго. Каждую итерацию максимальный элемент «всплывает» как пузырек к концу массива — отсюда и название. From 326755608c2464b2fddf960cf972b03d2f8a684f Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 28 Jul 2022 09:05:21 +0300 Subject: [PATCH 155/173] underline eytzinger search example --- .../english/hpc/data-structures/binary-search.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index f2e61ffb..7401712e 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -343,7 +343,7 @@ while (k <= n) The only problem arises when we need to restore the index of the resulting element, as $k$ does not directly point to it. Consider this example (its corresponding tree is listed above): -```center + + +
      +    array:  0 1 2 3 4 5 6 7 8 9                           
      +eytzinger:  6 3 7 1 5 8 9 0 2 4                           
      +1st range:  -------------------  k := 1                    
      +2nd range:  -------------        k := 2*k     = 2   (6 ≥ 3)
      +3rd range:  -------              k := 2*k     = 4   (3 ≥ 3)
      +4th range:      ---              k := 2*k + 1 = 9   (1 < 3)
      +5th range:        -              k := 2*k + 1 = 19  (2 < 3)
      +
      From da216d6c81f59d334f3c9c26cf2ce768871314bb Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 28 Jul 2022 09:17:31 +0300 Subject: [PATCH 156/173] fix example --- .../english/hpc/data-structures/binary-search.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 7401712e..85f9ef52 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -354,13 +354,13 @@ eytzinger: 6 3 7 1 5 8 9 0 2 4 -->
      -    array:  0 1 2 3 4 5 6 7 8 9                           
      -eytzinger:  6 3 7 1 5 8 9 0 2 4                           
      -1st range:  -------------------  k := 1                    
      -2nd range:  -------------        k := 2*k     = 2   (6 ≥ 3)
      -3rd range:  -------              k := 2*k     = 4   (3 ≥ 3)
      -4th range:      ---              k := 2*k + 1 = 9   (1 < 3)
      -5th range:        -              k := 2*k + 1 = 19  (2 < 3)
      +    array:  0 1 2 3 4 5 6 7 8 9                            
      +eytzinger:  6 3 7 1 5 8 9 0 2 4                            
      +1st range:  ------------?------  k := 2*k     = 2   (6 ≥ 3)
      +2nd range:  ------?------        k := 2*k     = 4   (3 ≥ 3)
      +3rd range:  --?----              k := 2*k + 1 = 9   (1 < 3)
      +4th range:      ?--              k := 2*k + 1 = 19  (2 < 3)
      +5th range:        !                                        
       
      From 0d811cc49a1784a813f071a2aeb5755e5dfd958a Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Thu, 28 Jul 2022 13:50:38 +0300 Subject: [PATCH 157/173] add s-tree rank example --- content/english/hpc/data-structures/s-tree.md | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/content/english/hpc/data-structures/s-tree.md b/content/english/hpc/data-structures/s-tree.md index d241aed5..875f72ec 100644 --- a/content/english/hpc/data-structures/s-tree.md +++ b/content/english/hpc/data-structures/s-tree.md @@ -102,7 +102,19 @@ int i = __builtin_ffs(mask) - 1; // now i is the number of the correct child node ``` -Unfortunately, the compilers are not smart enough yet to auto-vectorize this code, so we need to manually vectorize it with intrinsics: +Unfortunately, the compilers are not smart enough to [auto-vectorize](/hpc/simd/auto-vectorization/) this code yet, so we have to optimize it manually. In AVX2, we can load 8 elements, compare them against the search key, producing a [vector mask](/hpc/simd/masking/), and then extract the scalar mask from it with `movemask`. Here is a minimized illustrated example of what we want to do: + +```center + y = 4 17 65 103 + x = 42 42 42 42 + y ≥ x = 00000000 00000000 11111111 11111111 + ├┬┬┬─────┴────────┴────────┘ +movemask = 0011 + ┌─┘ + ffs = 3 +``` + +Since we are limited to processing 8 elements at a time (half our block / cache line size), we have to split the elements into two groups and then combine the two 8-bit masks. To do this, it will be slightly easier to swap the condition for `x > y` and compute the inverted mask instead: ```c++ typedef __m256i reg; @@ -114,7 +126,7 @@ int cmp(reg x_vec, int* y_ptr) { } ``` -This function works for 8-element vectors, which is half our block / cache line size. To process the entire block, we need to call it twice and then combine the masks: +Now, to process the entire block, we need to call it twice and combine the masks: ```c++ int mask = ~( @@ -123,7 +135,7 @@ int mask = ~( ); ``` -Now, to descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier: +To descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier: ```c++ int i = __builtin_ffs(mask) - 1; From f01a7d3df6e6a885fb5b63376df5a0399981bbe2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Anti=20R=C3=A4is?= Date: Fri, 29 Jul 2022 18:23:51 +0300 Subject: [PATCH 158/173] Improve wording. --- content/english/hpc/profiling/noise.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/english/hpc/profiling/noise.md b/content/english/hpc/profiling/noise.md index 74ff0272..243f3600 100644 --- a/content/english/hpc/profiling/noise.md +++ b/content/english/hpc/profiling/noise.md @@ -1,6 +1,7 @@ --- title: Getting Accurate Results weight: 10 +published: true --- It is not an uncommon for there to be two library algorithm implementations, each maintaining its own benchmarking code, and each claiming to be faster than the other. This confuses everyone involved, especially the users, who have to somehow choose between the two. @@ -111,7 +112,7 @@ for (int i = 0; i < N; i++) checksum ^= lower_bound(q[i]); ``` -It is also sometimes convenient to combine the warm-up run with answer validation, it if is more complicated than just computing some sort of checksum. +It is also sometimes convenient to combine the warm-up run with answer validation, if it is more complicated than just computing some sort of checksum. **Over-optimization.** Sometimes the benchmark is outright erroneous because the compiler just optimized the benchmarked code away. To prevent the compiler from cutting corners, you need to add checksums and either print them somewhere or add the `volatile` qualifier, which also prevents any sort of interleaving of loop iterations. From 20d53920f54959981cbab0c17b877a4025763cf4 Mon Sep 17 00:00:00 2001 From: Pasha Date: Sun, 31 Jul 2022 16:33:16 +0300 Subject: [PATCH 159/173] =?UTF-8?q?=D0=BD=D0=B5=D1=81=D0=BE=D0=B3=D0=BB?= =?UTF-8?q?=D0=B0=D1=81=D0=BE=D0=B2=D0=B0=D0=BD=D0=BE=D1=81=D1=82=D1=8C=20?= =?UTF-8?q?=D1=81=D0=BB=D0=BE=D0=B2=20=D0=B2=20=D0=BA=D0=BE=D0=BD=D1=86?= =?UTF-8?q?=D0=B5=20=D0=B0=D0=B1=D0=B7=D0=B0=D1=86=D0=B0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- content/russian/cs/decomposition/scanline.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/content/russian/cs/decomposition/scanline.md b/content/russian/cs/decomposition/scanline.md index 4c9bcdf0..3bc99afd 100644 --- a/content/russian/cs/decomposition/scanline.md +++ b/content/russian/cs/decomposition/scanline.md @@ -1,11 +1,12 @@ --- title: Сканирующая прямая authors: -- Сергей Слотин + - Сергей Слотин prerequisites: -- /cs/range-queries -- /cs/segment-tree + - /cs/range-queries + - /cs/segment-tree weight: 1 +published: true --- Метод сканирующей прямой (англ. *scanline*) заключается в сортировке точек на координатной прямой либо каких-то абстрактных «событий» по какому-то признаку и последующему проходу по ним. @@ -22,7 +23,7 @@ weight: 1 Это решение можно улучшить. Отсортируем интересные точки по возрастанию координаты и пройдем по ним слева направо, поддерживая количество отрезков `cnt`, которые покрывают данную точку. Если в данной точке начинается отрезок, то надо увеличить `cnt` на единицу, а если заканчивается, то уменьшить. После этого пробуем обновить ответ на задачу текущим значением `cnt`. -Как такое писать: нужно представить интересные точки в виде структур с полями «координата» и «тип» (начало / конец) и отсортировать со своим компаратором. Удобно начало отрезка обозначать +1, а конец -1, чтобы просто прибавлять к `cnt` это значение и на разбирать случае. +Как такое писать: нужно представить интересные точки в виде структур с полями «координата» и «тип» (начало / конец) и отсортировать со своим компаратором. Удобно начало отрезка обозначать +1, а конец -1, чтобы просто прибавлять к `cnt` это значение и не разбивать на случаи. Единственный нюанс — если координаты двух точек совпали, чтобы получить правильный ответ, сначала надо рассмотреть все начала отрезков, а только потом концы (чтобы при обновлении ответа в этой координате учлись и правые, и левые граничные отрезки). From 6661563a59217abbe6f69c38d27a6af2cd69aeb4 Mon Sep 17 00:00:00 2001 From: Iago-lito Date: Fri, 5 Aug 2022 16:39:01 +0200 Subject: [PATCH 160/173] Update integer.md --- content/english/hpc/arithmetic/integer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/english/hpc/arithmetic/integer.md b/content/english/hpc/arithmetic/integer.md index 47f5bd32..686db686 100644 --- a/content/english/hpc/arithmetic/integer.md +++ b/content/english/hpc/arithmetic/integer.md @@ -93,7 +93,7 @@ This seems like an important architecture aspect, but in most cases, it doesn't - Little-endian has the advantage that you can cast a value to a smaller type (e.g., `long long` to `int`) by just loading fewer bytes, which in most cases means doing nothing — thanks to *register aliasing*, `eax` refers to the first 4 bytes of `rax`, so conversion is essentially free. It is also easier to read values in a variety of type sizes — while on big-endian architectures, loading an `int` from a `long long` array would require shifting the pointer by 2 bytes. - Big-endian has the advantage that higher bytes are loaded first, which in theory can make highest-to-lowest routines such as comparisons and printing faster. You can also perform certain checks such as finding out whether a number is negative by only loading its first byte. -Big-endian is also more "natural" — this is how we write binary numbers on paper — but the advantage of having faster type conversions outweigh it. For this reason, little-endian is used by default on most hardware, although some CPUs are "bi-endian" and can be configured to switch modes on demand. +Big-endian is also more "natural" — this is how we write binary numbers on paper — but the advantage of having faster type conversions outweights it. For this reason, little-endian is used by default on most hardware, although some CPUs are "bi-endian" and can be configured to switch modes on demand. ### 128-bit Integers From 387715b6c648a722b2fce506aedf2b79a38d18aa Mon Sep 17 00:00:00 2001 From: psn2706 <69345823+psn2706@users.noreply.github.com> Date: Wed, 10 Aug 2022 23:19:44 +0300 Subject: [PATCH 161/173] Correction of typos --- content/russian/cs/persistent/persistent-array.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/content/russian/cs/persistent/persistent-array.md b/content/russian/cs/persistent/persistent-array.md index e476c355..018c287a 100644 --- a/content/russian/cs/persistent/persistent-array.md +++ b/content/russian/cs/persistent/persistent-array.md @@ -2,8 +2,9 @@ title: Структуры с откатами weight: 1 authors: -- Сергей Слотин -date: 2021-09-12 + - Сергей Слотин +date: {} +published: true --- Состояние любой структуры как-то лежит в памяти: в каких-то массивах, или в более общем случае, по каким-то определенным адресам в памяти. Для простоты, пусть у нас есть некоторый массив $a$ размера $n$, и нам нужно обрабатывать запросы присвоения и чтения, а также иногда откатывать изменения обратно. @@ -20,7 +21,7 @@ int a[N]; stack< pair > s; void change(int k, int x) { - l.push({k, a[k]}); + s.push({k, a[k]}); a[k] = x; } @@ -84,7 +85,7 @@ void rollback() { ```cpp int t = 0; -vector versions[N]; +vector< pair > versions[N]; void change(int k, int x) { versions[k].push_back({t++, x}); From 155891c5ed8502decd64047a21736c6285d8edcd Mon Sep 17 00:00:00 2001 From: zh Wang Date: Fri, 12 Aug 2022 05:36:02 +0800 Subject: [PATCH 162/173] Fix typo --- content/english/hpc/profiling/noise.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/english/hpc/profiling/noise.md b/content/english/hpc/profiling/noise.md index 243f3600..b1b186ae 100644 --- a/content/english/hpc/profiling/noise.md +++ b/content/english/hpc/profiling/noise.md @@ -128,10 +128,10 @@ https://github.com/sosy-lab/benchexec The issues we've described produce *bias* in measurements: they consistently give advantage to one algorithm over the other. There are other types of possible problems with benchmarking that result in either unpredictable skews or just completely random noise, thus increasing *variance*. -These type of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling: +These types of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling: - If you benchmark a compute-bound algorithm, measure its performance in cycles using `perf stat`: this way it will be independent of clock frequency, fluctuations of which is usually the main source of noise. -- Otherwise, set core frequency to the what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e.g., `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it. +- Otherwise, set core frequency to what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e.g., `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it. - If applicable, turn hyper-threading off and attach jobs to specific cores. Make sure no other jobs are running on the system, turn off networking and try not to fiddle with the mouse. You can't remove noises and biases completely. Even a program's name can affect its speed: the executable's name ends up in an environment variable, environment variables end up on the call stack, and so the length of the name affects stack alignment, which can result in data accesses slowing down due to crossing cache line or memory page boundaries. From adcdf626b1408ad0635ace91dc2a1facdad127d1 Mon Sep 17 00:00:00 2001 From: ar1emicus <87391584+ar1emicus@users.noreply.github.com> Date: Tue, 16 Aug 2022 03:52:35 +0500 Subject: [PATCH 163/173] Update sqrt-structures.md --- content/russian/cs/range-queries/sqrt-structures.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/content/russian/cs/range-queries/sqrt-structures.md b/content/russian/cs/range-queries/sqrt-structures.md index bac0da16..8d2cfd6f 100644 --- a/content/russian/cs/range-queries/sqrt-structures.md +++ b/content/russian/cs/range-queries/sqrt-structures.md @@ -1,10 +1,11 @@ --- title: Корневые структуры authors: -- Сергей Слотин -- Иван Сафонов + - Сергей Слотин + - Иван Сафонов weight: 6 -date: 2021-09-13 +date: {} +published: true --- Корневые оптимизации можно использовать много для чего, в частности в контексте структур данных. @@ -68,6 +69,7 @@ void upd(int l, int r, int x) { l += c; } else { + b[l / c] += x; a[l] += x; l++; } @@ -111,8 +113,8 @@ vector< vector > blocks; // возвращает индекс блока и индекс элемента внутри блока pair find_block(int pos) { int idx = 0; - while (blocks[idx].size() >= pos) - pos -= blocks[idx--].size(); + while (blocks[idx].size() <= pos) + pos -= blocks[idx++].size(); return {idx, pos}; } ``` From b80dafe5a8efe0389b1395919dc1770df7408d9f Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Tue, 16 Aug 2022 07:39:08 +0300 Subject: [PATCH 164/173] code style --- content/russian/cs/range-queries/sqrt-structures.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/content/russian/cs/range-queries/sqrt-structures.md b/content/russian/cs/range-queries/sqrt-structures.md index 8d2cfd6f..25fe3b5e 100644 --- a/content/russian/cs/range-queries/sqrt-structures.md +++ b/content/russian/cs/range-queries/sqrt-structures.md @@ -4,8 +4,7 @@ authors: - Сергей Слотин - Иван Сафонов weight: 6 -date: {} -published: true +date: 2022-08-16 --- Корневые оптимизации можно использовать много для чего, в частности в контексте структур данных. @@ -24,16 +23,15 @@ published: true ```c++ // c это и количество блоков, и также их размер; оно должно быть чуть больше корня const int maxn = 1e5, c = 330; -int a[maxn], b[c]; -int add[c]; +int a[maxn], b[c], add[c]; for (int i = 0; i < n; i++) b[i / c] += a[i]; ``` -Заведем также массив `add` размера $\sqrt n$, который будем использовать для отложенной операции прибавления на блоке. Будем считать, что реальное значение $i$-го элемента равно `a[i] + add[i / c]`. +Заведем также массив `add` размера $\sqrt n$, который будем использовать для отложенной операции прибавления на блоке: будем считать, что реальное значение $i$-го элемента равно `a[i] + add[i / c]`. -Теперь мы можем отвечать на запросы первого типа за $O(\sqrt n)$ на запрос: +Теперь мы можем отвечать на запросы первого типа за $O(\sqrt n)$ операций на запрос: 1. Для всех блоков, лежащих целиком внутри запроса, просто возьмём уже посчитанные суммы и сложим. 2. Для блоков, пересекающихся с запросом только частично (их максимум два — правый и левый), проитерируемся по нужным элементам и поштучно прибавим к ответу. @@ -69,7 +67,7 @@ void upd(int l, int r, int x) { l += c; } else { - b[l / c] += x; + b[l / c] += x; a[l] += x; l++; } From a9e98c13f2373883145b951af55cc881671e1804 Mon Sep 17 00:00:00 2001 From: Vladislav Shirshakov Date: Tue, 16 Aug 2022 18:59:47 +0500 Subject: [PATCH 165/173] =?UTF-8?q?=D0=94=D0=BB=D1=8F=20=D0=BF=D0=B5=D1=80?= =?UTF-8?q?=D0=B5=D0=BC=D0=B5=D0=BD=D0=BD=D0=BE=D0=B9=20=D0=BD=D0=B5=20?= =?UTF-8?q?=D1=83=D0=BA=D0=B0=D0=B7=D0=B0=D0=BD=20=D1=82=D0=B8=D0=BF?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- content/russian/cs/sorting/selection.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/russian/cs/sorting/selection.md b/content/russian/cs/sorting/selection.md index b47f2320..30854b5f 100644 --- a/content/russian/cs/sorting/selection.md +++ b/content/russian/cs/sorting/selection.md @@ -1,6 +1,7 @@ --- title: Сортировка выбором weight: 2 +published: true --- Похожим методом является **сортировка выбором** (минимума или максимума). @@ -10,7 +11,7 @@ weight: 2 ```cpp void selection_sort(int *a, int n) { for (int k = 0; k < n - 1; k++) - for (j = k + 1; j < n; j++) + for (int j = k + 1; j < n; j++) if (a[k] > a[j]) swap(a[j], a[k]); } From 7fd943e685a0d3ab4c9073cd704bdb25f2455606 Mon Sep 17 00:00:00 2001 From: Sergey Slotin Date: Wed, 17 Aug 2022 09:40:56 +0300 Subject: [PATCH 166/173] improve wording in branchless programming section --- content/english/hpc/pipelining/branchless.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md index d7416f35..31bd5a39 100644 --- a/content/english/hpc/pipelining/branchless.md +++ b/content/english/hpc/pipelining/branchless.md @@ -91,7 +91,7 @@ $$ This way you can eliminate branching, but this comes at the cost of evaluating *both* branches and the `cmov` itself. Because evaluating the ">=" branch costs nothing, the performance is exactly equal to [the "always yes" case](../branching/#branch-prediction) in the branchy version. -### When It Is Beneficial +### When Predication Is Beneficial Using predication eliminates [a control hazard](../hazards) but introduces a data hazard. There is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved and not flush the entire pipeline in case of a mispredict. @@ -180,11 +180,11 @@ int abs(int a) { ### Larger Examples -**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated char array (also known as "C-string") allocated somewhere on the heap and one integer containing the string size. +**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated `char` array (also known as a "C-string") allocated somewhere on the heap and one integer containing the string size. -A common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings. +A common value for a string is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic approach is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings. -However, this requires a separate branch, which is costly unless most strings are empty. What we can do to get rid of it is to allocate a "zero C-string," which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction. +However, this requires a separate branch, which is costly (unless the majority of strings are either empty or non-empty). To remove the check and thus also the branch, we can allocate a "zero C-string," which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction. **Binary search.** The standard binary search [can be implemented](/hpc/data-structures/binary-search) without branches, and on small arrays (that fit into cache) it works ~4x faster than the branchy `std::lower_bound`: @@ -193,10 +193,10 @@ int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; - base = (base[half] < x ? &base[half] : base); + base += (base[half - 1] < x) * half; // will be replaced with a "cmov" len -= half; } - return *(base + (*base < x)); + return *base; } ``` @@ -218,7 +218,7 @@ That there are no substantial reasons why compilers can't do this on their own, **Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications because they don't have branching in the first place. -In our array sum example, if you remove the `volatile` type qualifier from the accumulator, the compiler becomes able to [vectorize](/hpc/simd/auto-vectorization) the loop: +In our array sum example, removing the `volatile` type qualifier from the accumulator allows the compiler to [vectorize](/hpc/simd/auto-vectorization) the loop: ```c++ /* volatile */ int s = 0; @@ -230,7 +230,7 @@ for (int i = 0; i < N; i++) It now works in ~0.3 per element, which is mainly [bottlenecked by the memory](/hpc/cpu-cache/bandwidth). -The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/shuffling). +The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific small deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/shuffling). -When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce the latency. +When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce latency. Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored: @@ -27,7 +27,7 @@ Therefore, the only correct answer to this question is "it depends" — primaril - If it was accessed recently, it is probably *cached* and will take less than that to fetch, depending on how long ago it was accessed — it could be ~50 cycles for the slowest layer of cache and around 4-5 cycles for the fastest. - But it could also be stored on some type of *external memory* such as a hard drive, and in this case, it will take around 5ms, or roughly $10^7$ cycles (!) to access it. -Such high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind. +Such a high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind. ![](img/memory-vs-compute.png) diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md index da1f5bb6..26dfc144 100644 --- a/content/english/hpc/external-memory/hierarchy.md +++ b/content/english/hpc/external-memory/hierarchy.md @@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data. ### Non-Volatile Memory -While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms. +While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them to collide with silicon atoms. diff --git a/content/english/hpc/external-memory/model.md b/content/english/hpc/external-memory/model.md index 35cba4ea..9ab86eba 100644 --- a/content/english/hpc/external-memory/model.md +++ b/content/english/hpc/external-memory/model.md @@ -18,7 +18,7 @@ Similar in spirit, in the *external memory model*, we simply ignore every operat In this model, we measure the performance of an algorithm in terms of its high-level *I/O operations*, or *IOPS* — that is, the total number of blocks read or written to external memory during execution. -We will mostly focus on the case where the internal memory is RAM and external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes. +We will mostly focus on the case where the internal memory is RAM and the external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes. ### Array Scan diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md index 47310780..3d05e2f9 100644 --- a/content/english/hpc/number-theory/modular.md +++ b/content/english/hpc/number-theory/modular.md @@ -100,7 +100,7 @@ $$ $$ \begin{aligned} a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p & -\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)} +\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by definition)} \\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)} \\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)} \\\ &= a From 88ed77156863353ad37e486195ce0ba3ef682afb Mon Sep 17 00:00:00 2001 From: trasua Date: Mon, 5 Sep 2022 15:59:32 +0700 Subject: [PATCH 171/173] fix typo --- content/english/hpc/data-structures/binary-search.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 85f9ef52..6426ddde 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -1,6 +1,7 @@ --- title: Binary Search weight: 1 +published: true --- @@ -184,7 +185,7 @@ int lower_bound(int x) { Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely. -As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the funciton is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays: +As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the function is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays: From 4e00ee7cc5769cd650be6d215e0efacf2f14de51 Mon Sep 17 00:00:00 2001 From: novikov-vladimir <99834014+novikov-vladimir@users.noreply.github.com> Date: Fri, 11 Nov 2022 19:57:50 +0300 Subject: [PATCH 172/173] =?UTF-8?q?=D0=90=D0=B2=D1=82=D0=BE=D0=BC=D0=B0?= =?UTF-8?q?=D1=82=D0=BD=D1=8B=D0=B9=20=D0=BF=D0=B5=D1=80=D0=B5=D1=85=D0=BE?= =?UTF-8?q?=D0=B4=20=D0=B4=D0=BE=D0=BB=D0=B6=D0=B5=D0=BD=20=D0=B2=D0=B5?= =?UTF-8?q?=D1=81=D1=82=D0=B8=20=D0=B2=20=D0=B2=D0=B5=D1=80=D1=88=D0=B8?= =?UTF-8?q?=D0=BD=D1=83,=20=D1=81=D0=BE=D0=BE=D1=82=D0=B2=D0=B5=D1=82?= =?UTF-8?q?=D1=81=D1=82=D0=B2=D1=83=D1=8E=D1=89=D1=83=D1=8E=20=D0=BC=D0=B0?= =?UTF-8?q?=D0=BA=D1=81=D0=B8=D0=BC=D0=B0=D0=BB=D1=8C=D0=BD=D0=BE=D0=BC?= =?UTF-8?q?=D1=83=20=D0=BF=D1=80=D0=B8=D0=BD=D0=B8=D0=BC=D0=B0=D0=B5=D0=BC?= =?UTF-8?q?=D0=BE=D0=BC=D1=83=20=D0=B1=D0=BE=D1=80=D0=BE=D0=BC=20=D1=81?= =?UTF-8?q?=D1=83=D1=84=D1=84=D0=B8=D0=BA=D1=81=D1=83=20(=D0=BD=D0=B5=20?= =?UTF-8?q?=D0=BC=D0=B8=D0=BD=D0=B8=D0=BC=D0=B0=D0=BB=D1=8C=D0=BD=D0=BE?= =?UTF-8?q?=D0=BC=D1=83).?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- content/russian/cs/string-structures/aho-corasick.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/content/russian/cs/string-structures/aho-corasick.md b/content/russian/cs/string-structures/aho-corasick.md index 369f5171..2ca1da65 100644 --- a/content/russian/cs/string-structures/aho-corasick.md +++ b/content/russian/cs/string-structures/aho-corasick.md @@ -1,10 +1,11 @@ --- title: Алгоритм Ахо-Корасик authors: -- Сергей Слотин + - Сергей Слотин weight: 2 prerequisites: -- trie + - trie +published: true --- Представим, что мы работаем журналистами в некотором авторитарном государстве, контролирующем СМИ, и в котором время от времени издаются законы, запрещающие упоминать определенные политические события или использовать определенные слова. Как эффективно реализовать подобную цензуру программно? @@ -36,7 +37,7 @@ prerequisites: **Определение.** *Суффиксная ссылка* $l(v)$ ведёт в вершину $u \neq v$, которая соответствует наидлиннейшему принимаемому бором суффиксу $v$. -**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую минимальному принимаемому бором суффиксу строки $v + c$. +**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую максимальному принимаемому бором суффиксу строки $v + c$. **Наблюдение.** Если переход и так существует в боре (будем называть такой переход *прямым*), то автоматный переход будет вести туда же. From 0fa54119101693a9670972a3c27657d2ee1c59d1 Mon Sep 17 00:00:00 2001 From: DavideGianessi <118054693+DavideGianessi@users.noreply.github.com> Date: Sat, 12 Nov 2022 12:49:11 +0100 Subject: [PATCH 173/173] typo --- content/english/hpc/number-theory/montgomery.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md index 669e39ba..0eeef0b0 100644 --- a/content/english/hpc/number-theory/montgomery.md +++ b/content/english/hpc/number-theory/montgomery.md @@ -1,6 +1,7 @@ --- title: Montgomery Multiplication weight: 4 +published: true --- Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular) is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size. @@ -287,6 +288,6 @@ int inverse(int _a) { } ``` -While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types. +While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158ns we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types. **Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).