From 9c17582faf9acf067e7ee384434d6c6f83ecc6ed Mon Sep 17 00:00:00 2001
From: AlexXan312 <62149707+AlexXan312@users.noreply.github.com>
Date: Wed, 2 Feb 2022 15:42:37 +0300
Subject: [PATCH 001/173] fix dp divide and conquer

---
 content/russian/cs/layer-optimizations/divide-and-conquer.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/content/russian/cs/layer-optimizations/divide-and-conquer.md b/content/russian/cs/layer-optimizations/divide-and-conquer.md
index a7731f49..61a7304a 100644
--- a/content/russian/cs/layer-optimizations/divide-and-conquer.md
+++ b/content/russian/cs/layer-optimizations/divide-and-conquer.md
@@ -19,10 +19,10 @@ $$
 Конкретно в задаче покрытия точек отрезками, можно заметить следующее:
 
 $$
-opt[i, j] \leq opt[i, j+1]
+opt[i, j] \leq opt[i+1, j]
 $$
 
-Интуиция такая: если у нас появился дополнительный отрезок, то последний отрезок нам не выгодно делать больше, а скорее наоборот его нужно «сжать».
+Интуиция такая: когда мы сдвигаем i вправо, то точка, с которой может начинаться последняя группа, не может уменьшаться.
 
 ### Идея
 

From 91108ace5d37d6730480dc2624cc1fdd64d26361 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 31 Mar 2022 17:46:29 +0300
Subject: [PATCH 002/173] note about filtering performance

---
 content/english/hpc/simd/shuffling.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/simd/shuffling.md b/content/english/hpc/simd/shuffling.md
index f2a2cd15..5774b1fd 100644
--- a/content/english/hpc/simd/shuffling.md
+++ b/content/english/hpc/simd/shuffling.md
@@ -225,7 +225,9 @@ The vectorized version takes some work to implement, but it is 6-7x faster than
 
 ![](../img/filter.svg)
 
-This operation is considerably faster on AVX-512: it has a special "[compress](_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines.
+The loop performance is relatively low — taking 4 CPU cycles per iteration —  because, on this particular CPU (Zen 2), `movemask`, `permute`, and `store` have low throughput and all have to go through the same execution port (P2). On most other platforms, you can expect it to be ~2x faster.
+
+Filtering can also be implemented considerably faster on AVX-512: it has a special "[compress](_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines, such as quicksort.
 
 <!--
 

From f6af7ad3299cdb32f52dc12b9bd1828d8b7aa990 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 31 Mar 2022 17:51:42 +0300
Subject: [PATCH 003/173] update copyright year

---
 themes/algorithmica/i18n/en.toml | 2 +-
 themes/algorithmica/i18n/ru.toml | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/themes/algorithmica/i18n/en.toml b/themes/algorithmica/i18n/en.toml
index d58a7924..9aae4777 100644
--- a/themes/algorithmica/i18n/en.toml
+++ b/themes/algorithmica/i18n/en.toml
@@ -22,7 +22,7 @@ other = "prerequisites"
 other = "translations"
 
 [copyright1]
-other = "Copyright 2021 Sergey Slotin"
+other = "Copyright 2021–2022 Sergey Slotin"
 
 [copyright2]
 other = " " # Content is distributed under <a href='https://tldrlegal.com/license/creative-commons-attribution-noncommercial-4.0-international-(cc-by-nc-4.0)'>CC BY-NC</a>
diff --git a/themes/algorithmica/i18n/ru.toml b/themes/algorithmica/i18n/ru.toml
index a25a0c27..5e96226c 100644
--- a/themes/algorithmica/i18n/ru.toml
+++ b/themes/algorithmica/i18n/ru.toml
@@ -28,7 +28,7 @@ other = "пререквизиты"
 other = "переводы"
 
 [copyright1]
-other = "Copyleft 2017–2021 Тинькофф Образование" # {{ .Count / . }}
+other = "Copyleft 2017–2022 Algorithmica.org" # {{ .Count / . }}
 
 [copyright2]
 other = "Материалы распространяются под <a href='https://tldrlegal.com/license/creative-commons-attribution-sharealike-4.0-international-(cc-by-sa-4.0)'>CC BY-SA</a>"

From 137cae87d4bc3418ca10aa47c2f471f784ca500e Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 31 Mar 2022 17:57:32 +0300
Subject: [PATCH 004/173] update table of contents

---
 content/english/hpc/_index.md | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index 5bb1fe60..5ccf7821 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -178,20 +178,22 @@ Planned table of contents:
  11.7. Number-Theoretic Transform
  11.8. Argmin with SIMD
  11.9. Prefix Sum with SIMD
- 11.10. Reading and Writing Integers
-(11.11. Reading and Writing Floats)
-(11.12. String Searching)
- 11.13. Sorting
- 11.14. Matrix Multiplication
+ 11.10. Reading Decimal Integers
+ 11.11. Writing Decimal Integers
+(11.12. Reading and Writing Floats)
+(11.13. String Searching)
+ 11.14. Sorting
+ 11.15. Matrix Multiplication
 12. Data Structure Case Studies
  12.1. Binary Search
  12.2. Static B-Trees
- 12.3. Segment Trees
-(12.4. Search Trees)
-(12.5. Range Minimum Query)
- 12.6. Hash Tables
-(12.7. Bitmaps)
-(12.8. Probabilistic Filters)
+(12.3. Search Trees)
+ 12.4. Segment Trees
+(12.5. Tries)
+(12.6. Range Minimum Query)
+ 12.7. Hash Tables
+(12.8. Bitmaps)
+(12.9. Probabilistic Filters)
 ```
 
 Among the cool things that we will speed up:
@@ -201,12 +203,13 @@ Among the cool things that we will speed up:
 - 5-10x faster segment trees (compared to Fenwick trees)
 - 5x faster hash tables (compared to `std::unordered_map`)
 - 2x faster popcount (compared to repeatedly calling `popcnt`)
-- 2x faster parsing series of integers (compared to `scanf`)
+- 35x faster parsing series of integers (compared to `scanf`)
 - ?x faster sorting (compared to `std::sort`)
 - 2x faster sum (compared to `std::accumulate`)
 - 2-3x faster prefix sum (compared to naive implementation)
 - 10x faster argmin (compared to naive implementation)
 - 10x faster array searching (compared to `std::find`)
+- 15x faster search tree (compared to `std::set`)
 - 100x faster matrix multiplication (compared to "for-for-for")
 - optimal word-size integer factorization (~0.4ms per 60-bit integer)
 - optimal Karatsuba Algorithm
@@ -237,8 +240,10 @@ This work is largely based on blog posts, research papers, conference talks and
 - [Matt Kulukundis](https://twitter.com/JuvHarlequinKFM)
 - [Georg Sauthoff](https://gms.tf/)
 - [Marshall Lochbaum](https://mlochbaum.github.io/publications.html)
+- [Pavel Zemtsov](https://pzemtsov.github.io/)
 - [Nayuki](https://www.nayuki.io/category/programming)
 - [ridiculous_fish](https://ridiculousfish.com/blog/)
+- [Z boson](https://stackoverflow.com/users/2542702/z-boson)
 - [Creel](https://www.youtube.com/c/WhatsACreel)
 
 Volume: 450-600 pages  

From 79320d3a6de01237ee1cd13f8373c8d70cfd0a14 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 31 Mar 2022 18:12:24 +0300
Subject: [PATCH 005/173] change wording

---
 content/english/hpc/_index.md         | 1 +
 content/english/hpc/simd/shuffling.md | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index 5ccf7821..92d0cd91 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -242,6 +242,7 @@ This work is largely based on blog posts, research papers, conference talks and
 - [Marshall Lochbaum](https://mlochbaum.github.io/publications.html)
 - [Pavel Zemtsov](https://pzemtsov.github.io/)
 - [Nayuki](https://www.nayuki.io/category/programming)
+- [InstLatX64](https://twitter.com/InstLatX64)
 - [ridiculous_fish](https://ridiculousfish.com/blog/)
 - [Z boson](https://stackoverflow.com/users/2542702/z-boson)
 - [Creel](https://www.youtube.com/c/WhatsACreel)
diff --git a/content/english/hpc/simd/shuffling.md b/content/english/hpc/simd/shuffling.md
index 5774b1fd..b7e13ba1 100644
--- a/content/english/hpc/simd/shuffling.md
+++ b/content/english/hpc/simd/shuffling.md
@@ -225,7 +225,7 @@ The vectorized version takes some work to implement, but it is 6-7x faster than
 
 ![](../img/filter.svg)
 
-The loop performance is relatively low — taking 4 CPU cycles per iteration —  because, on this particular CPU (Zen 2), `movemask`, `permute`, and `store` have low throughput and all have to go through the same execution port (P2). On most other platforms, you can expect it to be ~2x faster.
+The loop performance is still relatively low — taking 4 CPU cycles per iteration —  because, on this particular CPU (Zen 2), `movemask`, `permute`, and `store` have low throughput and all have to go through the same execution port (P2). On most other x86 CPUs, you can expect it to be ~2x faster.
 
 Filtering can also be implemented considerably faster on AVX-512: it has a special "[compress](_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines, such as quicksort.
 

From 190a5bc11ad3e60212e721d26ac93ef151626afb Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sat, 2 Apr 2022 23:31:54 +0300
Subject: [PATCH 006/173] reading integers outline

---
 content/english/hpc/algorithms/parsing.md     |  5 --
 .../hpc/algorithms/reading-integers.md        | 59 +++++++++++++++++++
 2 files changed, 59 insertions(+), 5 deletions(-)
 delete mode 100644 content/english/hpc/algorithms/parsing.md
 create mode 100644 content/english/hpc/algorithms/reading-integers.md

diff --git a/content/english/hpc/algorithms/parsing.md b/content/english/hpc/algorithms/parsing.md
deleted file mode 100644
index c189e66a..00000000
--- a/content/english/hpc/algorithms/parsing.md
+++ /dev/null
@@ -1,5 +0,0 @@
----
-title: Parsing with SIMD
-weight: 5
-draft: true
----
diff --git a/content/english/hpc/algorithms/reading-integers.md b/content/english/hpc/algorithms/reading-integers.md
new file mode 100644
index 00000000..de9da4e9
--- /dev/null
+++ b/content/english/hpc/algorithms/reading-integers.md
@@ -0,0 +1,59 @@
+---
+title: Reading Decimal Integers
+weight: 10
+draft: true
+---
+
+I wrote a new integer parsing algorithm that is ~35x faster than scanf.
+
+(No, this is not an April Fools' joke — although it does sound ridiculous.)
+
+Zen 2 @ 2GHz. The compiler is Clang 13.
+
+Ridiculous.
+
+### Iostream
+
+### Scanf
+
+### Syncronization
+
+### Getchar
+
+### Buffering
+
+### SIMD
+
+http://0x80.pl/notesen/2014-10-12-parsing-decimal-numbers-part-1-swar.html
+
+
+### Serial
+
+### Transpose-based approach
+
+### Instruction-level parallelism
+
+
+### Modifications
+
+ILP benefits would not be that huge.
+
+One huge asterisk. We get the integers, and we can even do other parsing algorithms on them.
+
+1.75 cycles per byte. 
+
+AVX-512 both due to larger SIMD lane size and dedicated operations for filtering.
+
+It accounts for ~2% of all time, but it can be optimized by using special procedures. Pad buffer with any digits.
+
+### Future work
+
+Next time, we will be *writing* integers.
+
+You can create a string searcing algorithm by computing hashes in rabin-karp algorithm — although it does not seem to be possible to make an *exact* algorithm for that.
+
+## Acknowledgements
+
+http://0x80.pl/articles/simd-parsing-int-sequences.html
+
+https://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2/25627536#25627536

From 7edf7f627eff465a53497af39e20d3efa26fbfa6 Mon Sep 17 00:00:00 2001
From: Marco <cognetta.marco@gmail.com>
Date: Sat, 2 Apr 2022 21:55:15 -0700
Subject: [PATCH 007/173] Remove erroneous sentence fragment

Looks like a case of accidental copy paste.
---
 content/english/hpc/simd/intrinsics.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/simd/intrinsics.md b/content/english/hpc/simd/intrinsics.md
index f10f7b9d..e091ddb6 100644
--- a/content/english/hpc/simd/intrinsics.md
+++ b/content/english/hpc/simd/intrinsics.md
@@ -141,7 +141,7 @@ For example, the group of `extract` intrinsics that are used to get individual e
 
 ### GCC Vector Extensions
 
-If you feel like the design of C intrinsics is terrible, you are not alone. are all generated by cats walking on keyboards. I've spent hundreds of hours writing SIMD code and reading the Intel Intrinsics Guide, and I still can't remember whether I need to type `_mm256` or `__m256`.
+If you feel like the design of C intrinsics is terrible, you are not alone. I've spent hundreds of hours writing SIMD code and reading the Intel Intrinsics Guide, and I still can't remember whether I need to type `_mm256` or `__m256`.
 
 Intrinsics are not only hard to use but also neither portable nor maintainable. In good software, you don't want to maintain different procedures for each CPU: you want to implement it just once, in an architecture-agnostic way.
 

From 73c7e7a16dce3695b46e452fc81816c548ed7333 Mon Sep 17 00:00:00 2001
From: Marco <cognetta.marco@gmail.com>
Date: Mon, 4 Apr 2022 07:24:13 -0500
Subject: [PATCH 008/173] Replace Russian text in English version

---
 content/english/hpc/number-theory/inverse.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/inverse.md
index ccbb14ea..aec428fe 100644
--- a/content/english/hpc/number-theory/inverse.md
+++ b/content/english/hpc/number-theory/inverse.md
@@ -109,7 +109,7 @@ This helps if `n` or `mod` is a constant.
 
 ### Modular Division
 
-"Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, но $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$.
+"Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, but $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$.
 
 To perform division, we need to find an element that will behave itself like the reciprocal $\frac{1}{a} = a^{-1}$, and instead of "division" multiply by it. This element is called a *modular inverse*.
 

From b354e447de990a32cd68ca0ffa17994d11017cdd Mon Sep 17 00:00:00 2001
From: Marco <cognetta.marco@gmail.com>
Date: Mon, 4 Apr 2022 10:28:38 -0500
Subject: [PATCH 009/173] Fix typo

---
 content/english/hpc/simd/masking.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/simd/masking.md b/content/english/hpc/simd/masking.md
index 332597c1..dbe71575 100644
--- a/content/english/hpc/simd/masking.md
+++ b/content/english/hpc/simd/masking.md
@@ -67,7 +67,7 @@ for (int i = 0; i < N; i += 8) {
 }
 ```
 
-This loop performs slightly faster because on this particular CPU, the vector `and` take one cycle less than `blend`.
+This loop performs slightly faster because on this particular CPU, the vector `and` takes one cycle less than `blend`.
 
 Several other instructions support masks as inputs, most notably:
 

From aedaf297a8c30b55b2ae427875095dd33ddaac43 Mon Sep 17 00:00:00 2001
From: Seyoung Lee <nickte89@gmail.com>
Date: Mon, 4 Apr 2022 20:06:14 +0100
Subject: [PATCH 010/173] fix typo in architecture/functions.md

Hello! Although it's a really small typo but I just wanted to contribute :)
---
 content/english/hpc/architecture/functions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md
index 24ab9898..f7a74cc6 100644
--- a/content/english/hpc/architecture/functions.md
+++ b/content/english/hpc/architecture/functions.md
@@ -230,7 +230,7 @@ Equivalent assembly:
 ```nasm
 ; n = edi, ret = eax
 factorial:
-    test edi, edi   ; test if a value if zero
+    test edi, edi   ; test if a value is zero
     jne  nonzero    ; (the machine code of "cmp rax, 0" would be one byte longer)
     mov  eax, 1     ; return 1
     ret

From 50c1e809b01f465beed3236c6b8517fa4473e3f9 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 01:36:29 +0300
Subject: [PATCH 011/173] matmul plots

---
 .../english/hpc/algorithms/img/mm-blas.svg    | 1570 +++++++++++++++++
 .../hpc/algorithms/img/mm-blocked-barplot.svg | 1402 +++++++++++++++
 .../hpc/algorithms/img/mm-blocked-plot.svg    | 1474 ++++++++++++++++
 .../hpc/algorithms/img/mm-kernel-barplot.svg  | 1277 ++++++++++++++
 .../hpc/algorithms/img/mm-kernel-plot.svg     | 1385 +++++++++++++++
 .../english/hpc/algorithms/img/mm-noalloc.svg | 1344 ++++++++++++++
 .../algorithms/img/mm-vectorized-barplot.svg  | 1140 ++++++++++++
 .../hpc/algorithms/img/mm-vectorized-plot.svg | 1379 +++++++++++++++
 content/english/hpc/algorithms/matmul.md      |   42 +-
 9 files changed, 11012 insertions(+), 1 deletion(-)
 create mode 100644 content/english/hpc/algorithms/img/mm-blas.svg
 create mode 100644 content/english/hpc/algorithms/img/mm-blocked-barplot.svg
 create mode 100644 content/english/hpc/algorithms/img/mm-blocked-plot.svg
 create mode 100644 content/english/hpc/algorithms/img/mm-kernel-barplot.svg
 create mode 100644 content/english/hpc/algorithms/img/mm-kernel-plot.svg
 create mode 100644 content/english/hpc/algorithms/img/mm-noalloc.svg
 create mode 100644 content/english/hpc/algorithms/img/mm-vectorized-barplot.svg
 create mode 100644 content/english/hpc/algorithms/img/mm-vectorized-plot.svg

diff --git a/content/english/hpc/algorithms/img/mm-blas.svg b/content/english/hpc/algorithms/img/mm-blas.svg
new file mode 100644
index 00000000..5027faef
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-blas.svg
@@ -0,0 +1,1570 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:19:43.486396</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 116.162567 320.4 
+L 116.162567 43.2 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- naive -->
+      <g style="fill: #262626" transform="translate(102.504754 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6e"/>
+       <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+       <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+       <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 175.841711 320.4 
+L 175.841711 43.2 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- transposed -->
+      <g style="fill: #262626" transform="translate(147.899524 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-74"/>
+       <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+       <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+       <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+       <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+       <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+       <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+       <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+       <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+       <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 235.520856 320.4 
+L 235.520856 43.2 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- vectorized -->
+      <g style="fill: #262626" transform="translate(209.396637 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-76"/>
+       <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+       <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+       <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+       <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+       <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+       <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+       <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+       <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+       <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="line2d_4">
+      <path d="M 295.2 320.4 
+L 295.2 43.2 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- kernel -->
+      <g style="fill: #262626" transform="translate(279.807031 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6b" d="M 581 4863 
+L 1159 4863 
+L 1159 1991 
+L 2875 3500 
+L 3609 3500 
+L 1753 1863 
+L 3688 0 
+L 2938 0 
+L 1159 1709 
+L 1159 0 
+L 581 0 
+L 581 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6b"/>
+       <use xlink:href="#DejaVuSans-65" x="54.285156"/>
+       <use xlink:href="#DejaVuSans-72" x="115.808594"/>
+       <use xlink:href="#DejaVuSans-6e" x="155.171875"/>
+       <use xlink:href="#DejaVuSans-65" x="218.550781"/>
+       <use xlink:href="#DejaVuSans-6c" x="280.074219"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="line2d_5">
+      <path d="M 354.879144 320.4 
+L 354.879144 43.2 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- blocked -->
+      <g style="fill: #262626" transform="translate(335.542426 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-62" d="M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+M 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+L 1159 0 
+L 581 0 
+L 581 4863 
+L 1159 4863 
+L 1159 2969 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-62"/>
+       <use xlink:href="#DejaVuSans-6c" x="63.476562"/>
+       <use xlink:href="#DejaVuSans-6f" x="91.259766"/>
+       <use xlink:href="#DejaVuSans-63" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-6b" x="207.421875"/>
+       <use xlink:href="#DejaVuSans-65" x="261.707031"/>
+       <use xlink:href="#DejaVuSans-64" x="323.230469"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_6">
+     <g id="line2d_6">
+      <path d="M 414.558289 320.4 
+L 414.558289 43.2 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- in-place -->
+      <g style="fill: #262626" transform="translate(394.743445 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-2d" d="M 313 2009 
+L 1997 2009 
+L 1997 1497 
+L 313 1497 
+L 313 2009 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-69"/>
+       <use xlink:href="#DejaVuSans-6e" x="27.783203"/>
+       <use xlink:href="#DejaVuSans-2d" x="91.162109"/>
+       <use xlink:href="#DejaVuSans-70" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-6c" x="190.722656"/>
+       <use xlink:href="#DejaVuSans-61" x="218.505859"/>
+       <use xlink:href="#DejaVuSans-63" x="279.785156"/>
+       <use xlink:href="#DejaVuSans-65" x="334.765625"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_7">
+     <g id="line2d_7">
+      <path d="M 474.237433 320.4 
+L 474.237433 43.2 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- BLAS -->
+      <g style="fill: #262626" transform="translate(461.313996 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-42" d="M 1259 2228 
+L 1259 519 
+L 2272 519 
+Q 2781 519 3026 730 
+Q 3272 941 3272 1375 
+Q 3272 1813 3026 2020 
+Q 2781 2228 2272 2228 
+L 1259 2228 
+z
+M 1259 4147 
+L 1259 2741 
+L 2194 2741 
+Q 2656 2741 2882 2914 
+Q 3109 3088 3109 3444 
+Q 3109 3797 2882 3972 
+Q 2656 4147 2194 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2241 4666 
+Q 2963 4666 3353 4366 
+Q 3744 4066 3744 3513 
+Q 3744 3084 3544 2831 
+Q 3344 2578 2956 2516 
+Q 3422 2416 3680 2098 
+Q 3938 1781 3938 1306 
+Q 3938 681 3513 340 
+Q 3088 0 2303 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-41" d="M 2188 4044 
+L 1331 1722 
+L 3047 1722 
+L 2188 4044 
+z
+M 1831 4666 
+L 2547 4666 
+L 4325 0 
+L 3669 0 
+L 3244 1197 
+L 1141 1197 
+L 716 0 
+L 50 0 
+L 1831 4666 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-42"/>
+       <use xlink:href="#DejaVuSans-4c" x="68.603516"/>
+       <use xlink:href="#DejaVuSans-41" x="126.566406"/>
+       <use xlink:href="#DejaVuSans-53" x="194.974609"/>
+      </g>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_8">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 0 -->
+      <g style="fill: #262626" transform="translate(55.50125 324.579141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-30"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_9">
+      <path d="M 72 279.15 
+L 518.4 279.15 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_9">
+      <!-- 5 -->
+      <g style="fill: #262626" transform="translate(55.50125 283.329141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-35" d="M 691 4666 
+L 3169 4666 
+L 3169 4134 
+L 1269 4134 
+L 1269 2991 
+Q 1406 3038 1543 3061 
+Q 1681 3084 1819 3084 
+Q 2600 3084 3056 2656 
+Q 3513 2228 3513 1497 
+Q 3513 744 3044 326 
+Q 2575 -91 1722 -91 
+Q 1428 -91 1123 -41 
+Q 819 9 494 109 
+L 494 744 
+Q 775 591 1075 516 
+Q 1375 441 1709 441 
+Q 2250 441 2565 725 
+Q 2881 1009 2881 1497 
+Q 2881 1984 2565 2268 
+Q 2250 2553 1709 2553 
+Q 1456 2553 1204 2497 
+Q 953 2441 691 2322 
+L 691 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-35"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_10">
+      <path d="M 72 237.9 
+L 518.4 237.9 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_10">
+      <!-- 10 -->
+      <g style="fill: #262626" transform="translate(48.5025 242.079141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_11">
+      <path d="M 72 196.65 
+L 518.4 196.65 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_11">
+      <!-- 15 -->
+      <g style="fill: #262626" transform="translate(48.5025 200.829141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-35" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_12">
+      <path d="M 72 155.4 
+L 518.4 155.4 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_12">
+      <!-- 20 -->
+      <g style="fill: #262626" transform="translate(48.5025 159.579141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_6">
+     <g id="line2d_13">
+      <path d="M 72 114.15 
+L 518.4 114.15 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_13">
+      <!-- 25 -->
+      <g style="fill: #262626" transform="translate(48.5025 118.329141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-35" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_7">
+     <g id="line2d_14">
+      <path d="M 72 72.9 
+L 518.4 72.9 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_14">
+      <!-- 30 -->
+      <g style="fill: #262626" transform="translate(48.5025 77.079141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-33" d="M 2597 2516 
+Q 3050 2419 3304 2112 
+Q 3559 1806 3559 1356 
+Q 3559 666 3084 287 
+Q 2609 -91 1734 -91 
+Q 1441 -91 1130 -33 
+Q 819 25 488 141 
+L 488 750 
+Q 750 597 1062 519 
+Q 1375 441 1716 441 
+Q 2309 441 2620 675 
+Q 2931 909 2931 1356 
+Q 2931 1769 2642 2001 
+Q 2353 2234 1838 2234 
+L 1294 2234 
+L 1294 2753 
+L 1863 2753 
+Q 2328 2753 2575 2939 
+Q 2822 3125 2822 3475 
+Q 2822 3834 2567 4026 
+Q 2313 4219 1838 4219 
+Q 1578 4219 1281 4162 
+Q 984 4106 628 3988 
+L 628 4550 
+Q 988 4650 1302 4700 
+Q 1616 4750 1894 4750 
+Q 2613 4750 3031 4423 
+Q 3450 4097 3450 3541 
+Q 3450 3153 3228 2886 
+Q 3006 2619 2597 2516 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-33"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_15">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(42.006875 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="patch_3">
+    <path d="M 92.290909 320.4 
+L 140.034225 320.4 
+L 140.034225 316.913854 
+L 92.290909 316.913854 
+z
+" clip-path="url(#p51dc6f4d83)" style="fill: #4c72b0; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 151.970053 320.4 
+L 199.713369 320.4 
+L 199.713369 315.682286 
+L 151.970053 315.682286 
+z
+" clip-path="url(#p51dc6f4d83)" style="fill: #dd8452; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 211.649198 320.4 
+L 259.392513 320.4 
+L 259.392513 301.66771 
+L 211.649198 301.66771 
+z
+" clip-path="url(#p51dc6f4d83)" style="fill: #55a868; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 271.328342 320.4 
+L 319.071658 320.4 
+L 319.071658 294.362573 
+L 271.328342 294.362573 
+z
+" clip-path="url(#p51dc6f4d83)" style="fill: #c44e52; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_7">
+    <path d="M 331.007487 320.4 
+L 378.750802 320.4 
+L 378.750802 193.865902 
+L 331.007487 193.865902 
+z
+" clip-path="url(#p51dc6f4d83)" style="fill: #8172b3; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_8">
+    <path d="M 390.686631 320.4 
+L 438.429947 320.4 
+L 438.429947 128.209154 
+L 390.686631 128.209154 
+z
+" clip-path="url(#p51dc6f4d83)" style="fill: #937860; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_9">
+    <path d="M 450.365775 320.4 
+L 498.109091 320.4 
+L 498.109091 107.984498 
+L 450.365775 107.984498 
+z
+" clip-path="url(#p51dc6f4d83)" style="fill: #da8bc3; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="line2d_15">
+    <path d="M 339.84 56.4 
+L 496.08 56.4 
+" clip-path="url(#p51dc6f4d83)" style="fill: none; stroke: #808080; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="patch_10">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_11">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_12">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_13">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_16">
+    <!-- Theoretical maximum -->
+    <g style="fill: #808080" transform="translate(375.766845 67.5375)scale(0.1 -0.1)">
+     <defs>
+      <path id="DejaVuSans-54" d="M -19 4666 
+L 3928 4666 
+L 3928 4134 
+L 2272 4134 
+L 2272 0 
+L 1638 0 
+L 1638 4134 
+L -19 4134 
+L -19 4666 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-68" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 4863 
+L 1159 4863 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-54"/>
+     <use xlink:href="#DejaVuSans-68" x="61.083984"/>
+     <use xlink:href="#DejaVuSans-65" x="124.462891"/>
+     <use xlink:href="#DejaVuSans-6f" x="185.986328"/>
+     <use xlink:href="#DejaVuSans-72" x="247.167969"/>
+     <use xlink:href="#DejaVuSans-65" x="286.03125"/>
+     <use xlink:href="#DejaVuSans-74" x="347.554688"/>
+     <use xlink:href="#DejaVuSans-69" x="386.763672"/>
+     <use xlink:href="#DejaVuSans-63" x="414.546875"/>
+     <use xlink:href="#DejaVuSans-61" x="469.527344"/>
+     <use xlink:href="#DejaVuSans-6c" x="530.806641"/>
+     <use xlink:href="#DejaVuSans-20" x="558.589844"/>
+     <use xlink:href="#DejaVuSans-6d" x="590.376953"/>
+     <use xlink:href="#DejaVuSans-61" x="687.789062"/>
+     <use xlink:href="#DejaVuSans-78" x="749.068359"/>
+     <use xlink:href="#DejaVuSans-69" x="808.248047"/>
+     <use xlink:href="#DejaVuSans-6d" x="836.03125"/>
+     <use xlink:href="#DejaVuSans-75" x="933.443359"/>
+     <use xlink:href="#DejaVuSans-6d" x="996.822266"/>
+    </g>
+   </g>
+   <g id="text_17">
+    <!-- 1.00x -->
+    <g style="fill: #262626" transform="translate(97.489442 311.918229)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-31" d="M 750 831 
+L 1813 831 
+L 1813 3847 
+L 722 3622 
+L 722 4441 
+L 1806 4666 
+L 2950 4666 
+L 2950 831 
+L 4013 831 
+L 4013 0 
+L 750 0 
+L 750 831 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-2e" d="M 653 1209 
+L 1778 1209 
+L 1778 0 
+L 653 0 
+L 653 1209 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-30" d="M 2944 2338 
+Q 2944 3213 2780 3570 
+Q 2616 3928 2228 3928 
+Q 1841 3928 1675 3570 
+Q 1509 3213 1509 2338 
+Q 1509 1453 1675 1090 
+Q 1841 728 2228 728 
+Q 2613 728 2778 1090 
+Q 2944 1453 2944 2338 
+z
+M 4147 2328 
+Q 4147 1169 3647 539 
+Q 3147 -91 2228 -91 
+Q 1306 -91 806 539 
+Q 306 1169 306 2328 
+Q 306 3491 806 4120 
+Q 1306 4750 2228 4750 
+Q 3147 4750 3647 4120 
+Q 4147 3491 4147 2328 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-78" d="M 1422 1791 
+L 159 3500 
+L 1344 3500 
+L 2059 2463 
+L 2784 3500 
+L 3969 3500 
+L 2706 1797 
+L 4031 0 
+L 2847 0 
+L 2059 1106 
+L 1281 0 
+L 97 0 
+L 1422 1791 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_18">
+    <!-- 1.35x -->
+    <g style="fill: #262626" transform="translate(157.168586 310.686661)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-33" d="M 2981 2516 
+Q 3453 2394 3698 2092 
+Q 3944 1791 3944 1325 
+Q 3944 631 3412 270 
+Q 2881 -91 1863 -91 
+Q 1503 -91 1142 -33 
+Q 781 25 428 141 
+L 428 1069 
+Q 766 900 1098 814 
+Q 1431 728 1753 728 
+Q 2231 728 2486 893 
+Q 2741 1059 2741 1369 
+Q 2741 1688 2480 1852 
+Q 2219 2016 1709 2016 
+L 1228 2016 
+L 1228 2791 
+L 1734 2791 
+Q 2188 2791 2409 2933 
+Q 2631 3075 2631 3366 
+Q 2631 3634 2415 3781 
+Q 2200 3928 1806 3928 
+Q 1516 3928 1219 3862 
+Q 922 3797 628 3669 
+L 628 4550 
+Q 984 4650 1334 4700 
+Q 1684 4750 2022 4750 
+Q 2931 4750 3382 4451 
+Q 3834 4153 3834 3553 
+Q 3834 3144 3618 2883 
+Q 3403 2622 2981 2516 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-35" d="M 678 4666 
+L 3669 4666 
+L 3669 3781 
+L 1638 3781 
+L 1638 3059 
+Q 1775 3097 1914 3117 
+Q 2053 3138 2203 3138 
+Q 3056 3138 3531 2711 
+Q 4006 2284 4006 1522 
+Q 4006 766 3489 337 
+Q 2972 -91 2053 -91 
+Q 1656 -91 1267 -14 
+Q 878 63 494 219 
+L 494 1166 
+Q 875 947 1217 837 
+Q 1559 728 1863 728 
+Q 2300 728 2551 942 
+Q 2803 1156 2803 1522 
+Q 2803 1891 2551 2103 
+Q 2300 2316 1863 2316 
+Q 1603 2316 1309 2248 
+Q 1016 2181 678 2041 
+L 678 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-35" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_19">
+    <!-- 5.37x -->
+    <g style="fill: #262626" transform="translate(216.847731 296.672085)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-37" d="M 428 4666 
+L 3944 4666 
+L 3944 3988 
+L 2125 0 
+L 953 0 
+L 2675 3781 
+L 428 3781 
+L 428 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-35"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_20">
+    <!-- 7.47x -->
+    <g style="fill: #262626" transform="translate(276.526875 289.366948)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-34" d="M 2356 3675 
+L 1038 1722 
+L 2356 1722 
+L 2356 3675 
+z
+M 2156 4666 
+L 3494 4666 
+L 3494 1722 
+L 4159 1722 
+L 4159 850 
+L 3494 850 
+L 3494 0 
+L 2356 0 
+L 2356 850 
+L 288 850 
+L 288 1881 
+L 2156 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-37"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-34" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_21">
+    <!-- 36.30x -->
+    <g style="fill: #262626" transform="translate(332.031332 188.870277)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-36" d="M 2316 2303 
+Q 2000 2303 1842 2098 
+Q 1684 1894 1684 1484 
+Q 1684 1075 1842 870 
+Q 2000 666 2316 666 
+Q 2634 666 2792 870 
+Q 2950 1075 2950 1484 
+Q 2950 1894 2792 2098 
+Q 2634 2303 2316 2303 
+z
+M 3803 4544 
+L 3803 3681 
+Q 3506 3822 3243 3889 
+Q 2981 3956 2731 3956 
+Q 2194 3956 1894 3657 
+Q 1594 3359 1544 2772 
+Q 1750 2925 1990 3001 
+Q 2231 3078 2516 3078 
+Q 3231 3078 3670 2659 
+Q 4109 2241 4109 1563 
+Q 4109 813 3618 361 
+Q 3128 -91 2303 -91 
+Q 1394 -91 895 523 
+Q 397 1138 397 2266 
+Q 397 3422 980 4083 
+Q 1563 4744 2578 4744 
+Q 2900 4744 3203 4694 
+Q 3506 4644 3803 4544 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-33"/>
+     <use xlink:href="#DejaVuSans-Bold-36" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="139.160156"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="246.728516"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="316.308594"/>
+    </g>
+   </g>
+   <g id="text_22">
+    <!-- 55.13x -->
+    <g style="fill: #262626" transform="translate(391.710476 123.213529)scale(0.12 -0.12)">
+     <use xlink:href="#DejaVuSans-Bold-35"/>
+     <use xlink:href="#DejaVuSans-Bold-35" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="139.160156"/>
+     <use xlink:href="#DejaVuSans-Bold-31" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="246.728516"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="316.308594"/>
+    </g>
+   </g>
+   <g id="text_23">
+    <!-- 60.93x -->
+    <g style="fill: #262626" transform="translate(451.389621 102.988873)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-39" d="M 641 103 
+L 641 966 
+Q 928 831 1190 764 
+Q 1453 697 1709 697 
+Q 2247 697 2547 995 
+Q 2847 1294 2900 1881 
+Q 2688 1725 2447 1647 
+Q 2206 1569 1925 1569 
+Q 1209 1569 770 1986 
+Q 331 2403 331 3084 
+Q 331 3838 820 4291 
+Q 1309 4744 2131 4744 
+Q 3044 4744 3544 4128 
+Q 4044 3513 4044 2388 
+Q 4044 1231 3459 570 
+Q 2875 -91 1856 -91 
+Q 1528 -91 1228 -42 
+Q 928 6 641 103 
+z
+M 2125 2350 
+Q 2441 2350 2600 2554 
+Q 2759 2759 2759 3169 
+Q 2759 3575 2600 3781 
+Q 2441 3988 2125 3988 
+Q 1809 3988 1650 3781 
+Q 1491 3575 1491 3169 
+Q 1491 2759 1650 2554 
+Q 1809 2350 2125 2350 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-36"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="139.160156"/>
+     <use xlink:href="#DejaVuSans-Bold-39" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="246.728516"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="316.308594"/>
+    </g>
+   </g>
+   <g id="text_24">
+    <!-- Matrix multiplication ($n=1920$) -->
+    <g style="fill: #262626" transform="translate(184.74 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-3d" d="M 678 2906 
+L 4684 2906 
+L 4684 2381 
+L 678 2381 
+L 678 2906 
+z
+M 678 1631 
+L 4684 1631 
+L 4684 1100 
+L 678 1100 
+L 678 1631 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+     <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6d" transform="translate(346.630859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-75" transform="translate(444.042969 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(507.421875 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(535.205078 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(574.414062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-70" transform="translate(602.197266 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(665.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(693.457031 0.015625)"/>
+     <use xlink:href="#DejaVuSans-63" transform="translate(721.240234 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(776.220703 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(837.5 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(876.708984 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6f" transform="translate(904.492188 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6e" transform="translate(965.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(1029.052734 0.015625)"/>
+     <use xlink:href="#DejaVuSans-28" transform="translate(1060.839844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(1099.853516 0.015625)"/>
+     <use xlink:href="#DejaVuSans-3d" transform="translate(1182.714844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-31" transform="translate(1285.986328 0.015625)"/>
+     <use xlink:href="#DejaVuSans-39" transform="translate(1349.609375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-32" transform="translate(1411.482422 0.015625)"/>
+     <use xlink:href="#DejaVuSans-30" transform="translate(1475.105469 0.015625)"/>
+     <use xlink:href="#DejaVuSans-29" transform="translate(1538.728516 0.015625)"/>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p51dc6f4d83">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/img/mm-blocked-barplot.svg b/content/english/hpc/algorithms/img/mm-blocked-barplot.svg
new file mode 100644
index 00000000..93334ac1
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-blocked-barplot.svg
@@ -0,0 +1,1402 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:18:41.689702</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 126.109091 320.4 
+L 126.109091 43.2 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- naive -->
+      <g style="fill: #262626" transform="translate(112.451278 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6e"/>
+       <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+       <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+       <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 210.654545 320.4 
+L 210.654545 43.2 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- transposed -->
+      <g style="fill: #262626" transform="translate(182.712358 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-74"/>
+       <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+       <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+       <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+       <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+       <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+       <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+       <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+       <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+       <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 295.2 320.4 
+L 295.2 43.2 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- vectorized -->
+      <g style="fill: #262626" transform="translate(269.075781 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-76"/>
+       <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+       <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+       <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+       <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+       <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+       <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+       <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+       <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+       <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="line2d_4">
+      <path d="M 379.745455 320.4 
+L 379.745455 43.2 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- kernel -->
+      <g style="fill: #262626" transform="translate(364.352486 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6b" d="M 581 4863 
+L 1159 4863 
+L 1159 1991 
+L 2875 3500 
+L 3609 3500 
+L 1753 1863 
+L 3688 0 
+L 2938 0 
+L 1159 1709 
+L 1159 0 
+L 581 0 
+L 581 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6b"/>
+       <use xlink:href="#DejaVuSans-65" x="54.285156"/>
+       <use xlink:href="#DejaVuSans-72" x="115.808594"/>
+       <use xlink:href="#DejaVuSans-6e" x="155.171875"/>
+       <use xlink:href="#DejaVuSans-65" x="218.550781"/>
+       <use xlink:href="#DejaVuSans-6c" x="280.074219"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="line2d_5">
+      <path d="M 464.290909 320.4 
+L 464.290909 43.2 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- blocked -->
+      <g style="fill: #262626" transform="translate(444.95419 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-62" d="M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+M 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+L 1159 0 
+L 581 0 
+L 581 4863 
+L 1159 4863 
+L 1159 2969 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-62"/>
+       <use xlink:href="#DejaVuSans-6c" x="63.476562"/>
+       <use xlink:href="#DejaVuSans-6f" x="91.259766"/>
+       <use xlink:href="#DejaVuSans-63" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-6b" x="207.421875"/>
+       <use xlink:href="#DejaVuSans-65" x="261.707031"/>
+       <use xlink:href="#DejaVuSans-64" x="323.230469"/>
+      </g>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_6">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- 0 -->
+      <g style="fill: #262626" transform="translate(55.50125 324.579141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-30"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_7">
+      <path d="M 72 286.8 
+L 518.4 286.8 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- 2 -->
+      <g style="fill: #262626" transform="translate(55.50125 290.979141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_8">
+      <path d="M 72 253.2 
+L 518.4 253.2 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 4 -->
+      <g style="fill: #262626" transform="translate(55.50125 257.379141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-34" d="M 2419 4116 
+L 825 1625 
+L 2419 1625 
+L 2419 4116 
+z
+M 2253 4666 
+L 3047 4666 
+L 3047 1625 
+L 3713 1625 
+L 3713 1100 
+L 3047 1100 
+L 3047 0 
+L 2419 0 
+L 2419 1100 
+L 313 1100 
+L 313 1709 
+L 2253 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-34"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_9">
+      <path d="M 72 219.6 
+L 518.4 219.6 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_9">
+      <!-- 6 -->
+      <g style="fill: #262626" transform="translate(55.50125 223.779141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-36" d="M 2113 2584 
+Q 1688 2584 1439 2293 
+Q 1191 2003 1191 1497 
+Q 1191 994 1439 701 
+Q 1688 409 2113 409 
+Q 2538 409 2786 701 
+Q 3034 994 3034 1497 
+Q 3034 2003 2786 2293 
+Q 2538 2584 2113 2584 
+z
+M 3366 4563 
+L 3366 3988 
+Q 3128 4100 2886 4159 
+Q 2644 4219 2406 4219 
+Q 1781 4219 1451 3797 
+Q 1122 3375 1075 2522 
+Q 1259 2794 1537 2939 
+Q 1816 3084 2150 3084 
+Q 2853 3084 3261 2657 
+Q 3669 2231 3669 1497 
+Q 3669 778 3244 343 
+Q 2819 -91 2113 -91 
+Q 1303 -91 875 529 
+Q 447 1150 447 2328 
+Q 447 3434 972 4092 
+Q 1497 4750 2381 4750 
+Q 2619 4750 2861 4703 
+Q 3103 4656 3366 4563 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-36"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_10">
+      <path d="M 72 186 
+L 518.4 186 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_10">
+      <!-- 8 -->
+      <g style="fill: #262626" transform="translate(55.50125 190.179141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-38" d="M 2034 2216 
+Q 1584 2216 1326 1975 
+Q 1069 1734 1069 1313 
+Q 1069 891 1326 650 
+Q 1584 409 2034 409 
+Q 2484 409 2743 651 
+Q 3003 894 3003 1313 
+Q 3003 1734 2745 1975 
+Q 2488 2216 2034 2216 
+z
+M 1403 2484 
+Q 997 2584 770 2862 
+Q 544 3141 544 3541 
+Q 544 4100 942 4425 
+Q 1341 4750 2034 4750 
+Q 2731 4750 3128 4425 
+Q 3525 4100 3525 3541 
+Q 3525 3141 3298 2862 
+Q 3072 2584 2669 2484 
+Q 3125 2378 3379 2068 
+Q 3634 1759 3634 1313 
+Q 3634 634 3220 271 
+Q 2806 -91 2034 -91 
+Q 1263 -91 848 271 
+Q 434 634 434 1313 
+Q 434 1759 690 2068 
+Q 947 2378 1403 2484 
+z
+M 1172 3481 
+Q 1172 3119 1398 2916 
+Q 1625 2713 2034 2713 
+Q 2441 2713 2670 2916 
+Q 2900 3119 2900 3481 
+Q 2900 3844 2670 4047 
+Q 2441 4250 2034 4250 
+Q 1625 4250 1398 4047 
+Q 1172 3844 1172 3481 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-38"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_6">
+     <g id="line2d_11">
+      <path d="M 72 152.4 
+L 518.4 152.4 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_11">
+      <!-- 10 -->
+      <g style="fill: #262626" transform="translate(48.5025 156.579141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_7">
+     <g id="line2d_12">
+      <path d="M 72 118.8 
+L 518.4 118.8 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_12">
+      <!-- 12 -->
+      <g style="fill: #262626" transform="translate(48.5025 122.979141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-32" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_8">
+     <g id="line2d_13">
+      <path d="M 72 85.2 
+L 518.4 85.2 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_13">
+      <!-- 14 -->
+      <g style="fill: #262626" transform="translate(48.5025 89.379141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-34" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_9">
+     <g id="line2d_14">
+      <path d="M 72 51.6 
+L 518.4 51.6 
+" clip-path="url(#p2c5cd7951c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_14">
+      <!-- 16 -->
+      <g style="fill: #262626" transform="translate(48.5025 55.779141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-36" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_15">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(42.006875 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="patch_3">
+    <path d="M 92.290909 320.4 
+L 159.927273 320.4 
+L 159.927273 313.300939 
+L 92.290909 313.300939 
+z
+" clip-path="url(#p2c5cd7951c)" style="fill: #4c72b0; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 176.836364 320.4 
+L 244.472727 320.4 
+L 244.472727 310.793018 
+L 176.836364 310.793018 
+z
+" clip-path="url(#p2c5cd7951c)" style="fill: #dd8452; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 261.381818 320.4 
+L 329.018182 320.4 
+L 329.018182 282.254245 
+L 261.381818 282.254245 
+z
+" clip-path="url(#p2c5cd7951c)" style="fill: #55a868; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 345.927273 320.4 
+L 413.563636 320.4 
+L 413.563636 267.37833 
+L 345.927273 267.37833 
+z
+" clip-path="url(#p2c5cd7951c)" style="fill: #c44e52; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_7">
+    <path d="M 430.472727 320.4 
+L 498.109091 320.4 
+L 498.109091 62.730564 
+L 430.472727 62.730564 
+z
+" clip-path="url(#p2c5cd7951c)" style="fill: #8172b3; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_8">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_9">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_10">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_11">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_16">
+    <!-- 1.00x -->
+    <g style="fill: #262626" transform="translate(107.435966 308.305314)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-31" d="M 750 831 
+L 1813 831 
+L 1813 3847 
+L 722 3622 
+L 722 4441 
+L 1806 4666 
+L 2950 4666 
+L 2950 831 
+L 4013 831 
+L 4013 0 
+L 750 0 
+L 750 831 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-2e" d="M 653 1209 
+L 1778 1209 
+L 1778 0 
+L 653 0 
+L 653 1209 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-30" d="M 2944 2338 
+Q 2944 3213 2780 3570 
+Q 2616 3928 2228 3928 
+Q 1841 3928 1675 3570 
+Q 1509 3213 1509 2338 
+Q 1509 1453 1675 1090 
+Q 1841 728 2228 728 
+Q 2613 728 2778 1090 
+Q 2944 1453 2944 2338 
+z
+M 4147 2328 
+Q 4147 1169 3647 539 
+Q 3147 -91 2228 -91 
+Q 1306 -91 806 539 
+Q 306 1169 306 2328 
+Q 306 3491 806 4120 
+Q 1306 4750 2228 4750 
+Q 3147 4750 3647 4120 
+Q 4147 3491 4147 2328 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-78" d="M 1422 1791 
+L 159 3500 
+L 1344 3500 
+L 2059 2463 
+L 2784 3500 
+L 3969 3500 
+L 2706 1797 
+L 4031 0 
+L 2847 0 
+L 2059 1106 
+L 1281 0 
+L 97 0 
+L 1422 1791 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_17">
+    <!-- 1.35x -->
+    <g style="fill: #262626" transform="translate(191.98142 305.797393)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-33" d="M 2981 2516 
+Q 3453 2394 3698 2092 
+Q 3944 1791 3944 1325 
+Q 3944 631 3412 270 
+Q 2881 -91 1863 -91 
+Q 1503 -91 1142 -33 
+Q 781 25 428 141 
+L 428 1069 
+Q 766 900 1098 814 
+Q 1431 728 1753 728 
+Q 2231 728 2486 893 
+Q 2741 1059 2741 1369 
+Q 2741 1688 2480 1852 
+Q 2219 2016 1709 2016 
+L 1228 2016 
+L 1228 2791 
+L 1734 2791 
+Q 2188 2791 2409 2933 
+Q 2631 3075 2631 3366 
+Q 2631 3634 2415 3781 
+Q 2200 3928 1806 3928 
+Q 1516 3928 1219 3862 
+Q 922 3797 628 3669 
+L 628 4550 
+Q 984 4650 1334 4700 
+Q 1684 4750 2022 4750 
+Q 2931 4750 3382 4451 
+Q 3834 4153 3834 3553 
+Q 3834 3144 3618 2883 
+Q 3403 2622 2981 2516 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-35" d="M 678 4666 
+L 3669 4666 
+L 3669 3781 
+L 1638 3781 
+L 1638 3059 
+Q 1775 3097 1914 3117 
+Q 2053 3138 2203 3138 
+Q 3056 3138 3531 2711 
+Q 4006 2284 4006 1522 
+Q 4006 766 3489 337 
+Q 2972 -91 2053 -91 
+Q 1656 -91 1267 -14 
+Q 878 63 494 219 
+L 494 1166 
+Q 875 947 1217 837 
+Q 1559 728 1863 728 
+Q 2300 728 2551 942 
+Q 2803 1156 2803 1522 
+Q 2803 1891 2551 2103 
+Q 2300 2316 1863 2316 
+Q 1603 2316 1309 2248 
+Q 1016 2181 678 2041 
+L 678 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-35" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_18">
+    <!-- 5.37x -->
+    <g style="fill: #262626" transform="translate(276.526875 277.25862)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-37" d="M 428 4666 
+L 3944 4666 
+L 3944 3988 
+L 2125 0 
+L 953 0 
+L 2675 3781 
+L 428 3781 
+L 428 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-35"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_19">
+    <!-- 7.47x -->
+    <g style="fill: #262626" transform="translate(361.07233 262.382705)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-34" d="M 2356 3675 
+L 1038 1722 
+L 2356 1722 
+L 2356 3675 
+z
+M 2156 4666 
+L 3494 4666 
+L 3494 1722 
+L 4159 1722 
+L 4159 850 
+L 3494 850 
+L 3494 0 
+L 2356 0 
+L 2356 850 
+L 288 850 
+L 288 1881 
+L 2156 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-37"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-34" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_20">
+    <!-- 36.30x -->
+    <g style="fill: #262626" transform="translate(441.443097 57.734939)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-36" d="M 2316 2303 
+Q 2000 2303 1842 2098 
+Q 1684 1894 1684 1484 
+Q 1684 1075 1842 870 
+Q 2000 666 2316 666 
+Q 2634 666 2792 870 
+Q 2950 1075 2950 1484 
+Q 2950 1894 2792 2098 
+Q 2634 2303 2316 2303 
+z
+M 3803 4544 
+L 3803 3681 
+Q 3506 3822 3243 3889 
+Q 2981 3956 2731 3956 
+Q 2194 3956 1894 3657 
+Q 1594 3359 1544 2772 
+Q 1750 2925 1990 3001 
+Q 2231 3078 2516 3078 
+Q 3231 3078 3670 2659 
+Q 4109 2241 4109 1563 
+Q 4109 813 3618 361 
+Q 3128 -91 2303 -91 
+Q 1394 -91 895 523 
+Q 397 1138 397 2266 
+Q 397 3422 980 4083 
+Q 1563 4744 2578 4744 
+Q 2900 4744 3203 4694 
+Q 3506 4644 3803 4544 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-33"/>
+     <use xlink:href="#DejaVuSans-Bold-36" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="139.160156"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="246.728516"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="316.308594"/>
+    </g>
+   </g>
+   <g id="text_21">
+    <!-- Matrix multiplication ($n=1920$) -->
+    <g style="fill: #262626" transform="translate(184.74 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-3d" d="M 678 2906 
+L 4684 2906 
+L 4684 2381 
+L 678 2381 
+L 678 2906 
+z
+M 678 1631 
+L 4684 1631 
+L 4684 1100 
+L 678 1100 
+L 678 1631 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+     <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6d" transform="translate(346.630859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-75" transform="translate(444.042969 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(507.421875 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(535.205078 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(574.414062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-70" transform="translate(602.197266 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(665.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(693.457031 0.015625)"/>
+     <use xlink:href="#DejaVuSans-63" transform="translate(721.240234 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(776.220703 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(837.5 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(876.708984 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6f" transform="translate(904.492188 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6e" transform="translate(965.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(1029.052734 0.015625)"/>
+     <use xlink:href="#DejaVuSans-28" transform="translate(1060.839844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(1099.853516 0.015625)"/>
+     <use xlink:href="#DejaVuSans-3d" transform="translate(1182.714844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-31" transform="translate(1285.986328 0.015625)"/>
+     <use xlink:href="#DejaVuSans-39" transform="translate(1349.609375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-32" transform="translate(1411.482422 0.015625)"/>
+     <use xlink:href="#DejaVuSans-30" transform="translate(1475.105469 0.015625)"/>
+     <use xlink:href="#DejaVuSans-29" transform="translate(1538.728516 0.015625)"/>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p2c5cd7951c">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/img/mm-blocked-plot.svg b/content/english/hpc/algorithms/img/mm-blocked-plot.svg
new file mode 100644
index 00000000..87dda835
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-blocked-plot.svg
@@ -0,0 +1,1474 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:18:54.049300</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 117.784615 320.4 
+L 117.784615 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- 240 -->
+      <g style="fill: #262626" transform="translate(107.28649 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-34" d="M 2419 4116 
+L 825 1625 
+L 2419 1625 
+L 2419 4116 
+z
+M 2253 4666 
+L 3047 4666 
+L 3047 1625 
+L 3713 1625 
+L 3713 1100 
+L 3047 1100 
+L 3047 0 
+L 2419 0 
+L 2419 1100 
+L 313 1100 
+L 313 1709 
+L 2253 4666 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-34" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 175.015385 320.4 
+L 175.015385 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- 480 -->
+      <g style="fill: #262626" transform="translate(164.51726 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-38" d="M 2034 2216 
+Q 1584 2216 1326 1975 
+Q 1069 1734 1069 1313 
+Q 1069 891 1326 650 
+Q 1584 409 2034 409 
+Q 2484 409 2743 651 
+Q 3003 894 3003 1313 
+Q 3003 1734 2745 1975 
+Q 2488 2216 2034 2216 
+z
+M 1403 2484 
+Q 997 2584 770 2862 
+Q 544 3141 544 3541 
+Q 544 4100 942 4425 
+Q 1341 4750 2034 4750 
+Q 2731 4750 3128 4425 
+Q 3525 4100 3525 3541 
+Q 3525 3141 3298 2862 
+Q 3072 2584 2669 2484 
+Q 3125 2378 3379 2068 
+Q 3634 1759 3634 1313 
+Q 3634 634 3220 271 
+Q 2806 -91 2034 -91 
+Q 1263 -91 848 271 
+Q 434 634 434 1313 
+Q 434 1759 690 2068 
+Q 947 2378 1403 2484 
+z
+M 1172 3481 
+Q 1172 3119 1398 2916 
+Q 1625 2713 2034 2713 
+Q 2441 2713 2670 2916 
+Q 2900 3119 2900 3481 
+Q 2900 3844 2670 4047 
+Q 2441 4250 2034 4250 
+Q 1625 4250 1398 4047 
+Q 1172 3844 1172 3481 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-34"/>
+       <use xlink:href="#DejaVuSans-38" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 232.246154 320.4 
+L 232.246154 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- 720 -->
+      <g style="fill: #262626" transform="translate(221.748029 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-37" d="M 525 4666 
+L 3525 4666 
+L 3525 4397 
+L 1831 0 
+L 1172 0 
+L 2766 4134 
+L 525 4134 
+L 525 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-37"/>
+       <use xlink:href="#DejaVuSans-32" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="line2d_4">
+      <path d="M 289.476923 320.4 
+L 289.476923 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- 960 -->
+      <g style="fill: #262626" transform="translate(278.978798 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-36" d="M 2113 2584 
+Q 1688 2584 1439 2293 
+Q 1191 2003 1191 1497 
+Q 1191 994 1439 701 
+Q 1688 409 2113 409 
+Q 2538 409 2786 701 
+Q 3034 994 3034 1497 
+Q 3034 2003 2786 2293 
+Q 2538 2584 2113 2584 
+z
+M 3366 4563 
+L 3366 3988 
+Q 3128 4100 2886 4159 
+Q 2644 4219 2406 4219 
+Q 1781 4219 1451 3797 
+Q 1122 3375 1075 2522 
+Q 1259 2794 1537 2939 
+Q 1816 3084 2150 3084 
+Q 2853 3084 3261 2657 
+Q 3669 2231 3669 1497 
+Q 3669 778 3244 343 
+Q 2819 -91 2113 -91 
+Q 1303 -91 875 529 
+Q 447 1150 447 2328 
+Q 447 3434 972 4092 
+Q 1497 4750 2381 4750 
+Q 2619 4750 2861 4703 
+Q 3103 4656 3366 4563 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-39"/>
+       <use xlink:href="#DejaVuSans-36" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="line2d_5">
+      <path d="M 346.707692 320.4 
+L 346.707692 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- 1200 -->
+      <g style="fill: #262626" transform="translate(332.710192 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-32" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_6">
+     <g id="line2d_6">
+      <path d="M 403.938462 320.4 
+L 403.938462 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- 1440 -->
+      <g style="fill: #262626" transform="translate(389.940962 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-34" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-34" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_7">
+     <g id="line2d_7">
+      <path d="M 461.169231 320.4 
+L 461.169231 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- 1680 -->
+      <g style="fill: #262626" transform="translate(447.171731 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-36" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-38" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_8">
+     <g id="line2d_8">
+      <path d="M 518.4 320.4 
+L 518.4 43.2 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 1920 -->
+      <g style="fill: #262626" transform="translate(504.4025 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-39" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-32" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_9">
+     <!-- Matrix size ($n \times n$) -->
+     <g style="fill: #262626" transform="translate(241.2 353.664062)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-d7" d="M 4488 3438 
+L 3059 2003 
+L 4488 575 
+L 4116 197 
+L 2681 1631 
+L 1247 197 
+L 878 575 
+L 2303 2003 
+L 878 3438 
+L 1247 3816 
+L 2681 2381 
+L 4116 3816 
+L 4488 3438 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+      <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+      <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+      <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+      <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+      <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+      <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+      <use xlink:href="#DejaVuSans-73" transform="translate(346.630859 0.015625)"/>
+      <use xlink:href="#DejaVuSans-69" transform="translate(398.730469 0.015625)"/>
+      <use xlink:href="#DejaVuSans-7a" transform="translate(426.513672 0.015625)"/>
+      <use xlink:href="#DejaVuSans-65" transform="translate(479.003906 0.015625)"/>
+      <use xlink:href="#DejaVuSans-20" transform="translate(540.527344 0.015625)"/>
+      <use xlink:href="#DejaVuSans-28" transform="translate(572.314453 0.015625)"/>
+      <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(611.328125 0.015625)"/>
+      <use xlink:href="#DejaVuSans-d7" transform="translate(694.189453 0.015625)"/>
+      <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(797.460938 0.015625)"/>
+      <use xlink:href="#DejaVuSans-29" transform="translate(860.839844 0.015625)"/>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_9">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_10">
+      <!-- 0 -->
+      <g style="fill: #262626" transform="translate(55.50125 324.579141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-30"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_10">
+      <path d="M 72 262.193219 
+L 518.4 262.193219 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_11">
+      <!-- 5 -->
+      <g style="fill: #262626" transform="translate(55.50125 266.37236)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-35" d="M 691 4666 
+L 3169 4666 
+L 3169 4134 
+L 1269 4134 
+L 1269 2991 
+Q 1406 3038 1543 3061 
+Q 1681 3084 1819 3084 
+Q 2600 3084 3056 2656 
+Q 3513 2228 3513 1497 
+Q 3513 744 3044 326 
+Q 2575 -91 1722 -91 
+Q 1428 -91 1123 -41 
+Q 819 9 494 109 
+L 494 744 
+Q 775 591 1075 516 
+Q 1375 441 1709 441 
+Q 2250 441 2565 725 
+Q 2881 1009 2881 1497 
+Q 2881 1984 2565 2268 
+Q 2250 2553 1709 2553 
+Q 1456 2553 1204 2497 
+Q 953 2441 691 2322 
+L 691 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-35"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_11">
+      <path d="M 72 203.986439 
+L 518.4 203.986439 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_12">
+      <!-- 10 -->
+      <g style="fill: #262626" transform="translate(48.5025 208.165579)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_12">
+      <path d="M 72 145.779658 
+L 518.4 145.779658 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_13">
+      <!-- 15 -->
+      <g style="fill: #262626" transform="translate(48.5025 149.958799)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-35" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_13">
+      <path d="M 72 87.572878 
+L 518.4 87.572878 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_14">
+      <!-- 20 -->
+      <g style="fill: #262626" transform="translate(48.5025 91.752018)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_15">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(42.006875 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="line2d_14">
+    <path d="M 72 309.671326 
+L 83.446154 311.498075 
+L 94.892308 312.017819 
+L 106.338462 312.236301 
+L 117.784615 312.38036 
+L 129.230769 312.50921 
+L 140.676923 312.532968 
+L 152.123077 312.600467 
+L 163.569231 312.647151 
+L 175.015385 312.627851 
+L 186.461538 312.67057 
+L 197.907692 312.701045 
+L 209.353846 312.717554 
+L 220.8 312.708816 
+L 232.246154 312.710882 
+L 243.692308 315.407713 
+L 255.138462 312.728222 
+L 266.584615 312.801696 
+L 278.030769 313.848938 
+L 289.476923 313.911222 
+L 300.923077 313.705591 
+L 312.369231 313.756657 
+L 323.815385 313.596682 
+L 335.261538 315.165161 
+L 346.707692 313.8018 
+L 358.153846 313.675801 
+L 369.6 313.837237 
+L 381.046154 313.44832 
+L 392.492308 313.911027 
+L 403.938462 313.481161 
+L 415.384615 313.836402 
+L 426.830769 318.50093 
+L 438.276923 313.158347 
+L 449.723077 314.031662 
+L 461.169231 313.386046 
+L 472.615385 313.784265 
+L 484.061538 313.695596 
+L 495.507692 313.710558 
+L 506.953846 313.419903 
+L 518.4 315.480792 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #4c72b0; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_15">
+    <path d="M 72 310.100473 
+L 83.446154 311.723061 
+L 94.892308 312.164771 
+L 106.338462 312.369963 
+L 117.784615 312.46459 
+L 129.230769 312.566557 
+L 140.676923 312.627815 
+L 152.123077 312.735949 
+L 163.569231 312.765275 
+L 175.015385 312.823565 
+L 186.461538 312.89925 
+L 197.907692 312.923884 
+L 209.353846 312.952443 
+L 220.8 313.044405 
+L 232.246154 313.118095 
+L 243.692308 313.154137 
+L 255.138462 313.397423 
+L 266.584615 313.377379 
+L 278.030769 313.286294 
+L 289.476923 313.310017 
+L 300.923077 313.404439 
+L 312.369231 313.41027 
+L 323.815385 313.35796 
+L 335.261538 313.469655 
+L 346.707692 313.731341 
+L 358.153846 313.628065 
+L 369.6 313.918748 
+L 381.046154 313.780767 
+L 392.492308 313.762085 
+L 403.938462 313.712702 
+L 415.384615 313.680426 
+L 426.830769 313.69078 
+L 438.276923 313.655243 
+L 449.723077 313.801 
+L 461.169231 313.856531 
+L 472.615385 313.749041 
+L 484.061538 313.756204 
+L 495.507692 313.765085 
+L 506.953846 313.733048 
+L 518.4 313.742958 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #dd8452; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_16">
+    <path d="M 72 270.883044 
+L 83.446154 250.335192 
+L 94.892308 241.217988 
+L 106.338462 239.934947 
+L 117.784615 240.533942 
+L 129.230769 240.282678 
+L 140.676923 243.292926 
+L 152.123077 245.477322 
+L 163.569231 246.74824 
+L 175.015385 248.423657 
+L 186.461538 251.925124 
+L 197.907692 253.275787 
+L 209.353846 256.016235 
+L 220.8 256.125166 
+L 232.246154 260.496327 
+L 243.692308 265.344168 
+L 255.138462 269.443769 
+L 266.584615 275.716187 
+L 278.030769 279.160382 
+L 289.476923 284.835335 
+L 300.923077 287.083554 
+L 312.369231 289.513765 
+L 323.815385 291.436354 
+L 335.261538 292.78857 
+L 346.707692 292.867424 
+L 358.153846 292.906062 
+L 369.6 292.525449 
+L 381.046154 292.674721 
+L 392.492308 294.161419 
+L 403.938462 293.327772 
+L 415.384615 294.184601 
+L 426.830769 294.107848 
+L 438.276923 293.676165 
+L 449.723077 293.709604 
+L 461.169231 294.153819 
+L 472.615385 294.084407 
+L 484.061538 293.36601 
+L 495.507692 292.413419 
+L 506.953846 293.865577 
+L 518.4 293.967362 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #55a868; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_17">
+    <path d="M 72 136.479878 
+L 83.446154 56.309568 
+L 94.892308 113.489862 
+L 106.338462 116.952556 
+L 117.784615 119.488506 
+L 129.230769 129.668021 
+L 140.676923 134.074172 
+L 152.123077 140.348179 
+L 163.569231 146.208392 
+L 175.015385 148.580855 
+L 186.461538 145.687135 
+L 197.907692 153.104332 
+L 209.353846 155.817924 
+L 220.8 166.367399 
+L 232.246154 186.465197 
+L 243.692308 212.969007 
+L 255.138462 228.798185 
+L 266.584615 242.299424 
+L 278.030769 252.946887 
+L 289.476923 264.280352 
+L 300.923077 260.045711 
+L 312.369231 246.46932 
+L 323.815385 252.022952 
+L 335.261538 275.665034 
+L 346.707692 256.224693 
+L 358.153846 260.129384 
+L 369.6 255.160551 
+L 381.046154 252.305103 
+L 392.492308 261.313163 
+L 403.938462 260.210566 
+L 415.384615 260.971397 
+L 426.830769 276.441432 
+L 438.276923 260.644667 
+L 449.723077 258.857527 
+L 461.169231 253.939829 
+L 472.615385 252.91023 
+L 484.061538 258.367244 
+L 495.507692 267.230819 
+L 506.953846 271.063016 
+L 518.4 283.659277 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #c44e52; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_18">
+    <path d="M 72 136.479878 
+L 83.446154 62.911829 
+L 94.892308 113.489862 
+L 106.338462 120.893427 
+L 117.784615 126.741628 
+L 129.230769 132.376183 
+L 140.676923 138.224994 
+L 152.123077 146.338812 
+L 163.569231 144.938347 
+L 175.015385 150.306083 
+L 186.461538 151.357504 
+L 197.907692 147.204375 
+L 209.353846 147.740523 
+L 220.8 148.47389 
+L 232.246154 143.409658 
+L 243.692308 145.629253 
+L 255.138462 140.487126 
+L 266.584615 136.533924 
+L 278.030769 138.805922 
+L 289.476923 137.368886 
+L 300.923077 134.588138 
+L 312.369231 138.938768 
+L 323.815385 140.456141 
+L 335.261538 142.70316 
+L 346.707692 137.373466 
+L 358.153846 134.139357 
+L 369.6 135.082407 
+L 381.046154 133.469061 
+L 392.492308 132.046951 
+L 403.938462 132.576325 
+L 415.384615 135.306496 
+L 426.830769 187.347797 
+L 438.276923 131.519839 
+L 449.723077 136.72422 
+L 461.169231 130.629802 
+L 472.615385 131.495713 
+L 484.061538 133.317256 
+L 495.507692 133.160196 
+L 506.953846 138.508473 
+L 518.4 141.851091 
+" clip-path="url(#pc0259ccbb5)" style="fill: none; stroke: #8172b3; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="patch_3">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_16">
+    <!-- Matrix multiplication -->
+    <g style="fill: #262626" transform="translate(223.167812 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d"/>
+     <use xlink:href="#DejaVuSans-61" x="86.279297"/>
+     <use xlink:href="#DejaVuSans-74" x="147.558594"/>
+     <use xlink:href="#DejaVuSans-72" x="186.767578"/>
+     <use xlink:href="#DejaVuSans-69" x="227.880859"/>
+     <use xlink:href="#DejaVuSans-78" x="255.664062"/>
+     <use xlink:href="#DejaVuSans-20" x="314.84375"/>
+     <use xlink:href="#DejaVuSans-6d" x="346.630859"/>
+     <use xlink:href="#DejaVuSans-75" x="444.042969"/>
+     <use xlink:href="#DejaVuSans-6c" x="507.421875"/>
+     <use xlink:href="#DejaVuSans-74" x="535.205078"/>
+     <use xlink:href="#DejaVuSans-69" x="574.414062"/>
+     <use xlink:href="#DejaVuSans-70" x="602.197266"/>
+     <use xlink:href="#DejaVuSans-6c" x="665.673828"/>
+     <use xlink:href="#DejaVuSans-69" x="693.457031"/>
+     <use xlink:href="#DejaVuSans-63" x="721.240234"/>
+     <use xlink:href="#DejaVuSans-61" x="776.220703"/>
+     <use xlink:href="#DejaVuSans-74" x="837.5"/>
+     <use xlink:href="#DejaVuSans-69" x="876.708984"/>
+     <use xlink:href="#DejaVuSans-6f" x="904.492188"/>
+     <use xlink:href="#DejaVuSans-6e" x="965.673828"/>
+    </g>
+   </g>
+   <g id="legend_1">
+    <g id="patch_7">
+     <path d="M 246.863594 132.729687 
+L 343.536406 132.729687 
+Q 345.736406 132.729687 345.736406 130.529687 
+L 345.736406 50.9 
+Q 345.736406 48.7 343.536406 48.7 
+L 246.863594 48.7 
+Q 244.663594 48.7 244.663594 50.9 
+L 244.663594 130.529687 
+Q 244.663594 132.729687 246.863594 132.729687 
+z
+" style="fill: #ffffff; opacity: 0.8; stroke: #cccccc; stroke-linejoin: miter"/>
+    </g>
+    <g id="line2d_19">
+     <path d="M 249.063594 57.608281 
+L 260.063594 57.608281 
+L 271.063594 57.608281 
+" style="fill: none; stroke: #4c72b0; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_17">
+     <!-- naive -->
+     <g style="fill: #262626" transform="translate(279.863594 61.458281)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-6e"/>
+      <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+      <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+      <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+      <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+     </g>
+    </g>
+    <g id="line2d_20">
+     <path d="M 249.063594 73.754219 
+L 260.063594 73.754219 
+L 271.063594 73.754219 
+" style="fill: none; stroke: #dd8452; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_18">
+     <!-- transposed -->
+     <g style="fill: #262626" transform="translate(279.863594 77.604219)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-74"/>
+      <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+      <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+      <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+      <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+      <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+      <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+      <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+      <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+      <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+     </g>
+    </g>
+    <g id="line2d_21">
+     <path d="M 249.063594 89.900156 
+L 260.063594 89.900156 
+L 271.063594 89.900156 
+" style="fill: none; stroke: #55a868; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_19">
+     <!-- vectorized -->
+     <g style="fill: #262626" transform="translate(279.863594 93.750156)scale(0.11 -0.11)">
+      <use xlink:href="#DejaVuSans-76"/>
+      <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+      <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+      <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+      <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+      <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+      <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+      <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+      <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+      <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+     </g>
+    </g>
+    <g id="line2d_22">
+     <path d="M 249.063594 106.046094 
+L 260.063594 106.046094 
+L 271.063594 106.046094 
+" style="fill: none; stroke: #c44e52; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_20">
+     <!-- kernel -->
+     <g style="fill: #262626" transform="translate(279.863594 109.896094)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-6b" d="M 581 4863 
+L 1159 4863 
+L 1159 1991 
+L 2875 3500 
+L 3609 3500 
+L 1753 1863 
+L 3688 0 
+L 2938 0 
+L 1159 1709 
+L 1159 0 
+L 581 0 
+L 581 4863 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-6b"/>
+      <use xlink:href="#DejaVuSans-65" x="54.285156"/>
+      <use xlink:href="#DejaVuSans-72" x="115.808594"/>
+      <use xlink:href="#DejaVuSans-6e" x="155.171875"/>
+      <use xlink:href="#DejaVuSans-65" x="218.550781"/>
+      <use xlink:href="#DejaVuSans-6c" x="280.074219"/>
+     </g>
+    </g>
+    <g id="line2d_23">
+     <path d="M 249.063594 122.192031 
+L 260.063594 122.192031 
+L 271.063594 122.192031 
+" style="fill: none; stroke: #8172b3; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_21">
+     <!-- blocked -->
+     <g style="fill: #262626" transform="translate(279.863594 126.042031)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-62" d="M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+M 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+L 1159 0 
+L 581 0 
+L 581 4863 
+L 1159 4863 
+L 1159 2969 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-62"/>
+      <use xlink:href="#DejaVuSans-6c" x="63.476562"/>
+      <use xlink:href="#DejaVuSans-6f" x="91.259766"/>
+      <use xlink:href="#DejaVuSans-63" x="152.441406"/>
+      <use xlink:href="#DejaVuSans-6b" x="207.421875"/>
+      <use xlink:href="#DejaVuSans-65" x="261.707031"/>
+      <use xlink:href="#DejaVuSans-64" x="323.230469"/>
+     </g>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="pc0259ccbb5">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/img/mm-kernel-barplot.svg b/content/english/hpc/algorithms/img/mm-kernel-barplot.svg
new file mode 100644
index 00000000..834d8b39
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-kernel-barplot.svg
@@ -0,0 +1,1277 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:18:16.721432</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 135.008612 320.4 
+L 135.008612 43.2 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- naive -->
+      <g style="fill: #262626" transform="translate(121.3508 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6e"/>
+       <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+       <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+       <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 241.802871 320.4 
+L 241.802871 43.2 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- transposed -->
+      <g style="fill: #262626" transform="translate(213.860683 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-74"/>
+       <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+       <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+       <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+       <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+       <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+       <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+       <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+       <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+       <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 348.597129 320.4 
+L 348.597129 43.2 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- vectorized -->
+      <g style="fill: #262626" transform="translate(322.47291 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-76"/>
+       <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+       <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+       <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+       <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+       <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+       <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+       <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+       <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+       <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="line2d_4">
+      <path d="M 455.391388 320.4 
+L 455.391388 43.2 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- kernel -->
+      <g style="fill: #262626" transform="translate(439.998419 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6b" d="M 581 4863 
+L 1159 4863 
+L 1159 1991 
+L 2875 3500 
+L 3609 3500 
+L 1753 1863 
+L 3688 0 
+L 2938 0 
+L 1159 1709 
+L 1159 0 
+L 581 0 
+L 581 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6b"/>
+       <use xlink:href="#DejaVuSans-65" x="54.285156"/>
+       <use xlink:href="#DejaVuSans-72" x="115.808594"/>
+       <use xlink:href="#DejaVuSans-6e" x="155.171875"/>
+       <use xlink:href="#DejaVuSans-65" x="218.550781"/>
+       <use xlink:href="#DejaVuSans-6c" x="280.074219"/>
+      </g>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_5">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- 0.0 -->
+      <g style="fill: #262626" transform="translate(45.006563 324.579141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-2e" d="M 684 794 
+L 1344 794 
+L 1344 0 
+L 684 0 
+L 684 794 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-30"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_6">
+      <path d="M 72 280.8 
+L 518.4 280.8 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- 0.5 -->
+      <g style="fill: #262626" transform="translate(45.006563 284.979141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-35" d="M 691 4666 
+L 3169 4666 
+L 3169 4134 
+L 1269 4134 
+L 1269 2991 
+Q 1406 3038 1543 3061 
+Q 1681 3084 1819 3084 
+Q 2600 3084 3056 2656 
+Q 3513 2228 3513 1497 
+Q 3513 744 3044 326 
+Q 2575 -91 1722 -91 
+Q 1428 -91 1123 -41 
+Q 819 9 494 109 
+L 494 744 
+Q 775 591 1075 516 
+Q 1375 441 1709 441 
+Q 2250 441 2565 725 
+Q 2881 1009 2881 1497 
+Q 2881 1984 2565 2268 
+Q 2250 2553 1709 2553 
+Q 1456 2553 1204 2497 
+Q 953 2441 691 2322 
+L 691 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-30"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-35" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_7">
+      <path d="M 72 241.2 
+L 518.4 241.2 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- 1.0 -->
+      <g style="fill: #262626" transform="translate(45.006563 245.379141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_8">
+      <path d="M 72 201.6 
+L 518.4 201.6 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 1.5 -->
+      <g style="fill: #262626" transform="translate(45.006563 205.779141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-35" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_9">
+      <path d="M 72 162 
+L 518.4 162 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_9">
+      <!-- 2.0 -->
+      <g style="fill: #262626" transform="translate(45.006563 166.179141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_6">
+     <g id="line2d_10">
+      <path d="M 72 122.4 
+L 518.4 122.4 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_10">
+      <!-- 2.5 -->
+      <g style="fill: #262626" transform="translate(45.006563 126.579141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-35" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_7">
+     <g id="line2d_11">
+      <path d="M 72 82.8 
+L 518.4 82.8 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_11">
+      <!-- 3.0 -->
+      <g style="fill: #262626" transform="translate(45.006563 86.979141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-33" d="M 2597 2516 
+Q 3050 2419 3304 2112 
+Q 3559 1806 3559 1356 
+Q 3559 666 3084 287 
+Q 2609 -91 1734 -91 
+Q 1441 -91 1130 -33 
+Q 819 25 488 141 
+L 488 750 
+Q 750 597 1062 519 
+Q 1375 441 1716 441 
+Q 2309 441 2620 675 
+Q 2931 909 2931 1356 
+Q 2931 1769 2642 2001 
+Q 2353 2234 1838 2234 
+L 1294 2234 
+L 1294 2753 
+L 1863 2753 
+Q 2328 2753 2575 2939 
+Q 2822 3125 2822 3475 
+Q 2822 3834 2567 4026 
+Q 2313 4219 1838 4219 
+Q 1578 4219 1281 4162 
+Q 984 4106 628 3988 
+L 628 4550 
+Q 988 4650 1302 4700 
+Q 1616 4750 1894 4750 
+Q 2613 4750 3031 4423 
+Q 3450 4097 3450 3541 
+Q 3450 3153 3228 2886 
+Q 3006 2619 2597 2516 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-33"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_8">
+     <g id="line2d_12">
+      <path d="M 72 43.2 
+L 518.4 43.2 
+" clip-path="url(#p4787e7a7d6)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_12">
+      <!-- 3.5 -->
+      <g style="fill: #262626" transform="translate(45.006563 47.379141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-33"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-35" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_13">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(38.510937 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="patch_3">
+    <path d="M 92.290909 320.4 
+L 177.726316 320.4 
+L 177.726316 286.933 
+L 92.290909 286.933 
+z
+" clip-path="url(#p4787e7a7d6)" style="fill: #4c72b0; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 199.085167 320.4 
+L 284.520574 320.4 
+L 284.520574 275.109942 
+L 199.085167 275.109942 
+z
+" clip-path="url(#p4787e7a7d6)" style="fill: #dd8452; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 305.879426 320.4 
+L 391.314833 320.4 
+L 391.314833 140.570014 
+L 305.879426 140.570014 
+z
+" clip-path="url(#p4787e7a7d6)" style="fill: #55a868; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 412.673684 320.4 
+L 498.109091 320.4 
+L 498.109091 70.440698 
+L 412.673684 70.440698 
+z
+" clip-path="url(#p4787e7a7d6)" style="fill: #c44e52; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_7">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_8">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_9">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_10">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_14">
+    <!-- 1.00x -->
+    <g style="fill: #262626" transform="translate(116.335487 281.937375)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-31" d="M 750 831 
+L 1813 831 
+L 1813 3847 
+L 722 3622 
+L 722 4441 
+L 1806 4666 
+L 2950 4666 
+L 2950 831 
+L 4013 831 
+L 4013 0 
+L 750 0 
+L 750 831 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-2e" d="M 653 1209 
+L 1778 1209 
+L 1778 0 
+L 653 0 
+L 653 1209 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-30" d="M 2944 2338 
+Q 2944 3213 2780 3570 
+Q 2616 3928 2228 3928 
+Q 1841 3928 1675 3570 
+Q 1509 3213 1509 2338 
+Q 1509 1453 1675 1090 
+Q 1841 728 2228 728 
+Q 2613 728 2778 1090 
+Q 2944 1453 2944 2338 
+z
+M 4147 2328 
+Q 4147 1169 3647 539 
+Q 3147 -91 2228 -91 
+Q 1306 -91 806 539 
+Q 306 1169 306 2328 
+Q 306 3491 806 4120 
+Q 1306 4750 2228 4750 
+Q 3147 4750 3647 4120 
+Q 4147 3491 4147 2328 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-78" d="M 1422 1791 
+L 159 3500 
+L 1344 3500 
+L 2059 2463 
+L 2784 3500 
+L 3969 3500 
+L 2706 1797 
+L 4031 0 
+L 2847 0 
+L 2059 1106 
+L 1281 0 
+L 97 0 
+L 1422 1791 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_15">
+    <!-- 1.35x -->
+    <g style="fill: #262626" transform="translate(223.129746 270.114317)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-33" d="M 2981 2516 
+Q 3453 2394 3698 2092 
+Q 3944 1791 3944 1325 
+Q 3944 631 3412 270 
+Q 2881 -91 1863 -91 
+Q 1503 -91 1142 -33 
+Q 781 25 428 141 
+L 428 1069 
+Q 766 900 1098 814 
+Q 1431 728 1753 728 
+Q 2231 728 2486 893 
+Q 2741 1059 2741 1369 
+Q 2741 1688 2480 1852 
+Q 2219 2016 1709 2016 
+L 1228 2016 
+L 1228 2791 
+L 1734 2791 
+Q 2188 2791 2409 2933 
+Q 2631 3075 2631 3366 
+Q 2631 3634 2415 3781 
+Q 2200 3928 1806 3928 
+Q 1516 3928 1219 3862 
+Q 922 3797 628 3669 
+L 628 4550 
+Q 984 4650 1334 4700 
+Q 1684 4750 2022 4750 
+Q 2931 4750 3382 4451 
+Q 3834 4153 3834 3553 
+Q 3834 3144 3618 2883 
+Q 3403 2622 2981 2516 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-35" d="M 678 4666 
+L 3669 4666 
+L 3669 3781 
+L 1638 3781 
+L 1638 3059 
+Q 1775 3097 1914 3117 
+Q 2053 3138 2203 3138 
+Q 3056 3138 3531 2711 
+Q 4006 2284 4006 1522 
+Q 4006 766 3489 337 
+Q 2972 -91 2053 -91 
+Q 1656 -91 1267 -14 
+Q 878 63 494 219 
+L 494 1166 
+Q 875 947 1217 837 
+Q 1559 728 1863 728 
+Q 2300 728 2551 942 
+Q 2803 1156 2803 1522 
+Q 2803 1891 2551 2103 
+Q 2300 2316 1863 2316 
+Q 1603 2316 1309 2248 
+Q 1016 2181 678 2041 
+L 678 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-35" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_16">
+    <!-- 5.37x -->
+    <g style="fill: #262626" transform="translate(329.924004 135.574389)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-37" d="M 428 4666 
+L 3944 4666 
+L 3944 3988 
+L 2125 0 
+L 953 0 
+L 2675 3781 
+L 428 3781 
+L 428 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-35"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_17">
+    <!-- 7.47x -->
+    <g style="fill: #262626" transform="translate(436.718263 65.445073)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-34" d="M 2356 3675 
+L 1038 1722 
+L 2356 1722 
+L 2356 3675 
+z
+M 2156 4666 
+L 3494 4666 
+L 3494 1722 
+L 4159 1722 
+L 4159 850 
+L 3494 850 
+L 3494 0 
+L 2356 0 
+L 2356 850 
+L 288 850 
+L 288 1881 
+L 2156 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-37"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-34" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_18">
+    <!-- Matrix multiplication ($n=1920$) -->
+    <g style="fill: #262626" transform="translate(184.74 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-3d" d="M 678 2906 
+L 4684 2906 
+L 4684 2381 
+L 678 2381 
+L 678 2906 
+z
+M 678 1631 
+L 4684 1631 
+L 4684 1100 
+L 678 1100 
+L 678 1631 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+     <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6d" transform="translate(346.630859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-75" transform="translate(444.042969 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(507.421875 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(535.205078 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(574.414062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-70" transform="translate(602.197266 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(665.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(693.457031 0.015625)"/>
+     <use xlink:href="#DejaVuSans-63" transform="translate(721.240234 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(776.220703 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(837.5 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(876.708984 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6f" transform="translate(904.492188 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6e" transform="translate(965.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(1029.052734 0.015625)"/>
+     <use xlink:href="#DejaVuSans-28" transform="translate(1060.839844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(1099.853516 0.015625)"/>
+     <use xlink:href="#DejaVuSans-3d" transform="translate(1182.714844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-31" transform="translate(1285.986328 0.015625)"/>
+     <use xlink:href="#DejaVuSans-39" transform="translate(1349.609375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-32" transform="translate(1411.482422 0.015625)"/>
+     <use xlink:href="#DejaVuSans-30" transform="translate(1475.105469 0.015625)"/>
+     <use xlink:href="#DejaVuSans-29" transform="translate(1538.728516 0.015625)"/>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p4787e7a7d6">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/img/mm-kernel-plot.svg b/content/english/hpc/algorithms/img/mm-kernel-plot.svg
new file mode 100644
index 00000000..99f9315a
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-kernel-plot.svg
@@ -0,0 +1,1385 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:18:30.773700</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 117.784615 320.4 
+L 117.784615 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- 240 -->
+      <g style="fill: #262626" transform="translate(107.28649 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-34" d="M 2419 4116 
+L 825 1625 
+L 2419 1625 
+L 2419 4116 
+z
+M 2253 4666 
+L 3047 4666 
+L 3047 1625 
+L 3713 1625 
+L 3713 1100 
+L 3047 1100 
+L 3047 0 
+L 2419 0 
+L 2419 1100 
+L 313 1100 
+L 313 1709 
+L 2253 4666 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-34" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 175.015385 320.4 
+L 175.015385 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- 480 -->
+      <g style="fill: #262626" transform="translate(164.51726 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-38" d="M 2034 2216 
+Q 1584 2216 1326 1975 
+Q 1069 1734 1069 1313 
+Q 1069 891 1326 650 
+Q 1584 409 2034 409 
+Q 2484 409 2743 651 
+Q 3003 894 3003 1313 
+Q 3003 1734 2745 1975 
+Q 2488 2216 2034 2216 
+z
+M 1403 2484 
+Q 997 2584 770 2862 
+Q 544 3141 544 3541 
+Q 544 4100 942 4425 
+Q 1341 4750 2034 4750 
+Q 2731 4750 3128 4425 
+Q 3525 4100 3525 3541 
+Q 3525 3141 3298 2862 
+Q 3072 2584 2669 2484 
+Q 3125 2378 3379 2068 
+Q 3634 1759 3634 1313 
+Q 3634 634 3220 271 
+Q 2806 -91 2034 -91 
+Q 1263 -91 848 271 
+Q 434 634 434 1313 
+Q 434 1759 690 2068 
+Q 947 2378 1403 2484 
+z
+M 1172 3481 
+Q 1172 3119 1398 2916 
+Q 1625 2713 2034 2713 
+Q 2441 2713 2670 2916 
+Q 2900 3119 2900 3481 
+Q 2900 3844 2670 4047 
+Q 2441 4250 2034 4250 
+Q 1625 4250 1398 4047 
+Q 1172 3844 1172 3481 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-34"/>
+       <use xlink:href="#DejaVuSans-38" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 232.246154 320.4 
+L 232.246154 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- 720 -->
+      <g style="fill: #262626" transform="translate(221.748029 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-37" d="M 525 4666 
+L 3525 4666 
+L 3525 4397 
+L 1831 0 
+L 1172 0 
+L 2766 4134 
+L 525 4134 
+L 525 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-37"/>
+       <use xlink:href="#DejaVuSans-32" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="line2d_4">
+      <path d="M 289.476923 320.4 
+L 289.476923 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- 960 -->
+      <g style="fill: #262626" transform="translate(278.978798 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-36" d="M 2113 2584 
+Q 1688 2584 1439 2293 
+Q 1191 2003 1191 1497 
+Q 1191 994 1439 701 
+Q 1688 409 2113 409 
+Q 2538 409 2786 701 
+Q 3034 994 3034 1497 
+Q 3034 2003 2786 2293 
+Q 2538 2584 2113 2584 
+z
+M 3366 4563 
+L 3366 3988 
+Q 3128 4100 2886 4159 
+Q 2644 4219 2406 4219 
+Q 1781 4219 1451 3797 
+Q 1122 3375 1075 2522 
+Q 1259 2794 1537 2939 
+Q 1816 3084 2150 3084 
+Q 2853 3084 3261 2657 
+Q 3669 2231 3669 1497 
+Q 3669 778 3244 343 
+Q 2819 -91 2113 -91 
+Q 1303 -91 875 529 
+Q 447 1150 447 2328 
+Q 447 3434 972 4092 
+Q 1497 4750 2381 4750 
+Q 2619 4750 2861 4703 
+Q 3103 4656 3366 4563 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-39"/>
+       <use xlink:href="#DejaVuSans-36" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="line2d_5">
+      <path d="M 346.707692 320.4 
+L 346.707692 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- 1200 -->
+      <g style="fill: #262626" transform="translate(332.710192 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-32" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_6">
+     <g id="line2d_6">
+      <path d="M 403.938462 320.4 
+L 403.938462 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- 1440 -->
+      <g style="fill: #262626" transform="translate(389.940962 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-34" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-34" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_7">
+     <g id="line2d_7">
+      <path d="M 461.169231 320.4 
+L 461.169231 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- 1680 -->
+      <g style="fill: #262626" transform="translate(447.171731 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-36" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-38" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_8">
+     <g id="line2d_8">
+      <path d="M 518.4 320.4 
+L 518.4 43.2 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 1920 -->
+      <g style="fill: #262626" transform="translate(504.4025 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-39" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-32" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_9">
+     <!-- Matrix size ($n \times n$) -->
+     <g style="fill: #262626" transform="translate(241.2 353.664062)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-d7" d="M 4488 3438 
+L 3059 2003 
+L 4488 575 
+L 4116 197 
+L 2681 1631 
+L 1247 197 
+L 878 575 
+L 2303 2003 
+L 878 3438 
+L 1247 3816 
+L 2681 2381 
+L 4116 3816 
+L 4488 3438 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+      <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+      <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+      <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+      <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+      <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+      <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+      <use xlink:href="#DejaVuSans-73" transform="translate(346.630859 0.015625)"/>
+      <use xlink:href="#DejaVuSans-69" transform="translate(398.730469 0.015625)"/>
+      <use xlink:href="#DejaVuSans-7a" transform="translate(426.513672 0.015625)"/>
+      <use xlink:href="#DejaVuSans-65" transform="translate(479.003906 0.015625)"/>
+      <use xlink:href="#DejaVuSans-20" transform="translate(540.527344 0.015625)"/>
+      <use xlink:href="#DejaVuSans-28" transform="translate(572.314453 0.015625)"/>
+      <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(611.328125 0.015625)"/>
+      <use xlink:href="#DejaVuSans-d7" transform="translate(694.189453 0.015625)"/>
+      <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(797.460938 0.015625)"/>
+      <use xlink:href="#DejaVuSans-29" transform="translate(860.839844 0.015625)"/>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_9">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_10">
+      <!-- 0 -->
+      <g style="fill: #262626" transform="translate(55.50125 324.579141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-30"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_10">
+      <path d="M 72 262.193219 
+L 518.4 262.193219 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_11">
+      <!-- 5 -->
+      <g style="fill: #262626" transform="translate(55.50125 266.37236)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-35" d="M 691 4666 
+L 3169 4666 
+L 3169 4134 
+L 1269 4134 
+L 1269 2991 
+Q 1406 3038 1543 3061 
+Q 1681 3084 1819 3084 
+Q 2600 3084 3056 2656 
+Q 3513 2228 3513 1497 
+Q 3513 744 3044 326 
+Q 2575 -91 1722 -91 
+Q 1428 -91 1123 -41 
+Q 819 9 494 109 
+L 494 744 
+Q 775 591 1075 516 
+Q 1375 441 1709 441 
+Q 2250 441 2565 725 
+Q 2881 1009 2881 1497 
+Q 2881 1984 2565 2268 
+Q 2250 2553 1709 2553 
+Q 1456 2553 1204 2497 
+Q 953 2441 691 2322 
+L 691 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-35"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_11">
+      <path d="M 72 203.986439 
+L 518.4 203.986439 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_12">
+      <!-- 10 -->
+      <g style="fill: #262626" transform="translate(48.5025 208.165579)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_12">
+      <path d="M 72 145.779658 
+L 518.4 145.779658 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_13">
+      <!-- 15 -->
+      <g style="fill: #262626" transform="translate(48.5025 149.958799)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-35" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_13">
+      <path d="M 72 87.572878 
+L 518.4 87.572878 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_14">
+      <!-- 20 -->
+      <g style="fill: #262626" transform="translate(48.5025 91.752018)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_15">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(42.006875 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="line2d_14">
+    <path d="M 72 309.671326 
+L 83.446154 311.498075 
+L 94.892308 312.017819 
+L 106.338462 312.236301 
+L 117.784615 312.38036 
+L 129.230769 312.50921 
+L 140.676923 312.532968 
+L 152.123077 312.600467 
+L 163.569231 312.647151 
+L 175.015385 312.627851 
+L 186.461538 312.67057 
+L 197.907692 312.701045 
+L 209.353846 312.717554 
+L 220.8 312.708816 
+L 232.246154 312.710882 
+L 243.692308 315.407713 
+L 255.138462 312.728222 
+L 266.584615 312.801696 
+L 278.030769 313.848938 
+L 289.476923 313.911222 
+L 300.923077 313.705591 
+L 312.369231 313.756657 
+L 323.815385 313.596682 
+L 335.261538 315.165161 
+L 346.707692 313.8018 
+L 358.153846 313.675801 
+L 369.6 313.837237 
+L 381.046154 313.44832 
+L 392.492308 313.911027 
+L 403.938462 313.481161 
+L 415.384615 313.836402 
+L 426.830769 318.50093 
+L 438.276923 313.158347 
+L 449.723077 314.031662 
+L 461.169231 313.386046 
+L 472.615385 313.784265 
+L 484.061538 313.695596 
+L 495.507692 313.710558 
+L 506.953846 313.419903 
+L 518.4 315.480792 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #4c72b0; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_15">
+    <path d="M 72 310.100473 
+L 83.446154 311.723061 
+L 94.892308 312.164771 
+L 106.338462 312.369963 
+L 117.784615 312.46459 
+L 129.230769 312.566557 
+L 140.676923 312.627815 
+L 152.123077 312.735949 
+L 163.569231 312.765275 
+L 175.015385 312.823565 
+L 186.461538 312.89925 
+L 197.907692 312.923884 
+L 209.353846 312.952443 
+L 220.8 313.044405 
+L 232.246154 313.118095 
+L 243.692308 313.154137 
+L 255.138462 313.397423 
+L 266.584615 313.377379 
+L 278.030769 313.286294 
+L 289.476923 313.310017 
+L 300.923077 313.404439 
+L 312.369231 313.41027 
+L 323.815385 313.35796 
+L 335.261538 313.469655 
+L 346.707692 313.731341 
+L 358.153846 313.628065 
+L 369.6 313.918748 
+L 381.046154 313.780767 
+L 392.492308 313.762085 
+L 403.938462 313.712702 
+L 415.384615 313.680426 
+L 426.830769 313.69078 
+L 438.276923 313.655243 
+L 449.723077 313.801 
+L 461.169231 313.856531 
+L 472.615385 313.749041 
+L 484.061538 313.756204 
+L 495.507692 313.765085 
+L 506.953846 313.733048 
+L 518.4 313.742958 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #dd8452; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_16">
+    <path d="M 72 270.883044 
+L 83.446154 250.335192 
+L 94.892308 241.217988 
+L 106.338462 239.934947 
+L 117.784615 240.533942 
+L 129.230769 240.282678 
+L 140.676923 243.292926 
+L 152.123077 245.477322 
+L 163.569231 246.74824 
+L 175.015385 248.423657 
+L 186.461538 251.925124 
+L 197.907692 253.275787 
+L 209.353846 256.016235 
+L 220.8 256.125166 
+L 232.246154 260.496327 
+L 243.692308 265.344168 
+L 255.138462 269.443769 
+L 266.584615 275.716187 
+L 278.030769 279.160382 
+L 289.476923 284.835335 
+L 300.923077 287.083554 
+L 312.369231 289.513765 
+L 323.815385 291.436354 
+L 335.261538 292.78857 
+L 346.707692 292.867424 
+L 358.153846 292.906062 
+L 369.6 292.525449 
+L 381.046154 292.674721 
+L 392.492308 294.161419 
+L 403.938462 293.327772 
+L 415.384615 294.184601 
+L 426.830769 294.107848 
+L 438.276923 293.676165 
+L 449.723077 293.709604 
+L 461.169231 294.153819 
+L 472.615385 294.084407 
+L 484.061538 293.36601 
+L 495.507692 292.413419 
+L 506.953846 293.865577 
+L 518.4 293.967362 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #55a868; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_17">
+    <path d="M 72 136.479878 
+L 83.446154 56.309568 
+L 94.892308 113.489862 
+L 106.338462 116.952556 
+L 117.784615 119.488506 
+L 129.230769 129.668021 
+L 140.676923 134.074172 
+L 152.123077 140.348179 
+L 163.569231 146.208392 
+L 175.015385 148.580855 
+L 186.461538 145.687135 
+L 197.907692 153.104332 
+L 209.353846 155.817924 
+L 220.8 166.367399 
+L 232.246154 186.465197 
+L 243.692308 212.969007 
+L 255.138462 228.798185 
+L 266.584615 242.299424 
+L 278.030769 252.946887 
+L 289.476923 264.280352 
+L 300.923077 260.045711 
+L 312.369231 246.46932 
+L 323.815385 252.022952 
+L 335.261538 275.665034 
+L 346.707692 256.224693 
+L 358.153846 260.129384 
+L 369.6 255.160551 
+L 381.046154 252.305103 
+L 392.492308 261.313163 
+L 403.938462 260.210566 
+L 415.384615 260.971397 
+L 426.830769 276.441432 
+L 438.276923 260.644667 
+L 449.723077 258.857527 
+L 461.169231 253.939829 
+L 472.615385 252.91023 
+L 484.061538 258.367244 
+L 495.507692 267.230819 
+L 506.953846 271.063016 
+L 518.4 283.659277 
+" clip-path="url(#p1185134d18)" style="fill: none; stroke: #c44e52; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="patch_3">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_16">
+    <!-- Matrix multiplication -->
+    <g style="fill: #262626" transform="translate(223.167812 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d"/>
+     <use xlink:href="#DejaVuSans-61" x="86.279297"/>
+     <use xlink:href="#DejaVuSans-74" x="147.558594"/>
+     <use xlink:href="#DejaVuSans-72" x="186.767578"/>
+     <use xlink:href="#DejaVuSans-69" x="227.880859"/>
+     <use xlink:href="#DejaVuSans-78" x="255.664062"/>
+     <use xlink:href="#DejaVuSans-20" x="314.84375"/>
+     <use xlink:href="#DejaVuSans-6d" x="346.630859"/>
+     <use xlink:href="#DejaVuSans-75" x="444.042969"/>
+     <use xlink:href="#DejaVuSans-6c" x="507.421875"/>
+     <use xlink:href="#DejaVuSans-74" x="535.205078"/>
+     <use xlink:href="#DejaVuSans-69" x="574.414062"/>
+     <use xlink:href="#DejaVuSans-70" x="602.197266"/>
+     <use xlink:href="#DejaVuSans-6c" x="665.673828"/>
+     <use xlink:href="#DejaVuSans-69" x="693.457031"/>
+     <use xlink:href="#DejaVuSans-63" x="721.240234"/>
+     <use xlink:href="#DejaVuSans-61" x="776.220703"/>
+     <use xlink:href="#DejaVuSans-74" x="837.5"/>
+     <use xlink:href="#DejaVuSans-69" x="876.708984"/>
+     <use xlink:href="#DejaVuSans-6f" x="904.492188"/>
+     <use xlink:href="#DejaVuSans-6e" x="965.673828"/>
+    </g>
+   </g>
+   <g id="legend_1">
+    <g id="patch_7">
+     <path d="M 414.027187 116.58375 
+L 510.7 116.58375 
+Q 512.9 116.58375 512.9 114.38375 
+L 512.9 50.9 
+Q 512.9 48.7 510.7 48.7 
+L 414.027187 48.7 
+Q 411.827187 48.7 411.827187 50.9 
+L 411.827187 114.38375 
+Q 411.827187 116.58375 414.027187 116.58375 
+z
+" style="fill: #ffffff; opacity: 0.8; stroke: #cccccc; stroke-linejoin: miter"/>
+    </g>
+    <g id="line2d_18">
+     <path d="M 416.227187 57.608281 
+L 427.227187 57.608281 
+L 438.227187 57.608281 
+" style="fill: none; stroke: #4c72b0; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_17">
+     <!-- naive -->
+     <g style="fill: #262626" transform="translate(447.027187 61.458281)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-6e"/>
+      <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+      <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+      <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+      <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+     </g>
+    </g>
+    <g id="line2d_19">
+     <path d="M 416.227187 73.754219 
+L 427.227187 73.754219 
+L 438.227187 73.754219 
+" style="fill: none; stroke: #dd8452; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_18">
+     <!-- transposed -->
+     <g style="fill: #262626" transform="translate(447.027187 77.604219)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-74"/>
+      <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+      <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+      <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+      <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+      <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+      <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+      <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+      <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+      <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+     </g>
+    </g>
+    <g id="line2d_20">
+     <path d="M 416.227187 89.900156 
+L 427.227187 89.900156 
+L 438.227187 89.900156 
+" style="fill: none; stroke: #55a868; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_19">
+     <!-- vectorized -->
+     <g style="fill: #262626" transform="translate(447.027187 93.750156)scale(0.11 -0.11)">
+      <use xlink:href="#DejaVuSans-76"/>
+      <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+      <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+      <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+      <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+      <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+      <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+      <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+      <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+      <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+     </g>
+    </g>
+    <g id="line2d_21">
+     <path d="M 416.227187 106.046094 
+L 427.227187 106.046094 
+L 438.227187 106.046094 
+" style="fill: none; stroke: #c44e52; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_20">
+     <!-- kernel -->
+     <g style="fill: #262626" transform="translate(447.027187 109.896094)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-6b" d="M 581 4863 
+L 1159 4863 
+L 1159 1991 
+L 2875 3500 
+L 3609 3500 
+L 1753 1863 
+L 3688 0 
+L 2938 0 
+L 1159 1709 
+L 1159 0 
+L 581 0 
+L 581 4863 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-6b"/>
+      <use xlink:href="#DejaVuSans-65" x="54.285156"/>
+      <use xlink:href="#DejaVuSans-72" x="115.808594"/>
+      <use xlink:href="#DejaVuSans-6e" x="155.171875"/>
+      <use xlink:href="#DejaVuSans-65" x="218.550781"/>
+      <use xlink:href="#DejaVuSans-6c" x="280.074219"/>
+     </g>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p1185134d18">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/img/mm-noalloc.svg b/content/english/hpc/algorithms/img/mm-noalloc.svg
new file mode 100644
index 00000000..a4911ea0
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-noalloc.svg
@@ -0,0 +1,1344 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:19:35.314892</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 120.27837 320.4 
+L 120.27837 43.2 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- naive -->
+      <g style="fill: #262626" transform="translate(106.620557 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6e"/>
+       <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+       <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+       <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 190.247022 320.4 
+L 190.247022 43.2 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- transposed -->
+      <g style="fill: #262626" transform="translate(162.304834 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-74"/>
+       <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+       <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+       <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+       <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+       <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+       <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+       <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+       <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+       <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 260.215674 320.4 
+L 260.215674 43.2 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- vectorized -->
+      <g style="fill: #262626" transform="translate(234.091455 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-76"/>
+       <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+       <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+       <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+       <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+       <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+       <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+       <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+       <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+       <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="line2d_4">
+      <path d="M 330.184326 320.4 
+L 330.184326 43.2 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- kernel -->
+      <g style="fill: #262626" transform="translate(314.791357 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6b" d="M 581 4863 
+L 1159 4863 
+L 1159 1991 
+L 2875 3500 
+L 3609 3500 
+L 1753 1863 
+L 3688 0 
+L 2938 0 
+L 1159 1709 
+L 1159 0 
+L 581 0 
+L 581 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6b"/>
+       <use xlink:href="#DejaVuSans-65" x="54.285156"/>
+       <use xlink:href="#DejaVuSans-72" x="115.808594"/>
+       <use xlink:href="#DejaVuSans-6e" x="155.171875"/>
+       <use xlink:href="#DejaVuSans-65" x="218.550781"/>
+       <use xlink:href="#DejaVuSans-6c" x="280.074219"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="line2d_5">
+      <path d="M 400.152978 320.4 
+L 400.152978 43.2 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- blocked -->
+      <g style="fill: #262626" transform="translate(380.816259 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-62" d="M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+M 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+L 1159 0 
+L 581 0 
+L 581 4863 
+L 1159 4863 
+L 1159 2969 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-62"/>
+       <use xlink:href="#DejaVuSans-6c" x="63.476562"/>
+       <use xlink:href="#DejaVuSans-6f" x="91.259766"/>
+       <use xlink:href="#DejaVuSans-63" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-6b" x="207.421875"/>
+       <use xlink:href="#DejaVuSans-65" x="261.707031"/>
+       <use xlink:href="#DejaVuSans-64" x="323.230469"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_6">
+     <g id="line2d_6">
+      <path d="M 470.12163 320.4 
+L 470.12163 43.2 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- in-place -->
+      <g style="fill: #262626" transform="translate(450.306786 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-2d" d="M 313 2009 
+L 1997 2009 
+L 1997 1497 
+L 313 1497 
+L 313 2009 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-69"/>
+       <use xlink:href="#DejaVuSans-6e" x="27.783203"/>
+       <use xlink:href="#DejaVuSans-2d" x="91.162109"/>
+       <use xlink:href="#DejaVuSans-70" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-6c" x="190.722656"/>
+       <use xlink:href="#DejaVuSans-61" x="218.505859"/>
+       <use xlink:href="#DejaVuSans-63" x="279.785156"/>
+       <use xlink:href="#DejaVuSans-65" x="334.765625"/>
+      </g>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_7">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- 0 -->
+      <g style="fill: #262626" transform="translate(55.50125 324.579141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-30"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_8">
+      <path d="M 72 267.092308 
+L 518.4 267.092308 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 5 -->
+      <g style="fill: #262626" transform="translate(55.50125 271.271448)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-35" d="M 691 4666 
+L 3169 4666 
+L 3169 4134 
+L 1269 4134 
+L 1269 2991 
+Q 1406 3038 1543 3061 
+Q 1681 3084 1819 3084 
+Q 2600 3084 3056 2656 
+Q 3513 2228 3513 1497 
+Q 3513 744 3044 326 
+Q 2575 -91 1722 -91 
+Q 1428 -91 1123 -41 
+Q 819 9 494 109 
+L 494 744 
+Q 775 591 1075 516 
+Q 1375 441 1709 441 
+Q 2250 441 2565 725 
+Q 2881 1009 2881 1497 
+Q 2881 1984 2565 2268 
+Q 2250 2553 1709 2553 
+Q 1456 2553 1204 2497 
+Q 953 2441 691 2322 
+L 691 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-35"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_9">
+      <path d="M 72 213.784615 
+L 518.4 213.784615 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_9">
+      <!-- 10 -->
+      <g style="fill: #262626" transform="translate(48.5025 217.963756)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_10">
+      <path d="M 72 160.476923 
+L 518.4 160.476923 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_10">
+      <!-- 15 -->
+      <g style="fill: #262626" transform="translate(48.5025 164.656064)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-35" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_11">
+      <path d="M 72 107.169231 
+L 518.4 107.169231 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_11">
+      <!-- 20 -->
+      <g style="fill: #262626" transform="translate(48.5025 111.348371)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-30" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_6">
+     <g id="line2d_12">
+      <path d="M 72 53.861538 
+L 518.4 53.861538 
+" clip-path="url(#p47e68bb29c)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_12">
+      <!-- 25 -->
+      <g style="fill: #262626" transform="translate(48.5025 58.040679)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-35" x="63.623047"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_13">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(42.006875 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="patch_3">
+    <path d="M 92.290909 320.4 
+L 148.265831 320.4 
+L 148.265831 315.894827 
+L 92.290909 315.894827 
+z
+" clip-path="url(#p47e68bb29c)" style="fill: #4c72b0; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 162.259561 320.4 
+L 218.234483 320.4 
+L 218.234483 314.303261 
+L 162.259561 314.303261 
+z
+" clip-path="url(#p47e68bb29c)" style="fill: #dd8452; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 232.228213 320.4 
+L 288.203135 320.4 
+L 288.203135 296.192117 
+L 232.228213 296.192117 
+z
+" clip-path="url(#p47e68bb29c)" style="fill: #55a868; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 302.196865 320.4 
+L 358.171787 320.4 
+L 358.171787 286.751632 
+L 302.196865 286.751632 
+z
+" clip-path="url(#p47e68bb29c)" style="fill: #c44e52; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_7">
+    <path d="M 372.165517 320.4 
+L 428.140439 320.4 
+L 428.140439 156.879012 
+L 372.165517 156.879012 
+z
+" clip-path="url(#p47e68bb29c)" style="fill: #8172b3; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_8">
+    <path d="M 442.134169 320.4 
+L 498.109091 320.4 
+L 498.109091 72.030291 
+L 442.134169 72.030291 
+z
+" clip-path="url(#p47e68bb29c)" style="fill: #937860; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_9">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_10">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_11">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_12">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_14">
+    <!-- 1.00x -->
+    <g style="fill: #262626" transform="translate(101.605245 310.899202)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-31" d="M 750 831 
+L 1813 831 
+L 1813 3847 
+L 722 3622 
+L 722 4441 
+L 1806 4666 
+L 2950 4666 
+L 2950 831 
+L 4013 831 
+L 4013 0 
+L 750 0 
+L 750 831 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-2e" d="M 653 1209 
+L 1778 1209 
+L 1778 0 
+L 653 0 
+L 653 1209 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-30" d="M 2944 2338 
+Q 2944 3213 2780 3570 
+Q 2616 3928 2228 3928 
+Q 1841 3928 1675 3570 
+Q 1509 3213 1509 2338 
+Q 1509 1453 1675 1090 
+Q 1841 728 2228 728 
+Q 2613 728 2778 1090 
+Q 2944 1453 2944 2338 
+z
+M 4147 2328 
+Q 4147 1169 3647 539 
+Q 3147 -91 2228 -91 
+Q 1306 -91 806 539 
+Q 306 1169 306 2328 
+Q 306 3491 806 4120 
+Q 1306 4750 2228 4750 
+Q 3147 4750 3647 4120 
+Q 4147 3491 4147 2328 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-78" d="M 1422 1791 
+L 159 3500 
+L 1344 3500 
+L 2059 2463 
+L 2784 3500 
+L 3969 3500 
+L 2706 1797 
+L 4031 0 
+L 2847 0 
+L 2059 1106 
+L 1281 0 
+L 97 0 
+L 1422 1791 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_15">
+    <!-- 1.35x -->
+    <g style="fill: #262626" transform="translate(171.573897 309.307636)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-33" d="M 2981 2516 
+Q 3453 2394 3698 2092 
+Q 3944 1791 3944 1325 
+Q 3944 631 3412 270 
+Q 2881 -91 1863 -91 
+Q 1503 -91 1142 -33 
+Q 781 25 428 141 
+L 428 1069 
+Q 766 900 1098 814 
+Q 1431 728 1753 728 
+Q 2231 728 2486 893 
+Q 2741 1059 2741 1369 
+Q 2741 1688 2480 1852 
+Q 2219 2016 1709 2016 
+L 1228 2016 
+L 1228 2791 
+L 1734 2791 
+Q 2188 2791 2409 2933 
+Q 2631 3075 2631 3366 
+Q 2631 3634 2415 3781 
+Q 2200 3928 1806 3928 
+Q 1516 3928 1219 3862 
+Q 922 3797 628 3669 
+L 628 4550 
+Q 984 4650 1334 4700 
+Q 1684 4750 2022 4750 
+Q 2931 4750 3382 4451 
+Q 3834 4153 3834 3553 
+Q 3834 3144 3618 2883 
+Q 3403 2622 2981 2516 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-35" d="M 678 4666 
+L 3669 4666 
+L 3669 3781 
+L 1638 3781 
+L 1638 3059 
+Q 1775 3097 1914 3117 
+Q 2053 3138 2203 3138 
+Q 3056 3138 3531 2711 
+Q 4006 2284 4006 1522 
+Q 4006 766 3489 337 
+Q 2972 -91 2053 -91 
+Q 1656 -91 1267 -14 
+Q 878 63 494 219 
+L 494 1166 
+Q 875 947 1217 837 
+Q 1559 728 1863 728 
+Q 2300 728 2551 942 
+Q 2803 1156 2803 1522 
+Q 2803 1891 2551 2103 
+Q 2300 2316 1863 2316 
+Q 1603 2316 1309 2248 
+Q 1016 2181 678 2041 
+L 678 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-35" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_16">
+    <!-- 5.37x -->
+    <g style="fill: #262626" transform="translate(241.542549 291.196492)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-37" d="M 428 4666 
+L 3944 4666 
+L 3944 3988 
+L 2125 0 
+L 953 0 
+L 2675 3781 
+L 428 3781 
+L 428 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-35"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_17">
+    <!-- 7.47x -->
+    <g style="fill: #262626" transform="translate(311.511201 281.756007)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-34" d="M 2356 3675 
+L 1038 1722 
+L 2356 1722 
+L 2356 3675 
+z
+M 2156 4666 
+L 3494 4666 
+L 3494 1722 
+L 4159 1722 
+L 4159 850 
+L 3494 850 
+L 3494 0 
+L 2356 0 
+L 2356 850 
+L 288 850 
+L 288 1881 
+L 2156 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-37"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-34" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_18">
+    <!-- 36.30x -->
+    <g style="fill: #262626" transform="translate(377.305166 151.883387)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-36" d="M 2316 2303 
+Q 2000 2303 1842 2098 
+Q 1684 1894 1684 1484 
+Q 1684 1075 1842 870 
+Q 2000 666 2316 666 
+Q 2634 666 2792 870 
+Q 2950 1075 2950 1484 
+Q 2950 1894 2792 2098 
+Q 2634 2303 2316 2303 
+z
+M 3803 4544 
+L 3803 3681 
+Q 3506 3822 3243 3889 
+Q 2981 3956 2731 3956 
+Q 2194 3956 1894 3657 
+Q 1594 3359 1544 2772 
+Q 1750 2925 1990 3001 
+Q 2231 3078 2516 3078 
+Q 3231 3078 3670 2659 
+Q 4109 2241 4109 1563 
+Q 4109 813 3618 361 
+Q 3128 -91 2303 -91 
+Q 1394 -91 895 523 
+Q 397 1138 397 2266 
+Q 397 3422 980 4083 
+Q 1563 4744 2578 4744 
+Q 2900 4744 3203 4694 
+Q 3506 4644 3803 4544 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-33"/>
+     <use xlink:href="#DejaVuSans-Bold-36" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="139.160156"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="246.728516"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="316.308594"/>
+    </g>
+   </g>
+   <g id="text_19">
+    <!-- 55.13x -->
+    <g style="fill: #262626" transform="translate(447.273818 67.034666)scale(0.12 -0.12)">
+     <use xlink:href="#DejaVuSans-Bold-35"/>
+     <use xlink:href="#DejaVuSans-Bold-35" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="139.160156"/>
+     <use xlink:href="#DejaVuSans-Bold-31" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="246.728516"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="316.308594"/>
+    </g>
+   </g>
+   <g id="text_20">
+    <!-- Matrix multiplication ($n=1920$) -->
+    <g style="fill: #262626" transform="translate(184.74 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-3d" d="M 678 2906 
+L 4684 2906 
+L 4684 2381 
+L 678 2381 
+L 678 2906 
+z
+M 678 1631 
+L 4684 1631 
+L 4684 1100 
+L 678 1100 
+L 678 1631 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+     <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6d" transform="translate(346.630859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-75" transform="translate(444.042969 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(507.421875 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(535.205078 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(574.414062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-70" transform="translate(602.197266 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(665.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(693.457031 0.015625)"/>
+     <use xlink:href="#DejaVuSans-63" transform="translate(721.240234 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(776.220703 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(837.5 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(876.708984 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6f" transform="translate(904.492188 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6e" transform="translate(965.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(1029.052734 0.015625)"/>
+     <use xlink:href="#DejaVuSans-28" transform="translate(1060.839844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(1099.853516 0.015625)"/>
+     <use xlink:href="#DejaVuSans-3d" transform="translate(1182.714844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-31" transform="translate(1285.986328 0.015625)"/>
+     <use xlink:href="#DejaVuSans-39" transform="translate(1349.609375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-32" transform="translate(1411.482422 0.015625)"/>
+     <use xlink:href="#DejaVuSans-30" transform="translate(1475.105469 0.015625)"/>
+     <use xlink:href="#DejaVuSans-29" transform="translate(1538.728516 0.015625)"/>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p47e68bb29c">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg b/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg
new file mode 100644
index 00000000..610d8276
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-vectorized-barplot.svg
@@ -0,0 +1,1140 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:17:55.289785</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 150.264935 320.4 
+L 150.264935 43.2 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- naive -->
+      <g style="fill: #262626" transform="translate(136.607123 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-6e"/>
+       <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+       <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+       <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+       <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 295.2 320.4 
+L 295.2 43.2 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- transposed -->
+      <g style="fill: #262626" transform="translate(267.257812 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-74"/>
+       <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+       <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+       <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+       <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+       <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+       <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+       <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+       <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+       <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 440.135065 320.4 
+L 440.135065 43.2 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- vectorized -->
+      <g style="fill: #262626" transform="translate(414.010846 337.498438)scale(0.1 -0.1)">
+       <defs>
+        <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-76"/>
+       <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+       <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+       <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+       <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+       <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+       <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+       <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+       <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+       <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+      </g>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_4">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- 0.0 -->
+      <g style="fill: #262626" transform="translate(45.006563 324.579141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-2e" d="M 684 794 
+L 1344 794 
+L 1344 0 
+L 684 0 
+L 684 794 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-30"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_5">
+      <path d="M 72 264.96 
+L 518.4 264.96 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- 0.5 -->
+      <g style="fill: #262626" transform="translate(45.006563 269.139141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-35" d="M 691 4666 
+L 3169 4666 
+L 3169 4134 
+L 1269 4134 
+L 1269 2991 
+Q 1406 3038 1543 3061 
+Q 1681 3084 1819 3084 
+Q 2600 3084 3056 2656 
+Q 3513 2228 3513 1497 
+Q 3513 744 3044 326 
+Q 2575 -91 1722 -91 
+Q 1428 -91 1123 -41 
+Q 819 9 494 109 
+L 494 744 
+Q 775 591 1075 516 
+Q 1375 441 1709 441 
+Q 2250 441 2565 725 
+Q 2881 1009 2881 1497 
+Q 2881 1984 2565 2268 
+Q 2250 2553 1709 2553 
+Q 1456 2553 1204 2497 
+Q 953 2441 691 2322 
+L 691 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-30"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-35" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_6">
+      <path d="M 72 209.52 
+L 518.4 209.52 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- 1.0 -->
+      <g style="fill: #262626" transform="translate(45.006563 213.699141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_7">
+      <path d="M 72 154.08 
+L 518.4 154.08 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- 1.5 -->
+      <g style="fill: #262626" transform="translate(45.006563 158.259141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-35" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_8">
+      <path d="M 72 98.64 
+L 518.4 98.64 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 2.0 -->
+      <g style="fill: #262626" transform="translate(45.006563 102.819141)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_6">
+     <g id="line2d_9">
+      <path d="M 72 43.2 
+L 518.4 43.2 
+" clip-path="url(#p7814d30ea0)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_9">
+      <!-- 2.5 -->
+      <g style="fill: #262626" transform="translate(45.006563 47.379141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-2e" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-35" x="95.410156"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_10">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(38.510937 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="patch_3">
+    <path d="M 92.290909 320.4 
+L 208.238961 320.4 
+L 208.238961 273.546201 
+L 92.290909 273.546201 
+z
+" clip-path="url(#p7814d30ea0)" style="fill: #4c72b0; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 237.225974 320.4 
+L 353.174026 320.4 
+L 353.174026 256.993918 
+L 237.225974 256.993918 
+z
+" clip-path="url(#p7814d30ea0)" style="fill: #dd8452; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 382.161039 320.4 
+L 498.109091 320.4 
+L 498.109091 68.63802 
+L 382.161039 68.63802 
+z
+" clip-path="url(#p7814d30ea0)" style="fill: #55a868; stroke: #ffffff; stroke-linejoin: miter"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_7">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_8">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_9">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_11">
+    <!-- 1.00x -->
+    <g style="fill: #262626" transform="translate(131.59181 268.550576)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-31" d="M 750 831 
+L 1813 831 
+L 1813 3847 
+L 722 3622 
+L 722 4441 
+L 1806 4666 
+L 2950 4666 
+L 2950 831 
+L 4013 831 
+L 4013 0 
+L 750 0 
+L 750 831 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-2e" d="M 653 1209 
+L 1778 1209 
+L 1778 0 
+L 653 0 
+L 653 1209 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-30" d="M 2944 2338 
+Q 2944 3213 2780 3570 
+Q 2616 3928 2228 3928 
+Q 1841 3928 1675 3570 
+Q 1509 3213 1509 2338 
+Q 1509 1453 1675 1090 
+Q 1841 728 2228 728 
+Q 2613 728 2778 1090 
+Q 2944 1453 2944 2338 
+z
+M 4147 2328 
+Q 4147 1169 3647 539 
+Q 3147 -91 2228 -91 
+Q 1306 -91 806 539 
+Q 306 1169 306 2328 
+Q 306 3491 806 4120 
+Q 1306 4750 2228 4750 
+Q 3147 4750 3647 4120 
+Q 4147 3491 4147 2328 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-78" d="M 1422 1791 
+L 159 3500 
+L 1344 3500 
+L 2059 2463 
+L 2784 3500 
+L 3969 3500 
+L 2706 1797 
+L 4031 0 
+L 2847 0 
+L 2059 1106 
+L 1281 0 
+L 97 0 
+L 1422 1791 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-30" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_12">
+    <!-- 1.35x -->
+    <g style="fill: #262626" transform="translate(276.526875 251.998293)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-33" d="M 2981 2516 
+Q 3453 2394 3698 2092 
+Q 3944 1791 3944 1325 
+Q 3944 631 3412 270 
+Q 2881 -91 1863 -91 
+Q 1503 -91 1142 -33 
+Q 781 25 428 141 
+L 428 1069 
+Q 766 900 1098 814 
+Q 1431 728 1753 728 
+Q 2231 728 2486 893 
+Q 2741 1059 2741 1369 
+Q 2741 1688 2480 1852 
+Q 2219 2016 1709 2016 
+L 1228 2016 
+L 1228 2791 
+L 1734 2791 
+Q 2188 2791 2409 2933 
+Q 2631 3075 2631 3366 
+Q 2631 3634 2415 3781 
+Q 2200 3928 1806 3928 
+Q 1516 3928 1219 3862 
+Q 922 3797 628 3669 
+L 628 4550 
+Q 984 4650 1334 4700 
+Q 1684 4750 2022 4750 
+Q 2931 4750 3382 4451 
+Q 3834 4153 3834 3553 
+Q 3834 3144 3618 2883 
+Q 3403 2622 2981 2516 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Bold-35" d="M 678 4666 
+L 3669 4666 
+L 3669 3781 
+L 1638 3781 
+L 1638 3059 
+Q 1775 3097 1914 3117 
+Q 2053 3138 2203 3138 
+Q 3056 3138 3531 2711 
+Q 4006 2284 4006 1522 
+Q 4006 766 3489 337 
+Q 2972 -91 2053 -91 
+Q 1656 -91 1267 -14 
+Q 878 63 494 219 
+L 494 1166 
+Q 875 947 1217 837 
+Q 1559 728 1863 728 
+Q 2300 728 2551 942 
+Q 2803 1156 2803 1522 
+Q 2803 1891 2551 2103 
+Q 2300 2316 1863 2316 
+Q 1603 2316 1309 2248 
+Q 1016 2181 678 2041 
+L 678 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-31"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-35" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_13">
+    <!-- 5.37x -->
+    <g style="fill: #262626" transform="translate(421.46194 63.642395)scale(0.12 -0.12)">
+     <defs>
+      <path id="DejaVuSans-Bold-37" d="M 428 4666 
+L 3944 4666 
+L 3944 3988 
+L 2125 0 
+L 953 0 
+L 2675 3781 
+L 428 3781 
+L 428 4666 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-Bold-35"/>
+     <use xlink:href="#DejaVuSans-Bold-2e" x="69.580078"/>
+     <use xlink:href="#DejaVuSans-Bold-33" x="107.568359"/>
+     <use xlink:href="#DejaVuSans-Bold-37" x="177.148438"/>
+     <use xlink:href="#DejaVuSans-Bold-78" x="246.728516"/>
+    </g>
+   </g>
+   <g id="text_14">
+    <!-- Matrix multiplication ($n=1920$) -->
+    <g style="fill: #262626" transform="translate(184.74 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-3d" d="M 678 2906 
+L 4684 2906 
+L 4684 2381 
+L 678 2381 
+L 678 2906 
+z
+M 678 1631 
+L 4684 1631 
+L 4684 1100 
+L 678 1100 
+L 678 1631 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+     <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6d" transform="translate(346.630859 0.015625)"/>
+     <use xlink:href="#DejaVuSans-75" transform="translate(444.042969 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(507.421875 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(535.205078 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(574.414062 0.015625)"/>
+     <use xlink:href="#DejaVuSans-70" transform="translate(602.197266 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6c" transform="translate(665.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(693.457031 0.015625)"/>
+     <use xlink:href="#DejaVuSans-63" transform="translate(721.240234 0.015625)"/>
+     <use xlink:href="#DejaVuSans-61" transform="translate(776.220703 0.015625)"/>
+     <use xlink:href="#DejaVuSans-74" transform="translate(837.5 0.015625)"/>
+     <use xlink:href="#DejaVuSans-69" transform="translate(876.708984 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6f" transform="translate(904.492188 0.015625)"/>
+     <use xlink:href="#DejaVuSans-6e" transform="translate(965.673828 0.015625)"/>
+     <use xlink:href="#DejaVuSans-20" transform="translate(1029.052734 0.015625)"/>
+     <use xlink:href="#DejaVuSans-28" transform="translate(1060.839844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(1099.853516 0.015625)"/>
+     <use xlink:href="#DejaVuSans-3d" transform="translate(1182.714844 0.015625)"/>
+     <use xlink:href="#DejaVuSans-31" transform="translate(1285.986328 0.015625)"/>
+     <use xlink:href="#DejaVuSans-39" transform="translate(1349.609375 0.015625)"/>
+     <use xlink:href="#DejaVuSans-32" transform="translate(1411.482422 0.015625)"/>
+     <use xlink:href="#DejaVuSans-30" transform="translate(1475.105469 0.015625)"/>
+     <use xlink:href="#DejaVuSans-29" transform="translate(1538.728516 0.015625)"/>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p7814d30ea0">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/img/mm-vectorized-plot.svg b/content/english/hpc/algorithms/img/mm-vectorized-plot.svg
new file mode 100644
index 00000000..7374f73f
--- /dev/null
+++ b/content/english/hpc/algorithms/img/mm-vectorized-plot.svg
@@ -0,0 +1,1379 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="576pt" height="360pt" viewBox="0 0 576 360" xmlns="http://www.w3.org/2000/svg" version="1.1">
+ <metadata>
+  <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2022-04-05T01:18:01.560593</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 360 
+L 576 360 
+L 576 0 
+L 0 0 
+z
+" style="fill: #ffffff"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+L 518.4 43.2 
+L 72 43.2 
+z
+" style="fill: #ffffff"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="line2d_1">
+      <path d="M 117.784615 320.4 
+L 117.784615 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_1">
+      <!-- 240 -->
+      <g style="fill: #262626" transform="translate(107.28649 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-32" d="M 1228 531 
+L 3431 531 
+L 3431 0 
+L 469 0 
+L 469 531 
+Q 828 903 1448 1529 
+Q 2069 2156 2228 2338 
+Q 2531 2678 2651 2914 
+Q 2772 3150 2772 3378 
+Q 2772 3750 2511 3984 
+Q 2250 4219 1831 4219 
+Q 1534 4219 1204 4116 
+Q 875 4013 500 3803 
+L 500 4441 
+Q 881 4594 1212 4672 
+Q 1544 4750 1819 4750 
+Q 2544 4750 2975 4387 
+Q 3406 4025 3406 3419 
+Q 3406 3131 3298 2873 
+Q 3191 2616 2906 2266 
+Q 2828 2175 2409 1742 
+Q 1991 1309 1228 531 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-34" d="M 2419 4116 
+L 825 1625 
+L 2419 1625 
+L 2419 4116 
+z
+M 2253 4666 
+L 3047 4666 
+L 3047 1625 
+L 3713 1625 
+L 3713 1100 
+L 3047 1100 
+L 3047 0 
+L 2419 0 
+L 2419 1100 
+L 313 1100 
+L 313 1709 
+L 2253 4666 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-30" d="M 2034 4250 
+Q 1547 4250 1301 3770 
+Q 1056 3291 1056 2328 
+Q 1056 1369 1301 889 
+Q 1547 409 2034 409 
+Q 2525 409 2770 889 
+Q 3016 1369 3016 2328 
+Q 3016 3291 2770 3770 
+Q 2525 4250 2034 4250 
+z
+M 2034 4750 
+Q 2819 4750 3233 4129 
+Q 3647 3509 3647 2328 
+Q 3647 1150 3233 529 
+Q 2819 -91 2034 -91 
+Q 1250 -91 836 529 
+Q 422 1150 422 2328 
+Q 422 3509 836 4129 
+Q 1250 4750 2034 4750 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-32"/>
+       <use xlink:href="#DejaVuSans-34" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="line2d_2">
+      <path d="M 175.015385 320.4 
+L 175.015385 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_2">
+      <!-- 480 -->
+      <g style="fill: #262626" transform="translate(164.51726 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-38" d="M 2034 2216 
+Q 1584 2216 1326 1975 
+Q 1069 1734 1069 1313 
+Q 1069 891 1326 650 
+Q 1584 409 2034 409 
+Q 2484 409 2743 651 
+Q 3003 894 3003 1313 
+Q 3003 1734 2745 1975 
+Q 2488 2216 2034 2216 
+z
+M 1403 2484 
+Q 997 2584 770 2862 
+Q 544 3141 544 3541 
+Q 544 4100 942 4425 
+Q 1341 4750 2034 4750 
+Q 2731 4750 3128 4425 
+Q 3525 4100 3525 3541 
+Q 3525 3141 3298 2862 
+Q 3072 2584 2669 2484 
+Q 3125 2378 3379 2068 
+Q 3634 1759 3634 1313 
+Q 3634 634 3220 271 
+Q 2806 -91 2034 -91 
+Q 1263 -91 848 271 
+Q 434 634 434 1313 
+Q 434 1759 690 2068 
+Q 947 2378 1403 2484 
+z
+M 1172 3481 
+Q 1172 3119 1398 2916 
+Q 1625 2713 2034 2713 
+Q 2441 2713 2670 2916 
+Q 2900 3119 2900 3481 
+Q 2900 3844 2670 4047 
+Q 2441 4250 2034 4250 
+Q 1625 4250 1398 4047 
+Q 1172 3844 1172 3481 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-34"/>
+       <use xlink:href="#DejaVuSans-38" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="line2d_3">
+      <path d="M 232.246154 320.4 
+L 232.246154 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_3">
+      <!-- 720 -->
+      <g style="fill: #262626" transform="translate(221.748029 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-37" d="M 525 4666 
+L 3525 4666 
+L 3525 4397 
+L 1831 0 
+L 1172 0 
+L 2766 4134 
+L 525 4134 
+L 525 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-37"/>
+       <use xlink:href="#DejaVuSans-32" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="line2d_4">
+      <path d="M 289.476923 320.4 
+L 289.476923 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_4">
+      <!-- 960 -->
+      <g style="fill: #262626" transform="translate(278.978798 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-39" d="M 703 97 
+L 703 672 
+Q 941 559 1184 500 
+Q 1428 441 1663 441 
+Q 2288 441 2617 861 
+Q 2947 1281 2994 2138 
+Q 2813 1869 2534 1725 
+Q 2256 1581 1919 1581 
+Q 1219 1581 811 2004 
+Q 403 2428 403 3163 
+Q 403 3881 828 4315 
+Q 1253 4750 1959 4750 
+Q 2769 4750 3195 4129 
+Q 3622 3509 3622 2328 
+Q 3622 1225 3098 567 
+Q 2575 -91 1691 -91 
+Q 1453 -91 1209 -44 
+Q 966 3 703 97 
+z
+M 1959 2075 
+Q 2384 2075 2632 2365 
+Q 2881 2656 2881 3163 
+Q 2881 3666 2632 3958 
+Q 2384 4250 1959 4250 
+Q 1534 4250 1286 3958 
+Q 1038 3666 1038 3163 
+Q 1038 2656 1286 2365 
+Q 1534 2075 1959 2075 
+z
+" transform="scale(0.015625)"/>
+        <path id="DejaVuSans-36" d="M 2113 2584 
+Q 1688 2584 1439 2293 
+Q 1191 2003 1191 1497 
+Q 1191 994 1439 701 
+Q 1688 409 2113 409 
+Q 2538 409 2786 701 
+Q 3034 994 3034 1497 
+Q 3034 2003 2786 2293 
+Q 2538 2584 2113 2584 
+z
+M 3366 4563 
+L 3366 3988 
+Q 3128 4100 2886 4159 
+Q 2644 4219 2406 4219 
+Q 1781 4219 1451 3797 
+Q 1122 3375 1075 2522 
+Q 1259 2794 1537 2939 
+Q 1816 3084 2150 3084 
+Q 2853 3084 3261 2657 
+Q 3669 2231 3669 1497 
+Q 3669 778 3244 343 
+Q 2819 -91 2113 -91 
+Q 1303 -91 875 529 
+Q 447 1150 447 2328 
+Q 447 3434 972 4092 
+Q 1497 4750 2381 4750 
+Q 2619 4750 2861 4703 
+Q 3103 4656 3366 4563 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-39"/>
+       <use xlink:href="#DejaVuSans-36" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="line2d_5">
+      <path d="M 346.707692 320.4 
+L 346.707692 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_5">
+      <!-- 1200 -->
+      <g style="fill: #262626" transform="translate(332.710192 338.258281)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-31" d="M 794 531 
+L 1825 531 
+L 1825 4091 
+L 703 3866 
+L 703 4441 
+L 1819 4666 
+L 2450 4666 
+L 2450 531 
+L 3481 531 
+L 3481 0 
+L 794 0 
+L 794 531 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-32" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-30" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_6">
+     <g id="line2d_6">
+      <path d="M 403.938462 320.4 
+L 403.938462 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_6">
+      <!-- 1440 -->
+      <g style="fill: #262626" transform="translate(389.940962 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-34" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-34" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_7">
+     <g id="line2d_7">
+      <path d="M 461.169231 320.4 
+L 461.169231 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_7">
+      <!-- 1680 -->
+      <g style="fill: #262626" transform="translate(447.171731 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-36" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-38" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_8">
+     <g id="line2d_8">
+      <path d="M 518.4 320.4 
+L 518.4 43.2 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_8">
+      <!-- 1920 -->
+      <g style="fill: #262626" transform="translate(504.4025 338.258281)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+       <use xlink:href="#DejaVuSans-39" x="63.623047"/>
+       <use xlink:href="#DejaVuSans-32" x="127.246094"/>
+       <use xlink:href="#DejaVuSans-30" x="190.869141"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_9">
+     <!-- Matrix size ($n \times n$) -->
+     <g style="fill: #262626" transform="translate(241.2 353.664062)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-4d" d="M 628 4666 
+L 1569 4666 
+L 2759 1491 
+L 3956 4666 
+L 4897 4666 
+L 4897 0 
+L 4281 0 
+L 4281 4097 
+L 3078 897 
+L 2444 897 
+L 1241 4097 
+L 1241 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-61" d="M 2194 1759 
+Q 1497 1759 1228 1600 
+Q 959 1441 959 1056 
+Q 959 750 1161 570 
+Q 1363 391 1709 391 
+Q 2188 391 2477 730 
+Q 2766 1069 2766 1631 
+L 2766 1759 
+L 2194 1759 
+z
+M 3341 1997 
+L 3341 0 
+L 2766 0 
+L 2766 531 
+Q 2569 213 2275 61 
+Q 1981 -91 1556 -91 
+Q 1019 -91 701 211 
+Q 384 513 384 1019 
+Q 384 1609 779 1909 
+Q 1175 2209 1959 2209 
+L 2766 2209 
+L 2766 2266 
+Q 2766 2663 2505 2880 
+Q 2244 3097 1772 3097 
+Q 1472 3097 1187 3025 
+Q 903 2953 641 2809 
+L 641 3341 
+Q 956 3463 1253 3523 
+Q 1550 3584 1831 3584 
+Q 2591 3584 2966 3190 
+Q 3341 2797 3341 1997 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-74" d="M 1172 4494 
+L 1172 3500 
+L 2356 3500 
+L 2356 3053 
+L 1172 3053 
+L 1172 1153 
+Q 1172 725 1289 603 
+Q 1406 481 1766 481 
+L 2356 481 
+L 2356 0 
+L 1766 0 
+Q 1100 0 847 248 
+Q 594 497 594 1153 
+L 594 3053 
+L 172 3053 
+L 172 3500 
+L 594 3500 
+L 594 4494 
+L 1172 4494 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-72" d="M 2631 2963 
+Q 2534 3019 2420 3045 
+Q 2306 3072 2169 3072 
+Q 1681 3072 1420 2755 
+Q 1159 2438 1159 1844 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1341 3275 1631 3429 
+Q 1922 3584 2338 3584 
+Q 2397 3584 2469 3576 
+Q 2541 3569 2628 3553 
+L 2631 2963 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-69" d="M 603 3500 
+L 1178 3500 
+L 1178 0 
+L 603 0 
+L 603 3500 
+z
+M 603 4863 
+L 1178 4863 
+L 1178 4134 
+L 603 4134 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-78" d="M 3513 3500 
+L 2247 1797 
+L 3578 0 
+L 2900 0 
+L 1881 1375 
+L 863 0 
+L 184 0 
+L 1544 1831 
+L 300 3500 
+L 978 3500 
+L 1906 2253 
+L 2834 3500 
+L 3513 3500 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-20" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-73" d="M 2834 3397 
+L 2834 2853 
+Q 2591 2978 2328 3040 
+Q 2066 3103 1784 3103 
+Q 1356 3103 1142 2972 
+Q 928 2841 928 2578 
+Q 928 2378 1081 2264 
+Q 1234 2150 1697 2047 
+L 1894 2003 
+Q 2506 1872 2764 1633 
+Q 3022 1394 3022 966 
+Q 3022 478 2636 193 
+Q 2250 -91 1575 -91 
+Q 1294 -91 989 -36 
+Q 684 19 347 128 
+L 347 722 
+Q 666 556 975 473 
+Q 1284 391 1588 391 
+Q 1994 391 2212 530 
+Q 2431 669 2431 922 
+Q 2431 1156 2273 1281 
+Q 2116 1406 1581 1522 
+L 1381 1569 
+Q 847 1681 609 1914 
+Q 372 2147 372 2553 
+Q 372 3047 722 3315 
+Q 1072 3584 1716 3584 
+Q 2034 3584 2315 3537 
+Q 2597 3491 2834 3397 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-7a" d="M 353 3500 
+L 3084 3500 
+L 3084 2975 
+L 922 459 
+L 3084 459 
+L 3084 0 
+L 275 0 
+L 275 525 
+L 2438 3041 
+L 353 3041 
+L 353 3500 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-65" d="M 3597 1894 
+L 3597 1613 
+L 953 1613 
+Q 991 1019 1311 708 
+Q 1631 397 2203 397 
+Q 2534 397 2845 478 
+Q 3156 559 3463 722 
+L 3463 178 
+Q 3153 47 2828 -22 
+Q 2503 -91 2169 -91 
+Q 1331 -91 842 396 
+Q 353 884 353 1716 
+Q 353 2575 817 3079 
+Q 1281 3584 2069 3584 
+Q 2775 3584 3186 3129 
+Q 3597 2675 3597 1894 
+z
+M 3022 2063 
+Q 3016 2534 2758 2815 
+Q 2500 3097 2075 3097 
+Q 1594 3097 1305 2825 
+Q 1016 2553 972 2059 
+L 3022 2063 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-28" d="M 1984 4856 
+Q 1566 4138 1362 3434 
+Q 1159 2731 1159 2009 
+Q 1159 1288 1364 580 
+Q 1569 -128 1984 -844 
+L 1484 -844 
+Q 1016 -109 783 600 
+Q 550 1309 550 2009 
+Q 550 2706 781 3412 
+Q 1013 4119 1484 4856 
+L 1984 4856 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-Oblique-6e" d="M 3566 2113 
+L 3156 0 
+L 2578 0 
+L 2988 2091 
+Q 3016 2238 3031 2350 
+Q 3047 2463 3047 2528 
+Q 3047 2791 2881 2937 
+Q 2716 3084 2419 3084 
+Q 1956 3084 1622 2776 
+Q 1288 2469 1184 1941 
+L 800 0 
+L 225 0 
+L 903 3500 
+L 1478 3500 
+L 1363 2950 
+Q 1603 3253 1940 3418 
+Q 2278 3584 2650 3584 
+Q 3113 3584 3367 3334 
+Q 3622 3084 3622 2631 
+Q 3622 2519 3608 2391 
+Q 3594 2263 3566 2113 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-d7" d="M 4488 3438 
+L 3059 2003 
+L 4488 575 
+L 4116 197 
+L 2681 1631 
+L 1247 197 
+L 878 575 
+L 2303 2003 
+L 878 3438 
+L 1247 3816 
+L 2681 2381 
+L 4116 3816 
+L 4488 3438 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-29" d="M 513 4856 
+L 1013 4856 
+Q 1481 4119 1714 3412 
+Q 1947 2706 1947 2009 
+Q 1947 1309 1714 600 
+Q 1481 -109 1013 -844 
+L 513 -844 
+Q 928 -128 1133 580 
+Q 1338 1288 1338 2009 
+Q 1338 2731 1133 3434 
+Q 928 4138 513 4856 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-4d" transform="translate(0 0.015625)"/>
+      <use xlink:href="#DejaVuSans-61" transform="translate(86.279297 0.015625)"/>
+      <use xlink:href="#DejaVuSans-74" transform="translate(147.558594 0.015625)"/>
+      <use xlink:href="#DejaVuSans-72" transform="translate(186.767578 0.015625)"/>
+      <use xlink:href="#DejaVuSans-69" transform="translate(227.880859 0.015625)"/>
+      <use xlink:href="#DejaVuSans-78" transform="translate(255.664062 0.015625)"/>
+      <use xlink:href="#DejaVuSans-20" transform="translate(314.84375 0.015625)"/>
+      <use xlink:href="#DejaVuSans-73" transform="translate(346.630859 0.015625)"/>
+      <use xlink:href="#DejaVuSans-69" transform="translate(398.730469 0.015625)"/>
+      <use xlink:href="#DejaVuSans-7a" transform="translate(426.513672 0.015625)"/>
+      <use xlink:href="#DejaVuSans-65" transform="translate(479.003906 0.015625)"/>
+      <use xlink:href="#DejaVuSans-20" transform="translate(540.527344 0.015625)"/>
+      <use xlink:href="#DejaVuSans-28" transform="translate(572.314453 0.015625)"/>
+      <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(611.328125 0.015625)"/>
+      <use xlink:href="#DejaVuSans-d7" transform="translate(694.189453 0.015625)"/>
+      <use xlink:href="#DejaVuSans-Oblique-6e" transform="translate(797.460938 0.015625)"/>
+      <use xlink:href="#DejaVuSans-29" transform="translate(860.839844 0.015625)"/>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_9">
+      <path d="M 72 320.4 
+L 518.4 320.4 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_10">
+      <!-- 0 -->
+      <g style="fill: #262626" transform="translate(55.50125 324.579141)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-30"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_10">
+      <path d="M 72 282.162582 
+L 518.4 282.162582 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_11">
+      <!-- 1 -->
+      <g style="fill: #262626" transform="translate(55.50125 286.341722)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-31"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_11">
+      <path d="M 72 243.925164 
+L 518.4 243.925164 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_12">
+      <!-- 2 -->
+      <g style="fill: #262626" transform="translate(55.50125 248.104304)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-32"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_12">
+      <path d="M 72 205.687745 
+L 518.4 205.687745 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_13">
+      <!-- 3 -->
+      <g style="fill: #262626" transform="translate(55.50125 209.866886)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-33" d="M 2597 2516 
+Q 3050 2419 3304 2112 
+Q 3559 1806 3559 1356 
+Q 3559 666 3084 287 
+Q 2609 -91 1734 -91 
+Q 1441 -91 1130 -33 
+Q 819 25 488 141 
+L 488 750 
+Q 750 597 1062 519 
+Q 1375 441 1716 441 
+Q 2309 441 2620 675 
+Q 2931 909 2931 1356 
+Q 2931 1769 2642 2001 
+Q 2353 2234 1838 2234 
+L 1294 2234 
+L 1294 2753 
+L 1863 2753 
+Q 2328 2753 2575 2939 
+Q 2822 3125 2822 3475 
+Q 2822 3834 2567 4026 
+Q 2313 4219 1838 4219 
+Q 1578 4219 1281 4162 
+Q 984 4106 628 3988 
+L 628 4550 
+Q 988 4650 1302 4700 
+Q 1616 4750 1894 4750 
+Q 2613 4750 3031 4423 
+Q 3450 4097 3450 3541 
+Q 3450 3153 3228 2886 
+Q 3006 2619 2597 2516 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-33"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_13">
+      <path d="M 72 167.450327 
+L 518.4 167.450327 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_14">
+      <!-- 4 -->
+      <g style="fill: #262626" transform="translate(55.50125 171.629468)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-34"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_6">
+     <g id="line2d_14">
+      <path d="M 72 129.212909 
+L 518.4 129.212909 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_15">
+      <!-- 5 -->
+      <g style="fill: #262626" transform="translate(55.50125 133.39205)scale(0.11 -0.11)">
+       <defs>
+        <path id="DejaVuSans-35" d="M 691 4666 
+L 3169 4666 
+L 3169 4134 
+L 1269 4134 
+L 1269 2991 
+Q 1406 3038 1543 3061 
+Q 1681 3084 1819 3084 
+Q 2600 3084 3056 2656 
+Q 3513 2228 3513 1497 
+Q 3513 744 3044 326 
+Q 2575 -91 1722 -91 
+Q 1428 -91 1123 -41 
+Q 819 9 494 109 
+L 494 744 
+Q 775 591 1075 516 
+Q 1375 441 1709 441 
+Q 2250 441 2565 725 
+Q 2881 1009 2881 1497 
+Q 2881 1984 2565 2268 
+Q 2250 2553 1709 2553 
+Q 1456 2553 1204 2497 
+Q 953 2441 691 2322 
+L 691 4666 
+z
+" transform="scale(0.015625)"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-35"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_7">
+     <g id="line2d_15">
+      <path d="M 72 90.975491 
+L 518.4 90.975491 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_16">
+      <!-- 6 -->
+      <g style="fill: #262626" transform="translate(55.50125 95.154632)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-36"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_8">
+     <g id="line2d_16">
+      <path d="M 72 52.738073 
+L 518.4 52.738073 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #cccccc; stroke-linecap: round"/>
+     </g>
+     <g id="text_17">
+      <!-- 7 -->
+      <g style="fill: #262626" transform="translate(55.50125 56.917213)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-37"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_18">
+     <!-- GFLOPS -->
+     <g style="fill: #262626" transform="translate(49.005625 205.175625)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path id="DejaVuSans-47" d="M 3809 666 
+L 3809 1919 
+L 2778 1919 
+L 2778 2438 
+L 4434 2438 
+L 4434 434 
+Q 4069 175 3628 42 
+Q 3188 -91 2688 -91 
+Q 1594 -91 976 548 
+Q 359 1188 359 2328 
+Q 359 3472 976 4111 
+Q 1594 4750 2688 4750 
+Q 3144 4750 3555 4637 
+Q 3966 4525 4313 4306 
+L 4313 3634 
+Q 3963 3931 3569 4081 
+Q 3175 4231 2741 4231 
+Q 1884 4231 1454 3753 
+Q 1025 3275 1025 2328 
+Q 1025 1384 1454 906 
+Q 1884 428 2741 428 
+Q 3075 428 3337 486 
+Q 3600 544 3809 666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-46" d="M 628 4666 
+L 3309 4666 
+L 3309 4134 
+L 1259 4134 
+L 1259 2759 
+L 3109 2759 
+L 3109 2228 
+L 1259 2228 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4c" d="M 628 4666 
+L 1259 4666 
+L 1259 531 
+L 3531 531 
+L 3531 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-4f" d="M 2522 4238 
+Q 1834 4238 1429 3725 
+Q 1025 3213 1025 2328 
+Q 1025 1447 1429 934 
+Q 1834 422 2522 422 
+Q 3209 422 3611 934 
+Q 4013 1447 4013 2328 
+Q 4013 3213 3611 3725 
+Q 3209 4238 2522 4238 
+z
+M 2522 4750 
+Q 3503 4750 4090 4092 
+Q 4678 3434 4678 2328 
+Q 4678 1225 4090 567 
+Q 3503 -91 2522 -91 
+Q 1538 -91 948 565 
+Q 359 1222 359 2328 
+Q 359 3434 948 4092 
+Q 1538 4750 2522 4750 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-50" d="M 1259 4147 
+L 1259 2394 
+L 2053 2394 
+Q 2494 2394 2734 2622 
+Q 2975 2850 2975 3272 
+Q 2975 3691 2734 3919 
+Q 2494 4147 2053 4147 
+L 1259 4147 
+z
+M 628 4666 
+L 2053 4666 
+Q 2838 4666 3239 4311 
+Q 3641 3956 3641 3272 
+Q 3641 2581 3239 2228 
+Q 2838 1875 2053 1875 
+L 1259 1875 
+L 1259 0 
+L 628 0 
+L 628 4666 
+z
+" transform="scale(0.015625)"/>
+       <path id="DejaVuSans-53" d="M 3425 4513 
+L 3425 3897 
+Q 3066 4069 2747 4153 
+Q 2428 4238 2131 4238 
+Q 1616 4238 1336 4038 
+Q 1056 3838 1056 3469 
+Q 1056 3159 1242 3001 
+Q 1428 2844 1947 2747 
+L 2328 2669 
+Q 3034 2534 3370 2195 
+Q 3706 1856 3706 1288 
+Q 3706 609 3251 259 
+Q 2797 -91 1919 -91 
+Q 1588 -91 1214 -16 
+Q 841 59 441 206 
+L 441 856 
+Q 825 641 1194 531 
+Q 1563 422 1919 422 
+Q 2459 422 2753 634 
+Q 3047 847 3047 1241 
+Q 3047 1584 2836 1778 
+Q 2625 1972 2144 2069 
+L 1759 2144 
+Q 1053 2284 737 2584 
+Q 422 2884 422 3419 
+Q 422 4038 858 4394 
+Q 1294 4750 2059 4750 
+Q 2388 4750 2728 4690 
+Q 3069 4631 3425 4513 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-47"/>
+      <use xlink:href="#DejaVuSans-46" x="77.490234"/>
+      <use xlink:href="#DejaVuSans-4c" x="135.009766"/>
+      <use xlink:href="#DejaVuSans-4f" x="187.097656"/>
+      <use xlink:href="#DejaVuSans-50" x="265.808594"/>
+      <use xlink:href="#DejaVuSans-53" x="326.111328"/>
+     </g>
+    </g>
+   </g>
+   <g id="line2d_17">
+    <path d="M 72 285.160395 
+L 83.446154 291.16057 
+L 94.892308 292.867731 
+L 106.338462 293.58536 
+L 117.784615 294.05854 
+L 129.230769 294.481762 
+L 140.676923 294.5598 
+L 152.123077 294.781507 
+L 163.569231 294.934846 
+L 175.015385 294.871454 
+L 186.461538 295.011767 
+L 197.907692 295.111866 
+L 209.353846 295.166093 
+L 220.8 295.137392 
+L 232.246154 295.144178 
+L 243.692308 304.00224 
+L 255.138462 295.201133 
+L 266.584615 295.442468 
+L 278.030769 298.882257 
+L 289.476923 299.086838 
+L 300.923077 298.411418 
+L 312.369231 298.579152 
+L 323.815385 298.053693 
+L 335.261538 303.20555 
+L 346.707692 298.727426 
+L 358.153846 298.313568 
+L 369.6 298.843825 
+L 381.046154 297.566381 
+L 392.492308 299.086196 
+L 403.938462 297.674252 
+L 415.384615 298.841081 
+L 426.830769 314.162278 
+L 438.276923 296.61393 
+L 449.723077 299.482438 
+L 461.169231 297.361834 
+L 472.615385 298.669831 
+L 484.061538 298.378589 
+L 495.507692 298.427732 
+L 506.953846 297.473041 
+L 518.4 304.242277 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #4c72b0; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_18">
+    <path d="M 72 286.56998 
+L 83.446154 291.899562 
+L 94.892308 293.35041 
+L 106.338462 294.024387 
+L 117.784615 294.335204 
+L 129.230769 294.670125 
+L 140.676923 294.871336 
+L 152.123077 295.226513 
+L 163.569231 295.322838 
+L 175.015385 295.514299 
+L 186.461538 295.762896 
+L 197.907692 295.843808 
+L 209.353846 295.937615 
+L 220.8 296.239674 
+L 232.246154 296.481716 
+L 243.692308 296.600102 
+L 255.138462 297.399203 
+L 266.584615 297.333365 
+L 278.030769 297.034185 
+L 289.476923 297.112106 
+L 300.923077 297.422247 
+L 312.369231 297.441402 
+L 323.815385 297.269582 
+L 335.261538 297.636459 
+L 346.707692 298.495997 
+L 358.153846 298.156776 
+L 369.6 299.111556 
+L 381.046154 298.658342 
+L 392.492308 298.596979 
+L 403.938462 298.434774 
+L 415.384615 298.328761 
+L 426.830769 298.36277 
+L 438.276923 298.246043 
+L 449.723077 298.724799 
+L 461.169231 298.907197 
+L 472.615385 298.554134 
+L 484.061538 298.577661 
+L 495.507692 298.606833 
+L 506.953846 298.501605 
+L 518.4 298.534155 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #dd8452; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="line2d_19">
+    <path d="M 72 157.755671 
+L 83.446154 90.263807 
+L 94.892308 60.317269 
+L 106.338462 56.102966 
+L 117.784615 58.070437 
+L 129.230769 57.245131 
+L 140.676923 67.132648 
+L 152.123077 74.307558 
+L 163.569231 78.482037 
+L 175.015385 83.985143 
+L 186.461538 95.486128 
+L 197.907692 99.922542 
+L 209.353846 108.92387 
+L 220.8 109.281666 
+L 232.246154 123.639266 
+L 243.692308 139.562573 
+L 255.138462 153.028202 
+L 266.584615 173.630705 
+L 278.030769 184.943573 
+L 289.476923 203.583631 
+L 300.923077 210.968173 
+L 312.369231 218.950488 
+L 323.815385 225.265461 
+L 335.261538 229.706975 
+L 346.707692 229.965982 
+L 358.153846 230.092893 
+L 369.6 228.842724 
+L 381.046154 229.333025 
+L 392.492308 234.216261 
+L 403.938462 231.478049 
+L 415.384615 234.292407 
+L 426.830769 234.040301 
+L 438.276923 232.622386 
+L 449.723077 232.73222 
+L 461.169231 234.191297 
+L 472.615385 233.963307 
+L 484.061538 231.603647 
+L 495.507692 228.474748 
+L 506.953846 233.244534 
+L 518.4 233.578859 
+" clip-path="url(#pd6c2af2edb)" style="fill: none; stroke: #55a868; stroke-width: 1.5; stroke-linecap: round"/>
+   </g>
+   <g id="patch_3">
+    <path d="M 72 320.4 
+L 72 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_4">
+    <path d="M 518.4 320.4 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_5">
+    <path d="M 72 320.4 
+L 518.4 320.4 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="patch_6">
+    <path d="M 72 43.2 
+L 518.4 43.2 
+" style="fill: none; stroke: #cccccc; stroke-width: 1.25; stroke-linejoin: miter; stroke-linecap: square"/>
+   </g>
+   <g id="text_19">
+    <!-- Matrix multiplication -->
+    <g style="fill: #262626" transform="translate(223.167812 23.2)scale(0.14 -0.14)">
+     <defs>
+      <path id="DejaVuSans-6d" d="M 3328 2828 
+Q 3544 3216 3844 3400 
+Q 4144 3584 4550 3584 
+Q 5097 3584 5394 3201 
+Q 5691 2819 5691 2113 
+L 5691 0 
+L 5113 0 
+L 5113 2094 
+Q 5113 2597 4934 2840 
+Q 4756 3084 4391 3084 
+Q 3944 3084 3684 2787 
+Q 3425 2491 3425 1978 
+L 3425 0 
+L 2847 0 
+L 2847 2094 
+Q 2847 2600 2669 2842 
+Q 2491 3084 2119 3084 
+Q 1678 3084 1418 2786 
+Q 1159 2488 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1356 3278 1631 3431 
+Q 1906 3584 2284 3584 
+Q 2666 3584 2933 3390 
+Q 3200 3197 3328 2828 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-75" d="M 544 1381 
+L 544 3500 
+L 1119 3500 
+L 1119 1403 
+Q 1119 906 1312 657 
+Q 1506 409 1894 409 
+Q 2359 409 2629 706 
+Q 2900 1003 2900 1516 
+L 2900 3500 
+L 3475 3500 
+L 3475 0 
+L 2900 0 
+L 2900 538 
+Q 2691 219 2414 64 
+Q 2138 -91 1772 -91 
+Q 1169 -91 856 284 
+Q 544 659 544 1381 
+z
+M 1991 3584 
+L 1991 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6c" d="M 603 4863 
+L 1178 4863 
+L 1178 0 
+L 603 0 
+L 603 4863 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-70" d="M 1159 525 
+L 1159 -1331 
+L 581 -1331 
+L 581 3500 
+L 1159 3500 
+L 1159 2969 
+Q 1341 3281 1617 3432 
+Q 1894 3584 2278 3584 
+Q 2916 3584 3314 3078 
+Q 3713 2572 3713 1747 
+Q 3713 922 3314 415 
+Q 2916 -91 2278 -91 
+Q 1894 -91 1617 61 
+Q 1341 213 1159 525 
+z
+M 3116 1747 
+Q 3116 2381 2855 2742 
+Q 2594 3103 2138 3103 
+Q 1681 3103 1420 2742 
+Q 1159 2381 1159 1747 
+Q 1159 1113 1420 752 
+Q 1681 391 2138 391 
+Q 2594 391 2855 752 
+Q 3116 1113 3116 1747 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-63" d="M 3122 3366 
+L 3122 2828 
+Q 2878 2963 2633 3030 
+Q 2388 3097 2138 3097 
+Q 1578 3097 1268 2742 
+Q 959 2388 959 1747 
+Q 959 1106 1268 751 
+Q 1578 397 2138 397 
+Q 2388 397 2633 464 
+Q 2878 531 3122 666 
+L 3122 134 
+Q 2881 22 2623 -34 
+Q 2366 -91 2075 -91 
+Q 1284 -91 818 406 
+Q 353 903 353 1747 
+Q 353 2603 823 3093 
+Q 1294 3584 2113 3584 
+Q 2378 3584 2631 3529 
+Q 2884 3475 3122 3366 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6f" d="M 1959 3097 
+Q 1497 3097 1228 2736 
+Q 959 2375 959 1747 
+Q 959 1119 1226 758 
+Q 1494 397 1959 397 
+Q 2419 397 2687 759 
+Q 2956 1122 2956 1747 
+Q 2956 2369 2687 2733 
+Q 2419 3097 1959 3097 
+z
+M 1959 3584 
+Q 2709 3584 3137 3096 
+Q 3566 2609 3566 1747 
+Q 3566 888 3137 398 
+Q 2709 -91 1959 -91 
+Q 1206 -91 779 398 
+Q 353 888 353 1747 
+Q 353 2609 779 3096 
+Q 1206 3584 1959 3584 
+z
+" transform="scale(0.015625)"/>
+      <path id="DejaVuSans-6e" d="M 3513 2113 
+L 3513 0 
+L 2938 0 
+L 2938 2094 
+Q 2938 2591 2744 2837 
+Q 2550 3084 2163 3084 
+Q 1697 3084 1428 2787 
+Q 1159 2491 1159 1978 
+L 1159 0 
+L 581 0 
+L 581 3500 
+L 1159 3500 
+L 1159 2956 
+Q 1366 3272 1645 3428 
+Q 1925 3584 2291 3584 
+Q 2894 3584 3203 3211 
+Q 3513 2838 3513 2113 
+z
+" transform="scale(0.015625)"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-4d"/>
+     <use xlink:href="#DejaVuSans-61" x="86.279297"/>
+     <use xlink:href="#DejaVuSans-74" x="147.558594"/>
+     <use xlink:href="#DejaVuSans-72" x="186.767578"/>
+     <use xlink:href="#DejaVuSans-69" x="227.880859"/>
+     <use xlink:href="#DejaVuSans-78" x="255.664062"/>
+     <use xlink:href="#DejaVuSans-20" x="314.84375"/>
+     <use xlink:href="#DejaVuSans-6d" x="346.630859"/>
+     <use xlink:href="#DejaVuSans-75" x="444.042969"/>
+     <use xlink:href="#DejaVuSans-6c" x="507.421875"/>
+     <use xlink:href="#DejaVuSans-74" x="535.205078"/>
+     <use xlink:href="#DejaVuSans-69" x="574.414062"/>
+     <use xlink:href="#DejaVuSans-70" x="602.197266"/>
+     <use xlink:href="#DejaVuSans-6c" x="665.673828"/>
+     <use xlink:href="#DejaVuSans-69" x="693.457031"/>
+     <use xlink:href="#DejaVuSans-63" x="721.240234"/>
+     <use xlink:href="#DejaVuSans-61" x="776.220703"/>
+     <use xlink:href="#DejaVuSans-74" x="837.5"/>
+     <use xlink:href="#DejaVuSans-69" x="876.708984"/>
+     <use xlink:href="#DejaVuSans-6f" x="904.492188"/>
+     <use xlink:href="#DejaVuSans-6e" x="965.673828"/>
+    </g>
+   </g>
+   <g id="legend_1">
+    <g id="patch_7">
+     <path d="M 414.027187 100.437812 
+L 510.7 100.437812 
+Q 512.9 100.437812 512.9 98.237812 
+L 512.9 50.9 
+Q 512.9 48.7 510.7 48.7 
+L 414.027187 48.7 
+Q 411.827187 48.7 411.827187 50.9 
+L 411.827187 98.237812 
+Q 411.827187 100.437812 414.027187 100.437812 
+z
+" style="fill: #ffffff; opacity: 0.8; stroke: #cccccc; stroke-linejoin: miter"/>
+    </g>
+    <g id="line2d_20">
+     <path d="M 416.227187 57.608281 
+L 427.227187 57.608281 
+L 438.227187 57.608281 
+" style="fill: none; stroke: #4c72b0; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_20">
+     <!-- naive -->
+     <g style="fill: #262626" transform="translate(447.027187 61.458281)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-76" d="M 191 3500 
+L 800 3500 
+L 1894 563 
+L 2988 3500 
+L 3597 3500 
+L 2284 0 
+L 1503 0 
+L 191 3500 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-6e"/>
+      <use xlink:href="#DejaVuSans-61" x="63.378906"/>
+      <use xlink:href="#DejaVuSans-69" x="124.658203"/>
+      <use xlink:href="#DejaVuSans-76" x="152.441406"/>
+      <use xlink:href="#DejaVuSans-65" x="211.621094"/>
+     </g>
+    </g>
+    <g id="line2d_21">
+     <path d="M 416.227187 73.754219 
+L 427.227187 73.754219 
+L 438.227187 73.754219 
+" style="fill: none; stroke: #dd8452; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_21">
+     <!-- transposed -->
+     <g style="fill: #262626" transform="translate(447.027187 77.604219)scale(0.11 -0.11)">
+      <defs>
+       <path id="DejaVuSans-64" d="M 2906 2969 
+L 2906 4863 
+L 3481 4863 
+L 3481 0 
+L 2906 0 
+L 2906 525 
+Q 2725 213 2448 61 
+Q 2172 -91 1784 -91 
+Q 1150 -91 751 415 
+Q 353 922 353 1747 
+Q 353 2572 751 3078 
+Q 1150 3584 1784 3584 
+Q 2172 3584 2448 3432 
+Q 2725 3281 2906 2969 
+z
+M 947 1747 
+Q 947 1113 1208 752 
+Q 1469 391 1925 391 
+Q 2381 391 2643 752 
+Q 2906 1113 2906 1747 
+Q 2906 2381 2643 2742 
+Q 2381 3103 1925 3103 
+Q 1469 3103 1208 2742 
+Q 947 2381 947 1747 
+z
+" transform="scale(0.015625)"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-74"/>
+      <use xlink:href="#DejaVuSans-72" x="39.208984"/>
+      <use xlink:href="#DejaVuSans-61" x="80.322266"/>
+      <use xlink:href="#DejaVuSans-6e" x="141.601562"/>
+      <use xlink:href="#DejaVuSans-73" x="204.980469"/>
+      <use xlink:href="#DejaVuSans-70" x="257.080078"/>
+      <use xlink:href="#DejaVuSans-6f" x="320.556641"/>
+      <use xlink:href="#DejaVuSans-73" x="381.738281"/>
+      <use xlink:href="#DejaVuSans-65" x="433.837891"/>
+      <use xlink:href="#DejaVuSans-64" x="495.361328"/>
+     </g>
+    </g>
+    <g id="line2d_22">
+     <path d="M 416.227187 89.900156 
+L 427.227187 89.900156 
+L 438.227187 89.900156 
+" style="fill: none; stroke: #55a868; stroke-width: 1.5; stroke-linecap: round"/>
+    </g>
+    <g id="text_22">
+     <!-- vectorized -->
+     <g style="fill: #262626" transform="translate(447.027187 93.750156)scale(0.11 -0.11)">
+      <use xlink:href="#DejaVuSans-76"/>
+      <use xlink:href="#DejaVuSans-65" x="59.179688"/>
+      <use xlink:href="#DejaVuSans-63" x="120.703125"/>
+      <use xlink:href="#DejaVuSans-74" x="175.683594"/>
+      <use xlink:href="#DejaVuSans-6f" x="214.892578"/>
+      <use xlink:href="#DejaVuSans-72" x="276.074219"/>
+      <use xlink:href="#DejaVuSans-69" x="317.1875"/>
+      <use xlink:href="#DejaVuSans-7a" x="344.970703"/>
+      <use xlink:href="#DejaVuSans-65" x="397.460938"/>
+      <use xlink:href="#DejaVuSans-64" x="458.984375"/>
+     </g>
+    </g>
+   </g>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="pd6c2af2edb">
+   <rect x="72" y="43.2" width="446.4" height="277.2"/>
+  </clipPath>
+ </defs>
+</svg>
diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 408c6892..29081c0c 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -1,9 +1,49 @@
 ---
 title: Matrix Multiplication
-weight: 4
+weight: 20
 draft: true
 ---
 
+"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn.
+
+For reasons that will later become aparent, we only use sizes that are multiples of $48$. 1920
+
+Cache associativity strikes again. This is also an issue, but we will not address it for now.
+
+GCC 13.
+
+3.5s for 1025 ad 12s for 1024.
+
+baseline 13.58622 0.5209607970428861
+hugepages 16.749895 0.42256312651512146
+transposed 12.377302 0.5718441708863531
+autovec 3.117215 2.2705806304666187
+vectorized 3.075742 2.301196914435606
+kernel 2.24264 3.1560517960974566
+blocked 0.461477 15.33746643928083
+noalloc 0.408031 17.346446716058338
+nomove 0.303826 23.295860130469414
+blas 0.27489790320396423 25.747333528217077
+
+![](../img/mm-vectorized-barplot.svg)
+
+![](../img/mm-vectorized-plot.svg)
+
+![](../img/mm-kernel-barplot.svg)
+
+![](../img/mm-kernel-plot.svg)
+
+![](../img/mm-blocked-plot.svg)
+
+![](../img/mm-blocked-barplot.svg)
+
+![](../img/mm-noalloc.svg)
+
+![](../img/mm-blas.svg)
+
+Which is fine, considering that this is not the only thing that CPUs are made for.
+
+---
 
 ## Case Study: Distance Product
 

From d97421b9a3d47b22fe6ba12591a04edc5e406af9 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 01:41:49 +0300
Subject: [PATCH 012/173] matmul code

---
 content/english/hpc/algorithms/matmul.md | 137 +++++++++++++++++++++++
 1 file changed, 137 insertions(+)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 29081c0c..b787ae52 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -25,14 +25,151 @@ noalloc 0.408031 17.346446716058338
 nomove 0.303826 23.295860130469414
 blas 0.27489790320396423 25.747333528217077
 
+```c++
+void matmul(const float *a, const float *b, float *c, int n) {
+    for (int i = 0; i < n; i++)
+        for (int j = 0; j < n; j++)
+            for (int k = 0; k < n; k++)
+                c[i * n + j] += a[i * n + k] * b[k * n + j];
+}
+```
+
+Transpose:
+
+```c++
+void matmul(const float *a, const float *_b, float *c, int n) {
+    float *b = new float[n * n];
+
+    for (int i = 0; i < n; i++)
+        for (int j = 0; j < n; j++)
+            b[i * n + j] = _b[j * n + i];
+    
+    for (int i = 0; i < n; i++)
+        for (int j = 0; j < n; j++)
+            for (int k = 0; k < n; k++)
+                c[i * n + j] += a[i * n + k] * b[j * n + k]; // notice indices
+}
+```
+
+```c++
+void matmul(const float *a, const float *_b, float * __restrict__ c, int n) {
+    // ...
+}
+```
+
+```c++
+const int B = 8; // number of elements in a vector
+const int vecsize = B * sizeof(float); // size of a vector in bytes
+typedef float vector __attribute__ (( vector_size(vecsize) ));
+
+vector* alloc(int n) {
+    vector* ptr = (vector*) std::aligned_alloc(vecsize, vecsize * n);
+    memset(ptr, 0, vecsize * n);
+    return ptr;
+}
+
+float hsum(vector s) {
+    float res = 0;
+    for (int i = 0; i < B; i++)
+        res += s[i];
+    return res;
+}
+
+void matmul(const float *_a, const float *_b, float *c, int n) {
+    int nB = (n + B - 1) / B;
+
+    vector *a = alloc(n * nB);
+    vector *b = alloc(n * nB);
+
+    for (int i = 0; i < n; i++) {
+        for (int j = 0; j < n; j++) {
+            a[i * nB + j / 8][j % 8] = _a[i * n + j];
+            b[i * nB + j / 8][j % 8] = _b[j * n + i]; // <- still transposed
+        }
+    }
+
+    for (int i = 0; i < n; i++) {
+        for (int j = 0; j < n; j++) {
+            vector s = {0};
+            for (int k = 0; k < nB; k++)
+                s += a[i * nB + k] * b[j * nB + k];
+            c[i * n + j] = hsum(s);
+        }
+    }
+}
+```
+
 ![](../img/mm-vectorized-barplot.svg)
 
 ![](../img/mm-vectorized-plot.svg)
 
+```c++
+void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) {
+    vector t[6][2]{};
+
+    for (int k = l; k < r; k++) {
+        for (int i = 0; i < 6; i++) {
+            vector alpha = vector{} + a[(x + i) * n + k];
+            for (int j = 0; j < 2; j++)
+                t[i][j] += alpha * b[(k * n + y) / 8 + j];
+        }
+    }
+
+    for (int i = 0; i < 6; i++)
+        for (int j = 0; j < 2; j++)
+            c[((x + i) * n + y) / 8 + j] += t[i][j];
+}
+```
+
+```c++
+void matmul(const float *_a, const float *_b, float *_c, int n) {
+    int nx = (n + 5) / 6 * 6;
+    int ny = (n + 15) / 16 * 16;
+    
+    float *a = alloc(nx * ny);
+    float *b = alloc(nx * ny);
+    float *c = alloc(nx * ny);
+
+    for (int i = 0; i < n; i++) {
+        memcpy(&a[i * ny], &_a[i * n], 4 * n);
+        memcpy(&b[i * ny], &_b[i * n], 4 * n);
+    }
+
+    for (int x = 0; x < nx; x += 6)
+        for (int y = 0; y < ny; y += 16)
+            kernel(a, (vector*) b, (vector*) c, x, y, 0, n, ny);
+
+    for (int i = 0; i < n; i++)
+        memcpy(&_c[i * n], &c[i * ny], 4 * n);
+    
+    std::free(a);
+    std::free(b);
+    std::free(c);
+}
+```
+
 ![](../img/mm-kernel-barplot.svg)
 
 ![](../img/mm-kernel-plot.svg)
 
+```c++
+const int s3 = 64;
+const int s2 = 120;
+const int s1 = 240;
+
+for (int i3 = 0; i3 < ny; i3 += s3)
+    // now we are working with b[:][i3:i3+s3]
+    for (int i2 = 0; i2 < nx; i2 += s2)
+        // now we are working with a[i2:i2+s2][:]
+        for (int i1 = 0; i1 < ny; i1 += s1)
+            // now we are working with b[i1:i1+s1][i3:i3+s3]
+            // this equates to updating c[i2:i2+s2][i3:i3+s3]
+            // with [l:r] = [i1:i1+s1]
+            for (int x = i2; x < std::min(i2 + s2, nx); x += 6)
+                for (int y = i3; y < std::min(i3 + s3, ny); y += 16)
+                    kernel(a, (vector*) b, (vector*) c, x, y, i1, std::min(i1 + s1, n), ny);
+```
+
 ![](../img/mm-blocked-plot.svg)
 
 ![](../img/mm-blocked-barplot.svg)

From 4f3fb47f84d394b114338f3bfb8dd6fc28ac5bff Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 01:49:56 +0300
Subject: [PATCH 013/173] matmul outline

---
 content/english/hpc/algorithms/matmul.md | 434 ++---------------------
 1 file changed, 26 insertions(+), 408 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index b787ae52..1a611a52 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -6,6 +6,8 @@ draft: true
 
 "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn.
 
+Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course.
+
 For reasons that will later become aparent, we only use sizes that are multiples of $48$. 1920
 
 Cache associativity strikes again. This is also an issue, but we will not address it for now.
@@ -25,6 +27,12 @@ noalloc 0.408031 17.346446716058338
 nomove 0.303826 23.295860130469414
 blas 0.27489790320396423 25.747333528217077
 
+$$
+C_{ij} = \sum_{i=1}^{n} A_{ik} \cdot B_{kj}
+$$
+
+Implement the definition of what we need to do, but using arrays instead of matrices:
+
 ```c++
 void matmul(const float *a, const float *b, float *c, int n) {
     for (int i = 0; i < n; i++)
@@ -103,6 +111,15 @@ void matmul(const float *_a, const float *_b, float *c, int n) {
 
 ![](../img/mm-vectorized-plot.svg)
 
+## Theoretical Performance
+
+$$
+\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11})
+$$
+
+RAM bandwidth is lower than that
+
+
 ```c++
 void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) {
     vector t[6][2]{};
@@ -180,424 +197,25 @@ for (int i3 = 0; i3 < ny; i3 += s3)
 
 Which is fine, considering that this is not the only thing that CPUs are made for.
 
----
-
-## Case Study: Distance Product
-
-(We are going to speedrun "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course)
+### Generalizations
 
 Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as:
 
 $(D \circ D)_{ij} = \min_k(D_{ik} + D_{kj})$
 
-----
-
-Graph interpretation:
-find shortest paths of length 2 between all vertices in a fully-connected weighted graph
-
-![](https://i.imgur.com/Zf4G7qj.png)
-
-----
+Graph interpretation: find shortest paths of length 2 between all vertices in a fully-connected weighted graph
 
 A cool thing about distance product is that if if we iterate the process and calculate:
 
-$D_2 = D \circ D, \;\;
-D_4 = D_2 \circ D_2, \;\;
-D_8 = D_4 \circ D_4, \;\;
-\ldots$
-
-Then we can find all-pairs shortest distances in $O(\log n)$ steps
-
-(but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it) <!-- .element: class="fragment" data-fragment-index="1" -->
-
----
-
-## V0: Baseline
-
-Implement the definition of what we need to do, but using arrays instead of matrices:
-
-```cpp
-const float infty = std::numeric_limits<float>::infinity();
-
-void step(float* r, const float* d, int n) {
-    for (int i = 0; i < n; ++i) {
-        for (int j = 0; j < n; ++j) {
-            float v = infty;
-            for (int k = 0; k < n; ++k) {
-                float x = d[n*i + k];
-                float y = d[n*k + j];
-                float z = x + y;
-                v = std::min(v, z);
-            }
-            r[n*i + j] = v;
-        }
-    }
-}
-```
-
-Compile with `g++ -O3 -march=native -std=c++17`
-
-On our Intel Core i5-6500 ("Skylake," 4 cores, 3.6 GHz) with $n=4000$ it runs for 99s,
-which amounts to ~1.3B useful floating point operations per second
-
----
-
-## Theoretical Performance
-
 $$
-\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11})
+D_2 = D \circ D \\
+D_4 = D_2 \circ D_2 \\
+D_8 = D_4 \circ D_4 \\
+\ldots
 $$
 
-RAM bandwidth: 34.1 GB/s (or ~10 bytes per cycle)
-<!-- .element: class="fragment" data-fragment-index="1" -->
-
----
-
-## OpenMP
-
-* We have 4 cores, so why don't we use them?
-* There are low-level ways of creating threads, but they involve a lot of code
-* We will use a high-level interface called OpenMP
-* (We will talk about multithreading in much more detail on the next lecture)
-
-![](https://www.researchgate.net/profile/Mario_Storti/publication/231168223/figure/fig2/AS:393334787985424@1470789729707/The-master-thread-creates-a-team-of-parallel-threads.png =400x)
-
-----
-
-## Multithreading Made Easy
-
-All you need to know for now is the `#pragma omp parallel for` directive
-
-```cpp
-#pragma omp parallel for
-for (int i = 0; i < 10; ++i) {
-    do_stuff(i);
-}
-```
-
-It splits iterations of a loop among multiple threads
-
-There are many ways to control scheduling,
-but we'll just leave defaults because our use case is simple
-<!-- .element: class="fragment" data-fragment-index="1" -->
-
-
-----
-
-## Warning: Data Races
-
-This only works when all iterations can safely be executed simultaneously
-It's not always easy to determine, but for now following rules of thumb are enough:
-
-* There must not be any shared data element that is read by X and written by Y
-* There must not be any shared data element that is written by X and written by Y
-
-E. g. sum can't be parallelized this way, as threads would modify a shared variable
-<!-- .element: class="fragment" data-fragment-index="1" -->
-
----
-
-## Parallel Baseline
-
-OpenMP is included in compilers: just add `-fopenmp` flag and that's it
-
-```cpp
-void step(float* r, const float* d, int n) {
-    #pragma omp parallel for
-    for (int i = 0; i < n; ++i) {
-        for (int j = 0; j < n; ++j) {
-            float v = infty;
-            for (int k = 0; k < n; ++k) {
-                float x = d[n*i + k];
-                float y = d[n*k + j];
-                float z = x + y;
-                v = std::min(v, z);
-            }
-            r[n*i + j] = v;
-        }
-    }
-}
-```
-
-Runs ~4x times faster, as it should
-
----
-
-## Memory Bottleneck
-
-![](https://i.imgur.com/z4d6aez.png =450x)
-
-(It is slower on macOS because of smaller page sizes)
-
-----
-
-## Virtual Memory
-
-![](https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/images/Chapter9/9_01_VirtualMemoryLarger.jpg =500x)
-
----
-
-## V1: Linear Reading
-
-Just transpose it, as we did with matrices
-
-```cpp
-void step(float* r, const float* d, int n) {
-    std::vector<float> t(n*n);
-    #pragma omp parallel for
-    for (int i = 0; i < n; ++i) {
-        for (int j = 0; j < n; ++j) {
-            t[n*j + i] = d[n*i + j];
-        }
-    }
-
-    #pragma omp parallel for
-    for (int i = 0; i < n; ++i) {
-        for (int j = 0; j < n; ++j) {
-            float v = std::numeric_limits<float>::infinity();
-            for (int k = 0; k < n; ++k) {
-                float x = d[n*i + k];
-                float y = t[n*j + k];
-                float z = x + y;
-                v = std::min(v, z);
-            }
-            r[n*i + j] = v;
-        }
-    }
-}
-```
-
-----
-
-![](https://i.imgur.com/UwxcEG7.png =600x)
-
-----
-
-![](https://i.imgur.com/2ySfr0V.png =600x)
-
----
-
-## V2: Instruction-Level Parallelism
-
-We can apply the same trick as we did with array sum earlier, so that instead of:
-
-```cpp
-v = min(v, z0);
-v = min(v, z1);
-v = min(v, z2);
-v = min(v, z3);
-v = min(v, z4);
-```
-
-We use a few registers and compute minimum simultaneously utilizing ILP:
-
-```cpp
-v0 = min(v0, z0);
-v1 = min(v1, z1);
-v0 = min(v0, z2);
-v1 = min(v1, z3);
-v0 = min(v0, z4);
-...
-v = min(v0, v1);
-```
-
-----
-
-![](https://i.imgur.com/ihMC6z2.png)
-
-Our memory layout looks like this now
-
-----
-
-```cpp
-void step(float* r, const float* d_, int n) {
-    constexpr int nb = 4;
-    int na = (n + nb - 1) / nb;
-    int nab = na*nb;
-
-    // input data, padded
-    std::vector<float> d(n*nab, infty);
-    // input data, transposed, padded
-    std::vector<float> t(n*nab, infty);
-
-    #pragma omp parallel for
-    for (int j = 0; j < n; ++j) {
-        for (int i = 0; i < n; ++i) {
-            d[nab*j + i] = d_[n*j + i];
-            t[nab*j + i] = d_[n*i + j];
-        }
-    }
-
-    #pragma omp parallel for
-    for (int i = 0; i < n; ++i) {
-        for (int j = 0; j < n; ++j) {
-            // vv[0] = result for k = 0, 4, 8, ...
-            // vv[1] = result for k = 1, 5, 9, ...
-            // vv[2] = result for k = 2, 6, 10, ...
-            // vv[3] = result for k = 3, 7, 11, ...
-            float vv[nb];
-            for (int kb = 0; kb < nb; ++kb) {
-                vv[kb] = infty;
-            }
-            for (int ka = 0; ka < na; ++ka) {
-                for (int kb = 0; kb < nb; ++kb) {
-                    float x = d[nab*i + ka * nb + kb];
-                    float y = t[nab*j + ka * nb + kb];
-                    float z = x + y;
-                    vv[kb] = std::min(vv[kb], z);
-                }
-            }
-            // v = result for k = 0, 1, 2, ...
-            float v = infty;
-            for (int kb = 0; kb < nb; ++kb) {
-                v = std::min(vv[kb], v);
-            }
-            r[n*i + j] = v;
-        }
-    }
-}
-```
-
-----
-
-![](https://i.imgur.com/5uHVRL4.png =600x)
-
----
-
-## V3: Vectorization
-
-![](https://i.imgur.com/EG0WjHl.png =400x)
-
-----
-
-```cpp
-static inline float8_t min8(float8_t x, float8_t y) {
-    return x < y ? x : y;
-}
-
-void step(float* r, const float* d_, int n) {
-    // elements per vector
-    constexpr int nb = 8;
-    // vectors per input row
-    int na = (n + nb - 1) / nb;
-
-    // input data, padded, converted to vectors
-    float8_t* vd = float8_alloc(n*na);
-    // input data, transposed, padded, converted to vectors
-    float8_t* vt = float8_alloc(n*na);
-
-    #pragma omp parallel for
-    for (int j = 0; j < n; ++j) {
-        for (int ka = 0; ka < na; ++ka) {
-            for (int kb = 0; kb < nb; ++kb) {
-                int i = ka * nb + kb;
-                vd[na*j + ka][kb] = i < n ? d_[n*j + i] : infty;
-                vt[na*j + ka][kb] = i < n ? d_[n*i + j] : infty;
-            }
-        }
-    }
-
-    #pragma omp parallel for
-    for (int i = 0; i < n; ++i) {
-        for (int j = 0; j < n; ++j) {
-            float8_t vv = f8infty;
-            for (int ka = 0; ka < na; ++ka) {
-                float8_t x = vd[na*i + ka];
-                float8_t y = vt[na*j + ka];
-                float8_t z = x + y;
-                vv = min8(vv, z);
-            }
-            r[n*i + j] = hmin8(vv);
-        }
-    }
-
-    std::free(vt);
-    std::free(vd);
-}
-```
-
-----
-
-![](https://i.imgur.com/R3OvLKO.png =600x)
-
----
-
-## V4: Register Reuse
-
-* At this point we are actually bottlenecked by memory
-* It turns out that calculating one $r_{ij}$ at a time is not optimal
-* We can reuse data that we read into registers to update other fields
-
-----
-
-![](https://i.imgur.com/ljvD0ba.png =400x)
-
-----
-
-```cpp
-for (int ka = 0; ka < na; ++ka) {
-    float8_t y0 = vt[na*(jc * nd + 0) + ka];
-    float8_t y1 = vt[na*(jc * nd + 1) + ka];
-    float8_t y2 = vt[na*(jc * nd + 2) + ka];
-    float8_t x0 = vd[na*(ic * nd + 0) + ka];
-    float8_t x1 = vd[na*(ic * nd + 1) + ka];
-    float8_t x2 = vd[na*(ic * nd + 2) + ka];
-    vv[0][0] = min8(vv[0][0], x0 + y0);
-    vv[0][1] = min8(vv[0][1], x0 + y1);
-    vv[0][2] = min8(vv[0][2], x0 + y2);
-    vv[1][0] = min8(vv[1][0], x1 + y0);
-    vv[1][1] = min8(vv[1][1], x1 + y1);
-    vv[1][2] = min8(vv[1][2], x1 + y2);
-    vv[2][0] = min8(vv[2][0], x2 + y0);
-    vv[2][1] = min8(vv[2][1], x2 + y1);
-    vv[2][2] = min8(vv[2][2], x2 + y2);
-}
-```
-
-Ugly, but worth it
-
-----
-
-![](https://i.imgur.com/GZvIt8J.png =600x)
-
----
-
-## V5: More Register Reuse
-
-![](https://i.imgur.com/amUznoQ.png =400x)
-
-----
-
-![](https://i.imgur.com/24nBJ1Y.png =600x)
-
----
-
-## V6: Software Prefetching
-
-![](https://i.imgur.com/zwqa1ZS.png =600x)
-
----
-
-## V7: Temporal Cache Locality
-
-![](https://i.imgur.com/29vTLKJ.png)
-
-----
-
-### Z-Curve
-
-![](https://i.imgur.com/0optLZ3.png)
-
-----
-
-![](https://i.imgur.com/U3GaO5b.png)
-
----
+Then we can find all-pairs shortest distances in $O(\log n)$ steps
 
-## Summary
+(but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it)
 
-* Deal with memory problems first (make sure data fits L3 cache)
-* SIMD can get you ~10x speedup
-* ILP can get you 2-3x speedup
-* Multi-core parallelism can get you $NUM_CORES speedup
- (and it can be just one `#pragma omp parallel for` away)
+Which is an exercise.

From 823b55298830685d3eaa57b2b28d10ea91c92de1 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 17:33:05 +0300
Subject: [PATCH 014/173] matmul intro

---
 content/english/hpc/algorithms/matmul.md | 64 +++++++++++++++++++-----
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 1a611a52..c092f138 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -4,17 +4,8 @@ weight: 20
 draft: true
 ---
 
-"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn.
-
-Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course.
-
-For reasons that will later become aparent, we only use sizes that are multiples of $48$. 1920
-
-Cache associativity strikes again. This is also an issue, but we will not address it for now.
-
-GCC 13.
-
-3.5s for 1025 ad 12s for 1024.
+<!--
+todo: FMA, but without kernel?
 
 baseline 13.58622 0.5209607970428861
 hugepages 16.749895 0.42256312651512146
@@ -26,12 +17,23 @@ blocked 0.461477 15.33746643928083
 noalloc 0.408031 17.346446716058338
 nomove 0.303826 23.295860130469414
 blas 0.27489790320396423 25.747333528217077
+-->
+
+In this case study, we will design and implement several algorithms for matrix multiplication. We start with the naive "for-for-for" algorithm and incrementally improve it, eventually developing an implementation that is 50 times faster and matches the performance of BLAS libraries while being under 40 lines of C.
+
+We compile our implementations with GCC 13 and run them on Zen 2 clocked at 2GHz.
+
+## Baseline
+
+The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is an $l \times m$ matrix $C$ calculated as:
 
 $$
-C_{ij} = \sum_{i=1}^{n} A_{ik} \cdot B_{kj}
+C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}
 $$
 
-Implement the definition of what we need to do, but using arrays instead of matrices:
+For simplicity, we will only consider *square* matrices, where $l = m = n$.
+
+To implement matrix multiplication, we can just transfer this definition into code — but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays, to be explicit about memory addressing:
 
 ```c++
 void matmul(const float *a, const float *b, float *c, int n) {
@@ -42,6 +44,14 @@ void matmul(const float *a, const float *b, float *c, int n) {
 }
 ```
 
+For reasons that will become aparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes.
+
+Compiled with `g++ -O3 -march=native -funroll-loops`, this code runs in ~16.7s for $n = 1920$.
+
+[Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now.
+
+3.5s for 1025 ad 12s for 1024.
+
 Transpose:
 
 ```c++
@@ -113,6 +123,8 @@ void matmul(const float *_a, const float *_b, float *c, int n) {
 
 ## Theoretical Performance
 
+This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in the later implementations.
+
 $$
 \underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11})
 $$
@@ -197,6 +209,20 @@ for (int i3 = 0; i3 < ny; i3 += s3)
 
 Which is fine, considering that this is not the only thing that CPUs are made for.
 
+```c++
+for (int i3 = 0; i3 < n; i3 += s3)
+    for (int i2 = 0; i2 < n; i2 += s2)
+        for (int i1 = 0; i1 < n; i1 += s1)
+            for (int x = i2; x < i2 + s2; x += 6)
+                for (int y = i3; y < i3 + s3; y += 16)
+                    for (int k = i1; k < i1 + s1; k++)
+                        for (int i = 0; i < 6; i++)
+                            for (int j = 0; j < 2; j++)
+                                c[x * n / 8 + i * n / 8 + y / 8 + j]
+                                += (vector{} + a[x * n + i * n + k])
+                                   * b[n / 8 * k + y / 8 + j];
+```
+
 ### Generalizations
 
 Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as:
@@ -219,3 +245,15 @@ Then we can find all-pairs shortest distances in $O(\log n)$ steps
 (but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it)
 
 Which is an exercise.
+
+Strassen algorithm is only useful for large matrices.
+
+https://arxiv.org/pdf/1605.01078.pdf
+
+[cache-oblivious](/hpc/external-memory/oblivious/#matrix-multiplication) algorithms
+
+## Acknowledgements
+
+"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn.
+
+Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course.

From ab322dd710898564821b2f58cf84ada7e171f845 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 19:39:23 +0300
Subject: [PATCH 015/173] transposet matmul

---
 .../hpc/algorithms/img/column-major.jpg       | Bin 0 -> 22004 bytes
 content/english/hpc/algorithms/matmul.md      |  61 +++++++++++++-----
 2 files changed, 46 insertions(+), 15 deletions(-)
 create mode 100644 content/english/hpc/algorithms/img/column-major.jpg

diff --git a/content/english/hpc/algorithms/img/column-major.jpg b/content/english/hpc/algorithms/img/column-major.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..675d0b856231c263b12d3a246ab6216fe02af3f0
GIT binary patch
literal 22004
zcmdRWcUV(f+HdS#57MM7RY2fSLKQei2uK%52rY1=6Pol6mXRt!KtQ?%Bn^Uu9-2t+
zHS{9AgY*tJdPZbs&dfLW`R+e=^AO&wz4qJIyV|>co1@;NFMwZ^6_gYJr%nR^r^r8m
zqXB>%;M|$BXV0EFNB%f>?%eqcS1(>5UpFpaxpbA{#?6})Hz<Cgq+z6`q@t((h2qwo
zTl9=fEG#UPw5;r`%<PQJEX?1SoFYGZ;rxZ`7cX9CrlO!?{=dGCJ^?5$o=Q61cIFf-
z;55alGZd$essN0DQ$H@i->>r*&YeAT>hwkOUC1whQ)kHZh07POTsVFH^ck{Mr_Y={
zcb?+HO`1D`I??SNl(b^MLtNe7<aSa~vx;guy5z7435x)=z>l3Hve_S4Jd2FV({{SZ
z2{L(3kTa7ve@|vhf9e$A^ck{ueqMjFCExp%g6y}`WKW$r_f4^HHBOzTICJ)<Ce<B5
zp>veNT8`AeKVubn@cwH%8~fv<e!x{SCvpu6fGpq$VE*(c&;P$D^BC1utJr>Sew{{s
zZYFvXxbiZ1kmn}ER=bSPFzlr!wRdufa>)lwI(+gn6ea?ii%GvaKVFL2;krCapY=pW
zsy`+22|fb5b*==vX*%|#w3&*or`B70Rh_y6X!OO$Jx`W>6vhTwx)5ztOZjri>Rmmd
zsmI!53TW($%^qEo16^*|9xnnfOHgQvs6Fniab@!AkLN*Ei$iIT0KFvdXOmEZ$1DB<
zt9LDB$HLD1)ZjjLQ}H|Be|uXzQNS5=Uc<XtG!CATtF2~QIpNw-{6&+OA1Em})S%>^
zlyN^^9#cIImh@*@)TkGSKKvn;um7%a7Ge`Mk;t0{Fa5N=4EIy|OjN)P-ry;Xs2cT-
z|M#~MTvrC$f4>l9uBWE%v|!1eulEOZfXVo3_Z-H2U(bL(uWow)HEx5osWYN_`!+#^
z4+zrq0Gz$|{~~<ojOq?eLDN^l3)JBOP=f5+cF)h_i=7E{_M0)@o#D}cT;S!Yv(v3k
z+A=78P<_U*x3^o+n7u<PIJ>55+<UO(VV4+D!jX1n?nU@ZGm$tWZu`2K$H|xB+uj+D
zEkTh$Ah^9Q1d~PE;dWN7K2$AY&-SZY<iX1k(iLvmyUUpmhw?o*>-|12&%Jz;^2WXU
zgK3fVV)$n3#Ayw-W|(~)%i~;AvBZGUb5d}4--C$kH!DRDx{@8kZd?+PC<ug6R_R;Z
z(G5-x`Q)@BwNcJhixG$^P|nTcgE8<@M+!DO<u!RjR}6cA`k3|iT}(=$P7fEatWRB=
zrT3=RGdxMlyGCPY0cU~pR_*<8oCU6Gl~Z?yYl+%D_P3e}dL)v!9pUtFrX2$G0%}KV
zgVp4l6u}^2D5&lt8iVEG5nz4C0Y1Q0CdsB!dV?)qTWn@scR75ewx}8ZdvncltPSt%
zZvesk|Dwq}Y;3UBOZmIvnXZsuN%w;J!0;+WtSb>&O0C&gQqCc!>9OM?H~2qtYS>g<
zqZbf2^d&ljg-Rp#YeHLSS?m*g8{Knrh7)~oH)^xPb#T2s_QjsX_4v61?se|nYt!5p
zjsSt>r+W)}dzP7vn58(dBWAgs{dOW5Y46kRCw|R_XoiuX@;#7Ai1gpIX+rtH!<aXI
z1NSM-${)LLJZR8pkeb&LvrE22F#cG)``(;ul4%1&F&Ou~QNvn<LSI2|t=yBjg%gS|
z+(Q!OyRX>tR)38n-8aUzK(m%eDy#|{Ifl$@Uf5DYTP42FS?;5h1gu94tUJ}9<dqle
zOi*wgPiYaAn&AB*TVCe$YC$jFYxg=@`Ynpgl8tMV6(x+>5eG}&iOXwJ7oJyG7U<e(
z>Mx4SYbf3<(xY$Kp$0-AI~owkIAl_LSf8~mqj_qU>AhcKMJbzx363}T1RaDR<iTzW
z4Ha@g22SDY?Bak->SFdUR|gmEY;<^|?s#Rq&QyVV0bxT!l+oItNo_55teedX@NkNI
z0u#ztKI((&tEEWxo2sE%E-rVMrl#lO=Q4YYgm{DB=2AhZnkO|Ch2eL<pv~;V2sJ&a
zjPbmhnzSH|4?vlBG!R+%*J+KJs$mjc8eQqnuyS@Rp6qMbeFV5__w?UKz!5-H&l;Oq
zd@Fl0a)LH}Pyl+@u0X1av?^hBmpwyAstAdn@M6<#nN44H;LVfmuza4}XoVtV#xR&O
z@P*VW<XT_x4#{i4PT@p4K_pj=;qwNb*#;}NM*x~^lP%s<*1AZ1%&dcJ13YZk4CP%$
z6UsK2T^h-5Hv&&DB|qWP!5PZgz{^L2%AYjMlM+FV)Wz4s4)S&9bdnaMHuBN~JPS9*
zBSIX+^XY=eI+{hL5Ce8jEkS)|=_nh<DHJ^rBEF?PGa>y^H{+w^#EczSGqb&6cf*!X
z<E4Z_QAVOMx4R}~c-48k=SBo<quRX4$9`>cvwJ6XJ!*E@zW1c<bMRS#a8JA%%V6RG
z1*Op}W-TMP%Y490#ZP%YjGEr$2^ieZjdv1dUYd*3hROip#jX_E;7%hxPf9f%lEeEv
z1M9{7hKF}&FsTpUJ*Qz~3k!;8)9)IS*LcTK_L3h#<Y2#KxUH2$VhA3hVRtH`D{Ul(
zOu}TMAIY{H0jR*53nn}9vbW2@BMY>Wg?!PGgXxff%lbWXjLKDR<NX;H2q-JAAQoc8
z$156>8^>?yk*C#<#aU(>jQP5P)K!4k$WS6%a&1j-9#mEy0>x=k%By)qUsKz>!nR0F
z<F0R&WRLNQ)KLCdP*gWvYP2Omw8BpF6!}HNG^n?(RAm)-@NXsB_ZS<iciFm9Q#E_H
zqlrj3rY5MwDzxsM`L+?ro7iS+G!8M=)Yg8IPpXG(gXFQ^RvRFHRtZ-%SMVfEU5y%e
z&Hgiw+xsUC_^usitMrV>F9UBfl$E=e{mqnpm!^q5NYG$(d{#IFrpD}J^kp&4^rpcg
z6-|>3!}AYqW!#!f!c$?lBJ!#1U)lI+VX-SKR)KJzE0<GPN2z)<lT`{_l>BJs6z+wx
zI!Qy-#kl634!_>!%ImW5*3E05$P-hBt6%8sf%Ic==BQv2uT<3Jk_5H8tCMvW3k`85
zpe7h;6&T02&h27=)%yMG#xw5jR7WXf6#~j8R^;vqfyqR5{QFMw{m#ijQns~Nwq_3J
zSW8lTYfi-%Ir#*PO2Y<6C?4dyw4<}JD9#dGgYBe=MVYEck^Y)sdO+?)ql?{Zdm<aX
zzI9*19AtAeV1a`P{Wg6sW_tH(_r4O(3Y@j^Jc##NkePU*P>|Ax!dEUS#Ba)kg_ro3
z-_ZHTVN;vFzgs|0BS&R_shQ9pLkNSIBonv<_pPfY9!9tv0faEyVQ}gC%KgsR&0gPn
z-Skoz<@zfd{jWuTB}T<M%gmR7#p4LzNBi5$W9<;f3U?aSwr6jJZ;dYW)6O<Abck&7
z5DIH^e2fUZWw%MqgjpSfIYA~4Gv(~83LdB@5mdFrC%!QT;ofF5XkF;+dA!9A0uFm}
z&ED#4SgCjUD_2CNmqx7H`i>uL!%ova2bLT*OS`c+iT`+>P0gD{p>|kyBoIy-ZAobl
zF2xEz6dX$FPtUW<ak7~lULcri6|p>ti<Fe+NgcDFXFv+<Bv3Th1iw-$HhD$=(YhU~
z3pmg5>3dT$_6cm>ns9M4TFmf<DgsHWAkZ!~P&S#u^bnS}JLqenLdejj!B<$#%{zv2
zKveUn?IW!-R~~L<ef}^G<#p9ceWq(EyfPg6f?%s18y7m1s^i8T7kaM{6u#ruGw-;8
z_`%7{;q=EA%kfg$3uaGvuTJ?%wLfuFd6GjE-}S}u|EOd+XQsQmYHa6Mz@kU@DML7J
zdl^*G$03RdGI?auwfcKVuZh>Noh?*o;0n~0)W#MCc3?$fc|6)77ZY=9hXj*@6ZG(3
zLs?cAv%1IU8Ch%hi)}Uvck4)>KQ{aQlW?U`OBcuWJsrazwQ4VB9|8FL5A0<KV#ijA
zQwKQ9GR+i9vAs-ckx79!{!RA?5c^%hz0ddj)%>Ktxyj7(*tlq-$*m~2yztHf>W^sx
z1Vf|kj)Cj*iBXzJp7f;<ObJmS6tes9@7nGGTYi{s`jaukvE_`HD<x78>xN8`9(qHJ
zzA~iPV;!@}(i`*34Z<_RzUa_3?YkqtEFjK3PHe69;=f)uYUvz~>L-lS$W9IzXV!N6
zeRoO@*(q$t7P0Ga1dTbF)8#subaN4S2W`Xf1rN|Q`{$!w;Pxw=SRB%W<QHkmEi77W
z$sJ0Us766!ckI%i&dQPcV!_bN=$(3J0df9uNqa3(h`HYHeCCkMb$6@8e|7F()vx;k
zAN(@y7JO{elW%SKO<iHHMB(eo)(}BPU+P=S2SO~VhaLXr9tbYY97cW56&or-y=9v7
zgfEd+qD5C9i^UE>ptEex6Z&UHIX8=~jvXNWBOv2O&ip!TH9kal!|aLhwJC&j`|C}m
zFYuxJj7#mum5!s<!u?Z|SiJa_f?OZt)k@FQa2GS_h)<-Vik3ynKN~E$fJOIMiiyF4
zs%&w~YAr_nY8IC|Osd}`k!Bfa8mog38&_u^JE+z;mL62E+Em?6r&O-<PiB7$UE#7H
zuB<!uZqz3d(L-Vg`Q11F5Z_J9&)zzSI4->kB#6@mtSK_j)OoW#!#Tf5>gT$ceE{6v
z>?3!@l;g%|a_S2jQ_0&Ux>norUN(JQ7m-P%hD(iVac(9MZIIR&%kL35P40UVy`&vT
zGv%hW$q`^0-0|_))MH;o?0*kZ7el6q<EI22DRNle{+E>*<!wDfnJG+Xj{x{@y)S3&
zXIF?=X#J(iBDIdjR;!XHhrO_-+BcupK93Wuke)dvzj8<KSR`dQ#in{*;9vBQ@bCS@
z`M7^blKaOPo0(_m^&frDa{2f%g4Fk}-$EYHp16eun*6e`#u3k`!h(3Xuwif7XZsRw
z6G=<nx`+j9`j`lgo6TGA{|NEc<GyTmT{hrxR_$Vz-<@!YmeiLJou1QJY_@pMMDo)2
zF8X&BY;^y%E$a@CYy8xkRrG%KPt$VdRlIW`>b>0fb;^b#K((3cJ?Gdtc`e?bsyi7v
zGv%4}N<*?(avy=8M2foW>)(3sH`mcMm(UF7PgJT>6W>bE6NF7>1Fp_L{XQ1}0OzF$
z<#8EnVYLG&Q8uPN<EC!Uns58sIR2e&yNSK&QLR%g^#W9GSD0X9rLaNnxROj0_zoj3
zRkv%Us09q?3WHovYfie32x4<nc|AhlmF_l-jE{^Nm2*!HZtJum)?{aZ;n*yOC|^xz
zgv-64bTn4rVsHtJE#Kjs9#?PD!K19GRm5)B>jAOn8Bq)BhiGX1j-oGIN9qH`kie&_
zYi#9aYj!-Tw{nbaMZLVd4XM4iE(4R+l-LP5u0-V)`7oJeEjUiu!nUnqGPm|g$iP{5
z{kJ?g{UOfyi#oT4mp|OXJLSo`VZA~*xKa+|Kl|R%Fvap-x_<;%@1~r+?9HU9tCLK~
zaD9DRE0q<W(JykVH14Bha=?w3Msw`&;-b{fufGb1uOi01R%g02x*u5B#wGB4I{Trn
z)<?YZ`_2PR=O;B<!=k{siEjRF+`?DSDTg`j>6tSZukMcLH%ATNsi~-o79%fXtzlQ&
zWqcWH^B|0*u}Ki<fvMHVc3|HS?n1F)_N$f(r;<+Xj5jW!LdC_U)G`Xe$d#3M48O6u
z3Femg!#&I-s9!FbAh&oLlw`S8M1!!kzW0ep6pz3T$43BwPQBKLY2XMD4qeoyu0ovn
zQmb(zpOMjMXXlha?kwT0qkimIR9go)pPNP6T$#5!UbJSKH#J<1c>}C~<Uw*0@kA@H
zO?P96Yu}-iGd|h^hs*2KxE~Ufu<WDe94kLI4$<nFHg{PqCO%wrm`V~$UX)#>nsj8#
zE01>>_Rc*5)M<whl@wigd*1BJLf`CjJfg$5aSJv27O*$+x;^6vi5X#-%Gc_spv3go
z2<z;#5d~pv!-pISSZgesM`H#YPbg>1hYni%T{6&Cq$2&|-x1LSa`vx>JPN|DnNKBv
zhE6NYj|MWWQMWFmL&kp6FF{O@Pb}3BROdVy)HfA{vbDsdi<@n7#~^Cj;GSyi&D6c~
z+H!h6*_QcNctxMd7(B^9`?xnck3C~hx+y=+CuhK**#^u~0Mc*3Cj1g@YT#AckZx?!
zooj(U3oZM5EFZOK*+rSwa-`N4{3;$AA!c0fvushW908~SjQb(WpvT#yVSWK;Ym6zh
z+!g2?8$nO*B@O5<4SMXIY+*~3%c&IG$i+8<0upwdkVmy(Egfx19t$8vu3bzLVT%g9
zZ#T3^-PnM}T5(w8z%Z)jvE^v!RF40(k+#G>uICvVO}Qyf9SGc-UU|UsjJj=9g$~!0
zzDAMN;lo4ar&kN$eJs_*zY+9poI_qIlce$K>3X9N^CEQ`DRoJDx7Ov`TaDP&e~0Ls
z%wjVI-O{eBv(^-_q`P>Y?K`W5tY`WRm7k2m!?ZzRnQ){NXOL@fa0%y41I7JuiuIU;
zr72z|zSKj<pj{QUEobTv)=SZ*8BggtEUDb3v+!bSkpt+v8H>7D>k&-`i$Wb~@%J{c
z)dEuo`lZ#Px|AU;IF@@Z4todr^a_WV&DU1FnI@SYZhLwCec^5|;Tw4k2Fa$<dm)Lc
zJ-FNV-Q$(!7oFQX4dHe<%0qMz)mM6~FiltcJ;Na%L#vW>ce5949|o@R1u%|ndA@sB
zAd|ns<l-_AH0x@9-zvL`q{x#xT;lUeBdt5A_+0!`0QM#W-+m+8dh+B}12pmk9fQM~
zot5czpydVO+xk_oeRpy85gmj0Sq8)0UwDm*XR<&xkX8j@-#YZrfkr^R1NFw%y~zHe
zGXxtT5lLgPeElJ*Q=)xY<#L&b!o%!hjW@vui?m_qlc9X|ej1Cs6=j-J`mKeYzn!2d
zNWaXCr-<44n*=WPd$*pKSaa*yyz94!DtUNDJTrdh%_7eMq9R6T+{H~94y~p>b7yHo
zJr3{GG_Bm@cMY2m6ck*8AbI|>U>k4*Fpd}v)^kEX$EPO~u`g|In+QqfKDShe0%EEX
z>bhW*p)i**PU~FGhE-$n1Bn;AYP5u|=(A;xyUrKZ;|0tyTsMf9y>h>tDr@=si~zW`
z)Np8#(<1gQ3HHj}++SF4nt`15v^OiB9xaGkPI;-skjr~r!=tztch8l^JUw4^H;$WM
ziGLzzCYUPa#pEDqE~rEwD-e9;w#ys}tXD4Rkt5t|t0(WO7g(E-N1&YmsvkD>X|}{^
zNl&UBa)hgi7{OdcK4EbU7m-PVCT}}w9xf(S=qPP+`Ga{U_EXF9*hQPa;Mn&l%dym}
zewrit5lkIHOR=Nz(;IsNs*(6`$=Y|mmKC=mAUgWpX1Fldea4$WM(h+8pQ%Po?NJt`
zd=Z-%q6QwO1j9J8=WOygn=`t$(n&&?h;lc{58YOyO9)*()-2{-#lp}mW=xIFQ#C=W
zYzixahU@|GZ@UkYYr#4<J&yph8C)Gyo07<nAzlq^8DX8^p4UeJiB^Y7%)K}f36A{Y
zGWm^-DD8SGN<P~hNkt#(^(?q`QSo3d7)A+FbmBgIMW5B7Lj<xCf%kQ&3OPcCW6j}%
zS~2W6oz;5-!R4qU0H21{`G>Iod2+c@!4cESj)+uIJ2N~9aZXaNh8TZ<KJd0GJeR=#
znZLJ_8@)L1h|;A=m(<}M1A$4}QQ>MX_W+EH7k>HpM{gddSlA;1Vnlg}f-<Z_s%EQr
zuZ5iKGoEoLuMEy`_b$IE_FtJhz?u|BoS<h%H!EL+j>z)6X<p5{Ro$O_afrR6t);HF
zi)1(%Ru_~~3eh;<%*~UzCsBju{lixp0!)wLMtWJz4EM$(;HZv}ke2{Hdv~4vjGIyh
zJ1Dj3VvYTm666H)@kBC)kJKExyfG<zgET372@*PlFOZt@2C5tU&4N<THV+;X>dKL6
z@jP8YfZD8OSl(-XN7M-@qcw-C&VB^m7YewX_Z*p0tk5SFEhokv4(@SnUXi$Oy`J)i
zPP}a?#CYMRnDJHd<+pf$UwP;WT8%t))W$FJ+0qa(nO(0kG7+?r$w{kQoDJ(5k=1^o
z#n=@a6vgaWY(1ymYGNZ@VgIjEHBug_S$3eYNj>>~GuulHkq>F)jTaf_GUYqLW5Xm7
z;Kz&U349RfMO&$93-{^XgXgb{?tENMiW^BW-Msu@-CYj#<p@xi626V*6y7(=;R@+~
z<KEM6Y@exIlU1RoRd3Xd74}4cK|_#Xjp3j&cJ7K~6HYz>H9?Qu99Z0VksZ!T&)0Wp
zi_e@%b$t)vN)%_%#i&;;h3RC`OocR4219@%?=Gg3`&7J3bqY^bv&A7p`Fjb1#yQ=T
zIBS{74^00uxwP*`>_?x?*N*7wU4B$6ddH1uFolBgX0~x?_wjE{q}=pLN2Z4$#??Gx
zAEZ$V`USrwikZ=P&C8I_5W85YKe%>aiOEAHp<nNs`>mP4Mj}HSCgsOKPN*OPK%3sA
z3WTgbF^8nZv>yS)t9&`q_liO~a<`PpoeIVny4S>H>k9T3FR?vLwcNVAA41SiLrgU2
z@B!rsb<>KvbD4e*3Ow;JKWzw+Re+-86pTIu)3r8$jKAV5LM`m>IsWohq;ZcEkOFpS
zgH3*VB_knk+cIO_g2m=d0V|&$-MlniVE-gvt6=g$wK}H_DR$+4Qq3h!-H9m2kj1)z
zl4+~>nc#8A${i*GFQ^7lt>;_bDVcomx?5%(cI5~_pF}%RdHM)&oim&0D^>jtXx{@k
zd+A!tp?SovhST08Z~Y0n5keAr*1t$3r038T@x>s?N1ORQO2C$$GdH0Ph+qRM%uOtN
zY4dz((cXqUL~b;Po<7jRlX^Fh1y?IL*UO!XLSmkACrQguA9SP^a!s`BPl<(t+?KG3
z0U_t{Is$cEo6%oJ?ym7!4&Hxvf)1Jsa&c@;$S=>9+p?3Me!;j?Y_mrt>()8UZYkwh
z%}wxD<6sgM)Nm@*QGWBt&@9lC?_)7SDS4uEA`eXzbXAl#sBX|V*;H`@BHDEMeqpDg
z`bG32-T09Fr11xcE5w){n{-=1lwf~zr*^i9rNgQ;CJoK87T)<*Hm%*8PsvthhI_8_
z>G?hJ9}|md!acAmFdpj{QMi2sIGAmI_1Vq&*<)H;n7x>)a=n3`bj=q+t28C^7FMSv
z2s^0;4W^K-D`U0{OwuXlWEP7O%j-#KJ}^5p^(@awX&T$g?rRlv91aghyB|}{8&;b~
z0NasA;Wl|CBi$NPVhOwT`z-?AOo?l>Of%cNfj=oy>cBw?xngX-rC6zs(rVNi7dZ{+
zMZGFf+CIW90j;q#Fzy>ZCzo?I03UwsV`mpg0gup~F|~P1MA7qbQRs0txj98jsnRk|
z8Xw5WAZ0MLWho2eV_O{y4q_WoJ7H6Y1b^Ys;abc4)wNzSGVhNmq+cfU$inX5ykVK&
z>Ii<<)X}?lVft<1vSMwQ)G;OV_lkytJI&kkmv)raUf%aVOw!c*{)lB!cgt5E?-u&H
zUmLHgqRZK}fo}wn4_lvui(Ll|t#SH9-@7jrk1Y<`%}|t<?W5a9c>RxpHGX*Kzjos+
zf`@<bal5Af>bK(<G#w{3LRYZ4K#1X7%ySasb`w52_q8ay*9Wjw2s*fiNZ<E7!&uaH
z0B1zpZ^*L;T*m$xCk<{=UzSGpy~^Ql@#1fax!)s$+YEB9ehx?=DzR~YDHLQ2GurTu
zASarU@X3`FqfP79!I3o6EsG(EprD|f1S4f4Hm*eS)5KPey5H0R8%r_MLT5@*`NP_!
zMn~{M))}L(^Cu^t-err#gXUcYSvOC-0XwotTQt5L*bKj8o89!vwK|?{+}o(uQLzDG
z1KkfwY!+ssR97f`oACLWr$%M;aO=Kz!ul(}h5chD8r14_Uu=e%{&4c{L&-Y(MT8F_
zeyC^~f--%U<>4C4kQKn_-UO8`LrNVh*D8|0j*Of>*Sv#=&oy|MJ?D;@Xr5AJ%{J=V
zXJO5-H9ov6KV_Xlu&#%)c(_;8A4`<_At$O@n;=fV6{yRYd@Qjv-g!%G_SQk~7^e4|
zS|$wFclwfcHf8@vySw+}!DMB-*9Nm#ULNkX`yCQE3#B=uSWcUJK{eTo<u4|OOttr4
z$2)_ohMNZwRFm1Rq}Hf+H)OqfJvS0uqNID54Q|Nf*hg9x3$1k7@t4ynwElMgVE*xR
zw_S(omT9T&{oQ7wQu2ahmmL@R-qOEpUaUvl8gDy8aEVh5A@MvuX02{Me8sm52JE`J
z=NMZ>P-!26KUPrMtG2&_TXmzp>`+krf)EWIt{a;zU6?u3I-Jk!Yca&$>L%4)*_tts
z7PnMCj_%j4(adT4D<v6S8I)=?2{u+ofJ76S$WGtEYU>t!??!%6R|~EwM<OQ1t?M^@
zk?>b1AT>q(whUNNU`<;3hdJh^5~`}r5p+a{_L&PQ$F{j=2Bzp0C<zb6{Li9u0)K9P
z)N2Vm7ncx3Vc}*5{q37^ogDQEjW;C>!}e1+SFuNcxcO!4L+KVtvil}1?j{;4;boTm
z#15mIAI-mr81CHnPW;{gKk8Iel8R}9T`TI2pK>Q}$C;;@$lek6DiBc#CyX=^WR;eT
zeLi@~I`r{Tmqrmnfz5x|f#jowQtefDoJw82d%pT$ODs{PsO>HBSx#?)<!Q{D=9_;h
z5W$?;j)Xb+TN_<l4HxS6t!&ls)>v6`BsaWeO}ROkv<{>K#?{yXfvV~ORzpq};ieIq
z&Bhh0JUs6OpJ^hE7M!ORvlbp53}|>(zc62^IQv8{X=(u>%66-(NLV3{QeLC};KJ6i
zrK>!H?fiujOpCI$MYUviBQ{gU2gk`xUN<^NzEdQSNIs!H;0*19tA`&e(h+F!=FzZS
zfn-xEH&Jp(`zzLi*Cu{D5^VYwp_57pqgYzmZq$AuD99bc$>Bsk68<q?ow?V?(qvUK
z>bN=EtlB7g3==qSnopI5%smqvZJIrC6=fTXo6Nio<<D=JOoz^+#<8q(!6@2xY<z}#
zuW(X;P=5M|cPVkpqeyQa&u<>b4^(-yo*wwsm&Axp)RJbNI06h1tv;J{nrIH#epQL`
zeK~UQI_G;%?A3tfbHjuM{v*J`%fxT6kF<lRP5;|;=R22OM7$p)2MRUYzfF1>69Cpv
zkpt(BTuNd-T$+|fW2XggO0yYp3`5%GL3N>dUsV?fge}KDQb;iy$v&zBxK`AxP_&C(
zl5SRKca9@Y%1=PF-Uf%>xsxl<`Bp5=dI4hYd4l#x4&DQZfxw_Kx$~IfdV>ISi4Ox-
z1-zWU!?Ac<@YPH+@Vd&C!PlSWAR3>S^D-b%IDTXdJjTg70UD^ge`^v3@sqO6%0z8T
zL(~|BHEtY;6$MR;?MFjycwW@d-<Q<(5)jK%u$pqT`8umMQ$n6TQWFYcubVY~1DNUl
zXpaBQU!mq>U`JO|j+wt8a_uEK<@u|H)MMpcUL{|2=g9m=VzCa8A#&8o`EJa6=I;HQ
z+)Tyy_Tlk3T|8Kx<l2TlpP}|PSp}GnXthes9eDrHYUZ8!eG1f;ZcVCwO8pI8fTCSa
z1;5%Rt^ekOmhS)$VKTtO$>C%15r9DB^ljrBwW-m!AD34Xzk{%2J$NmA1Ji(B7Rhbz
zEQ{2=sjoiK18kGXkw8ZjLH%{GE7-ZjfXIu%1Xe9?_K$;(CD%hpuC{^@QN1aB594ws
zUbDDW(eqL@-N3r8&d|`%pbUWO;6uA(+>X&nnhWVyk!V@S$8Vg=z|(E2vm7#$i_a>O
zxftVBbgTT?2Sq=bF@Z)ed=R?3pe7x0RxN_^55RB9dD|}O>qZXnSM5z>J)<27?VjYr
zM@6xZ#dqS9p91R0iO2u__1IgsdBuiVG#^(w@Wu6`p-qRbS$cvev0HUw!XnWMT|Q<1
z^n(k?(OSN`59-rInJDO$JM?Ts`0VsmqfD_sNWX^%hLPH@;>s*8wcIMzxBs*p_uM%^
zKquiN{N0n4BIGURZs7%m*&Krd;*f_jmKXvu`?Xd@sA!2K9yLKN&~JY+%PAd1E3nY%
zc1`sLD__@t;6?XW3j8)Y?kxYS>VQlYFaP$d+=4f8Z(^&G-#33qc$m7)ry>?R;p8BV
z*}zf_t^KP0q<W4l2YiE7Ohk_7SW~H~KN3enN$CspXvX{YRsoP>^W2*%Br0ifYQ4Zm
z;vR$TR}(Il^*iz9iW2I<d2iZGovQC;1}Fft^xu*gaPf&=teA|a=HxA(Ig$?mApG{f
zZ=L@yIh^6nHuqL8Z>x7MkPFi8ki@6oGDR!LJ`4--WCc0pxsb>hOk*Y72~?j&1(Fxr
z(cNqU36xh8aHuiRIk06rpoW=_^WpsRDGo#MIc3-I%mIOPp~cTuVo2o%7molSsP1+n
zd?fR4{zzQL^0Ue|$G64EoU6rf@nHeSa*gY>noL)Dg^?x@V5>ZM3<NotF6t@8rbdf=
z_K4l!A-3`l_}rWr`349J_rdfH0F1-soBJ^~=)QIrc4KAw5Tsh9o~v<5F%p!}vs
z#?=x|nikZxz?`U+DU#nSuJ39#V-Id=?D7y>J~DFB)LsiD7%j)yJs2>yNQ=D64<h#L
zUqZkktO`rx_lg^9Ysl!#Lo2sK!Ty4Iat>k!H!mKS)pC!8oZaF<E{so?uT*s^sz~&q
zUr8=1`;Rl>{K{?|hLmNmAV}rLWat~nSns!MvXZ6!{gbjk%XGJE;Uifu$jWDBygSp9
z<!o5E9iS4^m|RR(|2|tHD^IeFJ9CIcO-;xL+C*yX<&EuL85jRK#589}#HDlUSton~
z@W1|l5qBqHD}ROUyRNSHdcZMM3h%DiIJh-HB`8N<&nJ%`Gt<l#AKVgP^_Q2d(4<u(
z!_7!Qb&%MH#_%J6p&`Q^RHGn+y^@le^gY^(-ncsnYMY$ccqw&0j}k>IviX%G&uk4A
zY_`G$E~YzBaO3&FRc!eKUt&4;u9Rb`guYJ%19pGT^`y&m-vFMfZ($wl%$NF%8^6A(
zbDHbt-CJ<GuQTnR@%(%j<larivyZYH(A~>*XoS}iHdVX20S$?W!>c(NBs|%dW=&yB
z7V+t<OK8)7Z^XYW&@nka7#`iY7XfZG>}QHkZ(BBj&M_u1e<&S@w5WK5P@fhSV-H?*
z%3JXEQMZXlmzQ40)KxBsc+>zvrtf_S)ymbN7zk-ix&iNlFz$9voEGSItq&cDH`$zu
zZT55g<M1^q8$xTwX|oo3Vg03a2)bCDnS0)kF>wq1i~A4~+=t4-Wqi)csLkwj=P54k
z<^-Zcf@Tda7ZYXJA>&?LoY)|=5uN$~bUZ1gXJRu0pNy=q30sBsN8Fk-i2S?({j`&?
zt-ynG7t?fs$bu*7=xbb=N=`52ZJiA1g-QLheUz)t(Dh)W$@@{u(?TVk9c&E$4E
zn62|welPuV9T8WZRrx^-xkd(?A{-vO$pNJSry~CNcfoarvc}G%1|n+U)3TJK=$St}
zhZ&sZkJREzPnHltEqQ0b6%gT0YDa)&UJ?ZQ`r60%x>cS83p+AO1r_qpK?@>k&4?>y
zae8T{HbP6E9*RlPThv`Ljj?EHnP-o|cRDUnxpWM&qs*^eeoza#Z`C6P^2>5+Y$P9j
z+0Quts*+T*_@oa6t08zD0Tv6;7+%Z$4=`S~=+~;tAB-7v7yV1|WxtBO_D`|zW2ZR+
z?BBBxz?I$;yU4S&?nQjBWucyz!|sS;UX@Z5liw9tp$f0gjC1REYEXT)5gjvU!<T9L
zUb%=ZkSGmzjdjIq!~}K88`I#%+pc-AObu`(<t9*nw7ay$`#}eCtve#vIXPeP;f&v}
zcCwC6HpJ`<6=EiM9A;R~`zkxNa!~?7atmYzak$!Xh^z~CbnFZ1j-i~<z4+)bn<zy*
zX0=4$(Hc9)a8W2MVPtJLENTqshLb)I63L%XCgNW9brE>ui_y3=jS+}E2?`#UXdlyA
zD7KPP8oe6bS*tq7+0jWL>L@C)whOkzh``gsN!pVh1s=I1*K6-srZe1n-{{gb39n}r
zsznXii`ER<3z6!kNwwOmW?_qvAswrUcA}#_5=B?jAKh>{K~yEBH?ymICo|SvyV5PQ
zj%(YgOHNW@3Zxp8X$#MEVuw#Ck}<F7kPEvHcp7>};CGE};w8mr$0H>tqeJSArX`r@
zfwZ2~=x#hVo`2Q44pmq$tny)VD8cS}`e#Dxlhn@`?A_aHMoB{~x@dF@$Su>M2YE-z
zx4=~NT8ggoJ^mdCG>X*KwtqcmoSQ3Vg`JSi9-@PbW>*BZLeyjw;?tLA7-odRQpVrV
z+M%Yl=F{?`qiLF6W)C1EKOEFV?NJLSPAP;ImrVN8V+B~PQOv}`e;3Bz%Fpor+`T+H
zF%X*a4R^G9Ij&{^7+SRJE%V#-7GE3ixkPBTYZtiPVpXHB>VZ+n<Y{*Ul%?#-zAs;Q
z-^+c+Crjs#oGlXTg&jP4|9yW_M(a{~<GCAjb)O^4K&oOH3J=Zeq<fFe!VE2S7BM2C
zTq!X4kocLPAn*&3PD)b$+u*scy_qzP)C690BtlYBJi4LIw8xc*#ja*`V7;Gmvev6i
zm5YD$f2iGy4JA+lcdB2IPhYCRuw?&gX@|;)<c-t%C`%8Z+Kxlc*yb$%^*An<Q2Hez
zcp{>y$JMxcyCFXM_OyIoI^ZH@clq@<!=cS&WVuA*i9N<9t(mZP8-R}HPv@PgND!%f
zANRF6M%Od~p;H=OSVcal;Njsdd~@jG{L;CSdG@aNZ;wv<2Z}@Q3ob<l=8FGiDW&-5
z*eHS8U1jDrDnQC;3w7&N@3i-e*BM6*zt$4}%hJdCHlO$XH6-NvPd}yHm-t*X8t(E&
z`h8_0Z}P1zrT?z<6>kzbyZXKIn33RBtrfR`MYo!ajjf9!OAsKHV?D3SxEdoS>*flo
z&8N_jY#|HTLb<?OWb+Qm=3VHyQzfVxl72a1Q_PCkdfJ*Q`i*1L6c>LG9yHcdUFVYG
z@XlvFrz|9{qkZm8t8N604E%yscTUVD8XEDlMO3~bEBjpmM)7Y3tiO?dNBvfTLC+0c
z3>k}24%v;l=iJpO$1o9%jpvbQh{t29Ysbc<erao$(*xRyyjIkfqGQYv6;Z6~wkhn9
ziq;fMLJDx=))W00ACrCg&BKP{mqG1L`~GAHpK4d(OX9Q5<G{YWduzg*+wc1DL=W3m
zyJoyq2x3Qnwz0Os>ozhKg~ec>$6Q2Vone~1Yzjn(jb<j4$FWG0f&#d5r)IQ<++N?<
zT|bt=l;BIvREFT;xvb00ja2I$4G=NCmGS^6{(3(kAcQ^y8-(E=kI&SEo7q|+;xvoV
zS;os&UGn2AoPjLI&J_RQ-``vYEOYY%BA?|qu%{ti>_0ra?cn@OZ#$%BHWfm2w5}hM
zh!$61fVg6aU1Kpy^K<bTXmn(Nm=X_<6?&QnPDI_DZ+JoGE4A{ik$&=Z7W8J?5Dkwo
zcYkYzxr6<HDtJ_(A(ifo3{XlLG9#ObU0U46;1f!)u9>}V9QLL4J_X}6aFw(W>~5V^
zs#ybWyRj@7j~n0hv0VA9wv!Uc1-lTsRMb??-h?O(!I>DXNYQ)fp=lmJp`|<wb~}Qy
z%D4XT*l)+KJoVcpa_MnIGyVwYcXv<89!K>5Tq;vkLzP(MC)@F8Gw}Koyoc!!#|^8?
z0FAHHm!|`y|841W%UfFb;J2m+f3H3MA8Y?L=kU_<d)0Vq$M!a}d^j<VRzdEZi<b1T
z3}<TIBdObejvwUD^Vf0`A4nm7&1<Q5!h&KZpBRlRx%mhR@7d0->INg780w=J!ly)x
zo)qr6O!<_^lbTUUaUrY&<;kZiTlNP~>~oA~0+V(M)RnrRXWT(j67h3GY@ftEM-p5n
zlH{H~St}Qvt(Oe>BBGs*&+HTJ`C5n(ef+vBT|2fVyS-?r#k+EZp|CT!89XjmxKG3G
zUA|^--8hMn47d2F?Zq}&+jC-hTWk)ATZ;H1Vz}6}3s-;t$*O1|ZoMlBWF_qBvao3V
zx7bt2X?x2~E!XR3KrMr<>1lJkzpc*pc3v&#uVf852o!!v=e(D<!3OaiUDEM4!@#ES
zDI6nQYJM-lD=~>@0+Z>=$)<R$87Jz>rg1MoKz$V}0-a%2ef2G?RTHnjFy8;#yz4|3
zEbcj)us)Fl@(=;4Xw6<fRoQxV<WFz>$pZQetAc!zz;-Xe-qw~l%rD(T7d_BJ-vcg+
zhbY7Yp%Jo<c)AnoJA3l_j_>FC?mUoS!x4!(g-k{^+r>k>Ah?OhNHK}Vub3i;HiV~8
zD%{zxeNNC#o%0{iQrFp3#dUKerze}fkP~Ekis<S-MyA@La;uG7#aya1xq`!nE!W2f
zM{>Ph!6G%_n+#MQnNq3Q{k_YGOf^!jBXTD1D=UhNj8dkZKy)8kcX}>$9s%yCgEVRM
zgOeUD;-$iE*1ii?yx_Rx&@ZjPU*Bb<{Wal1L9FnL!Y#2y=2Tj<C+^Hs2dP~q3rWB7
zojail`%(<UO*?ny;CpVlVck+uiLK9SkytfFB-XkiK5I>DKZ3aZxYnoId2f!DCrCdo
zq1fk#@@&V<XP?_$K4C>f#h`b>X1Hc3!R1pf?Fl7<s%&+*CiN<f&GL#eN^81V+iTEW
z+t+wz$f;=TxBT>nUwYkJnwDF$_s-pcvjXLCU-s9&;h~s}ZkQhap1b?TH=<&}zfLqo
zE1cj`k)1B;<t-O?Tp}(VzH`2FPIFWJExoBp+XTdP^K5#wR;BfRLir(;zv4mKQo}tR
zZc$iBJx65s7P0vu-Wfd(amwH$L0rbQI{x)20VaqsP(VRn^d13T+sS-jF?-_8vU&~G
zl}dJ_fvRoZ`=&3OvJ(s%7{xwD=Gy%i)*E;CxKytRW?TKXc(d9R;iFyD$M<TaZZ>|g
z|6!-6RAPxrb<IoT3uKu1?S5ZnnfmH}+38#1X}Kfrg8EwVreGY!H-G(=IOG2)4rFjo
zlw*L&P{lMa!#eEqgh_jCan|QFHJ8pedB<VLx%l&_S~z(!Tst^^!~8F!lD@#EDDf}a
z`I%$NLQUo4$nZOQ_uLAWCJJZ@yb!?lo96A5p%F6A3yvdVbfIAFUU-Lh><G7!ns&3$
z)N%YFe#ClI`b8q+apXrm%k@6@Tsq0D6;Ta09k9&MDazGEBGYcPYr3nj{yY(=51xAK
zFTN=qUd)^?(S1c3+x{7g1Kq^#ld(&S()&Mj=IZ=;J}uwyQ_RWGEc~fdRFOHd6BO~s
zTDv})+kEt%+DJfLM6t&6t`&Gt=zF-zGz;oie<kFme&n`qTb{K36Tq;=kMa6y)^Yw9
zD0gC!{HnlPMS8Zj;p3FK81cSa1=JI-NaS%#UvNy%ikm}sD^&f7c<n&D)D68ijpJKd
zdz_U$xcEyoJVyEDhr&*!drs<^3bkBy&QKlJa(N>@SQgKnVsMipm(bhbWQXM1<sIt7
zW)C;52Js~@EjEN!QJJKm&<zX;Bl**5`)YgFzPGh7xRTb$4bP#jGqzt6G6QPBwL2Oq
zQ+U$6$?yJ7f`T>0OT3UPnDhww;ba>Vd62f5+dmER-v6{{kG(>KEuUS;n%uVNQ<w8|
z>$I#v3(fOOsGP$lWAK@&W-ZBj9MRbB<%eV}t$37?wvuqk-2BtZ*Q7QJ-Gn(#Up9to
z;C%B8I~!V<N$`SAtS1(?3UzgbY0nA-cjB^*L8Nq^8NQElP8o>&?92{$(3d5?2){QL
zxZ({s3LV+IajknG$3~iZN-_wW8n*YbOUg_4u^wZ#!p2HbHjci$^m4~~^B4^(k>kXS
zaVdU(tm5E~h6$Fx<wA(pNFbN=+*wxzH9c7FwxEGXSLTO;-@hgdcCbnev}%0nRIJ^e
zuVsr3S4j2OCD<p_XK|KT?huik!CIz$(gp0*OVm;zY*uD`Jd)R^0|HjvVaz3Tf`P@E
zCP0&MMOHCa_pr+*iw^^Le`99!jlH-)GqcnosuIf0YG*J|HyaULhB*Q>Y7as#%^L?Q
zJAkIhNMFVlK0hYzQrd+6R1s(QD+D^s4B%?0xNA`W8wkqBVWDlBjt*|<V~rW5WUk(&
z#x82AbDHN4sy7B1IZ#RP%fcXXazIs)dgoBR<Qm?H;M6=4L}YANqjbQ0K&;ph%kNGG
z7G!2(yy~UpH)m4V5k^q*Vi}uwwb6t@vdBa%)HUT?HQF8m7M4@eUa_@Xx>*$R$`3uH
z?Fyj<kB^-xB`~FqduvqPuFLKZ({wS1@RFo5?-3odXAfv2w8O64ykR26U+bnFSZhj9
z9D<v*Z)~*MM?#c?w;R`iKv?%Gt1k~0=RscHXlIXHa61BwllH*Xj%+8tbXG0~UZ5@W
z(;9eW_+C)?Lp2V)*qP65hR%o!W&0AGUDRLITJ22Iq!25Ng1Nv%q`k%Q*rJ6k#=;)e
z**R<`xKo3z$Y@KR<doHoMHO>ChiAa&)imhqMY@D*87QM$T<qmF)*$y8#E{thg*Ao(
z=%+a^q5(uixM~q-IzXw}47+cIMr*8sjsS9HCs1Vn2uF^BDuI~VvdDAnfzhMHG;F=w
zWQ;M!8hbZec}tToX+4@2o)2Wi3|W}6rK@>FwAdtWd9<TPdVoa~6if0GAZ^Ni(Iz3c
z$L<X_=Z)R)t~C=4(`|gzLK)$%st!&WAw!dcaL)KFHVO&xu#kasYJ6u!HlkMzdD!e1
zqO}xyhlCCZdFU`TzrU$nbs2$*>aWj+N}mnusk|;V#hWssNnQ1mrp4$>;=*5kP9nKb
zQ)OV2BRUnP*uV0yNARR4G($l0PEP4)6#4BVY(vbDcfq1D^GAc64CF+&97n@4wM$z)
zrM1B-n@Oq(d|8Qg#c@?(9KxD!Y+WQ|vjp)^jf`CO&4lhVgFwDOcKuq7o3mIOV#tR4
z;d+REG*SZQqg2v+0-!b#P+8_?>E|2X%YF5YyWEE#U^VeRU+SfCD`nnmeK8@y7A9)Y
z!44(1K;ziKc1>*}k0WalH6_eNiXB0+k5kKGFm+@AL3}@B-9F8}3nLH_K+td;n{#|P
zSiYSBOU|3F&|wqgIw9>%m0X1b`LS-*bgu`6u9))?cra;rXZHAz56j?(F$#IW5fg`{
zEPCs8?su<?d6aEj88Y6yjZuT!BQWT!;#%edvj$AbR%w<KQPGB64QUn6%1SH$coT^T
z|F1yQ*0Obn$}f2On|ERk-t{gP+^d;?)i;?30juy`kFOUt56)2JsX659k*!M6!$wJn
zg50u)ADel+Y*{9wxBc@BAd!uX10zvB>N;4Ub{KOA29HK3rkQ|dd1p39gVib;^zITa
zCq9n{>zNd42y?h(LP#>&^xar4t~-5fn=0m6gz@N<TkS$a9G_<)Ps~fh6*(w5$RIjW
zIhQg0Y&9<S!KZks$0F-Wb}VbJ6g%`Cir;3k=ZEjqu+8#`4K?XRXt*FH>mJiC;8(7J
zmN3rBM8lfM)j0!;d&AR%*gH$;*Exka_X#D_9UC5jdXh({>!YCoqy)7`r}i~$wh{8P
zfQRZflU*$<9)oca>#{Yz5V5K}4bqu=>9)s2&+Dh2P?sU^K=V@2VH0}u2w*XJ1lYb2
z&BhTDc`H4(Sh=phr+#-yULNWnzmlRY%xd2=nw-=A))%ie81xf;cfZDai<*4apkri^
z3yh{|K=YBpkQtF&qL*XkBhv)HV}4^;wp=4=;JC1=4iWppI#+xnRkCpEmwo2czfq7H
zsg3}&byW_|e(QGX6VtMu<n63JKDZczoQS5?SD`jnLyTOj8K?5@S|_v)8VE=274P!T
zVQtb&^k?yj>As58Zc4d{UVHYR?L8Xp$}hTh=a=2WjDUM;GL_8~By_+dI;DmsN<~#f
zOqPbiMfmO0K@V+zu8dC7`gqJ6JV}`69_1BWIYg2=L8k&lMPRp!e?OmQ2Nhs*9e3J}
z4=H2Y>wD=dU3nGh;|b%})?{$E;ip>7TlAT)S(kShL-NdoP0GtlNNMFkRs5<}^}X3+
zcMX^fSHT0wK|Lt3y09<RhD1NDP{ya};)Hd4zyAl7*yZMT6Lc>YsC74pWg+pAlKS}s
ziMbEK9lTifun_bxS6(o7nTLxdC_y1=q3zPa2M@7^`;B^AH}0OGd6_h?BYpzUznUK7
z(W_o{jPo^e_Pg#erwA08y12oW^|=2isAm)+x$6|QB(-`Q;!Po96qsr(e+kOcG<I%3
z^rD?z4S{Fp9UhCcb5T_g&3oN6Ga$+8$hiunb{Xq|i6i_59CcT5d>r{H^vEJZ?3A*C
zA)LF9Re#i-F=uW`S`&@#PA8TcV68Fc2))Jajevk!Ym~Cr<y#DADNw<gYT}2{M&BTZ
zF>#@sUXmy9>pYiy#@DBA%!(K_YW=$R4uw+(Ff4re`&#eRiH$efte6Y?A^cKUB=}P;
z$LDbzjFlC&IlHfa{69Qi%dlKS9uN`^x;MfD+Wq>}4(Gj^=ZoBYE52^q(JsCjSt`iz
z5e?mJaM&{FzQW=5PbWGIa9o;OJwYcQo79lY-+ogEn?>)p`dqm|dduGyHEDbV;Pq-8
z)tHNY|JA6W%=nKGOdU3`a`cUV9(j969M1mQAlpxVA#TV)FaGOd*_Uk}q3}(jKUtIv
zo#Gx$b}w_DbIki^Ml^2I8v3y4`Y#xWIy&w$&mnb*(pSl>Vc6d_Oz8GprrapMuHfC}
z+d2eQ1Hx{|R$7(YrXK-3OHD^}jq|nrONMh2JCChf7W-jcLiYulpNIXz1|n_E1(L1-
zRul3yoFv_Kx0L?W-D`J;KQ9Epy#QxeKYjQ1Rg`Jj^~EDVEp8JNGNyRq%Qz}<;QDm|
zpZ%w&7apH{O(?qRANI#E4!G2NqY~8UF8P<|Dd68skrZ0%u28_GM*zTI&sA46hprdA
zI2`kn9eW3`l@<q_nS|_Qj8wX>bBQW_!BMX9aSi6CV)wfpMvHeT6WG5(f_Gw~mh_=J
zy!+9B(~SVY$x$`(#Op80+Cx0GdVq7?0DxNm@oOs)H04(}l3!IWI>k)CZS@;NQ3B`c
z06_XLfd2#u23$S~);&qTos&yN_zdM8gs>D_FW<j4-8^92`+hG!dF3|X_K9c*p8(#L
zvAA$1XNm(Ha-RM@^v~1f*z#T{nv~xtGI68tFT#tRi}4C|(R#R{+}8<}XoeWE%a89~
zYx;+i0b0zTnW?hUrA_}PTT|p83y3Xm*(p)M_FvtUpV-d^3cw7~rG^%c0B?J1<tzG_
zj_bYoV*LVUKp6V?<t>Q0k-&phm107nLxNinx|y%D7hu|9)C^eeA-~qId|Un#i8itI
zkK8oU|7q{C4X(jVtTSKRMPv8Uy+^+I)YlNnuWl~;KU?FXC%=)+aaY(HWoe^o8-HBH
zR(g1$Dba@fs^^<`)+kd|+vc3B43`5A0%~bk!fWc6>lvHv@~M>QzADGs7m0av>g2f?
z({r`yZx#0)0R*!V!7m6-C-xl`EBv~H`A*Iy*%F-p#?EOYeRoy&pO2rY#{H=zR$+9L
zfho{#A*9x#Kz}@eF1jNh5&fEt#FHuyQ9#1W^)mh;Wi|Q_N0R6LH3;5JyiC;B_V|#`
z%zVP0CvS;3W*bd<7wCsp#TP*}hJB@a28u`tF8c@OU%0?a_z<_>x3UVHaiX(3yR}|>
z?vgq7DJdBxXTVG0{9&(s=NAX(nu&gXIjzP7UM8Pw&kI<au&lP(4X+{(B$3kLirIrD
z=&3yIEhPPIdX#e`%8cb3leKS5+<)+G`~*~=j#a3W+1OliwHT6>)m=8WpVqPKT-%6!
z3QubNM?su*8BT_G3T0WFlAvcb0=Q!iWY%tNz)R7@Ri5mS^3C_iQ&{)=#DA-|r;F{2
zWB!OABK17ZV+L1e`YsHY4Ygzppr+^V=aS#Z({6>_r#o-TnCFy6x?#8LqG$BVhN5MB
z9M%n#Nq?Vu+7&|Vs?ruiMV={z;F9Q28m-+@B>kfH_-_u}rk@+~b+v!G41rH8Sd*<m
zZ0O)!ERtkp_598g3xK3zS+@o|et4df4|!<|(g>fPml0uL`S|Ejg&gR$g~)pDO8#9F
z#DS!!C{V;k6F4C+FQ;$@W<XBho*KrIT3&UM)nuB>G@NOc^z$R^Uvss$!MR-U!a*g4
z)xS9yo9CP=LT<RmuRu}}Ik{VHQR}!{jnXeZd@R(d+BXUvu{4@yWkpJe?`Tr{$IToo
z8vDft=L|D=cWjI4v3bVN@R`m3(I`oax_g3yNlq4WZ}YR;(yk{7bS`IDO*#p~Jb@5n
zGTe26Z-1eAm#pzus<*)j4LPle@Ai5PcKQ<5_YX`<UzR7um6QEwbXug7z;{t*ljhLP
z;^13QJa(A2aQGe+)+m#<mKiri)AzQ`sdZ~SZmq&|916xLwH;<+>3nSit_GEQf{{y2
z_4^CMpGR7nM!tFMr+kX^q|gU9j%)hh+%zmRQ<D){oXuS6=&*J*%H>X+Q7vm~JSK~r
zaizcbEs^*v&^)H_v6-N`1dc5XT>}(Yo9U8orx0(QKW1cfDqZwHsQ%*Elob7IaecE{
zG<%6#iSeU+WvQbgD?PBuWX}4+DYj??j9Qa<dCK59*?@O|o5!G)6Qk;1{!^*z{E)Er
z3C+%Y{QueGibY<fMT>5~GrIUbd)e+&M}438T5Nst*xKdT6^-@ReT>$f?NfW2dh*QT
zNsD(W+i|n>-JO+kwc_fUdl!ZKL!Hak^`+Vd|K?miD?0VnpVLe4ZMs;r!&2wXj<Ay<
z&n_MPVKPzXYv>>4iSPe2Y>mozKI^KZXV%tBQExwNld`giWD(pxu_rljt^bpbDp%l+
zw~$w}7Zblw<wV*1@~uI+Y*{|H)py@`le@*JIw*S8(&k-0{q^P5*Yu~X%PP0MQhxGr
zT<DeSJ-@b1S3YjmTysMA$Kg#;(N}GaB{p1j<+-x1b=B3*yH^9n)V1DcU5youU8PwQ
zee=fp#XC3MT75L{+?&$!pVn(nWx4H(6fduQv1)f;*?X~1n@hv}bVI#@;!~$dy~*cX
zvh~h1wyP?e&;3X!+ND+ZTKkDd#6Gi)^1J4Eo(K_JyiISbt<uYy>(9mbfjhoM1IwJ!
z;Iqt|n;*^F{_y1as^*fbrcdn8f4$x%`B7%M_6f7rqD@byomzG6sN3m9hfb-xd{?#6
z^j!C5ZFSVqt7nV$*OtUD{>$KyE|HZw^X<g!HPikxbmsmFyLxS(#a-K&GvyC5C)Ud<
zXZ;hr^--_af60YqNm3Iw9M{U$F1($ZdWdEDniY%NqB<s;yeqQoSeO4(Yn9vwE?g^D
z@=&V@XdO>3g`7GK&T#8*iYUAum3Lq2a@fIw_3tMB@eliJzFMs~ROZH1QPFj-S4As*
zepW5gdUoYpuqklD^m}H)r1!m@|1Ouj@VZ}Q{Wz(9%Uvs>?RRF+OO@U^FSPDvSd;fY
zGxguTmltZ+o<OLyIc09M>hGqXZ?{8M$%bg8fHrp;pI!BA8~46*`qxfp-F=m<l__%~
zDR<IVQPA?th%MIsx7hU7hkCBtll=r}TRgu|$INH%E9_q^S{pN!VcPq~p89LMzNqH!
zl>K1X|Lb~O%i8a1`DVQjZ+zCcC#rx|Uis|e3ZPrs-yxPS8QH8h+8BE8qvgwe=98?n
zFNK8c)Y&?}x&Ogdo_p~JL8~{9ewq6B<3{~k>*jLCzVj3B__rte^7+Zr=gf_m-kPDZ
z`*BkI)>tc{-FNza+?f70+^=QbceS>zZCO#F3qWC2cNG{`??GV&Rn{w-6BTzFF_?JD
zVD?ct$8BzRbbZ!Mn#5D)kuocb^X08cx;}YsueGIDz3AE#@3?yQjKHkKX;fQyOK$oZ
KcKVqA-vj`yOfcI3

literal 0
HcmV?d00001

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index c092f138..7102bc8c 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -5,8 +5,6 @@ draft: true
 ---
 
 <!--
-todo: FMA, but without kernel?
-
 baseline 13.58622 0.5209607970428861
 hugepages 16.749895 0.42256312651512146
 transposed 12.377302 0.5718441708863531
@@ -44,15 +42,23 @@ void matmul(const float *a, const float *b, float *c, int n) {
 }
 ```
 
-For reasons that will become aparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes.
+For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although it can be [generalized](#generalizations) to other types and operations.
 
-Compiled with `g++ -O3 -march=native -funroll-loops`, this code runs in ~16.7s for $n = 1920$.
+Compiled with `g++ -O3 -march=native -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920$ in ~16.7 seconds. That is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication.
 
-[Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now.
+## Transposition
 
-3.5s for 1025 ad 12s for 1024.
+In general, when you optimize an algorithm that processes large quantities of data — and $1920^2 \times 3 \times 4 \approx 42$ MB clearly is a large quantity as it can't fit into any of the [CPU caches](/hpc/cpu-cache) — you should always start with memory before optimizing arithmetic, as it is much more likely to be the bottleneck.
+
+Note that the field $C_{ij}$ can be viewed as the dot product of row $i$ in matrix $A$ and column $j$ in matrix $B$. As we are incrementing the `k` variable in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa).
+
+One [well-known optimization](/hpc/external-memory/oblivious/#matrix-multiplication) that mitigates this problem is to either store matrix $B$ in *column-major* order or to *transpose* it before the matrix multiplication — spending $O(n^2)$ additional operations, but ensuring sequential reads in the hot loop:
 
-Transpose:
+<!--
+
+![](../img/column-major.jpg)
+
+-->
 
 ```c++
 void matmul(const float *a, const float *_b, float *c, int n) {
@@ -65,16 +71,26 @@ void matmul(const float *a, const float *_b, float *c, int n) {
     for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
             for (int k = 0; k < n; k++)
-                c[i * n + j] += a[i * n + k] * b[j * n + k]; // notice indices
+                c[i * n + j] += a[i * n + k] * b[j * n + k]; // <- note the indices
 }
 ```
 
+This code runs in ~12.4s, or about 30% faster. As we will see in a bit, there are more important benefits to transposing it than just the sequential memory reads.
+
+## Vectorization
+
+/hpc/compilation/contracts/#memory-aliasing
+
+/hpc/simd/auto-vectorization/
+
 ```c++
 void matmul(const float *a, const float *_b, float * __restrict__ c, int n) {
     // ...
 }
 ```
 
+![](../img/mm-vectorized-barplot.svg)
+
 ```c++
 const int B = 8; // number of elements in a vector
 const int vecsize = B * sizeof(float); // size of a vector in bytes
@@ -117,20 +133,27 @@ void matmul(const float *_a, const float *_b, float *c, int n) {
 }
 ```
 
-![](../img/mm-vectorized-barplot.svg)
-
 ![](../img/mm-vectorized-plot.svg)
 
-## Theoretical Performance
+[memory bandwidth](/hpc/cpu-cache/bandwidth/) is not the problem.
 
-This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in the later implementations.
+[Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now.
 
-$$
-\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11})
-$$
+$1920 = 2^7 \times 3 \times 5$, so it is divisible by a large power of two.
+
+Slightly slower than.
+
+3.5s for 1025 ad 12s for 1024.
+
+However, now we *really* hit the memory limit.
+
+## Register reuse
+
+This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in later implementations.
 
 RAM bandwidth is lower than that
 
+The latency of FMA is 5 cycles, while its reciprocal throughput is ½. 
 
 ```c++
 void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) {
@@ -203,10 +226,18 @@ for (int i3 = 0; i3 < ny; i3 += s3)
 
 ![](../img/mm-blocked-barplot.svg)
 
+Avoid moving anything:
+
 ![](../img/mm-noalloc.svg)
 
+$$
+\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11})
+$$
+
 ![](../img/mm-blas.svg)
 
+We hit about 95.
+
 Which is fine, considering that this is not the only thing that CPUs are made for.
 
 ```c++

From f08193cceb6cfd7563d2d558ab77533e6c4b6ded Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 21:43:45 +0300
Subject: [PATCH 016/173] vectorized matmul

---
 content/english/hpc/algorithms/matmul.md | 147 +++++++++++++++++------
 1 file changed, 109 insertions(+), 38 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 7102bc8c..09184bce 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -44,7 +44,7 @@ void matmul(const float *a, const float *b, float *c, int n) {
 
 For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although it can be [generalized](#generalizations) to other types and operations.
 
-Compiled with `g++ -O3 -march=native -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920$ in ~16.7 seconds. That is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication.
+Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. That is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication.
 
 ## Transposition
 
@@ -79,76 +79,132 @@ This code runs in ~12.4s, or about 30% faster. As we will see in a bit, there ar
 
 ## Vectorization
 
-/hpc/compilation/contracts/#memory-aliasing
+Now that we are just sequentially reading the elements of `a` and `b`, multiplying them, and adding the result to an accumulator variable, we can use [SIMD](/hpc/simd/) instructions to speed it up like [any other reduction](/hpc/simd/reduction/).
 
-/hpc/simd/auto-vectorization/
+We can use [GCC vector types](/hpc/simd/intrinsics/#gcc-vector-extensions) to implement it:
 
 ```c++
-void matmul(const float *a, const float *_b, float * __restrict__ c, int n) {
-    // ...
-}
-```
+// a vector of 256 / 32 = 8 floats
+typedef float vec __attribute__ (( vector_size(32) ));
 
-![](../img/mm-vectorized-barplot.svg)
-
-```c++
-const int B = 8; // number of elements in a vector
-const int vecsize = B * sizeof(float); // size of a vector in bytes
-typedef float vector __attribute__ (( vector_size(vecsize) ));
-
-vector* alloc(int n) {
-    vector* ptr = (vector*) std::aligned_alloc(vecsize, vecsize * n);
-    memset(ptr, 0, vecsize * n);
+// helper function that allocates n vectors and initializes them with zeros
+vec* alloc(int n) {
+    vec* ptr = (vec*) std::aligned_alloc(32, 32 * n);
+    memset(ptr, 0, 32 * n);
     return ptr;
 }
 
-float hsum(vector s) {
-    float res = 0;
-    for (int i = 0; i < B; i++)
-        res += s[i];
-    return res;
-}
-
 void matmul(const float *_a, const float *_b, float *c, int n) {
-    int nB = (n + B - 1) / B;
+    // first, we need to align rows and pad them with zeros
+    int nB = (n + 7) / 8; // number of 8-element vectors in a row (rounded up)
 
-    vector *a = alloc(n * nB);
-    vector *b = alloc(n * nB);
+    vec *a = alloc(n * nB);
+    vec *b = alloc(n * nB);
 
+    // move both matrices
     for (int i = 0; i < n; i++) {
         for (int j = 0; j < n; j++) {
             a[i * nB + j / 8][j % 8] = _a[i * n + j];
-            b[i * nB + j / 8][j % 8] = _b[j * n + i]; // <- still transposed
+            b[i * nB + j / 8][j % 8] = _b[j * n + i]; // <- b is still transposed
         }
     }
 
     for (int i = 0; i < n; i++) {
         for (int j = 0; j < n; j++) {
-            vector s = {0};
+            vec s{}; // initialize the accumulator with zeros
+
+            // vertical summation
             for (int k = 0; k < nB; k++)
                 s += a[i * nB + k] * b[j * nB + k];
-            c[i * n + j] = hsum(s);
+            
+            // horizontal summation
+            for (int k = 0; k < 8; k++)
+                c[i * n + j] += s[k];
         }
     }
 }
 ```
 
+The performance for $n = 1920$ is now around 2.3 GFLOPS — or another ~4 times higher.
+
+![](../img/mm-vectorized-barplot.svg)
+
+This optimization looks neither too complex or specific to matrix multiplication. Why can't the compiler simply [auto-vectorizate](/hpc/simd/auto-vectorization/) the inner loop? It actually can — the only thing preventing that is the possibility that `c` overlaps with either `a` or `b`. The only thing that you need to do is to guarantee that `c` is not [aliased](/hpc/compilation/contracts/#memory-aliasing) with anything by adding the `__restrict__` keyword to it:
+
+<!-- (the compiler already knows that reading `a` and `b` is safe in any order because they are marked as `const`): -->
+
+```c++
+void matmul(const float *a, const float *_b, float * __restrict__ c, int n) {
+    // ...
+}
+```
+
+Both manually and auto-vectorized implementations perform roughly the same.
+
+## Memory efficiency
+
+Now, what is interesting is that the implementation efficiency depends on the problem size:
+
 ![](../img/mm-vectorized-plot.svg)
 
 [memory bandwidth](/hpc/cpu-cache/bandwidth/) is not the problem.
 
 [Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now.
 
+You can see an even more noticeable dip at $1536 = 2^9 \times 3$.
+
 $1920 = 2^7 \times 3 \times 5$, so it is divisible by a large power of two.
 
 Slightly slower than.
 
 3.5s for 1025 ad 12s for 1024.
 
+Now it is clear that we are really bottlenecked by the memory system.
+
 However, now we *really* hit the memory limit.
 
 ## Register reuse
 
+If we 
+
+Here is a proof of concept:
+
+```c++
+void update(int x, int y) {
+    int c00 = 0, c01 = 0, c10 = 0, c11 = 0;
+
+    for (int k = 0; k < n; k++) {
+        int a0 = a[x][k];
+        int a1 = a[x + 1][k];
+
+        int b0 = b[k][y];
+        int b1 = b[k][y + 1];
+
+        c00 += a0 * b0;
+        c01 += a0 * b0;
+        c10 += a0 * b0;
+        c11 += a1 * b1;
+    }
+
+    c[x][y]         += c00;
+    c[x][y + 1]     += c01;
+    c[x + 1][y]     += c10;
+    c[x + 1][y + 1] += c11;
+}
+```
+
+Before, we were reading $2 n$ elements to update one cell, and now we are reading $4n$ elements to update four cells: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
+
+It also boosts instruction-level parallelism and saves some instructions from execcuting the read instructions.
+
+We are not going to really try it. Instead, we will generalize it right away.
+
+Of course, this would not beat SIMD.
+
+## Micro-kernel
+
+*micro-kernel*.
+
 This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in later implementations.
 
 RAM bandwidth is lower than that
@@ -156,14 +212,14 @@ RAM bandwidth is lower than that
 The latency of FMA is 5 cycles, while its reciprocal throughput is ½. 
 
 ```c++
-void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) {
-    vector t[6][2]{};
+void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) {
+    vec t[6][2]{}; // will be stored in ymm registers
 
     for (int k = l; k < r; k++) {
         for (int i = 0; i < 6; i++) {
-            vector alpha = vector{} + a[(x + i) * n + k];
+            vec alpha = vec{} + a[(x + i) * n + k];  // broadcast
             for (int j = 0; j < 2; j++)
-                t[i][j] += alpha * b[(k * n + y) / 8 + j];
+                t[i][j] += alpha * b[(k * n + y) / 8 + j]; // fused multiply-add
         }
     }
 
@@ -173,6 +229,8 @@ void kernel(float *a, vector *b, vector *c, int x, int y, int l, int r, int n) {
 }
 ```
 
+## Macro-kernel
+
 ```c++
 void matmul(const float *_a, const float *_b, float *_c, int n) {
     int nx = (n + 5) / 6 * 6;
@@ -189,7 +247,7 @@ void matmul(const float *_a, const float *_b, float *_c, int n) {
 
     for (int x = 0; x < nx; x += 6)
         for (int y = 0; y < ny; y += 16)
-            kernel(a, (vector*) b, (vector*) c, x, y, 0, n, ny);
+            kernel(a, (vec*) b, (vec*) c, x, y, 0, n, ny);
 
     for (int i = 0; i < n; i++)
         memcpy(&_c[i * n], &c[i * ny], 4 * n);
@@ -204,6 +262,10 @@ void matmul(const float *_a, const float *_b, float *_c, int n) {
 
 ![](../img/mm-kernel-plot.svg)
 
+There is still a memory bandwidth problem.
+
+## Blocking
+
 ```c++
 const int s3 = 64;
 const int s2 = 120;
@@ -219,7 +281,7 @@ for (int i3 = 0; i3 < ny; i3 += s3)
             // with [l:r] = [i1:i1+s1]
             for (int x = i2; x < std::min(i2 + s2, nx); x += 6)
                 for (int y = i3; y < std::min(i3 + s3, ny); y += 16)
-                    kernel(a, (vector*) b, (vector*) c, x, y, i1, std::min(i1 + s1, n), ny);
+                    kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny);
 ```
 
 ![](../img/mm-blocked-plot.svg)
@@ -234,6 +296,13 @@ $$
 \underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11})
 $$
 
+(and also getting rid of `std::min` in the macro-kernel)
+
+
+[https://www.openblas.net/](OpenBLAS)
+
+[numpy](/hpc/complexity/languages/#blas)
+
 ![](../img/mm-blas.svg)
 
 We hit about 95.
@@ -250,11 +319,13 @@ for (int i3 = 0; i3 < n; i3 += s3)
                         for (int i = 0; i < 6; i++)
                             for (int j = 0; j < 2; j++)
                                 c[x * n / 8 + i * n / 8 + y / 8 + j]
-                                += (vector{} + a[x * n + i * n + k])
+                                += (vec{} + a[x * n + i * n + k])
                                    * b[n / 8 * k + y / 8 + j];
 ```
 
-### Generalizations
+Register spilling.
+
+## Generalizations
 
 Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as:
 

From c8860104c5e7610c7e03eddda174ade8712f7100 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 23:31:52 +0300
Subject: [PATCH 017/173] matmul memory efficiency

---
 content/english/hpc/algorithms/matmul.md | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 09184bce..09bb6f29 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -122,6 +122,9 @@ void matmul(const float *_a, const float *_b, float *c, int n) {
                 c[i * n + j] += s[k];
         }
     }
+
+    std::free(a);
+    std::free(b);
 }
 ```
 
@@ -147,21 +150,13 @@ Now, what is interesting is that the implementation efficiency depends on the pr
 
 ![](../img/mm-vectorized-plot.svg)
 
-[memory bandwidth](/hpc/cpu-cache/bandwidth/) is not the problem.
-
-[Cache associativity](/hpc/cpu-cache/associativity/) strikes again. This is also an issue, but we will not address it for now.
-
-You can see an even more noticeable dip at $1536 = 2^9 \times 3$.
-
-$1920 = 2^7 \times 3 \times 5$, so it is divisible by a large power of two.
-
-Slightly slower than.
+First, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. However, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/).
 
-3.5s for 1025 ad 12s for 1024.
+It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and actually slightly better because of the transpose itself — for all but few data points, where the performance deteriorates. This is because of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is divisible by a large power of two, we are fetching addresses of `b` that all likely map to the same cache line, reducing the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$.
 
-Now it is clear that we are really bottlenecked by the memory system.
+One may think that there would be at least some general performance gain from full sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` is painful, but the next 15 columns will actually be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for all practical problem sizes.
 
-However, now we *really* hit the memory limit.
+So, counterintuitively, transposing the matrix doesn't help the memory bandwidth — and in the naive implementation, we are not really bottlenecked by it anyway. But for our vectorize implementation, we certainly are, so let's tackle it.
 
 ## Register reuse
 

From 376d46a118aed91aa8d78a279c4ba327f4bd3ab5 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 5 Apr 2022 23:50:40 +0300
Subject: [PATCH 018/173] matmul register reuse

---
 content/english/hpc/algorithms/matmul.md | 45 ++++++++++++++----------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 09bb6f29..e90abf3b 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -144,13 +144,15 @@ void matmul(const float *a, const float *_b, float * __restrict__ c, int n) {
 
 Both manually and auto-vectorized implementations perform roughly the same.
 
+The performance is bottlenecked by using a single variable. We could use multiple variables similar to other reductions, but we will solve it later anyway.
+
 ## Memory efficiency
 
-Now, what is interesting is that the implementation efficiency depends on the problem size:
+Now, what is interesting is that the implementation efficiency depends on the problem size. 
 
-![](../img/mm-vectorized-plot.svg)
+At first, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/).
 
-First, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. However, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/).
+![](../img/mm-vectorized-plot.svg)
 
 It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and actually slightly better because of the transpose itself — for all but few data points, where the performance deteriorates. This is because of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is divisible by a large power of two, we are fetching addresses of `b` that all likely map to the same cache line, reducing the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$.
 
@@ -160,13 +162,20 @@ So, counterintuitively, transposing the matrix doesn't help the memory bandwidth
 
 ## Register reuse
 
-If we 
+To compute the cell $C[i][j]$, we need to compute the dot product of $A[i][:]$ and $B[:][j]$ (we are using the Python notation here to select rows and columns), which requires fetching $2n$ elements, even when $B$ is stored in column-major order.
+
+What if we were to compute $C[i:i+2][j:j+2]$, a $2 \times 2$ submatrix of $C$? We would need $A[i:i+2][:]$ and $B[:][j:j+2]$, which is $4n$ elements in total: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
+
+To actually avoid reading more data, we need to read these $2+2$ rows and columns in parallel and update all $2 \times 2$ cells at once using all possible combinations of products.
 
 Here is a proof of concept:
 
 ```c++
-void update(int x, int y) {
-    int c00 = 0, c01 = 0, c10 = 0, c11 = 0;
+void update_2x2(int x, int y) {
+    int c00 = c[x][y],
+        c01 = c[x][y + 1],
+        c10 = c[x + 1][y],
+        c11 = c[x + 1][y + 1];
 
     for (int k = 0; k < n; k++) {
         int a0 = a[x][k];
@@ -176,25 +185,21 @@ void update(int x, int y) {
         int b1 = b[k][y + 1];
 
         c00 += a0 * b0;
-        c01 += a0 * b0;
-        c10 += a0 * b0;
+        c01 += a0 * b1;
+        c10 += a1 * b0;
         c11 += a1 * b1;
     }
 
-    c[x][y]         += c00;
-    c[x][y + 1]     += c01;
-    c[x + 1][y]     += c10;
-    c[x + 1][y + 1] += c11;
+    c[x][y]         = c00;
+    c[x][y + 1]     = c01;
+    c[x + 1][y]     = c10;
+    c[x + 1][y + 1] = c11;
 }
 ```
 
-Before, we were reading $2 n$ elements to update one cell, and now we are reading $4n$ elements to update four cells: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
-
-It also boosts instruction-level parallelism and saves some instructions from execcuting the read instructions.
+It also boosts instruction-level parallelism (we don't have to wait between iterations to update the loop state) and saves some cycles from executing the read instructions.
 
-We are not going to really try it. Instead, we will generalize it right away.
-
-Of course, this would not beat SIMD.
+Of course, although better in terms of I/O, this $2 \times 2$ update would not beat our vectorized implementation, so we are not going to try this version in particular and instead will scale the idea right away.
 
 ## Micro-kernel
 
@@ -224,7 +229,7 @@ void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) {
 }
 ```
 
-## Macro-kernel
+The rest of the implementaiton:
 
 ```c++
 void matmul(const float *_a, const float *_b, float *_c, int n) {
@@ -261,6 +266,8 @@ There is still a memory bandwidth problem.
 
 ## Blocking
 
+*Macro-kernel*
+
 ```c++
 const int s3 = 64;
 const int s2 = 120;

From 4491d76d98814d18f62aa498eababe0bd87f7ac3 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 15:17:30 +0300
Subject: [PATCH 019/173] matmul kernel

---
 content/english/hpc/algorithms/matmul.md | 42 +++++++++++++++++-------
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index e90abf3b..1f6abce0 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -162,6 +162,8 @@ So, counterintuitively, transposing the matrix doesn't help the memory bandwidth
 
 ## Register reuse
 
+Any two cells of A and B are used to update some cell of C.
+
 To compute the cell $C[i][j]$, we need to compute the dot product of $A[i][:]$ and $B[:][j]$ (we are using the Python notation here to select rows and columns), which requires fetching $2n$ elements, even when $B$ is stored in column-major order.
 
 What if we were to compute $C[i:i+2][j:j+2]$, a $2 \times 2$ submatrix of $C$? We would need $A[i:i+2][:]$ and $B[:][j:j+2]$, which is $4n$ elements in total: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
@@ -201,17 +203,22 @@ It also boosts instruction-level parallelism (we don't have to wait between iter
 
 Of course, although better in terms of I/O, this $2 \times 2$ update would not beat our vectorized implementation, so we are not going to try this version in particular and instead will scale the idea right away.
 
-## Micro-kernel
+## Designing the kernel
 
-*micro-kernel*.
+We follow this approach and design a general kernel that updates a $h \times w$ submatrix of C using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$ (i. e. not a full computation, but only a partial update — it will be clear why later). We have several considerations:
 
-This CPU importantly supports the [FMA3](https://en.wikipedia.org/wiki/FMA_instruction_set) SIMD extension that we will utilize in later implementations.
+- In general, if we are updating an $h \times w$ submatrix, we will be fetching $2 \cdot n \cdot (h + w)$ elements to update $h \cdot w$ elements. We want that ratio of $\frac{h \cdot w}{2 \cdot n \cdot (h + w)}$ to be as high as possible, which is achieved with large square-ish submatrices.
+- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instructions that are available on all modern x86 architectures. As you can guess from the name, they perform a vector `c += a * b` operation in one go, which is the core of our computation.
+- We want to be able to exploit [instruction-level parallelism](/hpc/pipelining/) to achieve better utilizaiton of this instruction. On Zen 2, the `fma` instruction has the latency of 5 and the throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to fully saturate its execution ports.
+- We only have $16$ logical vector registers that we can use as accumulators, and we want to avoid register spill.
 
-RAM bandwidth is lower than that
+For these reasons, we settle on a $6 \times 16$ kernel. We process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we need some more to store temporary values). We [broadcast](/hpc/simd/moving/#broadcast) an element of A, and then use it to update the first row ($8 + 8$ elements). Then we load the one below it, and so on. When we have updated the last row, we move to the next $6$ elements to the right.
 
-The latency of FMA is 5 cycles, while its reciprocal throughput is ½. 
+The final implementation is simpler than it sounds:
 
 ```c++
+// update 6x16 submatrix C[x:x+6][y:y+16]
+// using A[x:x+6][l:r] and B[l:r][y:y+16]
 void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) {
     vec t[6][2]{}; // will be stored in ymm registers
 
@@ -229,10 +236,15 @@ void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) {
 }
 ```
 
-The rest of the implementaiton:
+We need `t` so that the compiler stores these elements in vector registers. We could just update the final destinations, but unfortunately, the compiler re-writes them back to memory, causing a huge slowdown — and wrapping everything in `__restrict__` keywords doesn't help.
+
+The rest of the implementaiton is straightforward. Similar to the previous vectorized implementation, we just allocate aligned arrays and call the kernel instead of the innermost loop:
 
 ```c++
 void matmul(const float *_a, const float *_b, float *_c, int n) {
+    // to avoid implementing partials,
+    // we pad height to nearest 6 and width to 16
+    
     int nx = (n + 5) / 6 * 6;
     int ny = (n + 15) / 16 * 16;
     
@@ -242,7 +254,7 @@ void matmul(const float *_a, const float *_b, float *_c, int n) {
 
     for (int i = 0; i < n; i++) {
         memcpy(&a[i * ny], &_a[i * n], 4 * n);
-        memcpy(&b[i * ny], &_b[i * n], 4 * n);
+        memcpy(&b[i * ny], &_b[i * n], 4 * n); // we don't need to transpose b this time
     }
 
     for (int x = 0; x < nx; x += 6)
@@ -258,15 +270,19 @@ void matmul(const float *_a, const float *_b, float *_c, int n) {
 }
 ```
 
+This improves the performance by another ~40%:
+
 ![](../img/mm-kernel-barplot.svg)
 
+The speedup is much better (2-3x) on smaller arrays, indicating that there is still a bandwidth problem:
+
 ![](../img/mm-kernel-plot.svg)
 
-There is still a memory bandwidth problem.
+If you've read the section on [cache-oblivious algorithms](/hpc/external-memory/oblivious/), you know that one universal solution to these types of things is to split matrices in four parts, do eight recursive block matrix multiplications until the matrix fits into cache, and carefully combine the results together. We will follow a different, simpler approach.
 
 ## Blocking
 
-*Macro-kernel*
+Note that we are reading.
 
 ```c++
 const int s3 = 64;
@@ -286,6 +302,8 @@ for (int i3 = 0; i3 < ny; i3 += s3)
                     kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny);
 ```
 
+This part is sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix).
+
 ![](../img/mm-blocked-plot.svg)
 
 ![](../img/mm-blocked-barplot.svg)
@@ -294,8 +312,10 @@ Avoid moving anything:
 
 ![](../img/mm-noalloc.svg)
 
+The theoretical performance limit is:
+
 $$
-\underbrace{4}_{CPUs} \cdot \underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{3.6 \cdot 10^9}_{cycles/sec} = 230.4 \; GFLOPS \;\; (2.3 \cdot 10^{11})
+\underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10})
 $$
 
 (and also getting rid of `std::min` in the macro-kernel)
@@ -325,7 +345,7 @@ for (int i3 = 0; i3 < n; i3 += s3)
                                    * b[n / 8 * k + y / 8 + j];
 ```
 
-Register spilling.
+(Assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.)
 
 ## Generalizations
 

From 02aba431230500c968e972019b7aa92f0e03ebfb Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 15:52:04 +0300
Subject: [PATCH 020/173] matmul cache blocking

---
 content/english/hpc/algorithms/matmul.md | 40 ++++++++++++++++++++----
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 1f6abce0..9acbc61a 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -282,12 +282,24 @@ If you've read the section on [cache-oblivious algorithms](/hpc/external-memory/
 
 ## Blocking
 
-Note that we are reading.
+Alternative to divide-and-conquer is *cache blocking*: selecting a subset of data and processing it, and then going to the next block. Sometimes blocking is hierarchical: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on.
+
+It is less trivial to do for matrices than for arrays, but the trick is like this:
+
+- Let's select a subset of B that fits into the L3 cache (say, a subset of its columns).
+- Now, let's select a submatrix of A that fits into the L2 cache (a subset of its rows).
+- Select a submatrix of previously selected submatrix of B that fits into the L1 cache, and use it to do the kernel update (a subset of its rows).
+
+Here is a good [visualization](https://jukkasuomela.fi/cache-blocking-demo/) by Jukka Suomela (it shows different approaches; we use the last one).
+
+We could have started with A, but this would be slower. Note that during the kernel execution, we are reading the elements of $A$ slower than elements of $B$: we are fetching and broadcasting just one element, and then we multiply it with $16$ elements of $B$, so we need to store $B$ in cache, and the last stage be about selecting B in cache.
+
+We can implement it with three more outer `for` loops:
 
 ```c++
-const int s3 = 64;
-const int s2 = 120;
-const int s1 = 240;
+const int s3 = 64;  // how many columns of B to select
+const int s2 = 120; // how many rows of A to select 
+const int s1 = 240; // how many rows of B to select
 
 for (int i3 = 0; i3 < ny; i3 += s3)
     // now we are working with b[:][i3:i3+s3]
@@ -302,13 +314,29 @@ for (int i3 = 0; i3 < ny; i3 += s3)
                     kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny);
 ```
 
-This part is sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix).
+These outer `for` loops are sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix).
+
+It completely removes the memory bottleneck:
 
 ![](../img/mm-blocked-plot.svg)
 
+The performance is no longer seriously affected by the problem size:
+
 ![](../img/mm-blocked-barplot.svg)
 
-Avoid moving anything:
+Notice the dip at $1536$ is still there. Cache associativity affects the effective cache size. We need to adjust the step constants or insert holes into the layout to mitigate this.
+
+## Optimization
+
+We need a few more optimizations to reach the performance limit:
+
+- Remove memory allocation and just operate on the arrays that we are given (note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use unaligned `store` for `c` as we only use it rarely).
+- Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code.
+- Rewrite the micro-kernel using 12 variables (the compiler seems to have a problem with keeping them fully in registers).
+
+Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$.
+
+Avoiding moving anything pays off:
 
 ![](../img/mm-noalloc.svg)
 

From 6553b3f085132e827b2dd207ab702927b8ca89db Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 16:04:13 +0300
Subject: [PATCH 021/173] matmul optimization

---
 content/english/hpc/algorithms/matmul.md | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 9acbc61a..30bdd36c 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -332,32 +332,27 @@ We need a few more optimizations to reach the performance limit:
 
 - Remove memory allocation and just operate on the arrays that we are given (note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use unaligned `store` for `c` as we only use it rarely).
 - Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code.
-- Rewrite the micro-kernel using 12 variables (the compiler seems to have a problem with keeping them fully in registers).
+- Rewrite the micro-kernel by hand using 12 variables (the compiler seems to have a problem with keeping them fully in registers).
 
 Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$.
 
-Avoiding moving anything pays off:
+Avoiding moving anything pays off. These improvements sum up and give us a 50% improvement:
 
 ![](../img/mm-noalloc.svg)
 
-The theoretical performance limit is:
+We are actually not that far from the theoretical performance limit — which can be calculated as the throughput of the SIMD lane width times the fma instruction times the clock frequency:
 
 $$
-\underbrace{8}_{SIMD} \cdot \underbrace{2}_{1/thr} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10})
+\underbrace{8}_{SIMD} \cdot \underbrace{2}_{thr.} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10})
 $$
 
-(and also getting rid of `std::min` in the macro-kernel)
-
-
-[https://www.openblas.net/](OpenBLAS)
-
-[numpy](/hpc/complexity/languages/#blas)
+A more realistic comparison is some practical library, such as [https://www.openblas.net/](OpenBLAS). We just call it from Python using [numpy](/hpc/complexity/languages/#blas), so there may be some minor overhead, but reaching 80% of theoretical performance seems plausible (matrix multiplication is not the only thing that CPUs are made for):
 
 ![](../img/mm-blas.svg)
 
-We hit about 95.
+We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is basically 40 lines of C.
 
-Which is fine, considering that this is not the only thing that CPUs are made for.
+Interestingly, the whole thing can be rolled into one large `for` loop:
 
 ```c++
 for (int i3 = 0; i3 < n; i3 += s3)
@@ -406,6 +401,6 @@ https://arxiv.org/pdf/1605.01078.pdf
 
 ## Acknowledgements
 
-"[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)" by Kazushige Goto and Robert van de Geijn.
+The algorithm was originally designed by Kazushige Goto, and it is the basis of GotoBLAS and OpenBLAS. The author himself described it and some other aspects in more detail in "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)".
 
-Inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/ch2/)" course.
+The exposition style is inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/)" course by Jukka Suomela, which features a [similar case study](http://ppc.cs.aalto.fi/ch2/) on speeding up the distance product.

From 32518f4c003b666546ba393e167556c5ff8fd430 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 16:39:02 +0300
Subject: [PATCH 022/173] floyd algorithm and matmul

---
 content/english/hpc/algorithms/matmul.md | 45 ++++++++++++++++--------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 30bdd36c..e01544db 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -334,7 +334,9 @@ We need a few more optimizations to reach the performance limit:
 - Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code.
 - Rewrite the micro-kernel by hand using 12 variables (the compiler seems to have a problem with keeping them fully in registers).
 
-Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$.
+Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. We leave the code out, because the change is large and tedious and involves slightly modifying the benchmarking code itself. It is straightforward, but we only implement the version for this particular size, whithout any safety checks. Cheating on the benchmark.
+
+https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc
 
 Avoiding moving anything pays off. These improvements sum up and give us a 50% improvement:
 
@@ -350,9 +352,9 @@ A more realistic comparison is some practical library, such as [https://www.open
 
 ![](../img/mm-blas.svg)
 
-We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is basically 40 lines of C.
+We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is essentially just 40 lines of C.
 
-Interestingly, the whole thing can be rolled into one large `for` loop:
+Interestingly, the whole thing can be rolled into one large `for` loop (assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.):
 
 ```c++
 for (int i3 = 0; i3 < n; i3 += s3)
@@ -368,17 +370,21 @@ for (int i3 = 0; i3 < n; i3 += s3)
                                    * b[n / 8 * k + y / 8 + j];
 ```
 
-(Assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.)
+There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is [only efficient for very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$) for which we typically use multi-threading anyway.
 
 ## Generalizations
 
-Given a matrix $D$, we need to calculate its "min-plus matrix multiplication" defined as:
+FMA also supports 64-bit floating point number, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you know that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better convert them to and from floats.
+
+You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication" defined as:
 
-$(D \circ D)_{ij} = \min_k(D_{ik} + D_{kj})$
+$$
+(D \circ D)_{ij} = \min_{1 \le k \le n} (D_{ik} + D_{kj})
+$$
 
-Graph interpretation: find shortest paths of length 2 between all vertices in a fully-connected weighted graph
+It is also known as the "distance product" due to its graph interpretation: the result is the matrix of shortest paths of length two between all pairs of vertices in a fully-connected weighted graph.
 
-A cool thing about distance product is that if if we iterate the process and calculate:
+A cool thing about the distance product is that if if we iterate the process and calculate:
 
 $$
 D_2 = D \circ D \\
@@ -387,17 +393,26 @@ D_8 = D_4 \circ D_4 \\
 \ldots
 $$
 
-Then we can find all-pairs shortest distances in $O(\log n)$ steps
+Then we can find all-pairs shortest distances in $O(\log n)$ steps:
 
-(but recall that there are [more direct ways](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm) to solve it)
-
-Which is an exercise.
+```c++
+for (int l = 0; l < logn; l++)
+    for (int i = 0; i < n; i++)
+        for (int j = 0; j < n; j++)
+            for (int k = 0; k < n; k++)
+                d[i][j] = min(d[i][j], d[i][k] + d[k][j]);
+```
 
-Strassen algorithm is only useful for large matrices.
+This requires $O(n^3 \log n)$ operations, but if we do these two-edge relaxations in a particular order, we can do it with just one pass, which is known as the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm):
 
-https://arxiv.org/pdf/1605.01078.pdf
+```c++
+for (int k = 0; k < n; k++)
+    for (int i = 0; i < n; i++)
+        for (int j = 0; j < n; j++)
+            d[i][j] = min(d[i][j], d[i][k] + d[k][j]);
+```
 
-[cache-oblivious](/hpc/external-memory/oblivious/#matrix-multiplication) algorithms
+As an exercise, try to think of ways of speeding up this "for-for-for" computation. It will be harder than matrix multiplication because you need to perform updates in this particular order.
 
 ## Acknowledgements
 

From 82ddb7412be5c82bb937185e5c199cb7c418fe23 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 20:32:17 +0300
Subject: [PATCH 023/173] scalar matmul edits

---
 content/english/hpc/algorithms/matmul.md | 49 ++++++++++++++----------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index e01544db..a6237da5 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -17,13 +17,15 @@ nomove 0.303826 23.295860130469414
 blas 0.27489790320396423 25.747333528217077
 -->
 
-In this case study, we will design and implement several algorithms for matrix multiplication. We start with the naive "for-for-for" algorithm and incrementally improve it, eventually developing an implementation that is 50 times faster and matches the performance of BLAS libraries while being under 40 lines of C.
+In this case study, we will design and implement several algorithms for matrix multiplication.
 
-We compile our implementations with GCC 13 and run them on Zen 2 clocked at 2GHz.
+We start with the naive "for-for-for" algorithm and incrementally improve it, eventually arriving at a version that is 50 times faster and matches the performance of BLAS libraries while being under 40 lines of C.
+
+All implementations are compiled with GCC 13 and run on a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2) CPU clocked at 2GHz.
 
 ## Baseline
 
-The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is an $l \times m$ matrix $C$ calculated as:
+The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is defined as an $l \times m$ matrix $C$ calculated as
 
 $$
 C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}
@@ -31,7 +33,7 @@ $$
 
 For simplicity, we will only consider *square* matrices, where $l = m = n$.
 
-To implement matrix multiplication, we can just transfer this definition into code — but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays, to be explicit about memory addressing:
+To implement matrix multiplication, we can simply transfer this definition into code — but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays to be explicit about pointer arithmetic:
 
 ```c++
 void matmul(const float *a, const float *b, float *c, int n) {
@@ -42,17 +44,17 @@ void matmul(const float *a, const float *b, float *c, int n) {
 }
 ```
 
-For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations are still correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although it can be [generalized](#generalizations) to other types and operations.
+For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations remain correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although all implementations can be easily [generalized](#generalizations) to other data types and operations.
 
-Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. That is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication.
+Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. Put in perspective, it is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication, which doesn't look that good yet.
 
 ## Transposition
 
 In general, when you optimize an algorithm that processes large quantities of data — and $1920^2 \times 3 \times 4 \approx 42$ MB clearly is a large quantity as it can't fit into any of the [CPU caches](/hpc/cpu-cache) — you should always start with memory before optimizing arithmetic, as it is much more likely to be the bottleneck.
 
-Note that the field $C_{ij}$ can be viewed as the dot product of row $i$ in matrix $A$ and column $j$ in matrix $B$. As we are incrementing the `k` variable in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa).
+The field $C_{ij}$ can be seen as the dot product of row $i$ of matrix $A$ and column $j$ of matrix $B$. As we increment `k` in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa) as sequential iteration.
 
-One [well-known optimization](/hpc/external-memory/oblivious/#matrix-multiplication) that mitigates this problem is to either store matrix $B$ in *column-major* order or to *transpose* it before the matrix multiplication — spending $O(n^2)$ additional operations, but ensuring sequential reads in the hot loop:
+One [well-known](/hpc/external-memory/oblivious/#matrix-multiplication) optimization that tackles this problem is to store matrix $B$ in *column-major* order — or, alternatively, to *transpose* it before the matrix multiplication. This requires $O(n^2)$ additional operations but ensures sequential reads in the innermost loop:
 
 <!--
 
@@ -79,15 +81,13 @@ This code runs in ~12.4s, or about 30% faster. As we will see in a bit, there ar
 
 ## Vectorization
 
-Now that we are just sequentially reading the elements of `a` and `b`, multiplying them, and adding the result to an accumulator variable, we can use [SIMD](/hpc/simd/) instructions to speed it up like [any other reduction](/hpc/simd/reduction/).
-
-We can use [GCC vector types](/hpc/simd/intrinsics/#gcc-vector-extensions) to implement it:
+Now that all we do is just sequentially read the elements of `a` and `b`, multiply them, and add the result to an accumulator variable, we can use [SIMD](/hpc/simd/) instructions to speed it all up. It is pretty straightforward to implement using [GCC vector types](/hpc/simd/intrinsics/#gcc-vector-extensions) — we can [memory-align](/hpc/cpu-cache/alignment/) matrix rows, pad them with zeros, and then just compute the multiply-sum as we would normally compute [any other reduction](/hpc/simd/reduction/):
 
 ```c++
 // a vector of 256 / 32 = 8 floats
 typedef float vec __attribute__ (( vector_size(32) ));
 
-// helper function that allocates n vectors and initializes them with zeros
+// a helper function that allocates n vectors and initializes them with zeros
 vec* alloc(int n) {
     vec* ptr = (vec*) std::aligned_alloc(32, 32 * n);
     memset(ptr, 0, 32 * n);
@@ -95,13 +95,12 @@ vec* alloc(int n) {
 }
 
 void matmul(const float *_a, const float *_b, float *c, int n) {
-    // first, we need to align rows and pad them with zeros
     int nB = (n + 7) / 8; // number of 8-element vectors in a row (rounded up)
 
     vec *a = alloc(n * nB);
     vec *b = alloc(n * nB);
 
-    // move both matrices
+    // move both matrices to the aligned region
     for (int i = 0; i < n; i++) {
         for (int j = 0; j < n; j++) {
             a[i * nB + j / 8][j % 8] = _a[i * n + j];
@@ -128,11 +127,13 @@ void matmul(const float *_a, const float *_b, float *c, int n) {
 }
 ```
 
-The performance for $n = 1920$ is now around 2.3 GFLOPS — or another ~4 times higher.
+The performance for $n = 1920$ is now around 2.3 GFLOPS — or another ~4 times higher compared to the transposed but not vectorized version.
 
 ![](../img/mm-vectorized-barplot.svg)
 
-This optimization looks neither too complex or specific to matrix multiplication. Why can't the compiler simply [auto-vectorizate](/hpc/simd/auto-vectorization/) the inner loop? It actually can — the only thing preventing that is the possibility that `c` overlaps with either `a` or `b`. The only thing that you need to do is to guarantee that `c` is not [aliased](/hpc/compilation/contracts/#memory-aliasing) with anything by adding the `__restrict__` keyword to it:
+This optimization looks neither too complex nor specific to matrix multiplication. Why can't the compiler [auto-vectorizee](/hpc/simd/auto-vectorization/) the inner loop by itself?
+
+It actually can — the only thing preventing that is the possibility that `c` overlaps with either `a` or `b`. The only thing that you need to do is to guarantee that `c` is not [aliased](/hpc/compilation/contracts/#memory-aliasing) with anything by adding the `__restrict__` keyword to it:
 
 <!-- (the compiler already knows that reading `a` and `b` is safe in any order because they are marked as `const`): -->
 
@@ -144,21 +145,27 @@ void matmul(const float *a, const float *_b, float * __restrict__ c, int n) {
 
 Both manually and auto-vectorized implementations perform roughly the same.
 
+<!--
+
 The performance is bottlenecked by using a single variable. We could use multiple variables similar to other reductions, but we will solve it later anyway.
 
+-->
+
 ## Memory efficiency
 
-Now, what is interesting is that the implementation efficiency depends on the problem size. 
+What is interesting is that the implementation efficiency depends on the problem size. 
 
-At first, the performance (in terms of useful operations per second) increases, as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/).
+At first, the performance (in terms of useful operations per second) increases as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/).
 
 ![](../img/mm-vectorized-plot.svg)
 
-It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and actually slightly better because of the transpose itself — for all but few data points, where the performance deteriorates. This is because of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is divisible by a large power of two, we are fetching addresses of `b` that all likely map to the same cache line, reducing the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$.
+It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and even slightly better because it doesn't need to perform a transposition.
+
+One might think that there would be some *general* performance gain from doing sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` indeed takes more time, but the next 15 column reads will be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for any practical matrix sizes.
 
-One may think that there would be at least some general performance gain from full sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` is painful, but the next 15 columns will actually be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for all practical problem sizes.
+Instead, the performance deteriorates on only a few specific matrix sizes due to the effects of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is a multiple of a large power of two, we are fetching the addresses of `b` that all likely map to the same cache line, which reduces the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$.
 
-So, counterintuitively, transposing the matrix doesn't help the memory bandwidth — and in the naive implementation, we are not really bottlenecked by it anyway. But for our vectorize implementation, we certainly are, so let's tackle it.
+So, counterintuitively, transposing the matrix doesn't help with caching — and in the naive implementation, we are not really bottlenecked by the memory bandwidth anyway. But our vectorized implementation certainly is, so let's work on its I/O efficiency.
 
 ## Register reuse
 

From 55edc44d68bf04054446ee084f575a397a5ee66b Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 21:48:29 +0300
Subject: [PATCH 024/173] matmul kernel

---
 content/english/hpc/algorithms/matmul.md | 83 ++++++++++++++++--------
 1 file changed, 56 insertions(+), 27 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index a6237da5..64282bd3 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -169,36 +169,41 @@ So, counterintuitively, transposing the matrix doesn't help with caching — and
 
 ## Register reuse
 
-Any two cells of A and B are used to update some cell of C.
+Using a Python-like notation to refer to submatrices, to compute the cell $C[x][y]$, we need to calculate the dot product of $A[x][:]$ and $B[:][y]$, which requires fetching $2n$ elements, even if we store $B$ in column-major order.
 
-To compute the cell $C[i][j]$, we need to compute the dot product of $A[i][:]$ and $B[:][j]$ (we are using the Python notation here to select rows and columns), which requires fetching $2n$ elements, even when $B$ is stored in column-major order.
+<!-- Any two cells of A and B are used to update some cell of C. -->
 
-What if we were to compute $C[i:i+2][j:j+2]$, a $2 \times 2$ submatrix of $C$? We would need $A[i:i+2][:]$ and $B[:][j:j+2]$, which is $4n$ elements in total: that is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
+To compute $C[x:x+2][y:y+2]$, a $2 \times 2$ submatrix of $C$, we would need two rows from $A$ and two columns from $B$, namely $A[x:x+2][:]$ and $B[:][y:y+2]$, containing $4n$ elements in total, to update four elements instead of one — which is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
+
+<!--
 
 To actually avoid reading more data, we need to read these $2+2$ rows and columns in parallel and update all $2 \times 2$ cells at once using all possible combinations of products.
 
-Here is a proof of concept:
+-->
+
+To avoid re-fetching data, we need to iterate these rows and columns in parallel and calculate all $2 \times 2$ possible combinations of products. Here is a proof of concept:
 
 ```c++
-void update_2x2(int x, int y) {
-    int c00 = c[x][y],
-        c01 = c[x][y + 1],
-        c10 = c[x + 1][y],
-        c11 = c[x + 1][y + 1];
+void kernel_2x2(int x, int y) {
+    int c00 = 0, c01 = 0, c10 = 0, c11 = 0;
 
     for (int k = 0; k < n; k++) {
+        // read rows
         int a0 = a[x][k];
         int a1 = a[x + 1][k];
 
+        // read columns
         int b0 = b[k][y];
         int b1 = b[k][y + 1];
 
+        // update all combinations
         c00 += a0 * b0;
         c01 += a0 * b1;
         c10 += a1 * b0;
         c11 += a1 * b1;
     }
 
+    // write the results to C
     c[x][y]         = c00;
     c[x][y + 1]     = c01;
     c[x + 1][y]     = c10;
@@ -206,52 +211,74 @@ void update_2x2(int x, int y) {
 }
 ```
 
-It also boosts instruction-level parallelism (we don't have to wait between iterations to update the loop state) and saves some cycles from executing the read instructions.
+We can now simply call this kernel on all 2x2 submatrices of $C$, but we won't bother evaluating it: although this algorithm is better in terms of I/O operations, it would still not beat our SIMD-based implementation. Instead, we will extend this approach and develop a similar *vectorized* kernel right away.
+
+<!-- It also boosts instruction-level parallelism (we don't have to wait between iterations to update the loop state) and saves some cycles from executing the read instructions.
 
 Of course, although better in terms of I/O, this $2 \times 2$ update would not beat our vectorized implementation, so we are not going to try this version in particular and instead will scale the idea right away.
 
+-->
+
 ## Designing the kernel
 
-We follow this approach and design a general kernel that updates a $h \times w$ submatrix of C using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$ (i. e. not a full computation, but only a partial update — it will be clear why later). We have several considerations:
+Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ from scratch, we will declare a function that *updates* it using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$. For now, this seems like an over-generalization, but this API will be useful later.
+
+<!--
 
-- In general, if we are updating an $h \times w$ submatrix, we will be fetching $2 \cdot n \cdot (h + w)$ elements to update $h \cdot w$ elements. We want that ratio of $\frac{h \cdot w}{2 \cdot n \cdot (h + w)}$ to be as high as possible, which is achieved with large square-ish submatrices.
-- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instructions that are available on all modern x86 architectures. As you can guess from the name, they perform a vector `c += a * b` operation in one go, which is the core of our computation.
-- We want to be able to exploit [instruction-level parallelism](/hpc/pipelining/) to achieve better utilizaiton of this instruction. On Zen 2, the `fma` instruction has the latency of 5 and the throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to fully saturate its execution ports.
-- We only have $16$ logical vector registers that we can use as accumulators, and we want to avoid register spill.
+We follow this approach and design a general kernel that updates a $h \times w$ submatrix of C using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$ (i. e. not a full computation, but only a partial update — it will be clear why later). 
+
+-->
 
-For these reasons, we settle on a $6 \times 16$ kernel. We process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we need some more to store temporary values). We [broadcast](/hpc/simd/moving/#broadcast) an element of A, and then use it to update the first row ($8 + 8$ elements). Then we load the one below it, and so on. When we have updated the last row, we move to the next $6$ elements to the right.
+To determine $h$ and $w$, we have several performance considerations:
+
+- In general, to compute an $h \times w$ submatrix, we need to fetch $2 \cdot n \cdot (h + w)$ elements. To optimize the I/O efficiency, we would want the $\frac{h \cdot w}{h + w}$ ratio to be high, which is achieved with large and square-ish submatrices.
+- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instruction available on all modern x86 architectures. As you can guess from the name, it performs the `c += a * b` operation — which is the core of a dot product — on 8-element vectors in one go, which saves us from executing vector multiplication and addition separately.
+- To achieve better utilization of this instruction, we want to make use of [instruction-level parallelism](/hpc/pipelining/). On Zen 2, the `fma` instruction has a latency of 5 and a throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to saturate its execution ports.
+- We want to avoid register spill, and we only have $16$ logical vector registers that we can use as accumulators.
+
+For these reasons, we settle on a $6 \times 16$ kernel. This way, we process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we can't use an $8 \times 16$ kernel and use all 16 vector registers because we need some to hold temporary values).
+
+To update them efficiently, we use the following procedure:
+
+<!--
+
+We [broadcast](/hpc/simd/moving/#broadcast) an element of A, and then use it to update the first row ($8 + 8$ elements). Then we load the one below it, and so on. When we have updated the last row, we move to the next $6$ elements to the right.
 
 The final implementation is simpler than it sounds:
 
+-->
+
 ```c++
 // update 6x16 submatrix C[x:x+6][y:y+16]
 // using A[x:x+6][l:r] and B[l:r][y:y+16]
 void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) {
-    vec t[6][2]{}; // will be stored in ymm registers
+    vec t[6][2]{}; // will be zero-filled and stored in ymm registers
 
     for (int k = l; k < r; k++) {
         for (int i = 0; i < 6; i++) {
-            vec alpha = vec{} + a[(x + i) * n + k];  // broadcast
+            // broadcast a[x + i][k] into a register
+            vec alpha = vec{} + a[(x + i) * n + k]; // converts to a broadcast
+            // multiply b[k][y:y+16] by it and update t[i][0] and t[i][1]
             for (int j = 0; j < 2; j++)
-                t[i][j] += alpha * b[(k * n + y) / 8 + j]; // fused multiply-add
+                t[i][j] += alpha * b[(k * n + y) / 8 + j]; // converts to an fma
         }
     }
 
+    // write the results back to C
     for (int i = 0; i < 6; i++)
         for (int j = 0; j < 2; j++)
             c[((x + i) * n + y) / 8 + j] += t[i][j];
 }
 ```
 
-We need `t` so that the compiler stores these elements in vector registers. We could just update the final destinations, but unfortunately, the compiler re-writes them back to memory, causing a huge slowdown — and wrapping everything in `__restrict__` keywords doesn't help.
+We need `t` so that the compiler stores these elements in vector registers. We could just update the final destinations, but, unfortunately, the compiler re-writes them back to memory, causing a slowdown (wrapping everything in `__restrict__` keywords doesn't help).
 
-The rest of the implementaiton is straightforward. Similar to the previous vectorized implementation, we just allocate aligned arrays and call the kernel instead of the innermost loop:
+The rest of the implementation is straightforward. Similar to the previous vectorized implementation, we just allocate aligned arrays and call the kernel instead of the innermost loop:
 
 ```c++
 void matmul(const float *_a, const float *_b, float *_c, int n) {
-    // to avoid implementing partials,
-    // we pad height to nearest 6 and width to 16
-    
+    // to simplify the implementation, we pad the height and width
+    // so that they are divisible by 6 and 16 respectively
     int nx = (n + 5) / 6 * 6;
     int ny = (n + 15) / 16 * 16;
     
@@ -277,15 +304,15 @@ void matmul(const float *_a, const float *_b, float *_c, int n) {
 }
 ```
 
-This improves the performance by another ~40%:
+This improves the benchmark performance, but only by ~40%:
 
 ![](../img/mm-kernel-barplot.svg)
 
-The speedup is much better (2-3x) on smaller arrays, indicating that there is still a bandwidth problem:
+The speedup is much higher (2-3x) on smaller arrays, indicating that there is still a bandwidth problem:
 
 ![](../img/mm-kernel-plot.svg)
 
-If you've read the section on [cache-oblivious algorithms](/hpc/external-memory/oblivious/), you know that one universal solution to these types of things is to split matrices in four parts, do eight recursive block matrix multiplications until the matrix fits into cache, and carefully combine the results together. We will follow a different, simpler approach.
+Now, if you've read the section on [cache-oblivious algorithms](/hpc/external-memory/oblivious/), you know that one universal solution to these types of things is to split all matrices into four parts, perform eight recursive block matrix multiplications, and carefully combine the results together. This solution is okay in practice, but there is some [overhead to recursion](/hpc/architecture/functions/), and it also doesn't allow us to fine-tune the algorithm, so instead, we will follow a different, simpler approach.
 
 ## Blocking
 
@@ -419,6 +446,8 @@ for (int k = 0; k < n; k++)
             d[i][j] = min(d[i][j], d[i][k] + d[k][j]);
 ```
 
+Vectorizing the distance product and executing it $O(\log n)$ times is faster than than naively executing the Floyd-Warshall algorithm, although not by a lot.
+
 As an exercise, try to think of ways of speeding up this "for-for-for" computation. It will be harder than matrix multiplication because you need to perform updates in this particular order.
 
 ## Acknowledgements

From d129828bb3d764ccf146eb41df189aac7559a4dc Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 22:17:08 +0300
Subject: [PATCH 025/173] matmul cache blocking

---
 content/english/hpc/algorithms/matmul.md | 32 +++++++++++-------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 64282bd3..2126daea 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -316,19 +316,20 @@ Now, if you've read the section on [cache-oblivious algorithms](/hpc/external-me
 
 ## Blocking
 
-Alternative to divide-and-conquer is *cache blocking*: selecting a subset of data and processing it, and then going to the next block. Sometimes blocking is hierarchical: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on.
+The *cache-aware* alternative to this divide-and-conquer trick is *cache blocking*: splitting the data into blocks that can fit into the cache and processing them one by one. If we have more than one layer of cache, we can do hierarchical blocking: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on. This requires knowing the cache sizes in advance, but it is usually easier to implement and also faster in practice.
 
-It is less trivial to do for matrices than for arrays, but the trick is like this:
+Cache blocking is less trivial to do with matrices than with arrays, but the general idea is this:
 
-- Let's select a subset of B that fits into the L3 cache (say, a subset of its columns).
-- Now, let's select a submatrix of A that fits into the L2 cache (a subset of its rows).
-- Select a submatrix of previously selected submatrix of B that fits into the L1 cache, and use it to do the kernel update (a subset of its rows).
+- Select a submatrix of $B$ that fits into the L3 cache (say, a subset of its columns).
+- Select a submatrix of $A$ that fits into the L2 cache (say, a subset of its rows).
+- Select a submatrix of the previously selected submatrix of $B$ (a subset of its rows) that fits into the L1 cache.
+- Update the relevant submatrix of $C$ using the kernel.
 
-Here is a good [visualization](https://jukkasuomela.fi/cache-blocking-demo/) by Jukka Suomela (it shows different approaches; we use the last one).
+Here is a good [visualization](https://jukkasuomela.fi/cache-blocking-demo/) by Jukka Suomela (it features many different approaches; you are interested in the last one).
 
-We could have started with A, but this would be slower. Note that during the kernel execution, we are reading the elements of $A$ slower than elements of $B$: we are fetching and broadcasting just one element, and then we multiply it with $16$ elements of $B$, so we need to store $B$ in cache, and the last stage be about selecting B in cache.
+Note that the decision to start this process with matrix $B$ is not arbitrary. During the kernel execution, we are reading the elements of $A$ much slower than the elements of $B$: we fetch and broadcast just one element of $A$ and then multiply it with $16$ elements of $B$. Therefore, we want $B$ to be in the L1 cache while $A$ can stay in the L2 cache and not the other way around.
 
-We can implement it with three more outer `for` loops:
+This sounds complicated, but we can implement it with just three more outer `for` loops, which are collectively called *macro-kernel* (and the highly optimized low-level function that updates a 6x16 submatrix is called *micro-kernel*):
 
 ```c++
 const int s3 = 64;  // how many columns of B to select
@@ -341,24 +342,21 @@ for (int i3 = 0; i3 < ny; i3 += s3)
         // now we are working with a[i2:i2+s2][:]
         for (int i1 = 0; i1 < ny; i1 += s1)
             // now we are working with b[i1:i1+s1][i3:i3+s3]
-            // this equates to updating c[i2:i2+s2][i3:i3+s3]
-            // with [l:r] = [i1:i1+s1]
+            // and we need to update c[i2:i2+s2][i3:i3+s3] with [l:r] = [i1:i1+s1]
             for (int x = i2; x < std::min(i2 + s2, nx); x += 6)
                 for (int y = i3; y < std::min(i3 + s3, ny); y += 16)
                     kernel(a, (vec*) b, (vec*) c, x, y, i1, std::min(i1 + s1, n), ny);
 ```
 
-These outer `for` loops are sometimes called *macro-kernel* (as opposed to the *micro-kernel* that only updates a 6x16 submatrix).
+Cache blocking completely removes the memory bottleneck:
 
-It completely removes the memory bottleneck:
-
-![](../img/mm-blocked-plot.svg)
+![](../img/mm-blocked-barplot.svg)
 
-The performance is no longer seriously affected by the problem size:
+The performance is no longer significantly affected by the problem size:
 
-![](../img/mm-blocked-barplot.svg)
+![](../img/mm-blocked-plot.svg)
 
-Notice the dip at $1536$ is still there. Cache associativity affects the effective cache size. We need to adjust the step constants or insert holes into the layout to mitigate this.
+Notice that the dip at $1536$ is still there: cache associativity still affects the effective cache size. To mitigate this, we can adjust the step constants or insert holes into the layout, but we are not going to bother doing that for now.
 
 ## Optimization
 

From f50135e9fa4cd1937da55eb1df4d5077d26e70df Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 22:57:16 +0300
Subject: [PATCH 026/173] matmul final edits

---
 content/english/hpc/algorithms/matmul.md | 52 ++++++++++++++----------
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 2126daea..e6749b81 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -360,33 +360,39 @@ Notice that the dip at $1536$ is still there: cache associativity still affects
 
 ## Optimization
 
-We need a few more optimizations to reach the performance limit:
+To approach closer to the performance limit, we need a few more optimizations:
 
-- Remove memory allocation and just operate on the arrays that we are given (note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use unaligned `store` for `c` as we only use it rarely).
-- Get rid of the `std::min` so that the size parameters are mostly constant and can be embedded into the machine code.
-- Rewrite the micro-kernel by hand using 12 variables (the compiler seems to have a problem with keeping them fully in registers).
+- Remove memory allocation and operate on the arrays that are passed to the function. Note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use an unaligned `store` for `c` as we only use it rarely.
+- Get rid of the `std::min` so that the size parameters are (mostly) constant and can be embedded into the machine code by the compiler (which also lets it [unroll](/hpc/architecture/loops/) the micro-kernel loop more efficiently without runtime checks).
+- Rewrite the micro-kernel by hand using 12 vector variables (the compiler seems to struggle with keeping them in registers and writes them first to temporary storage and only then to $C$).
+
+These optimizations are straightforward but quite tedious to implement, so we are not going to list [the code](https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc) in the article. It also requires some more work to effectively support "weird" matrix sizes, which is why we only run benchmarks for sizes that are multiple of $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$.
+
+<!--
 
 Effectively supporting weird sizes requires a bit more work, and this is the reason why we benchmarked at an array sizes that are divisible by $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$. We leave the code out, because the change is large and tedious and involves slightly modifying the benchmarking code itself. It is straightforward, but we only implement the version for this particular size, whithout any safety checks. Cheating on the benchmark.
 
-https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc
+But avoiding moving anything pays off. 
 
-Avoiding moving anything pays off. These improvements sum up and give us a 50% improvement:
+-->
+
+These individually small improvements sum up and result in another 50% improvement:
 
 ![](../img/mm-noalloc.svg)
 
-We are actually not that far from the theoretical performance limit — which can be calculated as the throughput of the SIMD lane width times the fma instruction times the clock frequency:
+We are actually not that far from the theoretical performance limit — which can be calculated as the width of a SIMD lane times the `fma` instruction throughput times the clock frequency:
 
 $$
 \underbrace{8}_{SIMD} \cdot \underbrace{2}_{thr.} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10})
 $$
 
-A more realistic comparison is some practical library, such as [https://www.openblas.net/](OpenBLAS). We just call it from Python using [numpy](/hpc/complexity/languages/#blas), so there may be some minor overhead, but reaching 80% of theoretical performance seems plausible (matrix multiplication is not the only thing that CPUs are made for):
+It is more useful to compare against some practical library, such as [OpenBLAS](https://www.openblas.net/). The laziest way is to simply invoke matrix multiplication from Python with [numpy](/hpc/complexity/languages/#blas). There may be some minor overhead, but it ends up reaching 80% of the theoretical limit, which seems plausible (this overhead is typical, as matrix multiplication is not the only thing that CPUs are made for):
 
 ![](../img/mm-blas.svg)
 
-We've reached ~93% of BLAS and ~75% of the theoretical performance limit. Which is really great for what is essentially just 40 lines of C.
+We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C.
 
-Interestingly, the whole thing can be rolled into one large `for` loop (assuming that we are in 2050 and using the 35th version of GCC, which finally properly manager not to screwing up with register spilling.):
+Interestingly, the whole thing can be rolled into just one deeply nested `for` loop (assuming that we are in 2050 and using the 35th version of GCC, which finally does not screw up with register spilling.):
 
 ```c++
 for (int i3 = 0; i3 < n; i3 += s3)
@@ -402,21 +408,23 @@ for (int i3 = 0; i3 < n; i3 += s3)
                                    * b[n / 8 * k + y / 8 + j];
 ```
 
-There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is [only efficient for very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$) for which we typically use multi-threading anyway.
+There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway.
+
+<!-- for which we typically use multi-threading anyway -->
 
 ## Generalizations
 
-FMA also supports 64-bit floating point number, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you know that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better convert them to and from floats.
+FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you can guarantee that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better to convert them to and from floats.
 
-You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication" defined as:
+You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication," which is defined as:
 
 $$
-(D \circ D)_{ij} = \min_{1 \le k \le n} (D_{ik} + D_{kj})
+(A \circ B)_{ij} = \min_{1 \le k \le n} (A_{ik} + B_{kj})
 $$
 
-It is also known as the "distance product" due to its graph interpretation: the result is the matrix of shortest paths of length two between all pairs of vertices in a fully-connected weighted graph.
+It is also known as the "distance product" due to its graph interpretation: when applied to itself $(D \circ D)$, the result is the matrix of shortest paths of length two between all pairs of vertices in a fully-connected weighted graph specified by the edge weight matrix $D$.
 
-A cool thing about the distance product is that if if we iterate the process and calculate:
+A cool thing about the distance product is that if we iterate the process and calculate
 
 $$
 D_2 = D \circ D \\
@@ -425,7 +433,7 @@ D_8 = D_4 \circ D_4 \\
 \ldots
 $$
 
-Then we can find all-pairs shortest distances in $O(\log n)$ steps:
+…we can find all-pairs shortest paths in $O(\log n)$ steps:
 
 ```c++
 for (int l = 0; l < logn; l++)
@@ -435,7 +443,7 @@ for (int l = 0; l < logn; l++)
                 d[i][j] = min(d[i][j], d[i][k] + d[k][j]);
 ```
 
-This requires $O(n^3 \log n)$ operations, but if we do these two-edge relaxations in a particular order, we can do it with just one pass, which is known as the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm):
+This requires $O(n^3 \log n)$ operations. If we do these two-edge relaxations in a particular order, we can do it with just one pass, which is known as the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm):
 
 ```c++
 for (int k = 0; k < n; k++)
@@ -444,12 +452,12 @@ for (int k = 0; k < n; k++)
             d[i][j] = min(d[i][j], d[i][k] + d[k][j]);
 ```
 
-Vectorizing the distance product and executing it $O(\log n)$ times is faster than than naively executing the Floyd-Warshall algorithm, although not by a lot.
+Interestingly, vectorizing the distance product and executing it $O(\log n)$ times in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot.
 
-As an exercise, try to think of ways of speeding up this "for-for-for" computation. It will be harder than matrix multiplication because you need to perform updates in this particular order.
+As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because you need to perform updates in a particular order, but it is still possible to design a similar kernel and an iteration order that achieves a 30-50x total speedup.
 
 ## Acknowledgements
 
-The algorithm was originally designed by Kazushige Goto, and it is the basis of GotoBLAS and OpenBLAS. The author himself described it and some other aspects in more detail in "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)".
+The final algorithm was originally designed by Kazushige Goto, and it is the basis of GotoBLAS and OpenBLAS. The author himself describes it in more detail in "[Anatomy of High-Performance Matrix Multiplication](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)".
 
-The exposition style is inspired by "[Programming Parallel Computers](http://ppc.cs.aalto.fi/)" course by Jukka Suomela, which features a [similar case study](http://ppc.cs.aalto.fi/ch2/) on speeding up the distance product.
+The exposition style is inspired by the "[Programming Parallel Computers](http://ppc.cs.aalto.fi/)" course by Jukka Suomela, which features a [similar case study](http://ppc.cs.aalto.fi/ch2/) on speeding up the distance product.

From 15e65f57a7b32c64d4afdabff0e726a542615fb5 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 6 Apr 2022 23:00:53 +0300
Subject: [PATCH 027/173] publish matmul

---
 content/english/hpc/algorithms/matmul.md    | 3 +--
 content/english/hpc/complexity/languages.md | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index e6749b81..01159313 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -1,7 +1,6 @@
 ---
 title: Matrix Multiplication
 weight: 20
-draft: true
 ---
 
 <!--
@@ -392,7 +391,7 @@ It is more useful to compare against some practical library, such as [OpenBLAS](
 
 We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C.
 
-Interestingly, the whole thing can be rolled into just one deeply nested `for` loop (assuming that we are in 2050 and using the 35th version of GCC, which finally does not screw up with register spilling.):
+Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS-level of performance (assuming that we're in 2050 and using GCC 35, which finally does not screw up with register spilling):
 
 ```c++
 for (int i3 = 0; i3 < n; i3 += s3)
diff --git a/content/english/hpc/complexity/languages.md b/content/english/hpc/complexity/languages.md
index 72a7cf76..435b450d 100644
--- a/content/english/hpc/complexity/languages.md
+++ b/content/english/hpc/complexity/languages.md
@@ -204,7 +204,7 @@ print(duration)
 
 Now it takes ~0.12 seconds: a ~5x speedup over the auto-vectorized C version and ~5250x speedup over our initial Python implementation!
 
-You don't typically see such dramatic improvements. For now, we are not ready to tell you exactly how this is achieved. Implementations of dense matrix multiplication in OpenBLAS are typically [5000 lines of handwritten assembly](https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_16x2_haswell.S) tailored separately for *each* architecture. In later chapters, we will explain all the relevant techniques one by one, and then return to this example and develop our own BLAS-level implementation using just under 40 lines of C.
+You don't typically see such dramatic improvements. For now, we are not ready to tell you exactly how this is achieved. Implementations of dense matrix multiplication in OpenBLAS are typically [5000 lines of handwritten assembly](https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_16x2_haswell.S) tailored separately for *each* architecture. In later chapters, we will explain all the relevant techniques one by one, and then [return](/hpc/algorithms/matmul) to this example and develop our own BLAS-level implementation using just under 40 lines of C.
 
 ### Takeaway
 

From a1629e9aabce91355793cdc6cf35d1be6c56bebb Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 7 Apr 2022 01:35:51 +0300
Subject: [PATCH 028/173] matmul edits

---
 content/english/hpc/algorithms/matmul.md | 76 ++++++++++++------------
 1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 01159313..e0ebdaac 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -24,7 +24,7 @@ All implementations are compiled with GCC 13 and run on a [Zen 2](https://en.wik
 
 ## Baseline
 
-The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is defined as an $l \times m$ matrix $C$ calculated as
+The result of multiplying an $l \times n$ matrix $A$ by an $n \times m$ matrix $B$ is defined as an $l \times m$ matrix $C$ such that:
 
 $$
 C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}
@@ -32,7 +32,7 @@ $$
 
 For simplicity, we will only consider *square* matrices, where $l = m = n$.
 
-To implement matrix multiplication, we can simply transfer this definition into code — but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays to be explicit about pointer arithmetic:
+To implement matrix multiplication, we can simply transfer this definition into code, but instead of two-dimensional arrays (aka matrices), we will be using one-dimensional arrays to be explicit about pointer arithmetic:
 
 ```c++
 void matmul(const float *a, const float *b, float *c, int n) {
@@ -43,15 +43,15 @@ void matmul(const float *a, const float *b, float *c, int n) {
 }
 ```
 
-For reasons that will become apparent later, we only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations remain correct for all other sizes. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although all implementations can be easily [generalized](#generalizations) to other data types and operations.
+For reasons that will become apparent later, we will only use matrix sizes that are multiples of $48$ for benchmarking, but the implementations remain correct for all others. We also use [32-bit floats](/hpc/arithmetic/ieee-754) specifically, although all implementations can be easily [generalized](#generalizations) to other data types and operations.
 
-Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. Put in perspective, it is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication, which doesn't look that good yet.
+Compiled with `g++ -O3 -march=native -ffast-math -funroll-loops`, the naive approach multiplies two matrices of size $n = 1920 = 48 \times 40$ in ~16.7 seconds. To put it in perspective, this is approximately $\frac{1920^3}{16.7 \times 10^9} \approx 0.42$ useful operations per nanosecond (GFLOPS), or roughly 5 CPU cycles per multiplication, which doesn't look that good yet.
 
 ## Transposition
 
-In general, when you optimize an algorithm that processes large quantities of data — and $1920^2 \times 3 \times 4 \approx 42$ MB clearly is a large quantity as it can't fit into any of the [CPU caches](/hpc/cpu-cache) — you should always start with memory before optimizing arithmetic, as it is much more likely to be the bottleneck.
+In general, when optimizing an algorithm that processes large quantities of data — and $1920^2 \times 3 \times 4 \approx 42$ MB clearly is a large quantity as it can't fit into any of the [CPU caches](/hpc/cpu-cache) — one should always start with memory before optimizing arithmetic, as it is much more likely to be the bottleneck.
 
-The field $C_{ij}$ can be seen as the dot product of row $i$ of matrix $A$ and column $j$ of matrix $B$. As we increment `k` in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa) as sequential iteration.
+The field $C_{ij}$ can be thought of as the dot product of row $i$ of matrix $A$ and column $j$ of matrix $B$. As we increment `k` in the inner loop above, we are reading the matrix `a` sequentially, but we are jumping over $n$ elements as we iterate over a column of `b`, which is [not as fast](/hpc/cpu-cache/aos-soa) as sequential iteration.
 
 One [well-known](/hpc/external-memory/oblivious/#matrix-multiplication) optimization that tackles this problem is to store matrix $B$ in *column-major* order — or, alternatively, to *transpose* it before the matrix multiplication. This requires $O(n^2)$ additional operations but ensures sequential reads in the innermost loop:
 
@@ -76,11 +76,13 @@ void matmul(const float *a, const float *_b, float *c, int n) {
 }
 ```
 
-This code runs in ~12.4s, or about 30% faster. As we will see in a bit, there are more important benefits to transposing it than just the sequential memory reads.
+This code runs in ~12.4s, or about 30% faster.
+
+As we will see in a bit, there are more important benefits to transposing it than just the sequential memory reads.
 
 ## Vectorization
 
-Now that all we do is just sequentially read the elements of `a` and `b`, multiply them, and add the result to an accumulator variable, we can use [SIMD](/hpc/simd/) instructions to speed it all up. It is pretty straightforward to implement using [GCC vector types](/hpc/simd/intrinsics/#gcc-vector-extensions) — we can [memory-align](/hpc/cpu-cache/alignment/) matrix rows, pad them with zeros, and then just compute the multiply-sum as we would normally compute [any other reduction](/hpc/simd/reduction/):
+Now that all we do is just sequentially read the elements of `a` and `b`, multiply them, and add the result to an accumulator variable, we can use [SIMD](/hpc/simd/) instructions to speed it all up. It is pretty straightforward to implement using [GCC vector types](/hpc/simd/intrinsics/#gcc-vector-extensions) — we can [memory-align](/hpc/cpu-cache/alignment/) matrix rows, pad them with zeros, and then compute the multiply-sum as we would normally compute any other [reduction](/hpc/simd/reduction/):
 
 ```c++
 // a vector of 256 / 32 = 8 floats
@@ -132,7 +134,7 @@ The performance for $n = 1920$ is now around 2.3 GFLOPS — or another ~4 times
 
 This optimization looks neither too complex nor specific to matrix multiplication. Why can't the compiler [auto-vectorizee](/hpc/simd/auto-vectorization/) the inner loop by itself?
 
-It actually can — the only thing preventing that is the possibility that `c` overlaps with either `a` or `b`. The only thing that you need to do is to guarantee that `c` is not [aliased](/hpc/compilation/contracts/#memory-aliasing) with anything by adding the `__restrict__` keyword to it:
+It actually can; the only thing preventing that is the possibility that `c` overlaps with either `a` or `b`. To rule it out, you can communicate to the compiler that you guarantee `c` is not [aliased](/hpc/compilation/contracts/#memory-aliasing) with anything by adding the `__restrict__` keyword to it:
 
 <!-- (the compiler already knows that reading `a` and `b` is safe in any order because they are marked as `const`): -->
 
@@ -154,17 +156,17 @@ The performance is bottlenecked by using a single variable. We could use multipl
 
 What is interesting is that the implementation efficiency depends on the problem size. 
 
-At first, the performance (in terms of useful operations per second) increases as the overhead of the loop management and horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/).
+At first, the performance (defined as the number of useful operations per second) increases as the overhead of the loop management and the horizontal reduction decreases. Then, at around $n=256$, it starts smoothly decreasing as the matrices stop fitting into the [cache](/hpc/cpu-cache/) ($2 \times 256^2 \times 4 = 512$ KB is the size of the L2 cache), and the performance becomes bottlenecked by the [memory bandwidth](/hpc/cpu-cache/bandwidth/).
 
 ![](../img/mm-vectorized-plot.svg)
 
 It is also interesting that the naive implementation is mostly on par with the non-vectorized transposed version — and even slightly better because it doesn't need to perform a transposition.
 
-One might think that there would be some *general* performance gain from doing sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` indeed takes more time, but the next 15 column reads will be in the same cache lines as the first one, so they will be cached — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for any practical matrix sizes.
+One might think that there would be some general performance gain from doing sequential reads since we are fetching fewer cache lines, but this is not the case: fetching the first column of `b` indeed takes more time, but the next 15 column reads will be in the same cache lines as the first one, so they will be cached anyway — unless the matrix is so large that it can't even fit `n * cache_line_size` bytes into the cache, which is not the case for any practical matrix sizes.
 
 Instead, the performance deteriorates on only a few specific matrix sizes due to the effects of [cache associativity](/hpc/cpu-cache/associativity/): when $n$ is a multiple of a large power of two, we are fetching the addresses of `b` that all likely map to the same cache line, which reduces the effective cache size. This explains the 30% performance dip for $n = 1920 = 2^7 \times 3 \times 5$, and you can see an even more noticeable one for $1536 = 2^9 \times 3$: it is roughly 3 times slower than for $n=1535$.
 
-So, counterintuitively, transposing the matrix doesn't help with caching — and in the naive implementation, we are not really bottlenecked by the memory bandwidth anyway. But our vectorized implementation certainly is, so let's work on its I/O efficiency.
+So, counterintuitively, transposing the matrix doesn't help with caching — and in the naive scalar implementation, we are not really bottlenecked by the memory bandwidth anyway. But our vectorized implementation certainly is, so let's work on its I/O efficiency.
 
 ## Register reuse
 
@@ -172,7 +174,7 @@ Using a Python-like notation to refer to submatrices, to compute the cell $C[x][
 
 <!-- Any two cells of A and B are used to update some cell of C. -->
 
-To compute $C[x:x+2][y:y+2]$, a $2 \times 2$ submatrix of $C$, we would need two rows from $A$ and two columns from $B$, namely $A[x:x+2][:]$ and $B[:][y:y+2]$, containing $4n$ elements in total, to update four elements instead of one — which is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
+To compute $C[x:x+2][y:y+2]$, a $2 \times 2$ submatrix of $C$, we would need two rows from $A$ and two columns from $B$, namely $A[x:x+2][:]$ and $B[:][y:y+2]$, containing $4n$ elements in total, to update *four* elements instead of *one* — which is $\frac{2n / 1}{4n / 4} = 2$ times better in terms of I/O efficiency.
 
 <!--
 
@@ -180,7 +182,7 @@ To actually avoid reading more data, we need to read these $2+2$ rows and column
 
 -->
 
-To avoid re-fetching data, we need to iterate these rows and columns in parallel and calculate all $2 \times 2$ possible combinations of products. Here is a proof of concept:
+To avoid fetching data more than once, we need to iterate over these rows and columns in parallel and calculate all $2 \times 2$ possible combinations of products. Here is a proof of concept:
 
 ```c++
 void kernel_2x2(int x, int y) {
@@ -220,7 +222,7 @@ Of course, although better in terms of I/O, this $2 \times 2$ update would not b
 
 ## Designing the kernel
 
-Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ from scratch, we will declare a function that *updates* it using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$. For now, this seems like an over-generalization, but this API will be useful later.
+Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ from scratch, we will declare a function that *updates* it using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$. For now, this seems like an over-generalization, but this function interface will prove useful later.
 
 <!--
 
@@ -230,14 +232,12 @@ We follow this approach and design a general kernel that updates a $h \times w$
 
 To determine $h$ and $w$, we have several performance considerations:
 
-- In general, to compute an $h \times w$ submatrix, we need to fetch $2 \cdot n \cdot (h + w)$ elements. To optimize the I/O efficiency, we would want the $\frac{h \cdot w}{h + w}$ ratio to be high, which is achieved with large and square-ish submatrices.
-- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instruction available on all modern x86 architectures. As you can guess from the name, it performs the `c += a * b` operation — which is the core of a dot product — on 8-element vectors in one go, which saves us from executing vector multiplication and addition separately.
+- In general, to compute an $h \times w$ submatrix, we need to fetch $2 \cdot n \cdot (h + w)$ elements. To optimize the I/O efficiency, we want the $\frac{h \cdot w}{h + w}$ ratio to be high, which is achieved with large and square-ish submatrices.
+- We want to use the [FMA](https://en.wikipedia.org/wiki/FMA_instruction_set) ("fused multiply-add") instruction available on all modern x86 architectures. As you can guess from the name, it performs the `c += a * b` operation — which is the core of a dot product — on 8-element vectors in one go, which saves us from executing vector multiplication and addition separately. <!-- saxpy: Single-Precision A·X Plus Y -->
 - To achieve better utilization of this instruction, we want to make use of [instruction-level parallelism](/hpc/pipelining/). On Zen 2, the `fma` instruction has a latency of 5 and a throughput of 2, meaning that we need to concurrently execute at least $5 \times 2 = 10$ of them to saturate its execution ports.
-- We want to avoid register spill, and we only have $16$ logical vector registers that we can use as accumulators.
-
-For these reasons, we settle on a $6 \times 16$ kernel. This way, we process $96$ elements at once, which can be stored in $6 \times 2 = 12$ vector registers (we can't use an $8 \times 16$ kernel and use all 16 vector registers because we need some to hold temporary values).
+- We want to avoid register spill (move data to and from registers more than necessary), and we only have $16$ logical vector registers that we can use as accumulators (minus those that we need to hold temporary values).
 
-To update them efficiently, we use the following procedure:
+For these reasons, we settle on a $6 \times 16$ kernel. This way, we process $96$ elements at once that are stored in $6 \times 2 = 12$ vector registers. To update them efficiently, we use the following procedure:
 
 <!--
 
@@ -270,9 +270,9 @@ void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) {
 }
 ```
 
-We need `t` so that the compiler stores these elements in vector registers. We could just update the final destinations, but, unfortunately, the compiler re-writes them back to memory, causing a slowdown (wrapping everything in `__restrict__` keywords doesn't help).
+We need `t` so that the compiler stores these elements in vector registers. We could just update their final destinations in `c`, but, unfortunately, the compiler re-writes them back to memory, causing a slowdown (wrapping everything in `__restrict__` keywords doesn't help).
 
-The rest of the implementation is straightforward. Similar to the previous vectorized implementation, we just allocate aligned arrays and call the kernel instead of the innermost loop:
+The rest of the implementation is straightforward. Similar to the previous vectorized implementation, we just move the matrices to memory-aligned arrays and call the kernel instead of the innermost loop:
 
 ```c++
 void matmul(const float *_a, const float *_b, float *_c, int n) {
@@ -307,7 +307,7 @@ This improves the benchmark performance, but only by ~40%:
 
 ![](../img/mm-kernel-barplot.svg)
 
-The speedup is much higher (2-3x) on smaller arrays, indicating that there is still a bandwidth problem:
+The speedup is much higher (2-3x) on smaller arrays, indicating that there is still a memory bandwidth problem:
 
 ![](../img/mm-kernel-plot.svg)
 
@@ -315,7 +315,7 @@ Now, if you've read the section on [cache-oblivious algorithms](/hpc/external-me
 
 ## Blocking
 
-The *cache-aware* alternative to this divide-and-conquer trick is *cache blocking*: splitting the data into blocks that can fit into the cache and processing them one by one. If we have more than one layer of cache, we can do hierarchical blocking: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on. This requires knowing the cache sizes in advance, but it is usually easier to implement and also faster in practice.
+The *cache-aware* alternative to the divide-and-conquer trick is *cache blocking*: splitting the data into blocks that can fit into the cache and processing them one by one. If we have more than one layer of cache, we can do hierarchical blocking: we first select a block of data that fits into the L3 cache, then we split it into blocks that fit into the L2 cache, and so on. This approach requires knowing the cache sizes in advance, but it is usually easier to implement and also faster in practice.
 
 Cache blocking is less trivial to do with matrices than with arrays, but the general idea is this:
 
@@ -351,21 +351,21 @@ Cache blocking completely removes the memory bottleneck:
 
 ![](../img/mm-blocked-barplot.svg)
 
-The performance is no longer significantly affected by the problem size:
+The performance is no longer (significantly) affected by the problem size:
 
 ![](../img/mm-blocked-plot.svg)
 
-Notice that the dip at $1536$ is still there: cache associativity still affects the effective cache size. To mitigate this, we can adjust the step constants or insert holes into the layout, but we are not going to bother doing that for now.
+Notice that the dip at $1536$ is still there: cache associativity still affects the performance. To mitigate this, we can adjust the step constants or insert holes into the layout, but we will not bother doing that for now.
 
 ## Optimization
 
 To approach closer to the performance limit, we need a few more optimizations:
 
-- Remove memory allocation and operate on the arrays that are passed to the function. Note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use an unaligned `store` for `c` as we only use it rarely.
-- Get rid of the `std::min` so that the size parameters are (mostly) constant and can be embedded into the machine code by the compiler (which also lets it [unroll](/hpc/architecture/loops/) the micro-kernel loop more efficiently without runtime checks).
-- Rewrite the micro-kernel by hand using 12 vector variables (the compiler seems to struggle with keeping them in registers and writes them first to temporary storage and only then to $C$).
+- Remove memory allocation and operate directly on the arrays that are passed to the function. Note that we don't need to do anything with `a` as we are reading just one element at a time, and we can use an [unaligned](/hpc/simd/moving/#aligned-loads-and-stores) `store` for `c` as we only use it rarely, so our only concern is reading `b`.
+- Get rid of the `std::min` so that the size parameters are (mostly) constant and can be embedded into the machine code by the compiler (which also lets it [unroll](/hpc/architecture/loops/) the micro-kernel loop more efficiently and avoid runtime checks).
+- Rewrite the micro-kernel by hand using 12 vector variables (the compiler seems to struggle with keeping them in registers and writes them first to a temporary memory location and only then to $C$).
 
-These optimizations are straightforward but quite tedious to implement, so we are not going to list [the code](https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc) in the article. It also requires some more work to effectively support "weird" matrix sizes, which is why we only run benchmarks for sizes that are multiple of $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$.
+These optimizations are straightforward but quite tedious to implement, so we are not going to list [the code](https://github.com/sslotin/amh-code/blob/main/matmul/v5-unrolled.cc) here in the article. It also requires some more work to effectively support "weird" matrix sizes, which is why we only run benchmarks for sizes that are multiple of $48 = \frac{6 \cdot 16}{\gcd(6, 16)}$.
 
 <!--
 
@@ -375,23 +375,23 @@ But avoiding moving anything pays off.
 
 -->
 
-These individually small improvements sum up and result in another 50% improvement:
+These individually small improvements compound and result in another 50% improvement:
 
 ![](../img/mm-noalloc.svg)
 
-We are actually not that far from the theoretical performance limit — which can be calculated as the width of a SIMD lane times the `fma` instruction throughput times the clock frequency:
+We are actually not that far from the theoretical performance limit — which can be calculated as the SIMD width times the `fma` instruction throughput times the clock frequency:
 
 $$
 \underbrace{8}_{SIMD} \cdot \underbrace{2}_{thr.} \cdot \underbrace{2 \cdot 10^9}_{cycles/sec} = 32 \; GFLOPS \;\; (3.2 \cdot 10^{10})
 $$
 
-It is more useful to compare against some practical library, such as [OpenBLAS](https://www.openblas.net/). The laziest way is to simply invoke matrix multiplication from Python with [numpy](/hpc/complexity/languages/#blas). There may be some minor overhead, but it ends up reaching 80% of the theoretical limit, which seems plausible (this overhead is typical, as matrix multiplication is not the only thing that CPUs are made for):
+It is more representative to compare against some practical library, such as [OpenBLAS](https://www.openblas.net/). The laziest way to do it is to simply [invoke matrix multiplication from NumPy](/hpc/complexity/languages/#blas). There may be some minor overhead due to Python, but it ends up reaching 80% of the theoretical limit, which seems plausible (a 20% overhead is okay: matrix multiplication is not the only thing that CPUs are made for).
 
 ![](../img/mm-blas.svg)
 
 We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C.
 
-Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS-level of performance (assuming that we're in 2050 and using GCC 35, which finally does not screw up with register spilling):
+Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS-level of performance (assuming that we're in 2050 and using GCC version 35, which finally stopped screwing up with register spilling):
 
 ```c++
 for (int i3 = 0; i3 < n; i3 += s3)
@@ -407,13 +407,13 @@ for (int i3 = 0; i3 < n; i3 += s3)
                                    * b[n / 8 * k + y / 8 + j];
 ```
 
-There is also a way to do fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway.
+There is also an approach that performs asymptotically fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway.
 
 <!-- for which we typically use multi-threading anyway -->
 
 ## Generalizations
 
-FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which projects to decreased performance. If you can guarantee that all intermediate results can be represented exactly as a 32- or 64-bit floating-point number (which is [often the case](/hpc/arithmetic/errors/)), it may be better to convert them to and from floats.
+FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which results in decreased performance. If you can guarantee that all intermediate results can be represented exactly as 32- or 64-bit floating-point numbers (which is [often the case](/hpc/arithmetic/errors/)), it may be faster to just convert them to and from floats.
 
 You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication," which is defined as:
 
@@ -453,7 +453,7 @@ for (int k = 0; k < n; k++)
 
 Interestingly, vectorizing the distance product and executing it $O(\log n)$ times in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot.
 
-As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because you need to perform updates in a particular order, but it is still possible to design a similar kernel and an iteration order that achieves a 30-50x total speedup.
+As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because now there is a logical dependency between the iterations, and you need to perform updates in a particular order, but it is still possible to design a similar kernel and a block iteration order that achieves a 30-50x total speedup.
 
 ## Acknowledgements
 

From b149f0900ce5b63a3c94088152879ee5530f81dc Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 7 Apr 2022 01:42:11 +0300
Subject: [PATCH 029/173] typo

---
 content/english/hpc/algorithms/matmul.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index e0ebdaac..a5a7b4f2 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -391,7 +391,7 @@ It is more representative to compare against some practical library, such as [Op
 
 We've reached ~93% of BLAS performance and ~75% of the theoretical performance limit, which is really great for what is essentially just 40 lines of C.
 
-Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS-level of performance (assuming that we're in 2050 and using GCC version 35, which finally stopped screwing up with register spilling):
+Interestingly, the whole thing can be rolled into just one deeply nested `for` loop with a BLAS level of performance (assuming that we're in 2050 and using GCC version 35, which finally stopped screwing up with register spilling):
 
 ```c++
 for (int i3 = 0; i3 < n; i3 += s3)
@@ -409,8 +409,6 @@ for (int i3 = 0; i3 < n; i3 += s3)
 
 There is also an approach that performs asymptotically fewer arithmetic operations — [the Strassen algorithm](/hpc/external-memory/oblivious/#strassen-algorithm) — but it has a large constant factor, and it is only efficient for [very large matrices](https://arxiv.org/pdf/1605.01078.pdf) ($n > 4000$), where we typically have to use either multiprocessing or some approximate dimensionality-reducing methods anyway.
 
-<!-- for which we typically use multi-threading anyway -->
-
 ## Generalizations
 
 FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which results in decreased performance. If you can guarantee that all intermediate results can be represented exactly as 32- or 64-bit floating-point numbers (which is [often the case](/hpc/arithmetic/errors/)), it may be faster to just convert them to and from floats.

From 1d039027db5e184c4d0b4b4824ddfbd119ae1f62 Mon Sep 17 00:00:00 2001
From: Daniel Paleka <danepale@gmail.com>
Date: Thu, 7 Apr 2022 13:30:12 +0200
Subject: [PATCH 030/173] Typo in argmin.md

---
 content/english/hpc/algorithms/argmin.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/algorithms/argmin.md b/content/english/hpc/algorithms/argmin.md
index ccd9f140..0a9531c1 100644
--- a/content/english/hpc/algorithms/argmin.md
+++ b/content/english/hpc/algorithms/argmin.md
@@ -3,7 +3,7 @@ title: Argmin with SIMD
 weight: 7
 ---
 
-Computing the *minimum* of an array [easily vectorizable](/hpc/simd/reduction), as it is not different from any other reduction: in AVX2, you just need to use a convenient `_mm256_min_epi32` intrinsic as the inner operation. It computes the minimum of two 8-element vectors in one cycle — even faster than in the scalar case, which requires at least a comparison and a conditional move.
+Computing the *minimum* of an array is [easily vectorizable](/hpc/simd/reduction), as it is not different from any other reduction: in AVX2, you just need to use a convenient `_mm256_min_epi32` intrinsic as the inner operation. It computes the minimum of two 8-element vectors in one cycle — even faster than in the scalar case, which requires at least a comparison and a conditional move.
 
 Finding the *index* of that minimum element (*argmin*) is much harder, but it is still possible to vectorize very efficiently. In this section, we design an algorithm that computes the argmin (almost) at the speed of computing the minimum and ~15x faster than the naive scalar approach.
 

From 965c76bb87126d51013dbbe8e181fa439c638138 Mon Sep 17 00:00:00 2001
From: Alex Saveau <saveau.alexandre@gmail.com>
Date: Sat, 9 Apr 2022 14:33:35 -0700
Subject: [PATCH 031/173] Fix extra word typo

---
 content/english/hpc/arithmetic/errors.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/arithmetic/errors.md b/content/english/hpc/arithmetic/errors.md
index f2e0fbf6..df62e91d 100644
--- a/content/english/hpc/arithmetic/errors.md
+++ b/content/english/hpc/arithmetic/errors.md
@@ -125,7 +125,7 @@ $$
 f(x, y) = x^2 - y^2 = (x + y) \cdot (x - y)
 $$
 
-In this one, it is easy to show that the error is be bound by $\epsilon \cdot |x - y|$. It is also faster because it needs 2 additions and 1 multiplication: one fast addition more and one slow multiplication less compared to the original.
+In this one, it is easy to show that the error is bound by $\epsilon \cdot |x - y|$. It is also faster because it needs 2 additions and 1 multiplication: one fast addition more and one slow multiplication less compared to the original.
 
 ### Kahan Summation
 

From a211cf62040495eddefa3c88f46b2206b513fd86 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 10 Apr 2022 19:52:42 +0300
Subject: [PATCH 032/173] bugfix

---
 content/russian/cs/tree-structures/treap.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/russian/cs/tree-structures/treap.md b/content/russian/cs/tree-structures/treap.md
index dd3417dd..724ed15f 100644
--- a/content/russian/cs/tree-structures/treap.md
+++ b/content/russian/cs/tree-structures/treap.md
@@ -199,7 +199,7 @@ struct Node {
 Вместо того, чтобы модифицировать и `merge`, и `split` под наши хотелки, напишем вспомогательную функцию `upd`, которую будем вызывать при обновлении детей вершины:
 
 ```c++
-void sum(Node* v) { return v ? v->sum : 0; }
+int sum(Node* v) { return v ? v->sum : 0; }
 // обращаться по пустому указателю нельзя -- выдаст ошибку
 
 void upd(Node* v) { v->sum = sum(v->l) + sum(v->r) + v->val; }

From cbd4948a082bc4959dfc565a2cc99041753d03b9 Mon Sep 17 00:00:00 2001
From: Alex Saveau <saveau.alexandre@gmail.com>
Date: Sun, 10 Apr 2022 13:32:20 -0700
Subject: [PATCH 033/173] Fix possible typo?

I'm pretty sure this should say not.
---
 content/english/hpc/external-memory/policies.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/external-memory/policies.md b/content/english/hpc/external-memory/policies.md
index 1ff0e724..4cb36bdd 100644
--- a/content/english/hpc/external-memory/policies.md
+++ b/content/english/hpc/external-memory/policies.md
@@ -33,7 +33,7 @@ $$
 
 The main idea of the proof is to consider the worst case scenario. For LRU it would be the repeating series of $\frac{M}{B}$ distinct blocks: each block is new and so LRU has 100% cache misses. Meanwhile, $OPT_{M/2}$ would be able to cache half of them (but not more, because it only has half the memory). Thus $LRU_M$ needs to fetch double the number of blocks that $OPT_{M/2}$ does, which is basically what is expressed in the inequality, and anything better for $LRU$ would only weaken it.
 
-![Dimmed are the blocks cached by OPT (but note cached by LRU)](../img/opt.png)
+![Dimmed are the blocks cached by OPT (but not cached by LRU)](../img/opt.png)
 
 This is a very relieving result. It means that, at least in terms of asymptotic I/O complexity, you can just assume that the eviction policy is either LRU or OPT — whichever is easier for you — do complexity analysis with it, and the result you get will normally transfer to any other reasonable cache replacement policy.
 

From 6e13a8d7a027ad4dc486e7b82335e766d8137c59 Mon Sep 17 00:00:00 2001
From: Alex Saveau <saveau.alexandre@gmail.com>
Date: Mon, 11 Apr 2022 00:26:23 -0700
Subject: [PATCH 034/173] Fix code typo

---
 content/english/hpc/cpu-cache/paging.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/cpu-cache/paging.md b/content/english/hpc/cpu-cache/paging.md
index fad39a54..684fcd65 100644
--- a/content/english/hpc/cpu-cache/paging.md
+++ b/content/english/hpc/cpu-cache/paging.md
@@ -53,7 +53,7 @@ always [madvise] never
 #include <sys/mman.h>
 
 void *ptr = std::aligned_alloc(page_size, array_size);
-madvise(pre, array_size, MADV_HUGEPAGE);
+madvise(ptr, array_size, MADV_HUGEPAGE);
 ```
 
 You can only request a memory region to be allocated using huge pages if it has the corresponding alignment.

From fc5fb2c45ee664d270bc65ca78e40a3a0aaaffbf Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 11 Apr 2022 17:58:33 +0300
Subject: [PATCH 035/173] fix approximate logarithm formula

---
 content/english/hpc/arithmetic/rsqrt.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/content/english/hpc/arithmetic/rsqrt.md b/content/english/hpc/arithmetic/rsqrt.md
index 06659136..63b2799a 100644
--- a/content/english/hpc/arithmetic/rsqrt.md
+++ b/content/english/hpc/arithmetic/rsqrt.md
@@ -77,13 +77,13 @@ $$
 \log_2 x = e_x + \log_2 (1 + m_x) \approx e_x + m_x + \sigma
 $$
 
-Now, having this approximation in mind and defining $L=23$ as the number of mantissa bits in a `float` and $B=127$ for the exponent bias, when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we get
+Now, having this approximation in mind and defining $L=2^{23}$ (the number of mantissa bits in a `float`) and $B=127$ (the exponent bias), when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we get
 
 $$
 \begin{aligned}
-I_x &= L(e_x + B + m_x)
-\\  &= L(e_x + m_x + \sigma +B-\sigma )
-\\  &\approx L\log_2 (x) + L (B-\sigma )
+I_x &= L \cdot (e_x + B + m_x)
+\\  &= L \cdot (e_x + m_x + \sigma +B-\sigma )
+\\  &\approx L \cdot \log_2 (x) + L \cdot (B-\sigma )
 \end{aligned}
 $$
 

From bb31ad26a9cb50c350a24104c2d734704ea72e2f Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 11 Apr 2022 18:30:22 +0300
Subject: [PATCH 036/173] exponent bias

---
 content/english/hpc/arithmetic/ieee-754.md | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/content/english/hpc/arithmetic/ieee-754.md b/content/english/hpc/arithmetic/ieee-754.md
index ae624add..6b1e2a24 100644
--- a/content/english/hpc/arithmetic/ieee-754.md
+++ b/content/english/hpc/arithmetic/ieee-754.md
@@ -15,7 +15,7 @@ When we designed our [DIY floating-point type](../float), we omitted quite a lot
 - What happens if we increment the largest representable number?
 - Can we somehow detect if one of the above three happened?
 
-Most of the early computers didn't have floating-point arithmetic, and when vendors started adding floating-point coprocessors, they had slightly different visions for what answers to those questions should be. Diverse implementations made it difficult to use floating-point arithmetic reliably and portably — particularly for people developing compilers.
+Most of the early computers didn't support floating-point arithmetic, and when vendors started adding floating-point coprocessors, they had slightly different visions for what the answers to these questions should be. Diverse implementations made it difficult to use floating-point arithmetic reliably and portably — especially for the people who develop compilers.
 
 In 1985, the Institute of Electrical and Electronics Engineers published a standard (called [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754)) that provided a formal specification of how floating-point numbers should work, which was quickly adopted by the vendors and is now used in virtually all general-purpose computers.
 
@@ -27,6 +27,15 @@ Similar to our handmade float implementation, hardware floats use one bit for si
 
 One of the reasons why they are stored in this exact order is that it is easier to compare and sort them: you can use mostly the same comparator circuit as for [unsigned integers](../integer), except for maybe flipping some bits in case one of the numbers is negative.
 
+For the same reason, the exponent is *biased:* the actual value is 127 less than the stored unsigned integer, which lets us also cover the values less than one (with negative exponents). In the example above:
+
+$$
+(-1)^0 \times 2^{01111100_2 - 127} \times (1 + 2^{-2})
+= 2^{124 - 127} \times 1.25
+= \frac{1.25}{8}
+= 0.15625
+$$
+
 IEEE 754 and a few consequent standards define not one but *several* representations that differ in sizes, most notably:
 
 |      Type | Sign | Exponent | Mantissa | Total bits | Approx. decimal digits |
@@ -46,11 +55,11 @@ Their availability ranges from chip to chip:
 - Half-precision arithmetic only supports a small subset of operations and is generally used for machine learning applications, especially neural networks, because they tend to do a large amount of calculation, but don't require a high level of precision.
 - Half-precision is being gradually replaced by bfloat, which trades off 3 mantissa bits to have the same range as single-precision, enabling interoperability with it. It is mostly being adopted by specialized hardware: TPUs, FGPAs, and GPUs. The name stands for "[Brain](https://en.wikipedia.org/wiki/Google_Brain) float."
 
-Lower precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e. g. the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it.
+Lower-precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e. g. the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it.
 
 Deep learning, emerging as a very popular and computationally-intensive field, created a huge demand for low-precision matrix multiplication, which led to manufacturers developing separate hardware or at least adding specialized instructions that support these types of computations — most notably, Google developing a custom chip called TPU (*tensor processing unit*) that specializes on multiplying 128-by-128 bfloat matrices, and NVIDIA adding "tensor cores," capable of performing 4-by-4 matrix multiplication in one go, to all their newer GPUs.
 
-Apart from their sizes, most of the behavior is exactly the same between all floating-point types, which we will now clarify.
+Apart from their sizes, most of the behavior is the same between all floating-point types, which we will now clarify.
 
 ## Handling Corner Cases
 

From 436ffa7b608309d8a2246f403d2c95557bbb7d76 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 11 Apr 2022 19:10:37 +0300
Subject: [PATCH 037/173] comments about bit tricks in fast rsqrt

---
 content/english/hpc/arithmetic/rsqrt.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/arithmetic/rsqrt.md b/content/english/hpc/arithmetic/rsqrt.md
index 63b2799a..9817e5a9 100644
--- a/content/english/hpc/arithmetic/rsqrt.md
+++ b/content/english/hpc/arithmetic/rsqrt.md
@@ -77,7 +77,7 @@ $$
 \log_2 x = e_x + \log_2 (1 + m_x) \approx e_x + m_x + \sigma
 $$
 
-Now, having this approximation in mind and defining $L=2^{23}$ (the number of mantissa bits in a `float`) and $B=127$ (the exponent bias), when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we get
+Now, having this approximation in mind and defining $L=2^{23}$ (the number of mantissa bits in a `float`) and $B=127$ (the exponent bias), when we reinterpret the bit-pattern of $x$ as an integer $I_x$, we essentially get
 
 $$
 \begin{aligned}
@@ -87,9 +87,11 @@ I_x &= L \cdot (e_x + B + m_x)
 \end{aligned}
 $$
 
+(Multiplying a number by $L=2^{23}$ is equivalent to left-shifting it by 23.)
+
 When you tune $\sigma$ to minimize the mean square error, this results in a surprisingly accurate approximation.
 
-![](../img/approx.svg)
+![Reinterpreting a floating-point number $x$ as an integer (blue) compared to its scaled and shifted logarithm (gray)](../img/approx.svg)
 
 Now, expressing the logarithm from the approximation, we get
 

From 95899a63c97b582a7b93ceb66369d27cd854c3e0 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 11 Apr 2022 19:12:10 +0300
Subject: [PATCH 038/173] more precise wording

---
 content/english/hpc/arithmetic/rsqrt.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/arithmetic/rsqrt.md b/content/english/hpc/arithmetic/rsqrt.md
index 9817e5a9..0fa4d209 100644
--- a/content/english/hpc/arithmetic/rsqrt.md
+++ b/content/english/hpc/arithmetic/rsqrt.md
@@ -87,7 +87,7 @@ I_x &= L \cdot (e_x + B + m_x)
 \end{aligned}
 $$
 
-(Multiplying a number by $L=2^{23}$ is equivalent to left-shifting it by 23.)
+(Multiplying an integer by $L=2^{23}$ is equivalent to left-shifting it by 23.)
 
 When you tune $\sigma$ to minimize the mean square error, this results in a surprisingly accurate approximation.
 

From aec8d782b10e76df76c6483d34fb05a4f988462a Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 11 Apr 2022 20:40:12 +0300
Subject: [PATCH 039/173] fix variable names in dp example

---
 .../english/hpc/external-memory/locality.md   | 49 +++++++++++--------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/content/english/hpc/external-memory/locality.md b/content/english/hpc/external-memory/locality.md
index a26ff70f..569d9437 100644
--- a/content/english/hpc/external-memory/locality.md
+++ b/content/english/hpc/external-memory/locality.md
@@ -47,44 +47,51 @@ In practice, there is still some overhead associated with the recursion, and for
 
 ### Dynamic Programming
 
-Similar reasoning can be applied to the implementations of dynamic programming algorithms but leading to the reverse result. Consider the classic knapsack problem, where we got $n$ items with integer costs $c_i$, and we need to pick a subset of items with the maximum total cost that does not exceed a given constant $w$.
+Similar reasoning can be applied to the implementations of dynamic programming algorithms but leading to the reverse result. Consider the classic *knapsack problem:* given $N$ items with positive integer costs $c_i$, pick a subset of items with the maximum total cost that does not exceed a given constant $W$.
 
-The way to solve it is to introduce the *state* $f[i, k]$, which corresponds to the maximum total cost not exceeding $k$ that can be achieved having already considered and excluded the first $i$ items. The state can be updated in $O(1)$ time per entry if consider either taking or not taking the $i$-th item and using further states of the dynamic to compute the optimal decision for each state.
+The way to solve it is to introduce the *state* $f[n, w]$, which corresponds to the maximum total cost not exceeding $w$ that can be achieved using only the first $n$ items. These values can be computed in $O(1)$ time per entry if we consider either taking or not taking the $n$-th item and using the previous states of the dynamic to make the optimal decision.
 
-Python has a handy `lru_cache` decorator, which can be used for implementing it with memoized recursion:
+Python has a handy `lru_cache` decorator which can be used for implementing it with memoized recursion:
 
 ```python
 @lru_cache
-def f(i, k):
-    if i == n or k == 0:
+def f(n, w):
+    # check if we have no items to choose
+    if n == 0:
         return 0
-    if w[i] > k:
-        return f(i + 1, k)
-    return max(f(i + 1, k), c[i] + f(i + 1, k - w[i]))
+    
+    # check if we can't pick the last item (note zero-based indexing)
+    if c[n - 1] > w:
+        return f(n - 1, w)
+    
+    # otherwise, we can either pick the last item or not
+    return max(f(n - 1, w), c[n - 1] + f(n - 1, w - c[n - 1]))
 ```
 
-When computing $f[n, w]$, the recursion may visit up to $O(n \cdot w)$ different states, which is asymptotically efficient, but rather slow in reality. Even after nullifying the overhead of Python recursion and all the hash table queries required for the LRU cache to work, it would still be slow because it does random I/O throughout most of the execution.
+When computing $f[N, W]$, the recursion may visit up to $O(N \cdot W)$ different states, which is asymptotically efficient, but rather slow in reality. Even after nullifying the overhead of Python recursion and all the [hash table queries](../policies/#implementing-caching) required for the LRU cache to work, it would still be slow because it does random I/O throughout most of the execution.
 
 What we can do instead is to create a two-dimensional array for the dynamic and replace the recursion with a nice nested loop like this:
 
 ```cpp
-int f[N + 1][W + 1];
+int f[N + 1][W + 1] = {0}; // this zero-fills the array
 
-for (int i = n - 1; i >= 0; i++)
-    for (int k = 0; k <= W; k++)
-        f[i][k] = w[i] > k ? f[i + 1][k] : max(f[i + 1][k], c[i] + f[i + 1][k - w[i]]);
+for (int n = 1; n <= N; n++)
+    for (int w = 0; w <= W; w++)
+        f[n][w] = c[n - 1] > w ?
+                  f[n - 1][w] :
+                  max(f[n - 1][k], c[n - 1] + f[n - 1][w - c[n - 1]]);
 ```
 
-Notice that we are only using the previous layer of the dynamic to calculate the next one. This means that if we can store one layer in the cache, we would only need to write $O(\frac{n \cdot w}{B})$ blocks in external memory.
+Notice that we are only using the previous layer of the dynamic to calculate the next one. This means that if we can store one layer in the cache, we would only need to write $O(\frac{N \cdot W}{B})$ blocks in external memory.
 
-Moreover, if we only need the answer, we don't actually have to store the whole 2d array but only the last layer. This lets us use just $O(w)$ memory by maintaining a single array of $w$ values. To simplify the code, we can slightly change the dynamic to store a binary value: whether it is possible to get the sum of exactly $k$ using the items that we have already considered. This dynamic is even faster to compute:
+Moreover, if we only need the answer, we don't actually have to store the whole 2d array but only the last layer. This lets us use just $O(W)$ memory by maintaining a single array of $W$ values. To simplify the code, we can slightly change the dynamic to store a binary value: whether it is possible to get the sum of exactly $w$ using the items that we have already considered. This dynamic is even faster to compute:
 
 ```cpp
-bool f[W + 1] = {}; // this zero-fills the array
+bool f[W + 1] = {0};
 f[0] = 1;
-for (int i = 0; i < n; i++)
-    for (int x = W - a[i]; x >= 0; x--)
-        f[x + a[i]] |= f[x];
+for (int n = 0; n < N; n++)
+    for (int x = W - c[n]; x >= 0; x--)
+        f[x + c[n]] |= f[x];
 ```
 
 As a side note, now that it only uses simple bitwise operations, it can be optimized further by using a bitset:
@@ -92,8 +99,8 @@ As a side note, now that it only uses simple bitwise operations, it can be optim
 ```cpp
 std::bitset<W + 1> b;
 b[0] = 1;
-for (int i = 0; i < n; i++)
-    b |= b << c[i];
+for (int n = 0; n < N; n++)
+    b |= b << c[n];
 ```
 
 Surprisingly, there is still some room for improvement, and we will come back to this problem later.

From 9872a11b931c184f51ce01e076d6b9adb1bbe690 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 11 Apr 2022 20:46:50 +0300
Subject: [PATCH 040/173] change wording

---
 content/english/hpc/cpu-cache/bandwidth.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/cpu-cache/bandwidth.md b/content/english/hpc/cpu-cache/bandwidth.md
index 88b547ad..a28570f5 100644
--- a/content/english/hpc/cpu-cache/bandwidth.md
+++ b/content/english/hpc/cpu-cache/bandwidth.md
@@ -38,7 +38,7 @@ All CPU cache layers are placed on the same microchip as the processor, so the b
 
 ![](../img/boost.svg)
 
-This detail comes into play when comparing algorithm implementations. Unless the dataset fits entirely in the cache, the relative performance of the two implementations may be different depending on the CPU clock rate because the RAM remains unaffected by it, while everything else does.
+This detail comes into play when comparing algorithm implementations. When the working dataset fits in the cache, the relative performance of the two implementations may be different depending on the CPU clock rate because the RAM remains unaffected by it (while everything else does not).
 
 For this reason, it is [advised](/hpc/profiling/noise) to keep the clock rate fixed, and as the turbo boost isn't stable enough, we run most of the benchmarks in this book at plain 2GHz.
 

From 69390b1012b84459f33b279f2e4646a0ed41f357 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 12 Apr 2022 17:45:08 +0300
Subject: [PATCH 041/173] typo

---
 content/english/hpc/external-memory/list-ranking.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/content/english/hpc/external-memory/list-ranking.md b/content/english/hpc/external-memory/list-ranking.md
index cf5d9929..6d7c0053 100644
--- a/content/english/hpc/external-memory/list-ranking.md
+++ b/content/english/hpc/external-memory/list-ranking.md
@@ -50,11 +50,11 @@ List ranking is especially useful in graph algorithms.
 
 For example, we can obtain the Euler tour of a tree in external memory by constructing a linked list from the tree that corresponds to its Euler tour and then applying the list ranking algorithm — the ranks of each node will be the same as its index $tin_v$ in the Euler tour. To construct this list, we need to:
 
-- split each undirected tree edge into two directed ones;
-- duplicate the parent node for each up-edge (because list nodes can only have one incoming edge, but we visit some tree vertices multiple times);
+- split each undirected edge into two directed ones;
+- duplicate the parent node for each up-edge (because list nodes can only have one incoming edge, but we visit some vertices multiple times);
 - route each such node either to the "next sibling," if it has one, or otherwise to its own parent;
 - and then finally break the resulting cycle at the root.
 
 This general technique is called *tree contraction*, and it serves as the basis for a large number of tree algorithms.
 
-Exactly the same approach can be applied to parallel algorithms, and we will convert that much more deeply in part 2.
+The same approach can be applied to parallel algorithms, and we will cover that much more deeply in part II.

From 0b9d2bb532003b65c1ae4bc9f5477bd2f4a5ddf4 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 12 Apr 2022 17:48:21 +0300
Subject: [PATCH 042/173] link to strassen algorithm implementation paper

---
 content/english/hpc/external-memory/oblivious.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/external-memory/oblivious.md b/content/english/hpc/external-memory/oblivious.md
index 5e4650b2..a0327855 100644
--- a/content/english/hpc/external-memory/oblivious.md
+++ b/content/english/hpc/external-memory/oblivious.md
@@ -198,7 +198,7 @@ $$
 T(N) = O\left(\frac{(\sqrt{M})^2}{B} \cdot \left(\frac{N}{\sqrt M}\right)^3\right) = O\left(\frac{N^3}{B\sqrt{M}}\right)
 $$
 
-This is better than just $O(\frac{N^3}{B})$ and by quite a lot.
+This is better than just $O(\frac{N^3}{B})$, and by quite a lot.
 
 ### Strassen Algorithm
 
@@ -237,7 +237,7 @@ $$
 
 You can verify these formulas with simple substitution if you feel like it.
 
-As far as I know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are some prototype implementations that are efficient for matrices larger than 4000 or so.
+As far as I know, none of the mainstream optimized linear algebra libraries use the Strassen algorithm, although there are [some prototype implementations](https://arxiv.org/pdf/1605.01078.pdf) that are efficient for matrices larger than 2000 or so.
 
 This technique can and actually has been extended multiple times to reduce the asymptotic even further by considering more submatrix products. As of 2020, current world record is $O(n^{2.3728596})$. Whether you can multiply matrices in $O(n^2)$ or at least $O(n^2 \log^k n)$ time is an open problem.
 

From c5b7bd4b85ab1a90c25400c073181df687978377 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 12 Apr 2022 19:11:58 +0300
Subject: [PATCH 043/173] note about kernel design choices

---
 content/english/hpc/algorithms/matmul.md | 25 ++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index a5a7b4f2..c692a227 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -272,6 +272,31 @@ void kernel(float *a, vec *b, vec *c, int x, int y, int l, int r, int n) {
 
 We need `t` so that the compiler stores these elements in vector registers. We could just update their final destinations in `c`, but, unfortunately, the compiler re-writes them back to memory, causing a slowdown (wrapping everything in `__restrict__` keywords doesn't help).
 
+After unrolling these loops and hoisting `b` out of the `i` loop (`b[(k * n + y) / 8 + j]` does not depend on `i` and can be loaded once and reused in all 6 iterations), the compiler generates something more similar to this:
+
+<!-- /hpc/simd/intrinsics/#simd-intrinsics -->
+
+```c++
+for (int k = l; k < r; k++) {
+    __m256 b0 = _mm256_load_ps((__m256*) &b[k * n + y];
+    __m256 b1 = _mm256_load_ps((__m256*) &b[k * n + y + 8];
+    
+    __m256 a0 = _mm256_broadcast_ps((__m128*) &a[x * n + k]);
+    t00 = _mm256_fmadd_ps(a0, b0, t00);
+    t01 = _mm256_fmadd_ps(a0, b1, t01);
+
+    __m256 a1 = _mm256_broadcast_ps((__m128*) &a[(x + 1) * n + k]);
+    t10 = _mm256_fmadd_ps(a1, b0, t10);
+    t11 = _mm256_fmadd_ps(a1, b1, t11);
+
+    // ...
+}
+```
+
+We are using $12+3=15$ vector registers and a total of $6 \times 3 + 2 = 20$ instructions to perform $16 \times 6 = 96$ updates. Assuming that there are no other bottleneks, we should be hitting the throughput of `_mm256_fmadd_ps`.
+
+Note that this kernel is architecture-specific. If we didn't have `fma`, or if its throughput/latency were different, or if the SIMD width was 128 or 512 bits, we would have made different design choices. Multi-platform BLAS implementations ship [many kernels](https://github.com/xianyi/OpenBLAS/tree/develop/kernel), each written in assembly by hand and optimized for a particular architecture.
+
 The rest of the implementation is straightforward. Similar to the previous vectorized implementation, we just move the matrices to memory-aligned arrays and call the kernel instead of the innermost loop:
 
 ```c++

From 473fe8562d44b769d27a0b8c8229f281eea2d3b3 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 12 Apr 2022 23:40:21 +0300
Subject: [PATCH 044/173] mlp clarifications

---
 content/english/hpc/cpu-cache/mlp.md         | 2 +-
 content/english/hpc/cpu-cache/prefetching.md | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/content/english/hpc/cpu-cache/mlp.md b/content/english/hpc/cpu-cache/mlp.md
index 11c5b660..95dfa4cb 100644
--- a/content/english/hpc/cpu-cache/mlp.md
+++ b/content/english/hpc/cpu-cache/mlp.md
@@ -3,7 +3,7 @@ title: Memory-Level Parallelism
 weight: 5
 ---
 
-Memory requests can overlap in time: while you wait for a read request to complete, you can send a few others, which will be executed concurrently with it. This is the reason why [linear iteration](../bandwidth) is so much faster than [pointer jumping](../latency): the CPU knows which memory locations it needs to fetch next and sends memory requests far ahead of time.
+Memory requests can overlap in time: while you wait for a read request to complete, you can send a few others, which will be executed concurrently with it. This is the main reason why [linear iteration](../bandwidth) is so much faster than [pointer jumping](../latency): the CPU knows which memory locations it needs to fetch next and sends memory requests far ahead of time.
 
 The number of concurrent memory operations is large but limited, and it is different for different types of memory. When designing algorithms and especially data structures, you may want to know this number, as it limits the amount of parallelism your computation can achieve.
 
diff --git a/content/english/hpc/cpu-cache/prefetching.md b/content/english/hpc/cpu-cache/prefetching.md
index 8ccdea6b..3001389c 100644
--- a/content/english/hpc/cpu-cache/prefetching.md
+++ b/content/english/hpc/cpu-cache/prefetching.md
@@ -70,7 +70,7 @@ There is some overhead to computing the next address, but for arrays large enoug
 
 ![](../img/sw-prefetch.svg)
 
-Interestingly, we can prefetch more than just two elements ahead, making use of this pattern in the LCG function:
+Interestingly, we can prefetch more than just one element ahead, making use of this pattern in the LCG function:
 
 $$
 \begin{aligned}
@@ -82,17 +82,17 @@ $$
 \end{aligned}
 $$
 
-Hence, in order to load `D` elements ahead, we can do this:
+Hence, to load the `D`-th element ahead, we can do this:
 
 ```cpp
 __builtin_prefetch(&q[((1 << D) * k + (1 << D) - 1) % n]);
 ```
 
-Ignoring some issues such as the integer overflow, this way we can reduce the latency arbitrarily close to the cost of computing the next index (which in this case is dominated by the [modulo operation](/hpc/arithmetic/division)).
+If we execute this request on every iteration, we will be simultaneously prefetching `D` elements ahead on average, increasing the throughput by `D` times. Ignoring some issues such as the integer overflow when `D` is too large, this way, we can reduce the average latency arbitrarily close to the cost of computing the next index (which, in this case, is dominated by the [modulo operation](/hpc/arithmetic/division)).
 
 ![](../img/sw-prefetch-others.svg)
 
-Note that this is an artificial example, and you actually fail more often than not when trying to insert software prefetching into practical programs. This is largely due to the fact that you need to issue a separate memory instruction that may compete for resources with the others. At the same time, hardware prefetching is 100% harmless as it only activates when the memory and cache buses are not busy.
+Note that this is an artificial example, and you actually fail more often than not when trying to insert software prefetching into practical programs. This is largely because you need to issue a separate memory instruction that may compete for resources with the others. At the same time, hardware prefetching is 100% harmless as it only activates when the memory and cache buses are not busy.
 
 You can also specify a specific level of cache the data needs to be brought to when doing software prefetching — when you aren't sure if you will be using it and don't want to kick out what is already in the L1 cache. You can use it with the `_mm_prefetch` intrinsic, which takes an integer value as the second parameter, specifying the cache level. This is useful in combination with [non-temporal loads and stores](../bandwidth#bypassing-the-cache).
 

From 2a0cf6808d345a51c19de689c8b206f30d1ae92d Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 12 Apr 2022 23:47:11 +0300
Subject: [PATCH 045/173] prefetching edits

---
 content/english/hpc/cpu-cache/prefetching.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/content/english/hpc/cpu-cache/prefetching.md b/content/english/hpc/cpu-cache/prefetching.md
index 3001389c..4f5a7545 100644
--- a/content/english/hpc/cpu-cache/prefetching.md
+++ b/content/english/hpc/cpu-cache/prefetching.md
@@ -30,9 +30,9 @@ for (int i = 0; i + 16 < N; i += 16) {
 }
 ```
 
-There is no point in making a graph because the latency is flat: 3ns regardless of the array size. Even though the instruction scheduler still can't tell what we are going to fetch next, the memory prefetcher can detect a pattern just by looking at the memory accesses and start loading the next cache line ahead of time, leveling out its latency.
+There is no point in making a graph because it would be just flat: the latency is 3ns regardless of the array size. Even though the instruction scheduler still can't tell what we are going to fetch next, the memory prefetcher can detect a pattern just by looking at the memory accesses and start loading the next cache line ahead of time, mitigating the latency.
 
-Hardware prefetching is usually powerful enough for most cases, but it only detects simple patterns. You can iterate forward and backward over multiple arrays in parallel, perhaps with small-to-medium strides, but that's about it. For anything more complex, the prefetcher won't figure out what's happening, and we need to help it out ourselves.
+Hardware prefetching is smart enough for most use cases, but it only detects simple patterns. You can iterate forward and backward over multiple arrays in parallel, perhaps with small-to-medium strides, but that's about it. For anything more complex, the prefetcher won't figure out what's happening, and we need to help it out ourselves.
 
 ### Software Prefetching
 
@@ -88,7 +88,7 @@ Hence, to load the `D`-th element ahead, we can do this:
 __builtin_prefetch(&q[((1 << D) * k + (1 << D) - 1) % n]);
 ```
 
-If we execute this request on every iteration, we will be simultaneously prefetching `D` elements ahead on average, increasing the throughput by `D` times. Ignoring some issues such as the integer overflow when `D` is too large, this way, we can reduce the average latency arbitrarily close to the cost of computing the next index (which, in this case, is dominated by the [modulo operation](/hpc/arithmetic/division)).
+If we execute this request on every iteration, we will be simultaneously prefetching `D` elements ahead on average, increasing the throughput by `D` times. Ignoring some issues such as the integer overflow when `D` is too large, we can reduce the average latency arbitrarily close to the cost of computing the next index (which, in this case, is dominated by the [modulo operation](/hpc/arithmetic/division)).
 
 ![](../img/sw-prefetch-others.svg)
 

From 68ae398833ab4e47918c4e20f013a333729b0bb9 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 13 Apr 2022 10:54:28 +0300
Subject: [PATCH 046/173] column -> cell

---
 content/english/hpc/cpu-cache/aos-soa.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/cpu-cache/aos-soa.md b/content/english/hpc/cpu-cache/aos-soa.md
index 048271db..d5765339 100644
--- a/content/english/hpc/cpu-cache/aos-soa.md
+++ b/content/english/hpc/cpu-cache/aos-soa.md
@@ -99,8 +99,8 @@ As the performance on smaller arrays sizes is not affected, this clearly has som
 From the performance analysis point of view, all data in RAM is physically stored in a two-dimensional array of tiny capacitor cells, which is split into rows and columns. To read or write any cell, you need to perform one, two, or three actions:
 
 1. Read the contents of a row in a *row buffer*, which temporarily discharges the capacitors. 
-2. Read or write a specific column in this buffer.
-3. Write the contents of a row buffer back into the capacitors, so that the data is preserved, and the row buffer can be used for other memory accesses.
+2. Read or write a specific cell in this buffer.
+3. Write the contents of a row buffer back into the capacitors so that the data is preserved and the row buffer can be used for other memory accesses.
 
 Here is the punchline: you don't have to perform steps 1 and 3 between two memory accesses that correspond to the same row — you can just use the row buffer as a temporary cache. These three actions take roughly the same time, so this optimization makes long sequences of row-local accesses run thrice as fast compared to dispersed access patterns.
 

From 50ffb1c9324e9d62433f178ba62494070c9b1afd Mon Sep 17 00:00:00 2001
From: Alex Saveau <saveau.alexandre@gmail.com>
Date: Fri, 15 Apr 2022 11:33:19 -0700
Subject: [PATCH 047/173] Fix missing word

---
 content/english/hpc/data-structures/binary-search.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index ff9f73b4..36bb5059 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -9,7 +9,7 @@ Instead, the most fascinating showcases of performance engineering are multifold
 
 <!-- Yet, with remarkable periodicity, these can be optimized to ridiculous levels of performance. -->
 
-In this article, we focus on such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code.
+In this article, we focus on one such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code.
 
 The first algorithm achieves that by removing [branches](/hpc/pipelining/branching), and the second also optimizes the memory layout to achieve better [cache system](/hpc/cpu-cache) performance. This technically disqualifies it from being a drop-in replacement for `std::lower_bound` as it needs to permute the elements of the array before it can start answering queries — but I can't recall a lot of scenarios where you obtain a sorted array but can't afford to spend linear time on preprocessing.
 

From 35016003c29a455a023f56118f5a9a0cf9c48072 Mon Sep 17 00:00:00 2001
From: Elk Cloner <28754537+elkcl@users.noreply.github.com>
Date: Sat, 16 Apr 2022 17:34:45 +0300
Subject: [PATCH 048/173] =?UTF-8?q?=D0=98=D1=81=D0=BF=D1=80=D0=B0=D0=B2?=
 =?UTF-8?q?=D0=BB=D0=B5=D0=BD=D0=B8=D0=B5=20=D1=81=D1=81=D1=8B=D0=BB=D0=BA?=
 =?UTF-8?q?=D0=B8=20=D0=BD=D0=B0=20z-=D1=84=D1=83=D0=BD=D0=BA=D1=86=D0=B8?=
 =?UTF-8?q?=D1=8E=20=D0=B2=20=D1=81=D1=82=D0=B0=D1=82=D1=8C=D0=B5=20=D0=BF?=
 =?UTF-8?q?=D1=80=D0=BE=20=D1=81=D1=83=D1=84=D0=BC=D0=B0=D1=81?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 content/russian/cs/string-structures/suffix-array.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/russian/cs/string-structures/suffix-array.md b/content/russian/cs/string-structures/suffix-array.md
index 80d2b129..25b90a3e 100644
--- a/content/russian/cs/string-structures/suffix-array.md
+++ b/content/russian/cs/string-structures/suffix-array.md
@@ -136,7 +136,7 @@ vector<int> suffix_array(vector<int> &s) {
 
 ### Алгоритм Касаи, Аримуры, Арикавы, Ли, Парка
 
-Алгоритм в реальности называется как угодно, но не исходным способом (*алгоритм Касаи*, *алгоритм пяти корейцев*, и т. д.). Используется для подсчета $lcp$ за линейное время. Автору алгоритм кажется чем-то похожим на [z-функцию](string-searching) по своей идее.
+Алгоритм в реальности называется как угодно, но не исходным способом (*алгоритм Касаи*, *алгоритм пяти корейцев*, и т. д.). Используется для подсчета $lcp$ за линейное время. Автору алгоритм кажется чем-то похожим на [z-функцию](/cs/string-searching/z-function) по своей идее.
 
 **Утверждение.** Пусть мы уже построили суфмасс и посчитали $lcp[i]$. Тогда:
 

From 656f10fb82d03cb22d928566a5a67e7f6a8fcbd6 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 05:28:45 +0300
Subject: [PATCH 049/173] bugfix

---
 content/russian/cs/layer-optimizations/_index.md | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/content/russian/cs/layer-optimizations/_index.md b/content/russian/cs/layer-optimizations/_index.md
index 492473b5..2456aa4c 100644
--- a/content/russian/cs/layer-optimizations/_index.md
+++ b/content/russian/cs/layer-optimizations/_index.md
@@ -10,10 +10,7 @@ date: 2021-08-29
 
 **Задача.** Даны $n$ точек на прямой, отсортированные по своей координате $x_i$. Нужно найти $m$ отрезков, покрывающих все точки, минимизировав при этом сумму квадратов их длин.
 
-**Базовое решение** — это следующая динамика:
-
-- $f[i, j]$ = минимальная стоимость покрытия $i$ первых точек, используя не более $j$ отрезков.
-- Переход — перебор всех возможных последних отрезков, то есть
+**Базовое решение** — определить состояние динамики $f[i, j]$ как минимальную стоимость покрытия $i$ первых точек используя не более $j$ отрезков. Пересчитывать её можно перебором всех возможных последних отрезков:
 
 $$
 f[i, j] = \min_{k < i} \{f[k, j-1] + (x_{i-1}-x_k)^2 \}
@@ -30,7 +27,7 @@ int cost(int i, int j) {
 }
 
 for (int i = 0; i <= m; i++)
-    f[0][k] = 0; // если нам не нужно ничего покрывать, то всё и так хорошо
+    f[0][i] = 0; // если нам не нужно ничего покрывать, то всё и так хорошо
 // все остальные f предполагаем равными бесконечности
 
 for (int i = 1; i <= n; i++)

From 85bc919acc8cb33a7a09e0d37d973cef0548e7bf Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 05:54:04 +0300
Subject: [PATCH 050/173] fix divide and conquer dp

---
 .../layer-optimizations/divide-and-conquer.md | 31 +++++++++----------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/content/russian/cs/layer-optimizations/divide-and-conquer.md b/content/russian/cs/layer-optimizations/divide-and-conquer.md
index 61a7304a..c5e218db 100644
--- a/content/russian/cs/layer-optimizations/divide-and-conquer.md
+++ b/content/russian/cs/layer-optimizations/divide-and-conquer.md
@@ -8,44 +8,43 @@ published: true
 
 *Эта статья — одна из [серии](../). Рекомендуется сначала прочитать все предыдущие.*
 
-Посмотрим на формулу пересчета динамики для базового решения:
+Посмотрим на формулу пересчета динамики из базового решения:
 
 $$
 f[i, j] = \min_{k < i} \{f[k, j-1] + (x_{i-1}-x_k)^2 \}
 $$
 
-Обозначим за $opt[i, j]$ оптимальный $k$ для данного состояния — то есть  от выражения выше. Для однозначности, если оптимальный индекс не один, то выберем среди них самый правый.
+Обозначим за $opt[i, j]$ оптимальный $k$ для данного состояния — то есть аргминимум от выражения выше. Для однозначности, если оптимальный индекс не один, то выберем среди них самый правый.
 
-Конкретно в задаче покрытия точек отрезками, можно заметить следующее:
+Конкретно в задаче покрытия точек отрезками можно заметить следующее:
 
 $$
-opt[i, j] \leq opt[i+1, j]
+opt[i + 1, j] \leq opt[i, j]
 $$
 
-Интуиция такая: когда мы сдвигаем i вправо, то точка, с которой может начинаться последняя группа, не может уменьшаться.
+Интуация такая: если нам нужно покрыть больший префикс точек, то начало последнего отрезка точно не будет раньше.
 
-### Идея
+### Алгоритм
 
-Пусть мы уже знаем $opt[i, l]$ и $opt[i, r]$ и хотим посчитать $opt[i, j]$ для какого-то $j$ между $l$ и $r$. Тогда, воспользовавшись неравенством выше, мы можем сузить отрезок поиска оптимального индекса для $j$ со всего отрезка $[0, i-1]$ до $[opt[i, l], opt[i, r]]$.
+Пусть мы уже знаем $opt[l, k]$ и $opt[r, k]$ и хотим посчитать $opt[i, k]$ для какого-то $i$ между $l$ и $r$. Тогда, воспользовавшись неравенством выше, мы можем сузить отрезок поиска оптимального индекса для $i$ со всего отрезка $[0, i - 1]$ до $[opt[l, k], opt[r, k]]$.
 
-Будем делать следующее: заведем рекурсивную функцию, которая считает динамики для отрезка $[l, r]$, зная, что их $opt$ лежат между $l'$ и $r'$. Эта функция просто берет середину отрезка $[l, r]$ и линейным проходом считает ответ для неё, а затем рекурсивно запускается от половин, передавая в качестве границ $[l', opt]$ и $[opt, r']$ соответственно.
-
-### Реализация
-
-Один $k$-тый слой целиком пересчитывается из $(k-1)$-го следующим образом:
+Будем делать следующее: заведем рекурсивную функцию, которая считает динамики для отрезка $[l, r]$ на $k$-том слое, зная, что их $opt$ лежат между $l'$ и $r'$. Эта функция просто берет середину отрезка $[l, r]$ и линейным проходом считает ответ для неё, а затем рекурсивно запускается от половин, передавая в качестве границ $[l', opt]$ и $[opt, r']$ соответственно:
 
 ```c++
+// [ l,  r] -- какие динамики на k-том слое посчитать
+// [_l, _r] -- где могут быть их ответы
 void solve(int l, int r, int _l, int _r, int k) {
     if (l > r)
         return; // отрезок пустой -- выходим
     int opt = _l, t = (l + r) / 2;
+    // считаем ответ для f[t][k]
     for (int i = _l; i <= min(_r, t); i++) { 
         int val = f[i + 1][k - 1] + cost(i, t - 1);
         if (val < f[t][k])
             f[t][k] = val, opt = i;
     }
-    solve(l, t - 1, _l, opt, k);
-    solve(t + 1, r, opt, _r, k);
+    solve(l,     t - 1, _l,  opt, k);
+    solve(t + 1, r,     opt, _r,  k);
 }
 ```
 
@@ -56,8 +55,6 @@ for (int k = 1; k <= m; k++)
     solve(0, n - 1, 0, n - 1, k);
 ```
 
-### Асимптотика
-
 Так как отрезок $[l, r]$ на каждом вызове уменьшается примерно в два раза, глубина рекурсии будет $O(\log n)$. Так как отрезки поиска для всех элементов на одном «уровне» могут пересекаться разве что только по границам, то суммарно на каждом уровне поиск проверит $O(n)$ различных индексов. Соответственно, пересчет всего слоя займет $O(n \log n)$ операций вместо $O(n^2)$ в базовом решении.
 
-Таким образом, мы улучшили асимптотику до $O(n m \log n)$.
+Таким образом, мы улучшили асимптотику до $O(n \cdot m \cdot \log n)$.

From d5c5fb5a62c2a5645d9473dda6bec8eb7430a39f Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 05:56:29 +0300
Subject: [PATCH 051/173] fix knuth dp criterion

---
 content/russian/cs/layer-optimizations/knuth.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/content/russian/cs/layer-optimizations/knuth.md b/content/russian/cs/layer-optimizations/knuth.md
index 5c49dbe6..8a184d2d 100644
--- a/content/russian/cs/layer-optimizations/knuth.md
+++ b/content/russian/cs/layer-optimizations/knuth.md
@@ -9,13 +9,13 @@ prerequisites:
 
 Предыдущий метод оптимизации опирался на тот факт, что $opt[i, j] \leq opt[i, j + 1]$.
 
-Асимптотику можно ещё улучшить, заметив, что $opt$ монотонен ещё и по первому параметру:
+Асимптотику можно ещё улучшить, заметив, что $opt$ монотонен также и по второму параметру:
 
 $$
-opt[i-1, j] \leq opt[i, j] \leq opt[i, j+1]
+opt[i - 1, j] \leq opt[i, j] \leq opt[i, j + 1]
 $$
 
-В задаче про покрытие отрезками это выполняется примерно по той же причине: если нам нужно покрывать меньше точек, то новый оптимальный последний отрезок будет начинаться не позже старого.
+В задаче про покрытие отрезками это выполняется примерно по той же причине: если нам доступно больше отрезков, то последний отрезок в оптимальном решении точно не будет длиннее, чем раньше.
 
 ### Алгоритм
 

From ac8906113eee302e9ee6b681909a56d391cf5bb3 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 06:15:41 +0300
Subject: [PATCH 052/173] mark drafts in toc

---
 themes/algorithmica/assets/style.sass             | 5 +++++
 themes/algorithmica/layouts/partials/sidebar.html | 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass
index fe3ebaeb..0a42a2d6 100644
--- a/themes/algorithmica/assets/style.sass
+++ b/themes/algorithmica/assets/style.sass
@@ -157,6 +157,11 @@ body
       &::before
         content: counter(chapter-counter) "." counter(section-counter) ". "
         font-weight: bold
+  
+  .draft, .draft a
+    color: $dimmed
+
+    
 
 #wrapper
   width: 100%
diff --git a/themes/algorithmica/layouts/partials/sidebar.html b/themes/algorithmica/layouts/partials/sidebar.html
index 2276957a..816887f5 100644
--- a/themes/algorithmica/layouts/partials/sidebar.html
+++ b/themes/algorithmica/layouts/partials/sidebar.html
@@ -24,13 +24,13 @@
         {{ if isset .Params "part" }}
           <li class='part'>{{.Params.Part}}</li>
         {{ end }}
-        <li {{ if .Params.IgnoreIndexing }}class='ignore-indexing'{{end}}><a href='{{ .RelPermalink }}'
+        <li {{ if .Draft }}class='draft'{{end}} {{ if .Params.IgnoreIndexing }}class='ignore-indexing'{{end}}><a href='{{ .RelPermalink }}'
           {{ if eq $currentPage . }}id='active-element'{{ end }}
           >{{ .Title }}</a></li>
         {{ if .IsSection }}
           <ol>
             {{ range .Pages }}
-              <li><a href='{{ .RelPermalink }}'
+              <li {{ if .Draft }}class='draft'{{end}}><a href='{{ .RelPermalink }}'
                 {{ if eq $currentPage . }}id='active-element'{{ end }}
                 >{{ .Title }}</a></li>
             {{ end }}

From 16a9a52c12e777103d06cb52728aadc8fcb5c4ce Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:22:48 +0300
Subject: [PATCH 053/173] inversions edits

---
 content/russian/cs/sequences/inversions.md | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/content/russian/cs/sequences/inversions.md b/content/russian/cs/sequences/inversions.md
index f18d1f4a..2fbec7d9 100644
--- a/content/russian/cs/sequences/inversions.md
+++ b/content/russian/cs/sequences/inversions.md
@@ -4,13 +4,18 @@ title: Число инверсий
 weight: 5
 authors:
 - Сергей Слотин
+draft: true
 ---
 
-Пусть у нас есть некоторая перестановка $p$ (какая-то последовательность чисел от $1$ до $n$, где все числа встречаются ровно один раз). *Инверсией* называется пара индексов $i$ и $j$ такая, что $i < j$ и $p_i > p_j$. Требуется найти количество инверсий в данной перестановке.
+**Определение.** *Инверсией* в перестановке $p$ называется пара индексов $i$ и $j$ такая, что $i < j$ и $p_i > p_j$.
 
-## Наивный алгоритм
+Например:
 
-Эта задача легко решается за $O(n^2)$ обычным перебором всех пар индексов и проверкой каждого на инверсию:
+- в перестановке $[1, 2, 3]$ инверсий нет,
+- в $[1, 3, 2]$ одна инверсия ($3 \leftrightarrow 2$),
+- в $[3, 2, 1]$ три инверсии ($3 \leftrightarrow 2$, $3 \leftrightarrow 1$ и $2 \leftrightarrow 1$).
+
+В этой статье мы рассмотрим, как находить количество инверсий в перестановке. Эта задача легко решается за $O(n^2)$ обычным перебором всех пар индексов и проверкой каждого на инверсию:
 
 ```cpp
 int count_inversions(int *p, int n) {
@@ -23,6 +28,8 @@ int count_inversions(int *p, int n) {
 }
 ```
 
+Решить её быстрее сложнее.
+
 ## Сортировкой слиянием
 
 Внезапно эту задачу можно решить сортировкой слиянием, слегка модифицировав её.

From b402d342b998a1a13d17eea48781845321abcae4 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:23:07 +0300
Subject: [PATCH 054/173] quickselect edits

---
 content/russian/cs/sequences/quickselect.md | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/content/russian/cs/sequences/quickselect.md b/content/russian/cs/sequences/quickselect.md
index b1606bbd..7e83a267 100644
--- a/content/russian/cs/sequences/quickselect.md
+++ b/content/russian/cs/sequences/quickselect.md
@@ -1,12 +1,12 @@
 ---
-# TODO: реализация
 title: Порядковые статистики
 weight: 4
+draft: true
 ---
 
 Если в [начале предыдущей главы](/cs/interactive/binary-search) мы искали число элементов массива, меньших $x$ — также известное как индекс этого элемента в отсортированном массиве — то теперь нас интересует обратная задача: узнать, какой элемент $k$-тый по возрастанию.
 
-Если массив уже отсортирован, то задача тривиальная — просто берем $k$-тый элемент. Иначе мы его можем отсортировать, но на это потребуется $O(n \log n)$ операций — и мы знаем, что используя только сравнения быстрее не получится.
+Если массив уже отсортирован, то задача тривиальная: просто берем $k$-тый элемент. Иначе мы его можем отсортировать, но на это потребуется $O(n \log n)$ операций — и мы знаем, что если мы используем только сравнения, быстрее не получится.
 
 Есть другой подход — мы можем модифицировать алгоритм быстрой сортировки.
 
@@ -26,4 +26,17 @@ weight: 4
 
 Подумав над тем, что размер отрезка каждый раз убывает приблизительно в 2 раза, над ограниченностью суммы $n + \frac{n}{2} + \frac{n}{4} + \ldots = 2 \cdot n$, и немного помахав руками, получаем, что алгоритм работает за $O(n)$. 
 
+<!--
+```c++
+int buffer[maxn];
+
+int quickselect(int *a, int n) {
+    int t = rand() % n;
+
+    for ()
+
+}
+```
+-->
+
 В C++ этот алгоритм уже реализован и доступен как `nth_element`.

From c10ebb35240390d9bd7fd69769c8b65ed4f0cdfe Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:23:21 +0300
Subject: [PATCH 055/173] sequence compression

---
 content/russian/cs/sequences/_index.md      |  3 +-
 content/russian/cs/sequences/compression.md | 50 ++++++++++++++-------
 2 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/content/russian/cs/sequences/_index.md b/content/russian/cs/sequences/_index.md
index d02ed49b..6888831d 100644
--- a/content/russian/cs/sequences/_index.md
+++ b/content/russian/cs/sequences/_index.md
@@ -1,7 +1,6 @@
 ---
 title: Последовательности
 weight: 4
-draft: true
 ---
 
-В этой главе рассматриваются некоторые алгоритмы на неотсортированных последовательностях.
+В этой главе рассматриваются алгоритмы для неотсортированных последовательностей.
diff --git a/content/russian/cs/sequences/compression.md b/content/russian/cs/sequences/compression.md
index 332011b3..58686d5c 100644
--- a/content/russian/cs/sequences/compression.md
+++ b/content/russian/cs/sequences/compression.md
@@ -3,46 +3,64 @@ title: Сжатие координат
 authors:
 - Сергей Слотин
 weight: -1
-draft: true
+date: 2022-04-20
 ---
 
+Часто бывает полезно преобразовать последовательность чисел либо каких-то других объектов в промежуток последовательных целых чисел — например, чтобы использовать её элементы как индексы в массиве либо какой-нибудь другой структуре.
 
-## Сжатие координат
-Это общая идея, которая может оказаться полезной. Пусть, есть $n$ чисел $a_1,\ldots,a_n$. Хотим, преобразовать $a_i$ так, чтобы равные остались равными, разные остались разными, но все они были от 0 до $n-1$. Для этого надо отсортировать числа, удалить повторяющиеся и заменить каждое $a_i$ на его индекс в отсортированном массиве.
+Эта задача эквивалентна нумерации элементов множества, что можно сделать за $O(n)$ через хэш-таблицу:
 
+```c++
+vector<int> compress(vector<int> a) {
+    unordered_map<int, int> m;
 
-```
-int a[n], all[n];
-for (int i = 0; i < n; ++i) {
-    cin >> a[i];
-    all[i] = a[i];
+    for (int &x : a) {
+        if (m.count(x))
+            x = m[x];
+        else
+            m[x] = m.size();
+    }
+
+    return a;
 }
-sort(all, all + n);
-m = unique(all, all + n) - all; // теперь m - число различных координат
-for (int i = 0; i < n; ++i)
-    a[i] = lower_bound(all, all + m, x[i]) - all;
 ```
 
-```cpp
+Элементам будут присвоены номера в порядке их первого вхождения в последовательность. Если нужно сохранить *порядок*, присвоив меньшим элементам меньшие номера, то задача становится чуть сложнее, и её можно решить разными способами.
+
+Как вариант, можно отсортировать массив, а затем два раза пройтись по нему с хэш-таблицей — в первый раз заполняя её, а во второй раз сжимая сам массив:
+
+```c++
 vector<int> compress(vector<int> a) {
+    vector<int> b = a;
+    sort(b.begin(), b.end());
+
     unordered_map<int, int> m;
-    for (int x : a)
-        if (m.count(x))
+
+    for (int x : b)
+        if (!m.count(x))
             m[x] = m.size();
+
     for (int &x : a)
         x = m[x];
+
     return a;
 }
 ```
 
+Также можно выкинуть из отсортированного массива дупликаты (за линейное время), а затем использовать его для нахождения индекса каждого элемента исходного массива бинарным поиском:
 
-```cpp
+```c++
 vector<int> compress(vector<int> a) {
     vector<int> b = a;
+
     sort(b.begin(), b.end());
     b.erase(unique(b.begin(), b.end()), b.end());
+
     for (int &x : a)
         x = int(lower_bound(b.begin(), b.end(), x) - b.begin());
+
     return a;
 }
 ```
+
+Оба подхода работают за $O(n \log n)$. Используйте тот, который больше нравится.

From ad0c2aa70cfb6e6d3622174e8cbd6fee8399bba7 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:28:06 +0300
Subject: [PATCH 056/173] quicksort edits

---
 content/russian/cs/sorting/quicksort.md | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/content/russian/cs/sorting/quicksort.md b/content/russian/cs/sorting/quicksort.md
index f3a6a5d6..e6494cd3 100644
--- a/content/russian/cs/sorting/quicksort.md
+++ b/content/russian/cs/sorting/quicksort.md
@@ -7,13 +7,18 @@ draft: true
 Быстрая сортировка заключается в том, что на каждом шаге мы находим опорный элемент, все элементы, которые меньше его кидаем в левую часть, остальные в правую, а затем рекурсивно спускаемся в обе части.
 
 ```cpp
+// partition - функция разбивающие элементы 
+// на меньшие и больше/равные a[index], 
+// при этом функция возвращает границу разбиения
+void partition(int l, int r, int p) {
+
+}
+
 void quicksort(int l, int r){
     if (l < r){
         int index = (l + r) / 2; /* index - индекс опорного элемента для 
         начала сделаем его равным середине отрезка*/
-        index = divide(l, r, index); /* divide - функция разбивающие элементы 
-        на меньшие и больше/равные a[index], 
-        при этом функция возвращает границу разбиения*/
+        index = partition(l, r, index);
         quicksort(l, index);
         quicksort(index + 1, r);
     }
@@ -25,8 +30,6 @@ void quicksort(int l, int r){
 
 Существуют несколько выходов из этой ситуации :
 
-2. Давайте если быстрая сортировка работает долго, то запустим любую другую сортировку за $NlogN$.
-
-3. Давайте делить массив не на две, а на три части(меньше, равны, больше).
-
-4. Чтобы избавиться от проблемы с максимумом/минимумом в середине, давайте **брать случайный элемент**.
+1. Давайте если быстрая сортировка работает долго, то запустим любую другую сортировку за $NlogN$.
+2. Давайте делить массив не на две, а на три части(меньше, равны, больше).
+3. Чтобы избавиться от проблемы с максимумом/минимумом в середине, давайте **брать случайный элемент**.

From 78d207d2d08787ecfecfafb25dfe6adaf347a03c Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:31:36 +0300
Subject: [PATCH 057/173] fix ru broken links

---
 content/russian/cs/algebra/matmul.md                     | 2 +-
 content/russian/cs/basic-structures/iterators.md         | 4 ++--
 content/russian/cs/matching/matching-problems.md         | 2 +-
 content/russian/cs/spanning-trees/kruskal.md             | 2 +-
 content/russian/cs/spanning-trees/safe-edge.md           | 2 +-
 content/russian/cs/string-searching/manacher.md          | 2 +-
 content/russian/cs/string-structures/palindromic-tree.md | 2 +-
 content/russian/cs/string-structures/suffix-array.md     | 4 ++--
 content/russian/cs/tree-structures/treap.md              | 2 +-
 9 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/content/russian/cs/algebra/matmul.md b/content/russian/cs/algebra/matmul.md
index bc5ca593..8a633bea 100644
--- a/content/russian/cs/algebra/matmul.md
+++ b/content/russian/cs/algebra/matmul.md
@@ -188,7 +188,7 @@ matrix binpow(matrix a, int p) {
 
 Эту технику можно применить и к другим динамикам, где нужно посчитать количество способов что-то сделать — иногда очень неочевидными способами.
 
-Например, можно решить такую задачу: найти количество строк длины $k \approx 10^{18}$, не содержащих данные маленькие запрещённые подстроки. Для этого нужно построить граф «легальных» переходов в [Ахо-Корасике](/cs/automata/aho-corasick), возвести его матрицу смежности в $k$-тую степень и просуммировать в нём первую строчку.
+Например, можно решить такую задачу: найти количество строк длины $k \approx 10^{18}$, не содержащих данные маленькие запрещённые подстроки. Для этого нужно построить граф «легальных» переходов в [Ахо-Корасике](/cs/string-structures/aho-corasick), возвести его матрицу смежности в $k$-тую степень и просуммировать в нём первую строчку.
 
 В некоторых изощрённых случаях в матричном умножении вместо умножения и сложения нужно использовать другие операции, которые ведут себя как умножение и сложение. Пример задачи: «найти путь от $s$ до $t$ с минимальным весом ребра, использующий ровно $k$ переходов»; здесь нужно возводить в $(k-1)$-ую степень матрицу весов графа, и вместо и сложения, и умножения использовать минимум из двух весов.
 
diff --git a/content/russian/cs/basic-structures/iterators.md b/content/russian/cs/basic-structures/iterators.md
index b2d8269f..c048e0b6 100644
--- a/content/russian/cs/basic-structures/iterators.md
+++ b/content/russian/cs/basic-structures/iterators.md
@@ -71,7 +71,7 @@ for (int x : c)
 
 ### Алгоритмы из STL
 
-Например, итераторы `std::vector` относятся к `random_access_iterator`, и если вызвать функцию `lower_bound` из стандартной библиотеки, то она произведет [бинарный поиск](../../ordered-search/binary-search) по элементам (предполагая, что они отсортированы в порядке неубывания):
+Например, итераторы `std::vector` относятся к `random_access_iterator`, и если вызвать функцию `lower_bound` из стандартной библиотеки, то она произведет [бинарный поиск](/cs/interactive/binary-search/) по элементам (предполагая, что они отсортированы в порядке неубывания):
 
 ```cpp
 vector<int> a = {1, 2, 3, 5, 8, 13};
@@ -93,4 +93,4 @@ array<int, 3> a = {4, 2, 1, 3};
 cout << *min_element(a.begin(), a.end()) << endl;
 ```
 
-Подробнее про разные полезные алгоритмы STL можно прочитать в [ликбезе по C++](../../programming/cpp).
+<!-- Подробнее про разные полезные алгоритмы STL можно прочитать в [ликбезе по C++](../../programming/cpp). -->
diff --git a/content/russian/cs/matching/matching-problems.md b/content/russian/cs/matching/matching-problems.md
index cedfe69d..cd14e54e 100644
--- a/content/russian/cs/matching/matching-problems.md
+++ b/content/russian/cs/matching/matching-problems.md
@@ -81,6 +81,6 @@ $$
 
 Пусть у вершин левой доли есть какие-то веса, и нам нужно набрать максимальное паросочетание минимального веса.
 
-Выясняется, что можно просто отсортировать вершины левой доли по весу и пытаться в таком порядке добавлять их в паросочетание стандартным алгоритмом Куна. Для доказательства этого факта читатель может прочитать про [жадный алгоритм Радо-Эдмондса](/cs/greedy/matroid), частным случаем которого является такая модификация алгоритма Куна.
+Выясняется, что можно просто отсортировать вершины левой доли по весу и пытаться в таком порядке добавлять их в паросочетание стандартным алгоритмом Куна. Для доказательства этого факта читатель может прочитать про [жадный алгоритм Радо-Эдмондса](/cs/combinatorial-optimization/matroid), частным случаем которого является такая модификация алгоритма Куна.
 
 Аналогичную задачу, но когда у *ребер* есть веса, проще всего решать сведением к нахождению [потока минимальной стоимости](/cs/flows/mincost-maxflow).
diff --git a/content/russian/cs/spanning-trees/kruskal.md b/content/russian/cs/spanning-trees/kruskal.md
index ddb9cabf..1f4c98a4 100644
--- a/content/russian/cs/spanning-trees/kruskal.md
+++ b/content/russian/cs/spanning-trees/kruskal.md
@@ -34,4 +34,4 @@ for (auto [a, b, w] : edges) {
 }
 ```
 
-Раз остовные деревья являются частным случаем [матроида](/cs/greedy/matroid), то алгоритм Краскала является частным случаем алгоритма Радо-Эдмондса.
+Раз остовные деревья являются частным случаем [матроида](/cs/combinatorial-optimization/matroid), то алгоритм Краскала является частным случаем алгоритма Радо-Эдмондса.
diff --git a/content/russian/cs/spanning-trees/safe-edge.md b/content/russian/cs/spanning-trees/safe-edge.md
index cc7138c9..19f97006 100644
--- a/content/russian/cs/spanning-trees/safe-edge.md
+++ b/content/russian/cs/spanning-trees/safe-edge.md
@@ -24,4 +24,4 @@ weight: 1
 - Если веса всех рёбер различны, то остов будет уникален.
 - Минимальный остов является также и остовом с минимальным произведением весов рёбер (замените веса всех рёбер на их логарифмы).
 - Минимальный остов является также и остовом с минимальным весом самого тяжелого ребра.
-- Остовные деревья — частный случай [матроидов](/cs/greedy/matroid).
+- Остовные деревья — частный случай [матроидов](/cs/combinatorial-optimization/matroid).
diff --git a/content/russian/cs/string-searching/manacher.md b/content/russian/cs/string-searching/manacher.md
index 8954b653..16d32ccb 100644
--- a/content/russian/cs/string-searching/manacher.md
+++ b/content/russian/cs/string-searching/manacher.md
@@ -32,7 +32,7 @@ vector<int> pal_array(string s) {
 
 Тот же пример $s = aa\dots a$ показывает, что данная реализация работает за $O(n^2)$.
 
-Для оптимизации применим идею, знакомую из алгоритма [z-функции](string-searching): при инициализации $t_i$ будем пользоваться уже посчитанными $t$. А именно, будем поддерживать $(l, r)$ — интервал, соответствующий самому правому из найденных подпалиндромов. Тогда мы можем сказать, что часть наибольшего палиндрома с центром в $s_i$, которая лежит внутри $s_{l:r}$, имеет радиус хотя бы $\min(r-i, \; t_{l+r-i})$. Первая величина равна длине, дальше которой произошел бы выход за пределы $s_{l:r}$, а вторая — значению радиуса в позиции, зеркальной относительно центра палиндрома $s_{l:r}$.
+Для оптимизации применим идею, знакомую из алгоритма [z-функции](/cs/string-searching/z-function/): при инициализации $t_i$ будем пользоваться уже посчитанными $t$. А именно, будем поддерживать $(l, r)$ — интервал, соответствующий самому правому из найденных подпалиндромов. Тогда мы можем сказать, что часть наибольшего палиндрома с центром в $s_i$, которая лежит внутри $s_{l:r}$, имеет радиус хотя бы $\min(r-i, \; t_{l+r-i})$. Первая величина равна длине, дальше которой произошел бы выход за пределы $s_{l:r}$, а вторая — значению радиуса в позиции, зеркальной относительно центра палиндрома $s_{l:r}$.
 
 ```c++
 
diff --git a/content/russian/cs/string-structures/palindromic-tree.md b/content/russian/cs/string-structures/palindromic-tree.md
index 3d70c76b..9b57534a 100644
--- a/content/russian/cs/string-structures/palindromic-tree.md
+++ b/content/russian/cs/string-structures/palindromic-tree.md
@@ -19,7 +19,7 @@ weight: 3
 
 Будем поддерживать наибольший суффикс-палиндром. Когда мы будем дописывать очередной символ $c$, нужно найти наибольший суффикс этого палиндрома, который может быть дополнен символом $c$ — это и будет новый наидлиннейший суффикс-палиндром.
 
-Для этого поступим аналогично [алгоритму Ахо-Корасик](aho-corasick): будем поддерживать для каждого палиндрома суффиксную ссылку $l(v)$, ведущую из $v$ в её наибольший суффикс-палиндром. При добавлении очередного символа, будем подниматься по суффиксным ссылкам, пока не найдём вершину, из которой можно совершить нужный переход.
+Для этого поступим аналогично [алгоритму Ахо-Корасик](../aho-corasick): будем поддерживать для каждого палиндрома суффиксную ссылку $l(v)$, ведущую из $v$ в её наибольший суффикс-палиндром. При добавлении очередного символа, будем подниматься по суффиксным ссылкам, пока не найдём вершину, из которой можно совершить нужный переход.
 
 Если в подходящей вершине этого перехода не существовало, то нужно создать новую вершину, и для неё тоже понадобится своя суффиксная ссылка. Чтобы найти её, будем продолжать подниматься по суффиксным ссылкам предыдущего суффикс-палиндрома, пока не найдём второе такое место, которое мы можем дополнить символом $c$.
 
diff --git a/content/russian/cs/string-structures/suffix-array.md b/content/russian/cs/string-structures/suffix-array.md
index 25b90a3e..a7b90768 100644
--- a/content/russian/cs/string-structures/suffix-array.md
+++ b/content/russian/cs/string-structures/suffix-array.md
@@ -22,7 +22,7 @@ weight: 100
 
 ![Сортировка всех суффиксов строки «mississippi$»](../img/sa-sort.png)
 
-**Где это может быть полезно.** Пусть вы хотите основать ещё один поисковик, и чтобы получить финансирование, вам нужно сделать хоть что-то минимально работающее — хотя бы просто научиться искать по ключевому слову документы, включающие его, а также позиции их вхождения (в 90-е это был бы уже довольно сильный MVP). Простыми алгоритмами — [полиномиальными хешами](/cs/hashing), [z- и префикс-функцией](/cs/string-searching) и даже [Ахо-Корасиком](/cs/automata/aho-corasick) — это сделать быстро нельзя, потому что на каждый раз нужно проходиться по всем данным, а суффиксными структурами — можно.
+**Где это может быть полезно.** Пусть вы хотите основать ещё один поисковик, и чтобы получить финансирование, вам нужно сделать хоть что-то минимально работающее — хотя бы просто научиться искать по ключевому слову документы, включающие его, а также позиции их вхождения (в 90-е это был бы уже довольно сильный MVP). Простыми алгоритмами — [полиномиальными хешами](/cs/hashing), [z- и префикс-функцией](/cs/string-searching) и даже [Ахо-Корасиком](../aho-corasick) — это сделать быстро нельзя, потому что на каждый раз нужно проходиться по всем данным, а суффиксными структурами — можно.
 
 В случае с суффиксным массивом можно сделать следующее: сконкатенировать все строки-документы с каким-нибудь внеалфавитным разделителем (`$`), построить по ним суффиксный массив, а дальше для каждого запроса искать бинарным поиском первый суффикс в суффиксном массиве, который меньше искомого слова, а также последний, который меньше. Все суффиксы между этими двумя будут включать искомую строку как префикс.
 
@@ -132,7 +132,7 @@ vector<int> suffix_array(vector<int> &s) {
 
 Тогда есть мотивация посчитать массив `lcp$` в котором окажутся наибольшие общие префиксы соседних суффиксов, а после как-нибудь считать минимумы на отрезках в этом массиве (например, с помощью [разреженной таблицы](/cs/range-queries/sparse-table)).
 
-Осталось придумать способ быстро посчитать массив `lcp`. Можно воспользоваться идеей из построения суффиксного массива за $O(n \log^2 n)$: с помощью [хешей](hashing) и бинпоиска находить `lcp` для каждой пары соседей. Такой метод работает за $O(n \log n)$, но является не самым удобным и популярным.
+Осталось придумать способ быстро посчитать массив `lcp`. Можно воспользоваться идеей из построения суффиксного массива за $O(n \log^2 n)$: с помощью [хешей](/cs/hashing/polynomial/) и бинпоиска находить `lcp` для каждой пары соседей. Такой метод работает за $O(n \log n)$, но является не самым удобным и популярным.
 
 ### Алгоритм Касаи, Аримуры, Арикавы, Ли, Парка
 
diff --git a/content/russian/cs/tree-structures/treap.md b/content/russian/cs/tree-structures/treap.md
index 724ed15f..ad11c794 100644
--- a/content/russian/cs/tree-structures/treap.md
+++ b/content/russian/cs/tree-structures/treap.md
@@ -100,7 +100,7 @@ $$
 
 Примечательно, что ожидаемая глубина вершин зависит от их позиции: вершина из середины должна быть примерно в два раза глубже, чем крайняя.
 
-**Упражнение.** Выведите по аналогии с этим рассуждением асимптотику [quicksort](/cs/sorting/quicksort).
+**Упражнение.** Выведите по аналогии с этим рассуждением асимптотику quicksort.<!-- [quicksort](/cs/sorting/quicksort). -->
 
 ## Реализация
 

From d184936628da9db13363466ce12f91f7c1af4660 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:39:32 +0300
Subject: [PATCH 058/173] fix hpc broken links

---
 content/english/hpc/algorithms/prefix.md      | 2 +-
 content/english/hpc/architecture/assembly.md  | 2 +-
 content/english/hpc/architecture/indirect.md  | 2 +-
 content/english/hpc/cpu-cache/paging.md       | 2 +-
 content/english/hpc/data-structures/b-tree.md | 4 ++--
 content/english/hpc/data-structures/s-tree.md | 2 +-
 content/english/hpc/pipelining/branchless.md  | 4 ++--
 content/english/hpc/pipelining/throughput.md  | 2 +-
 content/english/hpc/simd/shuffling.md         | 2 +-
 9 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/content/english/hpc/algorithms/prefix.md b/content/english/hpc/algorithms/prefix.md
index 5e31570d..f07daaf3 100644
--- a/content/english/hpc/algorithms/prefix.md
+++ b/content/english/hpc/algorithms/prefix.md
@@ -61,7 +61,7 @@ for (int l = 0; l < logn; l++)
 
 We can prove that this algorithm works by induction: if on $k$-th iteration every element $a_i$ is equal to the sum of the $(i - 2^k, i]$ segment of the original array, then after adding $a_{i - 2^k}$ to it, it will be equal to the sum of $(i - 2^{k+1}, i]$. After $O(\log n)$ iterations, the array will turn into its prefix sum.
 
-To implement it in SIMD, we could use [permutations](/hpc/simd/shuffles) to place $i$-th element against $(i-2^k)$-th, but they are too slow. Instead, we will use the `sll` ("shift lanes left") instruction that does exactly that and also replaces the unmatched elements with zeros:
+To implement it in SIMD, we could use [permutations](/hpc/simd/shuffling) to place $i$-th element against $(i-2^k)$-th, but they are too slow. Instead, we will use the `sll` ("shift lanes left") instruction that does exactly that and also replaces the unmatched elements with zeros:
 
 ```c++
 typedef __m128i v4i;
diff --git a/content/english/hpc/architecture/assembly.md b/content/english/hpc/architecture/assembly.md
index 013d2987..5c981547 100644
--- a/content/english/hpc/architecture/assembly.md
+++ b/content/english/hpc/architecture/assembly.md
@@ -57,7 +57,7 @@ Most instructions write their result into the first operand, which can also be i
 
 There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the first 32 bits of `rax` are `eax`, the first 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. 
 
-These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../jumps), but we'll get there in time.
+These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../loops), but we'll get there in time.
 
 **Constants** are just integer or floating-point values: `42`, `0x2a`, `3.14`, `6.02e23`. They are more commonly called *immediate values* because they are embedded right into the machine code. Because it may considerably increase the complexity of the instruction encoding, some instructions don't support immediate values or allow just a fixed subset of them. In some cases, you have to load a constant value into a register and then use it instead of an immediate value.
 
diff --git a/content/english/hpc/architecture/indirect.md b/content/english/hpc/architecture/indirect.md
index ce6e86b8..487b81e3 100644
--- a/content/english/hpc/architecture/indirect.md
+++ b/content/english/hpc/architecture/indirect.md
@@ -106,7 +106,7 @@ During a virtual method call, that offset field is fetched from the instance of
 
 Of course, this adds some overhead:
 
-- You may need to spend another 15 cycles or so for the same pipeline flushing reasons as for [branch misprediction](../pipelining).
+- You may need to spend another 15 cycles or so for the same pipeline flushing reasons as for [branch misprediction](/hpc/pipelining).
 - The compiler most likely won't be able to inline the function call itself.
 - Class size increases by a couple of bytes or so (this is implementation-specific).
 - The binary size itself increases a little bit.
diff --git a/content/english/hpc/cpu-cache/paging.md b/content/english/hpc/cpu-cache/paging.md
index 684fcd65..3e6cfd8f 100644
--- a/content/english/hpc/cpu-cache/paging.md
+++ b/content/english/hpc/cpu-cache/paging.md
@@ -81,7 +81,7 @@ Enabling huge pages also improves [latency](../latency) by up to 10-15% for arra
 
 In general, enabling huge pages is a good idea when you have any sort of sparse reads, as they usually slightly improve and ([almost](../aos-soa)) never hurt performance.
 
-That said, you shouldn't rely on huge pages if possible, as they aren't always available due to either hardware or computing environment restrictions. There are [many](../cache-lines) [other](../hw-prefetching) [reasons](../aos-soa) why grouping data accesses spatially may be beneficial, which automatically solves the paging problem.
+That said, you shouldn't rely on huge pages if possible, as they aren't always available due to either hardware or computing environment restrictions. There are [many](../cache-lines) [other](../prefetching) [reasons](../aos-soa) why grouping data accesses spatially may be beneficial, which automatically solves the paging problem.
 
 <!--
 
diff --git a/content/english/hpc/data-structures/b-tree.md b/content/english/hpc/data-structures/b-tree.md
index 96d1a08e..122e1c8e 100644
--- a/content/english/hpc/data-structures/b-tree.md
+++ b/content/english/hpc/data-structures/b-tree.md
@@ -24,7 +24,7 @@ Instead of making small incremental improvements like we usually do in other cas
 - Nodes in the B− tree do not store pointers or any metadata except for the pointers to internal node children (while the B+ tree leaf nodes store a pointer to the next leaf node). This lets us perfectly place the keys in the leaf nodes on cache lines.
 - We define key $i$ to be the *maximum* key in the subtree of the child $i$ instead of the *minimum* key in the subtree of the child $(i + 1)$. This lets us not fetch any other nodes after we reach a leaf (in the B+ tree, all keys in the leaf node may be less than the search key, so we need to go to the next leaf node to fetch its first element).
 
-We also use a node size of $B=32$, which is smaller than typical. The reason why it is not $16$, which was [optimal for the S+ tree](s-tree/#modifications-and-further-optimizations), is because we have the additional overhead associated with fetching the pointer, and the benefit of reducing the tree height by ~20% outweighs the cost of processing twice the elements per node, and also because it improves the running time of the `insert` query that needs to perform a costly node split every $\frac{B}{2}$ insertions on average.
+We also use a node size of $B=32$, which is smaller than typical. The reason why it is not $16$, which was [optimal for the S+ tree](../s-tree/#modifications-and-further-optimizations), is because we have the additional overhead associated with fetching the pointer, and the benefit of reducing the tree height by ~20% outweighs the cost of processing twice the elements per node, and also because it improves the running time of the `insert` query that needs to perform a costly node split every $\frac{B}{2}$ insertions on average.
 
 <!--
 
@@ -83,7 +83,7 @@ To "allocate" a new node, we simply increase `n_tree` by $B$ if it is a leaf nod
 
 Since new nodes can only be created by splitting a full node, each node except for the root will be at least half full. This implies that we need between 4 and 8 bytes per integer element (the internal nodes will contribute $\frac{1}{16}$-th or so to that number), the former being the case when the inserts are sequential, and the latter being the case when the input is adversarial. When the queries are uniformly distributed, the nodes are ~75% full on average, projecting to ~5.2 bytes per element.
 
-B-trees are very memory-efficient compared to the pointer-based binary trees. For example, `std::set` needs at least three pointers (the left child, the right child, and the parent), alone costing $3 \times 8 = 24$ bytes, plus at least another $8$ bytes to store the key and the meta-information due to [structure padding](hpc/cpu-cache/alignment/).
+B-trees are very memory-efficient compared to the pointer-based binary trees. For example, `std::set` needs at least three pointers (the left child, the right child, and the parent), alone costing $3 \times 8 = 24$ bytes, plus at least another $8$ bytes to store the key and the meta-information due to [structure padding](/hpc/cpu-cache/alignment/).
 
 ### Searching
 
diff --git a/content/english/hpc/data-structures/s-tree.md b/content/english/hpc/data-structures/s-tree.md
index a6e3ea57..3fcf97b5 100644
--- a/content/english/hpc/data-structures/s-tree.md
+++ b/content/english/hpc/data-structures/s-tree.md
@@ -301,7 +301,7 @@ It doesn't feel very satisfying so far, but we will reuse these optimization ide
 There are two main problems with the current implementation:
 
 - The `update` procedure is quite costly, especially considering that it is very likely going to be useless: 16 out of 17 times, we can just fetch the result from the last block.
-- We do a non-constant number of iterations, causing branch prediction problems similar to how it did for the [Eytzinger binary search](/binary-search/#removing-the-last-branch); you can also see it on the graph this time, but the latency bumps have a period of $2^4$.
+- We do a non-constant number of iterations, causing branch prediction problems similar to how it did for the [Eytzinger binary search](../binary-search/#removing-the-last-branch); you can also see it on the graph this time, but the latency bumps have a period of $2^4$.
 
 To address these problems, we need to change the layout a little bit.
 
diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md
index e8d4d2dd..f84627b5 100644
--- a/content/english/hpc/pipelining/branchless.md
+++ b/content/english/hpc/pipelining/branchless.md
@@ -93,7 +93,7 @@ This way you can eliminate branching, but this comes at the cost of evaluating *
 
 ### When It Is Beneficial
 
-Using predication eliminates [a control hazard](../hazard) but introduces a data hazard. There is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved and not flush the entire pipeline in case of a mispredict.
+Using predication eliminates [a control hazard](../hazards) but introduces a data hazard. There is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved and not flush the entire pipeline in case of a mispredict.
 
 However, there are many situations when it is more efficient to leave branchy code as it is. This is the case when the cost of computing *both* branches instead of just *one* outweighs the penalty for the potential branch mispredictions.
 
@@ -103,7 +103,7 @@ In our example, the branchy code wins when the branch can be predicted with a pr
 
 This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile-time, so it needs to be provided in one of several ways:
 
-- We can use [profile-guided optimization](/hpc/compilation/pgo) which will decide for itself whether to use predication or not.
+- We can use [profile-guided optimization](/hpc/compilation/situational/#profile-guided-optimization) which will decide for itself whether to use predication or not.
 - We can use [likeliness attributes](../branching#hinting-likeliness-of-branches) and [compiler-specific intrinsics](/hpc/compilation/situational) to hint at the likeliness of branches: `__builtin_expect_with_probability` in GCC and `__builtin_unpredictable` in Clang.
 - We can rewrite branchy code using the ternary operator or various arithmetic tricks, which acts as sort of an implicit contract between programmers and compilers: if the programmer wrote the code this way, then it was probably meant to be branchless.
 
diff --git a/content/english/hpc/pipelining/throughput.md b/content/english/hpc/pipelining/throughput.md
index ffb6b762..27789b28 100644
--- a/content/english/hpc/pipelining/throughput.md
+++ b/content/english/hpc/pipelining/throughput.md
@@ -21,7 +21,7 @@ for (int i = 0; i < n; i++)
     s += a[i];
 ```
 
-Let's assume for a moment that the compiler doesn't [vectorize](/hpc/simd) this loop, [the memory bandwidth](/hpc/memory/bandwidth) isn't a concern, and that the loop is [unrolled](/hpc/architecture/loops) so that we don't pay any additional cost associated with maintaining the loop variables. In this case, the computation becomes very simple:
+Let's assume for a moment that the compiler doesn't [vectorize](/hpc/simd) this loop, [the memory bandwidth](/hpc/cpu-cache/bandwidth) isn't a concern, and that the loop is [unrolled](/hpc/architecture/loops) so that we don't pay any additional cost associated with maintaining the loop variables. In this case, the computation becomes very simple:
 
 ```c++
 int s = 0;
diff --git a/content/english/hpc/simd/shuffling.md b/content/english/hpc/simd/shuffling.md
index b7e13ba1..111c34d5 100644
--- a/content/english/hpc/simd/shuffling.md
+++ b/content/english/hpc/simd/shuffling.md
@@ -227,7 +227,7 @@ The vectorized version takes some work to implement, but it is 6-7x faster than
 
 The loop performance is still relatively low — taking 4 CPU cycles per iteration —  because, on this particular CPU (Zen 2), `movemask`, `permute`, and `store` have low throughput and all have to go through the same execution port (P2). On most other x86 CPUs, you can expect it to be ~2x faster.
 
-Filtering can also be implemented considerably faster on AVX-512: it has a special "[compress](_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines, such as quicksort.
+Filtering can also be implemented considerably faster on AVX-512: it has a special "[compress](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=7395,7392,7269,4868,7269,7269,1820,1835,6385,5051,4909,4918,5051,7269,6423,7410,150,2138,1829,1944,3009,1029,7077,519,5183,4462,4490,1944,1395&text=_mm512_mask_compress_epi32)" instruction that takes a vector of data and a mask and writes its unmasked elements contiguously. It makes a huge difference in algorithms that rely on various filtering subroutines, such as quicksort.
 
 <!--
 

From 9174f5477e36916de3823a747de0d02b81711359 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:39:51 +0300
Subject: [PATCH 059/173] wget command that finds broken links

---
 scripts/check-links.sh | 2 ++
 1 file changed, 2 insertions(+)
 create mode 100644 scripts/check-links.sh

diff --git a/scripts/check-links.sh b/scripts/check-links.sh
new file mode 100644
index 00000000..1f840e14
--- /dev/null
+++ b/scripts/check-links.sh
@@ -0,0 +1,2 @@
+# huge serve
+wget --spider -r -nd -nv http://localhost:1313/

From c1198b6ac1fadfeaedf737d88e971762c76f9d24 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:41:23 +0300
Subject: [PATCH 060/173] fix localhost link

---
 content/english/hpc/data-structures/binary-search.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 36bb5059..56f1609a 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -20,7 +20,7 @@ The first algorithm achieves that by removing [branches](/hpc/pipelining/branchi
 
 -->
 
-The usual disclaimer: the CPU is a [Zen 2](https://www.7-cpu.com/cpu/Zen2.html), the RAM is a [DDR4-2666](http://localhost:1313/hpc/cpu-cache/), and the compiler we will be using by default is Clang 10. The performance on your machine may be different, so I highly encourage to [go and test it](https://godbolt.org/z/14rd5Pnve) for yourself.
+The usual disclaimer: the CPU is a [Zen 2](https://www.7-cpu.com/cpu/Zen2.html), the RAM is a [DDR4-2666](/hpc/cpu-cache/), and the compiler we will be using by default is Clang 10. The performance on your machine may be different, so I highly encourage to [go and test it](https://godbolt.org/z/14rd5Pnve) for yourself.
 
 <!--
 

From 5cf11cf2d11401e134eb32c3db117f53c34f575c Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Apr 2022 07:48:47 +0300
Subject: [PATCH 061/173] typo

---
 scripts/check-links.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/check-links.sh b/scripts/check-links.sh
index 1f840e14..9f87cefd 100644
--- a/scripts/check-links.sh
+++ b/scripts/check-links.sh
@@ -1,2 +1,2 @@
-# huge serve
+# hugo serve
 wget --spider -r -nd -nv http://localhost:1313/

From f9f8573364007e037574cb3e50756b27f2c305b9 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 09:33:08 +0300
Subject: [PATCH 062/173] total wordcount script

---
 scripts/list-files.sh | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 scripts/list-files.sh

diff --git a/scripts/list-files.sh b/scripts/list-files.sh
new file mode 100644
index 00000000..47259b5c
--- /dev/null
+++ b/scripts/list-files.sh
@@ -0,0 +1 @@
+find ./ -type f -name "*.md" -exec wc {} +

From 63a47d7ca295976c942fef9e99386f515099edc5 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 16:01:54 +0300
Subject: [PATCH 063/173] adjust top buttons

---
 .../layouts/partials/buttons.html             | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/themes/algorithmica/layouts/partials/buttons.html b/themes/algorithmica/layouts/partials/buttons.html
index ce9d5728..265b63d9 100644
--- a/themes/algorithmica/layouts/partials/buttons.html
+++ b/themes/algorithmica/layouts/partials/buttons.html
@@ -3,16 +3,21 @@
   {{ with .File }}{{ $path = .Path }}{{ end }}
   <div class='left'>
     <a>
-      <img src='/icons/bars-solid.svg' onclick='toggleSidebar()' title='open table of contents'>
+      <img src='/icons/bars-solid.svg'
+           onclick='toggleSidebar()'
+           title='open table of contents'>
     </a>
     <a>
-      <img src='/icons/adjust-solid.svg' onclick='switchTheme(localStorage.getItem("theme") == "dark" ? "light" : "dark")' title='dark theme'>
+      <img src='/icons/adjust-solid.svg'
+           style='position: relative; top: -1px'
+           onclick='switchTheme(localStorage.getItem("theme") == "dark" ? "light" : "dark")'
+           title='dark theme'>
     </a>
-    <!--
     <a>
-      <img src='/icons/search-solid.svg'>
+      <img src='/icons/search-solid.svg'
+           onclick='toggleSearch()'
+           title='search'>
     </a>
-    -->
   </div>
   <div class='title'>{{.Title}}</div>
   <div class='right'>
@@ -20,7 +25,9 @@
       <img src='/icons/print-solid.svg' title='print'>
     </a>
     <a href='https://prose.io/#algorithmica-org/algorithmica/edit/master/{{.Site.Params.ContentDir}}/{{$path}}'>
-      <img src='/icons/edit-solid.svg' title='edit'>
+      <img src='/icons/edit-solid.svg'
+           title='edit'
+           style='width: 18px; position: relative; right: -2px; top: -1px'>
     </a>
     <a href='{{.Site.Params.Repo}}/blob/master/{{.Site.Params.ContentDir}}/{{$path}}' class='github-main'>
       <img src='/icons/github-brands.svg' title='view on github'>

From 5bb09004d6024361734f78e4518b2fb829a7b103 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 16:02:16 +0300
Subject: [PATCH 064/173] search string translations

---
 themes/algorithmica/i18n/en.toml | 9 +++++++++
 themes/algorithmica/i18n/ru.toml | 9 +++++++++
 2 files changed, 18 insertions(+)

diff --git a/themes/algorithmica/i18n/en.toml b/themes/algorithmica/i18n/en.toml
index 9aae4777..6fa12340 100644
--- a/themes/algorithmica/i18n/en.toml
+++ b/themes/algorithmica/i18n/en.toml
@@ -15,6 +15,15 @@ other = "updated"
 [sections]
 other = "sections"
 
+[search]
+other = "Search this book…"
+
+[searchCountPrefix]
+other = "Found"
+
+[searchCountSuffix]
+other = "pages"
+
 [prerequisites]
 other = "prerequisites"
 
diff --git a/themes/algorithmica/i18n/ru.toml b/themes/algorithmica/i18n/ru.toml
index 5e96226c..08d47b66 100644
--- a/themes/algorithmica/i18n/ru.toml
+++ b/themes/algorithmica/i18n/ru.toml
@@ -21,6 +21,15 @@ other = "обновлено"
 [sections]
 other = "статьи раздела"
 
+[search]
+other = "Поиск по сайту…"
+
+[searchCountPrefix]
+other = "Найдено"
+
+[searchCountSuffix]
+other = "страниц"
+
 [prerequisites]
 other = "пререквизиты"
 

From 641a7d6dd401360a778594035d1ddc62ee55d21a Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 16:02:32 +0300
Subject: [PATCH 065/173] add lunr

---
 themes/algorithmica/static/scripts/lunr.multi.min.js           | 1 +
 themes/algorithmica/static/scripts/lunr.ru.min.js              | 1 +
 themes/algorithmica/static/scripts/lunr.stemmer.support.min.js | 1 +
 3 files changed, 3 insertions(+)
 create mode 100644 themes/algorithmica/static/scripts/lunr.multi.min.js
 create mode 100644 themes/algorithmica/static/scripts/lunr.ru.min.js
 create mode 100644 themes/algorithmica/static/scripts/lunr.stemmer.support.min.js

diff --git a/themes/algorithmica/static/scripts/lunr.multi.min.js b/themes/algorithmica/static/scripts/lunr.multi.min.js
new file mode 100644
index 00000000..6f417304
--- /dev/null
+++ b/themes/algorithmica/static/scripts/lunr.multi.min.js
@@ -0,0 +1 @@
+!function(e,t){"function"==typeof define&&define.amd?define(t):"object"==typeof exports?module.exports=t():t()(e.lunr)}(this,function(){return function(e){e.multiLanguage=function(){for(var t=Array.prototype.slice.call(arguments),i=t.join("-"),r="",n=[],s=[],p=0;p<t.length;++p)"en"==t[p]?(r+="\\w",n.unshift(e.stopWordFilter),n.push(e.stemmer),s.push(e.stemmer)):(r+=e[t[p]].wordCharacters,e[t[p]].stopWordFilter&&n.unshift(e[t[p]].stopWordFilter),e[t[p]].stemmer&&(n.push(e[t[p]].stemmer),s.push(e[t[p]].stemmer)));var o=e.trimmerSupport.generateTrimmer(r);return e.Pipeline.registerFunction(o,"lunr-multi-trimmer-"+i),n.unshift(o),function(){this.pipeline.reset(),this.pipeline.add.apply(this.pipeline,n),this.searchPipeline&&(this.searchPipeline.reset(),this.searchPipeline.add.apply(this.searchPipeline,s))}}}});
diff --git a/themes/algorithmica/static/scripts/lunr.ru.min.js b/themes/algorithmica/static/scripts/lunr.ru.min.js
new file mode 100644
index 00000000..f04c9d33
--- /dev/null
+++ b/themes/algorithmica/static/scripts/lunr.ru.min.js
@@ -0,0 +1 @@
+!function(e,n){"function"==typeof define&&define.amd?define(n):"object"==typeof exports?module.exports=n():n()(e.lunr)}(this,function(){return function(e){if(void 0===e)throw new Error("Lunr is not present. Please include / require Lunr before this script.");if(void 0===e.stemmerSupport)throw new Error("Lunr stemmer support is not present. Please include / require Lunr stemmer support before this script.");e.ru=function(){this.pipeline.reset(),this.pipeline.add(e.ru.trimmer,e.ru.stopWordFilter,e.ru.stemmer),this.searchPipeline&&(this.searchPipeline.reset(),this.searchPipeline.add(e.ru.stemmer))},e.ru.wordCharacters="Ѐ-҄҇-ԯᴫᵸⷠ-ⷿꙀ-ꚟ︮︯",e.ru.trimmer=e.trimmerSupport.generateTrimmer(e.ru.wordCharacters),e.Pipeline.registerFunction(e.ru.trimmer,"trimmer-ru"),e.ru.stemmer=function(){var n=e.stemmerSupport.Among,r=e.stemmerSupport.SnowballProgram,t=new function(){function e(){for(;!W.in_grouping(S,1072,1103);){if(W.cursor>=W.limit)return!1;W.cursor++}return!0}function t(){for(;!W.out_grouping(S,1072,1103);){if(W.cursor>=W.limit)return!1;W.cursor++}return!0}function w(){b=W.limit,_=b,e()&&(b=W.cursor,t()&&e()&&t()&&(_=W.cursor))}function i(){return _<=W.cursor}function u(e,n){var r,t;if(W.ket=W.cursor,r=W.find_among_b(e,n)){switch(W.bra=W.cursor,r){case 1:if(t=W.limit-W.cursor,!W.eq_s_b(1,"а")&&(W.cursor=W.limit-t,!W.eq_s_b(1,"я")))return!1;case 2:W.slice_del()}return!0}return!1}function o(){return u(h,9)}function s(e,n){var r;return W.ket=W.cursor,!!(r=W.find_among_b(e,n))&&(W.bra=W.cursor,1==r&&W.slice_del(),!0)}function c(){return s(g,26)}function m(){return!!c()&&(u(C,8),!0)}function f(){return s(k,2)}function l(){return u(P,46)}function a(){s(v,36)}function p(){var e;W.ket=W.cursor,(e=W.find_among_b(F,2))&&(W.bra=W.cursor,i()&&1==e&&W.slice_del())}function d(){var e;if(W.ket=W.cursor,e=W.find_among_b(q,4))switch(W.bra=W.cursor,e){case 1:if(W.slice_del(),W.ket=W.cursor,!W.eq_s_b(1,"н"))break;W.bra=W.cursor;case 2:if(!W.eq_s_b(1,"н"))break;case 3:W.slice_del()}}var _,b,h=[new n("в",-1,1),new n("ив",0,2),new n("ыв",0,2),new n("вши",-1,1),new n("ивши",3,2),new n("ывши",3,2),new n("вшись",-1,1),new n("ившись",6,2),new n("ывшись",6,2)],g=[new n("ее",-1,1),new n("ие",-1,1),new n("ое",-1,1),new n("ые",-1,1),new n("ими",-1,1),new n("ыми",-1,1),new n("ей",-1,1),new n("ий",-1,1),new n("ой",-1,1),new n("ый",-1,1),new n("ем",-1,1),new n("им",-1,1),new n("ом",-1,1),new n("ым",-1,1),new n("его",-1,1),new n("ого",-1,1),new n("ему",-1,1),new n("ому",-1,1),new n("их",-1,1),new n("ых",-1,1),new n("ею",-1,1),new n("ою",-1,1),new n("ую",-1,1),new n("юю",-1,1),new n("ая",-1,1),new n("яя",-1,1)],C=[new n("ем",-1,1),new n("нн",-1,1),new n("вш",-1,1),new n("ивш",2,2),new n("ывш",2,2),new n("щ",-1,1),new n("ющ",5,1),new n("ующ",6,2)],k=[new n("сь",-1,1),new n("ся",-1,1)],P=[new n("ла",-1,1),new n("ила",0,2),new n("ыла",0,2),new n("на",-1,1),new n("ена",3,2),new n("ете",-1,1),new n("ите",-1,2),new n("йте",-1,1),new n("ейте",7,2),new n("уйте",7,2),new n("ли",-1,1),new n("или",10,2),new n("ыли",10,2),new n("й",-1,1),new n("ей",13,2),new n("уй",13,2),new n("л",-1,1),new n("ил",16,2),new n("ыл",16,2),new n("ем",-1,1),new n("им",-1,2),new n("ым",-1,2),new n("н",-1,1),new n("ен",22,2),new n("ло",-1,1),new n("ило",24,2),new n("ыло",24,2),new n("но",-1,1),new n("ено",27,2),new n("нно",27,1),new n("ет",-1,1),new n("ует",30,2),new n("ит",-1,2),new n("ыт",-1,2),new n("ют",-1,1),new n("уют",34,2),new n("ят",-1,2),new n("ны",-1,1),new n("ены",37,2),new n("ть",-1,1),new n("ить",39,2),new n("ыть",39,2),new n("ешь",-1,1),new n("ишь",-1,2),new n("ю",-1,2),new n("ую",44,2)],v=[new n("а",-1,1),new n("ев",-1,1),new n("ов",-1,1),new n("е",-1,1),new n("ие",3,1),new n("ье",3,1),new n("и",-1,1),new n("еи",6,1),new n("ии",6,1),new n("ами",6,1),new n("ями",6,1),new n("иями",10,1),new n("й",-1,1),new n("ей",12,1),new n("ией",13,1),new n("ий",12,1),new n("ой",12,1),new n("ам",-1,1),new n("ем",-1,1),new n("ием",18,1),new n("ом",-1,1),new n("ям",-1,1),new n("иям",21,1),new n("о",-1,1),new n("у",-1,1),new n("ах",-1,1),new n("ях",-1,1),new n("иях",26,1),new n("ы",-1,1),new n("ь",-1,1),new n("ю",-1,1),new n("ию",30,1),new n("ью",30,1),new n("я",-1,1),new n("ия",33,1),new n("ья",33,1)],F=[new n("ост",-1,1),new n("ость",-1,1)],q=[new n("ейше",-1,1),new n("н",-1,2),new n("ейш",-1,1),new n("ь",-1,3)],S=[33,65,8,232],W=new r;this.setCurrent=function(e){W.setCurrent(e)},this.getCurrent=function(){return W.getCurrent()},this.stem=function(){return w(),W.cursor=W.limit,!(W.cursor<b)&&(W.limit_backward=b,o()||(W.cursor=W.limit,f()||(W.cursor=W.limit),m()||(W.cursor=W.limit,l()||(W.cursor=W.limit,a()))),W.cursor=W.limit,W.ket=W.cursor,W.eq_s_b(1,"и")?(W.bra=W.cursor,W.slice_del()):W.cursor=W.limit,p(),W.cursor=W.limit,d(),!0)}};return function(e){return"function"==typeof e.update?e.update(function(e){return t.setCurrent(e),t.stem(),t.getCurrent()}):(t.setCurrent(e),t.stem(),t.getCurrent())}}(),e.Pipeline.registerFunction(e.ru.stemmer,"stemmer-ru"),e.ru.stopWordFilter=e.generateStopWordFilter("алло без близко более больше будем будет будете будешь будто буду будут будь бы бывает бывь был была были было быть в важная важное важные важный вам вами вас ваш ваша ваше ваши вверх вдали вдруг ведь везде весь вниз внизу во вокруг вон восемнадцатый восемнадцать восемь восьмой вот впрочем времени время все всегда всего всем всеми всему всех всею всю всюду вся всё второй вы г где говорил говорит год года году да давно даже далеко дальше даром два двадцатый двадцать две двенадцатый двенадцать двух девятнадцатый девятнадцать девятый девять действительно дел день десятый десять для до довольно долго должно другая другие других друго другое другой е его ее ей ему если есть еще ещё ею её ж же жизнь за занят занята занято заняты затем зато зачем здесь значит и из или им именно иметь ими имя иногда их к каждая каждое каждые каждый кажется как какая какой кем когда кого ком кому конечно которая которого которой которые который которых кроме кругом кто куда лет ли лишь лучше люди м мало между меля менее меньше меня миллионов мимо мира мне много многочисленная многочисленное многочисленные многочисленный мной мною мог могут мож может можно можхо мои мой мор мочь моя моё мы на наверху над надо назад наиболее наконец нам нами нас начала наш наша наше наши не него недавно недалеко нее ней нельзя нем немного нему непрерывно нередко несколько нет нею неё ни нибудь ниже низко никогда никуда ними них ничего но ну нужно нх о об оба обычно один одиннадцатый одиннадцать однажды однако одного одной около он она они оно опять особенно от отовсюду отсюда очень первый перед по под пожалуйста позже пока пор пора после посреди потом потому почему почти прекрасно при про просто против процентов пятнадцатый пятнадцать пятый пять раз разве рано раньше рядом с сам сама сами самим самими самих само самого самой самом самому саму свое своего своей свои своих свою сеаой себе себя сегодня седьмой сейчас семнадцатый семнадцать семь сих сказал сказала сказать сколько слишком сначала снова со собой собою совсем спасибо стал суть т та так такая также такие такое такой там твой твоя твоё те тебе тебя тем теми теперь тех то тобой тобою тогда того тоже только том тому тот тою третий три тринадцатый тринадцать ту туда тут ты тысяч у уж уже уметь хорошо хотеть хоть хотя хочешь часто чаще чего человек чем чему через четвертый четыре четырнадцатый четырнадцать что чтоб чтобы чуть шестнадцатый шестнадцать шестой шесть эта эти этим этими этих это этого этой этом этому этот эту я \ufeffа".split(" ")),e.Pipeline.registerFunction(e.ru.stopWordFilter,"stopWordFilter-ru")}});
diff --git a/themes/algorithmica/static/scripts/lunr.stemmer.support.min.js b/themes/algorithmica/static/scripts/lunr.stemmer.support.min.js
new file mode 100644
index 00000000..abd4475b
--- /dev/null
+++ b/themes/algorithmica/static/scripts/lunr.stemmer.support.min.js
@@ -0,0 +1 @@
+!function(r,t){"function"==typeof define&&define.amd?define(t):"object"==typeof exports?module.exports=t():t()(r.lunr)}(this,function(){return function(r){r.stemmerSupport={Among:function(r,t,i,s){if(this.toCharArray=function(r){for(var t=r.length,i=new Array(t),s=0;s<t;s++)i[s]=r.charCodeAt(s);return i},!r&&""!=r||!t&&0!=t||!i)throw"Bad Among initialisation: s:"+r+", substring_i: "+t+", result: "+i;this.s_size=r.length,this.s=this.toCharArray(r),this.substring_i=t,this.result=i,this.method=s},SnowballProgram:function(){var r;return{bra:0,ket:0,limit:0,cursor:0,limit_backward:0,setCurrent:function(t){r=t,this.cursor=0,this.limit=t.length,this.limit_backward=0,this.bra=this.cursor,this.ket=this.limit},getCurrent:function(){var t=r;return r=null,t},in_grouping:function(t,i,s){if(this.cursor<this.limit){var e=r.charCodeAt(this.cursor);if(e<=s&&e>=i&&(e-=i,t[e>>3]&1<<(7&e)))return this.cursor++,!0}return!1},in_grouping_b:function(t,i,s){if(this.cursor>this.limit_backward){var e=r.charCodeAt(this.cursor-1);if(e<=s&&e>=i&&(e-=i,t[e>>3]&1<<(7&e)))return this.cursor--,!0}return!1},out_grouping:function(t,i,s){if(this.cursor<this.limit){var e=r.charCodeAt(this.cursor);if(e>s||e<i)return this.cursor++,!0;if(e-=i,!(t[e>>3]&1<<(7&e)))return this.cursor++,!0}return!1},out_grouping_b:function(t,i,s){if(this.cursor>this.limit_backward){var e=r.charCodeAt(this.cursor-1);if(e>s||e<i)return this.cursor--,!0;if(e-=i,!(t[e>>3]&1<<(7&e)))return this.cursor--,!0}return!1},eq_s:function(t,i){if(this.limit-this.cursor<t)return!1;for(var s=0;s<t;s++)if(r.charCodeAt(this.cursor+s)!=i.charCodeAt(s))return!1;return this.cursor+=t,!0},eq_s_b:function(t,i){if(this.cursor-this.limit_backward<t)return!1;for(var s=0;s<t;s++)if(r.charCodeAt(this.cursor-t+s)!=i.charCodeAt(s))return!1;return this.cursor-=t,!0},find_among:function(t,i){for(var s=0,e=i,n=this.cursor,u=this.limit,o=0,h=0,c=!1;;){for(var a=s+(e-s>>1),f=0,l=o<h?o:h,_=t[a],m=l;m<_.s_size;m++){if(n+l==u){f=-1;break}if(f=r.charCodeAt(n+l)-_.s[m])break;l++}if(f<0?(e=a,h=l):(s=a,o=l),e-s<=1){if(s>0||e==s||c)break;c=!0}}for(;;){var _=t[s];if(o>=_.s_size){if(this.cursor=n+_.s_size,!_.method)return _.result;var b=_.method();if(this.cursor=n+_.s_size,b)return _.result}if((s=_.substring_i)<0)return 0}},find_among_b:function(t,i){for(var s=0,e=i,n=this.cursor,u=this.limit_backward,o=0,h=0,c=!1;;){for(var a=s+(e-s>>1),f=0,l=o<h?o:h,_=t[a],m=_.s_size-1-l;m>=0;m--){if(n-l==u){f=-1;break}if(f=r.charCodeAt(n-1-l)-_.s[m])break;l++}if(f<0?(e=a,h=l):(s=a,o=l),e-s<=1){if(s>0||e==s||c)break;c=!0}}for(;;){var _=t[s];if(o>=_.s_size){if(this.cursor=n-_.s_size,!_.method)return _.result;var b=_.method();if(this.cursor=n-_.s_size,b)return _.result}if((s=_.substring_i)<0)return 0}},replace_s:function(t,i,s){var e=s.length-(i-t),n=r.substring(0,t),u=r.substring(i);return r=n+s+u,this.limit+=e,this.cursor>=i?this.cursor+=e:this.cursor>t&&(this.cursor=t),e},slice_check:function(){if(this.bra<0||this.bra>this.ket||this.ket>this.limit||this.limit>r.length)throw"faulty slice operation"},slice_from:function(r){this.slice_check(),this.replace_s(this.bra,this.ket,r)},slice_del:function(){this.slice_from("")},insert:function(r,t,i){var s=this.replace_s(r,t,i);r<=this.bra&&(this.bra+=s),r<=this.ket&&(this.ket+=s)},slice_to:function(){return this.slice_check(),r.substring(this.bra,this.ket)},eq_v_b:function(r){return this.eq_s_b(r.length,r)}}}},r.trimmerSupport={generateTrimmer:function(r){var t=new RegExp("^[^"+r+"]+"),i=new RegExp("[^"+r+"]+$");return function(r){return"function"==typeof r.update?r.update(function(r){return r.replace(t,"").replace(i,"")}):r.replace(t,"").replace(i,"")}}}}});
\ No newline at end of file

From 4ffb00832e101ca478e2f973cef4afdff72b82aa Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 16:02:50 +0300
Subject: [PATCH 066/173] build search index

---
 themes/algorithmica/layouts/_default/list.searchindex.json | 5 +++++
 1 file changed, 5 insertions(+)
 create mode 100644 themes/algorithmica/layouts/_default/list.searchindex.json

diff --git a/themes/algorithmica/layouts/_default/list.searchindex.json b/themes/algorithmica/layouts/_default/list.searchindex.json
new file mode 100644
index 00000000..6310c263
--- /dev/null
+++ b/themes/algorithmica/layouts/_default/list.searchindex.json
@@ -0,0 +1,5 @@
+{{- $.Scratch.Add "searchindex" slice -}}
+{{- range $index, $element := .Site.Pages -}}
+    {{- $.Scratch.Add "searchindex" (dict "id" $index "title" $element.Title "path" $element.RelPermalink "content" $element.Plain) -}}
+{{- end -}}
+{{- $.Scratch.Get "searchindex" | jsonify -}}

From c387bd73ba6a942b8b922b07ea019eafbb4672d6 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 16:03:16 +0300
Subject: [PATCH 067/173] implement search

---
 config.yaml                                   |  9 ++
 .../algorithmica/layouts/_default/baseof.html |  1 +
 .../algorithmica/layouts/partials/head.html   | 88 ++++++++++++++++++-
 .../algorithmica/layouts/partials/search.html |  6 ++
 4 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 themes/algorithmica/layouts/partials/search.html

diff --git a/config.yaml b/config.yaml
index 7e4ca1b7..8fb26a1c 100644
--- a/config.yaml
+++ b/config.yaml
@@ -8,6 +8,15 @@ outputFormats:
     baseName: index
     mediaType: text/html
     isHTML: true
+  SearchIndex:
+    mediaType: "application/json"
+    baseName: "searchindex"
+    isPlainText: true
+    notAlternative: true
+outputs:
+  home:
+  - HTML
+  - SearchIndex
 markup:
   goldmark:
     footnote: false  # katex conflict
diff --git a/themes/algorithmica/layouts/_default/baseof.html b/themes/algorithmica/layouts/_default/baseof.html
index f9056521..dbe71ede 100644
--- a/themes/algorithmica/layouts/_default/baseof.html
+++ b/themes/algorithmica/layouts/_default/baseof.html
@@ -6,6 +6,7 @@
     <div id='wrapper' {{ if .Params.HideSidebar }}class='sidebar-hidden sidebar-toggled'{{end}}>
       {{- partial "buttons.html" . -}}
       <main>
+        {{ partial "search.html" . }}
         {{- partial "header.html" . -}}
         <article>
           {{- block "main" . }}{{- end }}
diff --git a/themes/algorithmica/layouts/partials/head.html b/themes/algorithmica/layouts/partials/head.html
index f87a8873..2f4c3c46 100644
--- a/themes/algorithmica/layouts/partials/head.html
+++ b/themes/algorithmica/layouts/partials/head.html
@@ -10,6 +10,11 @@
   <link rel="stylesheet" type="text/css" href="https://tikzjax.com/v1/fonts.css">
   <script src="https://tikzjax.com/v1/tikzjax.js"></script>
 
+  <script src="https://cdnjs.cloudflare.com/ajax/libs/lunr.js/2.3.9/lunr.min.js"></script>
+  <script src="/scripts/lunr.stemmer.support.min.js"></script>
+  <script src="/scripts/lunr.ru.min.js"></script>
+  <script src="/scripts/lunr.multi.min.js"></script>
+
   {{ $dark := resources.Get "dark.sass" | toCSS | minify | fingerprint }}
   <link rel="stylesheet" id="theme">
 
@@ -18,22 +23,100 @@
       console.log("Toggling sidebar visibility")
       var sidebar = document.getElementById('sidebar')
       var wrapper = document.getElementById('wrapper')
-      if (sidebar.classList.contains('sidebar-toggled') || window.getComputedStyle(sidebar).display == 'block') {
+      if (sidebar.classList.contains('sidebar-toggled') || window.getComputedStyle(sidebar).display == 'block') { 
         sidebar.classList.toggle('sidebar-hidden')
         wrapper.classList.toggle('sidebar-hidden')
       }
       sidebar.classList.add('sidebar-toggled')
       wrapper.classList.add('sidebar-toggled')
     }
+
     function switchTheme(theme) {
       console.log("Changing theme:", theme)
       document.getElementById('theme').href = (theme == 'dark' ? "{{ $dark.RelPermalink }}" : "")
       document.getElementById('syntax-theme').href = (theme == 'dark' ? '/syntax-dark.css' : '/syntax.css')
       localStorage.setItem('theme', theme)
     }
+
+    async function toggleSearch() {
+      console.log("Toggling search")
+      
+      var searchDiv = document.getElementById('search')
+      if (window.getComputedStyle(searchDiv).display == 'none') {
+        searchDiv.style.display = 'block'
+        window.scrollTo({ top: 0 });
+      } else {
+        searchDiv.style.display = 'none'  
+      }
+
+      if (!index) {
+        console.log("Fetching index")
+        const response = await fetch('/searchindex.json')
+        const pages = await response.json()
+        index = lunr(function() {
+          this.use(lunr.multiLanguage('en', 'ru'))
+          this.field('title', {
+            boost: 5
+          })
+          this.field('content', {
+            boost: 1
+          })
+          pages.forEach(function(doc) {
+            this.add(doc)
+            articles.push(doc)
+          }, this)
+        })
+        console.log("Ready to search")
+      }
+    }
+
+    var articles = []
+    var index = undefined
+
+    function search() {
+      var query = document.getElementById('search-bar').value
+      var resultsDiv = document.getElementById('search-results')
+      var countDiv = document.getElementById('search-count')
+      
+      if (query == '') {
+        resultsDiv.innerHTML = ''
+        countDiv.innerHTML = ''
+        return
+      }
+      
+      var results = index.search(query)
+
+      countDiv.innerHTML = '{{ T "searchCountPrefix" }} <b>' + results.length + '</b> {{ T "searchCountSuffix" }}'
+
+      let resultList = ''
+
+      for (const n in results) {
+        const item = articles[results[n].ref]
+        resultList += '<li><a href="' + item.path + '">' + item.title + '</a> <p>'
+        const text = item.content
+
+        const contextLimit = 80
+
+        if (text.includes(query)) {
+          const start = text.indexOf(query)
+          if (start > contextLimit)
+            resultList += '…'
+          resultList += text.substring(start - contextLimit, start)
+                      + '<b>' + query + '</b>' + text.substring(start + query.length, start + query.length + contextLimit)
+
+        } else {
+          resultList += text.substring(0, contextLimit * 2)
+        }
+        resultList += '…</p></li>'
+      }
+
+      resultsDiv.innerHTML = resultList
+    }
+
     if (localStorage.getItem('theme') == 'dark') {
       switchTheme('dark')
     }
+
     window.addEventListener('load', function() {
       var el = document.getElementById("active-element")
       //console.log(el)
@@ -46,6 +129,7 @@
         toggleSidebar()
       }*/
     })
+
     window.addEventListener('scroll', function() {
       var menu = document.getElementById('menu')
       if (window.scrollY < 120) {
@@ -56,8 +140,10 @@
         menu.classList.add('scrolled')
       }
     })
+
     window.addEventListener('keydown', function(e) {
       if (e.altKey) { return }
+      if (document.activeElement.tagName == 'INPUT') { return }
       if (e.key == 'ArrowLeft') {
         document.getElementById('prev-article').click()
       } else if (e.key == 'ArrowRight') {
diff --git a/themes/algorithmica/layouts/partials/search.html b/themes/algorithmica/layouts/partials/search.html
new file mode 100644
index 00000000..ee853dfa
--- /dev/null
+++ b/themes/algorithmica/layouts/partials/search.html
@@ -0,0 +1,6 @@
+<div id="search">
+    <input id="search-bar" type="search" placeholder='{{ T "search" }}' oninput="search()">
+    <div id="search-count"></div>
+    <div id="search-results">
+    </div>
+</div>

From 849e9d1e652b60e4c7bdc8e7d35ba37ed5741ffc Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 16:03:21 +0300
Subject: [PATCH 068/173] search styling

---
 themes/algorithmica/assets/style.sass | 29 ++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass
index 0a42a2d6..a6835c1e 100644
--- a/themes/algorithmica/assets/style.sass
+++ b/themes/algorithmica/assets/style.sass
@@ -222,7 +222,34 @@ menu
     .title
       opacity: 1
       transition: opacity 0.1s
-    
+
+#search
+  display: none
+  font-family: $font-interface
+
+  input
+    width: 100%
+    padding: 6px
+
+    color: $font-color
+
+    background: $code-background
+    border: $code-border
+
+  #search-count
+    margin-top: 8px
+    color: $dimmed
+  
+  #search-results
+    margin-top: 6px
+    border-bottom: $borders
+
+    li
+      list-style: none
+      margin: 12px 6px
+
+    p
+      margin-top: 0
 
 /*
   .github

From 882df601131e76b86687c2c843b58252c5faec46 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Apr 2022 16:22:12 +0300
Subject: [PATCH 069/173] update readme

---
 README.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 171f5406..959dc025 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,10 @@
 # Algorithmica v3
 
-Algorithmica is a free and open web book about Computer Science.
+Algorithmica is an open-access web book dedicated to the art and science of computing.
 
-If you are concerned with editing, please read the [contributing guide](https://ru.algorithmica.org/contributing/) (in Russian).
+You can contribute via [Prose](https://prose.io/) by clicking on the pencil icon on the top right on any page or by editing its source directly on GitHub. We use a slightly different Markdown dialect, so if you are not sure that the change is correct (e. g. editing an intricate LaTeX formula), you can install [Hugo](https://gohugo.io/) and build the site locally — or just create a pull request, and a preview link will be automatically generated for you.
+
+If you happen to speak Russian, please also read the [contributing guidelines](https://ru.algorithmica.org/contributing/).
 
 ---
 
@@ -16,11 +18,11 @@ Key technical changes from the [previous version](https://github.com/algorithmic
 * Rich metadata support (language, sections, TOCs, authors...)
 * Automated global table of contents
 * Theming support
+* Search support (Lunr)
 
 Short-term todo list:
 
-* Search with lunr
-* Themes (especially a better dark theme)
-* Minor style adjustments for mobile and print versions
+* Style adjustments for mobile and print versions
 * A pdf version of the whole website
+* Meta-information support (for Google Scholar and social media)
 * [Sticky table of contents](https://css-tricks.com/table-of-contents-with-intersectionobserver/)

From 75443b0d15f22a3d5f765621d5919ee1aaf2e9a6 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 25 Apr 2022 17:13:20 +0300
Subject: [PATCH 070/173] consistent spelling

---
 content/russian/cs/sequences/compression.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/russian/cs/sequences/compression.md b/content/russian/cs/sequences/compression.md
index 58686d5c..5b469fec 100644
--- a/content/russian/cs/sequences/compression.md
+++ b/content/russian/cs/sequences/compression.md
@@ -8,7 +8,7 @@ date: 2022-04-20
 
 Часто бывает полезно преобразовать последовательность чисел либо каких-то других объектов в промежуток последовательных целых чисел — например, чтобы использовать её элементы как индексы в массиве либо какой-нибудь другой структуре.
 
-Эта задача эквивалентна нумерации элементов множества, что можно сделать за $O(n)$ через хэш-таблицу:
+Эта задача эквивалентна нумерации элементов множества, что можно сделать за $O(n)$ через хеш-таблицу:
 
 ```c++
 vector<int> compress(vector<int> a) {

From aeef2db22cf8692463b39fbdc8f321c080f4356a Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 26 Apr 2022 13:37:36 +0300
Subject: [PATCH 071/173] fix integer overflow issue

---
 content/russian/cs/modular/reciprocal.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/russian/cs/modular/reciprocal.md b/content/russian/cs/modular/reciprocal.md
index 5d0e34e9..7b966de3 100644
--- a/content/russian/cs/modular/reciprocal.md
+++ b/content/russian/cs/modular/reciprocal.md
@@ -99,7 +99,7 @@ $$ ax + my = 1 \iff ax \equiv 1 \iff x \equiv a^{-1} \pmod m $$
 int inv(int a, int m) {
     if (a == 1)
         return 1;
-    return (1 - inv(m % a, a) * m) / a + m;
+    return (1 - 1ll * inv(m % a, a) * m) / a + m;
 }
 ```
 

From 238e3987c9f1c6d6e2716e8bb1b8ce4a9a2cdfee Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 27 Apr 2022 00:01:51 +0300
Subject: [PATCH 072/173] number theory intro

---
 content/english/hpc/number-theory/_index.md | 32 +++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md
index f4936581..d532bcfd 100644
--- a/content/english/hpc/number-theory/_index.md
+++ b/content/english/hpc/number-theory/_index.md
@@ -4,10 +4,38 @@ weight: 7
 draft: true
 ---
 
-In 1940, British mathematician Godfrey Harold Hardy published a famous essay titled [A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology) where he discusses the notion that mathematics should be pursued for its own sake rather than for the sake of its applications. As a 62-year-old, he saw the devastation caused by first world war, and was amidst the second one.
+In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._H._Hardy) published a famous essay titled "[A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology)" discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications.
 
-A scientist faces a moral dilemma because some of its inventions may do more harm than good. One can find calm in pursuing useless math. Hardy himself specialized in number theory, and he was content about it not having any applications: "No one has yet discovered any warlike purpose to be served by the theory of numbers or relativity, and it seems unlikely that anyone will do so for many years."
+I personally don't agree — and I wrote this book partially to show that there are way too few people working on practical algorithm design instead of theoretical computer science — but I understand where Hardy is coming from. Being 62 years old, he witnessed the devastation caused by the First and the ongoing Second World War that was greatly amplified by the weaponization of science.
+
+As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing:
+
+> No one has yet discovered any warlike purpose to be served by the theory of numbers or relativity, and it seems unlikely that anyone will do so for many years.
+
+Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory.
+
+<!--
+
+One can find calm in pursuing "useless" math and not having to face any moral dilemmas.
+
+Hardy seems somewhat gratified that his own field has no applications:
+
+A scientist faces a moral dilemma because some of their inventions may do more harm than good. One may find calm in pursing "useless" math.
+
+Hardy seems to find calm in pursuing "useless" math and not having to face any moral dilemmas:
+
+Scientists often face a moral dilemma because some of their inventions may do more harm than good.
+
+One can find calm in pursuing "useless" math and not facing any moral dilemmas. Hardy seems somewhat gratified that his own field has no applications:
+
+If your field has no applications, you don't have to face any moral dilemmas — and Hardy seems to be his own field, number theory, has none:
+
+somewhat proudly pointing out that his field has no practical applications:
+
+A scientist faces a moral dilemma because some of its inventions may do more harm than good. One can find calm in pursuing useless math. Hardy himself specialized in number theory, and he was content about it not having any applications:
 
 It is ironic that within just 5 years number theory was the basis of cracking Enigma and relativity theory developing atomic bomb respectively.
 
 Number theory has many more applications.
+
+-->

From cf6e133d384586f41c59c859fcf1130afd57ef28 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 27 Apr 2022 17:30:50 +0300
Subject: [PATCH 073/173] ignoreIndexing conflicted with drafts

---
 themes/algorithmica/layouts/partials/sidebar.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/themes/algorithmica/layouts/partials/sidebar.html b/themes/algorithmica/layouts/partials/sidebar.html
index 816887f5..652a1f1b 100644
--- a/themes/algorithmica/layouts/partials/sidebar.html
+++ b/themes/algorithmica/layouts/partials/sidebar.html
@@ -24,7 +24,7 @@
         {{ if isset .Params "part" }}
           <li class='part'>{{.Params.Part}}</li>
         {{ end }}
-        <li {{ if .Draft }}class='draft'{{end}} {{ if .Params.IgnoreIndexing }}class='ignore-indexing'{{end}}><a href='{{ .RelPermalink }}'
+        <li class='{{ if .Draft }}draft{{end}} {{ if .Params.IgnoreIndexing }}ignore-indexing{{end}}'><a href='{{ .RelPermalink }}'
           {{ if eq $currentPage . }}id='active-element'{{ end }}
           >{{ .Title }}</a></li>
         {{ if .IsSection }}

From 47df9a54170f32812ac3043aace1ef7f2df2027a Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 27 Apr 2022 17:42:27 +0300
Subject: [PATCH 074/173] modular arithmetic intro

---
 .../english/hpc/number-theory/img/clock.gif   | Bin 0 -> 2331 bytes
 content/english/hpc/number-theory/inverse.md  | 137 +++++++++++++++---
 2 files changed, 116 insertions(+), 21 deletions(-)
 create mode 100644 content/english/hpc/number-theory/img/clock.gif

diff --git a/content/english/hpc/number-theory/img/clock.gif b/content/english/hpc/number-theory/img/clock.gif
new file mode 100644
index 0000000000000000000000000000000000000000..0d0c65556eafa788280115c63496ef6024ecce91
GIT binary patch
literal 2331
zcmV+$3FP)iNk%w1Vc7uL0K@<Q|NsB*@9*E=-^|R+zrVk)udkVznU9Z;e}8{wW@b-M
zPd`6DGcz+EA0H174*&oF{s9620000000000000000000000000000000000000000
z000000000000000A^8LZ6afDKEC2ui0NDW9000I5ARvxpX`X1Ru59bRa4gSsZQppV
z?|kq7z@TtQEE<o<q;kn@I-k&}bV{vSuh^`1%k6r<;IP>GD4WmdG<wW_s^9Q<yiEtn
z@A!LtZ_Arie<*$re1$7|GJ!CNhI@H6iZOVNY>a(nH<pxJb3Bnaotsy0L7zHsqEo0s
zraP;tOs}k%L!_`twneQ!xwo@xxU@!UzQ4Of$E9h-J;brX!Dq}lq0!IF(=>9|*Us54
z-qPRF-rs=0Bk0WH0OjT-*zn1e?CI+7BKPvpKmrIR0Sq8OKtNZkc=+_~BP1jM1BN3E
zI5@K)fU$G)U@0_FuZKfu0v<RZSYW|}1s5^@@+*c=B}5()MZQe%pa2Pz0_<dL*YQ@%
zLpx(2c+jBaO9LoL3@ykj=+8}0fv&t^z<>yoD`*aAIyFVqLK|81+<`Sk$ObyQdJS8L
ztyr}<%AWYh)JIT`W83DPBDciH13dh>+B?Q=l)fVdJOsSsqbI^(3?F_eg1|(pjUQ45
z1-Xml$t4Vk+`1#?M7y1}e1;+T1Od!5`U-I1fV7UGMG>D$jfFQ1$0i0iNE-X40-_6&
zD~CO!>ebD=?^5yo;^SuFCu{VD^Dl?)!Ek$1Z(;gnfT0tKbNLP)dY08La2vgOazJ)3
z?4yTQHJ&`?z69l?-WK`k=iYSv#lYVGeO(l2plS&!=pKXX9r&1Z(O5{<gac~fV0#*N
zW5j|TX7OQrWJD+fiC>(EB7*yEC`gI;oB&@f7~a5QjV-8%T8uK<NF$GU0g0oLI_yZK
zR3Rec<B}pe@g$T>+;}09R7x?WHtuA}NsiR?;-z?2X7WduV@4s1LBgP!rVaxPWI#?#
zikXr<WxlCG04AB?0CNk8vn4nOeWhm%G%Zj70SHV~7oPWkIRjx)8StnG0U>YzorebS
z00Sft@IV42*vS)?T!yERXx$|?kr{xFDL?{%Qux)NDiOMwXb-dsKmZdF6w{@LiV%_l
zrDlYyAd&(gz;VHn!0HI1l3Hy42{gs3Ym=97%WMJ#^h&~U7LCAw1J_DRVIs}Gg(?XN
z<Rq%E4WI?@1DqBsS_9>#>MgtPnlNvv11^>BPz>bjFIWGnFmO5rKR~RP0no%izuh*x
ztqu<hSc1WNeq?QQT<y?tI}uMzB~2rf+)TSAoV)<K<350~m-d~coN{Icyq3=bi#q~M
z`PO^rP^K;dbCEop94@jV958CSC12RI(B__m08I`kOm&zVOLeuczxGL#ZC}3#cGxor
zKmgiltIZkEacfC;vTWB4qkekpb@!=x_pRpA%L-m4%!M1CVb6;D&3NN&ApSSx+;mHM
zC3G)+`Qd?|y15vVcmBEm6742F`kerae)<tAYQ8$_KD+)pn~+RCyOVj#e)})A2VTHH
zaPNMC00$soz|HRVPJ?x2D1f~1BE(dN12(<TtfVF>%IE`7Lr+3T4jjNJZO=2C`@8n3
zCw1_`XB*K2jP)4%yF)SODcR7+87kK9vJEE4>FA#KPH-Idna_WN2o3y5kiG#12LTOO
zU_KH!DGH7Ne5Covkc5Y{2DEAr2=s#qr1q)ot;>NqdDhebH;D@7tpE`mNjn7Sz!I3G
zd_@W&$M{!6rASOy%9FvY;s*elRL_A7u-DKYU_hp&Y6hO-7p5GbL<b%xe!{7u*q%58
zns_h)6FH&-v6F!Rt%cEg^J81Z%utjbCMJSqXd@cW7dW>yag1*RSP<Q4H#+i>KXg>x
zAA1(aLAs%Iwlky)AE!vpF|v`?upA^wm$^hv5{Qz#B<wn;J544pW1eIaCp#HR8xf9_
zrG(VlQkfP(qLP)edu1tcWx83$QkI_VBrY|nOH0DilC|VzFMS!x#O)H2s63`3O&83~
z5x{G66y!qSG!QXTjwblqSSxS9r!zF*j&TbY52L9Jy3s@di?SHyW>bc2%1@dHv;h_G
z1x~6x1e-2|rzwF(Api=FWhW!&k@8kROf|p(IHX#*Ko!t;>NB3pGYJDsQX9q1rfUE=
z**uzww|AERK!s*tKs}4~km_kPFC4|2h3>H!5fp$n(4<<T2tZS*C4d0sdP+)FxzBop
z%U9JBoY~S)mvZb>rGff{KT#$)fed6Ii9&$FoN-i0F~n3u5u>Lz7`KK_a81M51`6zR
zu_Y2wabC4*Rg;<#FP2fF3GM1Qb4bi>-GwG*sE8Z$#5WnxG_JEVlqgx*)4|pCb!%l3
zVUgO$!+tKY{aY+w%lXjJLDr#_4Gv8=yV;S}=(2Vt!cZx)g#d7trZIh)GJEG(066uj
zTa#cGNO%T!UMaJ;MS??Ucm<ZhR%Iaa!ekMJP%b<K3Bz4n5yEB7-he8stG!YY8o*U4
zRLdX#2=FXx3&T;V0hbC=gJT(5rdr(M&ap=OL8h{s120<20QY2=dC!~KaGg~LFFgZ!
zXAq};1t5YyYON?MOEe$YXC$wQplxeFlkcW(zMJ%=1D@N1<J9FWK3HnTAdK8)*=N5v
z*lK8+8-pa4;5ez31Za&RP0`#J2ciOuX&wB6<G`kYYvVBIG^b%4ECsJ3(N+v{Rz|LA
z*Cjr&(lHU{fFhTH#4#X%c)!_ErL8!^NJ=Sg`jtgE<#C#J)v;lXq~!+yN-L;yiI^MP
zq@9NAQazS#oj%){orZ*L1!ggld33Aj&dfIU(c(4_xMni#qN>7aa%_J%*raC39f=*7
zfZ62OXtky(9MS#Wmpe=gH!X=6SKhIw9}QNPB<)S3=AEboOb@}PWz~vH^^Y<XYQH6S
zj?ld|sa*{=`qmmMr|vanb$zH(3p>ESo<=o}onvC_O%sbGWUvR=!e~F(*n@<&M^(J+
zUq0ezKl}D#2@S$>PkY=dHMhD8o$Yle+m_Box4h>~?|R$&-uTY9zW2@Ve$Orh06Qq$
BG)w>h

literal 0
HcmV?d00001

diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/inverse.md
index aec428fe..beb56611 100644
--- a/content/english/hpc/number-theory/inverse.md
+++ b/content/english/hpc/number-theory/inverse.md
@@ -3,39 +3,79 @@ title: Modular Inverse
 weight: 1
 ---
 
-```c++
-mint inv() const {
-    uint t = x;
-    uint res = 1;
-    while (t != 1) {
-        uint z = mod / t;
-        res = (ull) res * (mod - z) % mod;
-        t = mod - t * z;
-    }
-    return res;
-}
-```
+<!--
 
 In this section, we are going to discuss some preliminaries before discussing more advanced topics.
 
-In computers, we use the 1st of January, 1970 as the start of the "Unix era," and all time computations are usually done relative to that timestamp.
-
-We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. At the moment of writing, approximately 63882260594 seconds have passed since 0 AD.
+we use the 1st of January, 1970 as the start of the "Unix era," and all time computations are usually done relative to that timestamp.
 
-But for daily tasks, we do not really need that information. Depending on the situation, the relevant part may be that it is 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. What we do is instead of using a timestamp we use its remainder, which contains just the information we need. And the beautiful thing about it is that remainders are small and cyclic. Think the hour clock: after 12 there comes 1 again, so the number is always small.
+And the beautiful thing about it is that remainders are small and cyclic. Think the hour clock: after 12 there comes 1 again, so the number is always small.
 
 ![](../img/clock.gif)
 
-It is much easier to deal with 1- or 2-digit numbers than 11-digit ones. If we encode each day of the weak starting with Monday from 0 to 6 inclusive, Thursday is going to get number 3. But what day of the week is it going to be in one year? We need to add 365 to it and then reduce modulo 7. It is convenient that `365 % 7` is 1, so we will know that it's Friday unless it is a leap year (in which case it will be Saturday).
+-->
+
+Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time.
+
+We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD.
+
+But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
+
+### Modular Arithmetic
+
+Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference:
+
+$$
+m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m
+$$
+
+Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$).
+
+<!--
+
+Equivalently, the *remainders* of their division by $m$ should be equal:
+
+a \bmod m = b \bmod m
+
+Here are a few example of how this can be useful.
+
+-->
+
+*Modular arithmetic* studies these sets of residues, which are fundamental for number theory.
 
-Modular arithmetic studies the way these sets of remainders behave, and it has fundamental applications in number theory, cryptography and data compression.
+**Problem.** Today is Thursday. What day of the week it will be exactly in a year?
 
+If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday).
 
-Consider the following problem: our "week" now consists of $m$ days, and we cycle through it with a steps of $a > 0$. How many distinct days there will be?
+**Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now?
 
-Let's assume that the first day is always Monday. At some point the sequence of day is going to cycle. The days will be representable as $k a \mod m$, so we need to find the first $k$ such as $k a$ is divisible by $m$. In the case of $m=7$, $m$ is prime, so the cycle length will be 7 exactly for any $a$.
+For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero and after each year, it changes to
+
+$$
+d_{k + 1} = (d_k + a) \bmod m
+$$
+
+After $k$ years, it will be
+
+$$
+d_k = k \cdot a \bmod m
+$$
 
-Now, if $m$ is not prime, but it is still coprime with $a$. For $ka$ to be divisible by $m$, $k$ needs to be divisible by $m$. In general, the answer is $\frac{m}{gcd(a, m)}$. For example, if the week is 10 days long, if the starting number is even, then it will cycle through all even numbers, and if the number is 5, then it will only cycle between 0 and 5. Otherwise it will go through all 10 remainders.
+Since there are only $m$ days in a week, at some point it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that
+
+$$
+k \cdot a \equiv 0 \pmod m
+$$
+
+First of all, if $a \equiv 0$, it will be ethernal Monday. We now assume the non-trivial case of $a \not \equiv 0$.
+
+For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ week days.
+
+If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster.
+
+If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders.
+
+Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$ is the [greatest common divisor](/hpc/algorithms/gcd/) of $a$ and $m$.
 
 ### Fermat's Theorem
 
@@ -65,6 +105,17 @@ $$
 
 where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case.
 
+Несколько причин:
+
+Это выражение довольно легко вбивать (1e9+7).
+Простое число.
+Достаточно большое.
+int не переполняется при сложении.
+long long не переполняется при умножении.
+Кстати, 10^9 + 910 
+9
+ +9 обладает всеми теми же свойствами. Иногда используют и его.
+
 ### Primality Testing
 
 These theorems have a lot of applications. One of them is checking whether a number $n$ is prime or not faster than factoring it. You can pick any base $a$ at random and try to raise it to power $a^{p-1}$ modulo $n$ and check if it is $1$. Such base is called *witness*.
@@ -105,8 +156,27 @@ int binpow(int a, int n) {
 }
 ```
 
+179.64
+
 This helps if `n` or `mod` is a constant.
 
+```c++
+int inverse(int _a) {
+    long long a = _a, r = 1;
+    
+    #pragma GCC unroll(30)
+    for (int l = 0; l < 30; l++) {
+        if ( (M - 2) >> l & 1 )
+            r = r * a % M;
+        a = a * a % M;
+    }
+
+    return r;
+}
+```
+
+171.68
+
 ### Modular Division
 
 "Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, but $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$.
@@ -180,8 +250,33 @@ int gcd(int a, int b, int &x, int &y) {
     y = x1;
     return d;
 }
+
+int inverse(int a) {
+    int x, y;
+    gcd(a, M, x, y);
+    if (x < 0)
+        x += M;
+    return x;
+}
 ```
 
+159.28
+
+```c++
+int inverse(int a) {
+    int b = M, x = 1, y = 0;
+    while (a != 1) {
+        y -= b / a * x;
+        b %= a;
+        swap(a, b);
+        swap(x, y);
+    }
+    return x < 0 ? x + M : x;
+}
+```
+
+134.33
+
 Another application is the exact division modulo $2^k$.
 
 **Exercise**. Try to adapt the technique for binary GCD.

From 0d4d13729d5662f715d258fb37bebba5ada2928c Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 27 Apr 2022 17:54:18 +0300
Subject: [PATCH 075/173] reorganize hpc number theory

---
 .../english/hpc/number-theory/cryptography.md |   2 +-
 .../hpc/number-theory/error-correction.md     |   2 +-
 .../hpc/number-theory/exponentiation.md       |  70 ++++++++
 content/english/hpc/number-theory/finite.md   |   2 +-
 content/english/hpc/number-theory/inverse.md  | 170 +-----------------
 content/english/hpc/number-theory/modular.md  | 105 +++++++++++
 .../english/hpc/number-theory/montgomery.md   |  14 +-
 7 files changed, 192 insertions(+), 173 deletions(-)
 create mode 100644 content/english/hpc/number-theory/exponentiation.md
 create mode 100644 content/english/hpc/number-theory/modular.md

diff --git a/content/english/hpc/number-theory/cryptography.md b/content/english/hpc/number-theory/cryptography.md
index 0dd500dc..e552372a 100644
--- a/content/english/hpc/number-theory/cryptography.md
+++ b/content/english/hpc/number-theory/cryptography.md
@@ -1,6 +1,6 @@
 ---
 title: Cryptography
-weight: 6
+weight: 7
 draft: true
 ---
 
diff --git a/content/english/hpc/number-theory/error-correction.md b/content/english/hpc/number-theory/error-correction.md
index 91f1f472..e8774ed8 100644
--- a/content/english/hpc/number-theory/error-correction.md
+++ b/content/english/hpc/number-theory/error-correction.md
@@ -1,6 +1,6 @@
 ---
 title: Error Correction
-weight: 4
+weight: 6
 draft: true
 ---
 
diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md
new file mode 100644
index 00000000..f82af3e6
--- /dev/null
+++ b/content/english/hpc/number-theory/exponentiation.md
@@ -0,0 +1,70 @@
+---
+title: Binary Exponentiation
+weight: 2
+---
+
+### Binary Exponentiation
+
+To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative:
+
+$$
+\begin{aligned}
+    a^{2k}       &= (a^k)^2
+\\  a^{2k + 1} &= (a^k)^2 \cdot a
+\end{aligned}
+$$
+
+We essentially group it like this:
+
+$$
+a^8 = (aaaa) \cdot (aaaa) = ((aa)(aa))((aa)(aa))
+$$
+
+This allows using only $O(\log n)$ operations (or, more specifically, at most $2 \cdot \log_2 n$ modular multiplications).
+
+```c++
+int binpow(int a, int n) {
+    int res = 1;
+    while (n) {
+        if (n & 1)
+            res = res * a % mod;
+        a = a * a % mod;
+        n >>= 1;
+    }
+    return res;
+}
+```
+
+179.64
+
+This helps if `n` or `mod` is a constant.
+
+```c++
+int inverse(int _a) {
+    long long a = _a, r = 1;
+    
+    #pragma GCC unroll(30)
+    for (int l = 0; l < 30; l++) {
+        if ( (M - 2) >> l & 1 )
+            r = r * a % M;
+        a = a * a % M;
+    }
+
+    return r;
+}
+```
+
+171.68
+
+
+Несколько причин:
+
+Это выражение довольно легко вбивать (1e9+7).
+Простое число.
+Достаточно большое.
+int не переполняется при сложении.
+long long не переполняется при умножении.
+Кстати, 10^9 + 910 
+9
+ +9 обладает всеми теми же свойствами. Иногда используют и его.
+
diff --git a/content/english/hpc/number-theory/finite.md b/content/english/hpc/number-theory/finite.md
index fbef0015..cae2f2ef 100644
--- a/content/english/hpc/number-theory/finite.md
+++ b/content/english/hpc/number-theory/finite.md
@@ -1,6 +1,6 @@
 ---
 title: Finite Fields
-weight: 3
+weight: 5
 draft: true
 ---
 
diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/inverse.md
index beb56611..c0d9df08 100644
--- a/content/english/hpc/number-theory/inverse.md
+++ b/content/english/hpc/number-theory/inverse.md
@@ -1,121 +1,8 @@
 ---
-title: Modular Inverse
-weight: 1
+title: Extended Euclidean Algorithm
+weight: 3
 ---
 
-<!--
-
-In this section, we are going to discuss some preliminaries before discussing more advanced topics.
-
-we use the 1st of January, 1970 as the start of the "Unix era," and all time computations are usually done relative to that timestamp.
-
-And the beautiful thing about it is that remainders are small and cyclic. Think the hour clock: after 12 there comes 1 again, so the number is always small.
-
-![](../img/clock.gif)
-
--->
-
-Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time.
-
-We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD.
-
-But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
-
-### Modular Arithmetic
-
-Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference:
-
-$$
-m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m
-$$
-
-Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$).
-
-<!--
-
-Equivalently, the *remainders* of their division by $m$ should be equal:
-
-a \bmod m = b \bmod m
-
-Here are a few example of how this can be useful.
-
--->
-
-*Modular arithmetic* studies these sets of residues, which are fundamental for number theory.
-
-**Problem.** Today is Thursday. What day of the week it will be exactly in a year?
-
-If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday).
-
-**Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now?
-
-For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero and after each year, it changes to
-
-$$
-d_{k + 1} = (d_k + a) \bmod m
-$$
-
-After $k$ years, it will be
-
-$$
-d_k = k \cdot a \bmod m
-$$
-
-Since there are only $m$ days in a week, at some point it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that
-
-$$
-k \cdot a \equiv 0 \pmod m
-$$
-
-First of all, if $a \equiv 0$, it will be ethernal Monday. We now assume the non-trivial case of $a \not \equiv 0$.
-
-For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ week days.
-
-If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster.
-
-If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders.
-
-Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$ is the [greatest common divisor](/hpc/algorithms/gcd/) of $a$ and $m$.
-
-### Fermat's Theorem
-
-Now, consider what happens if instead of adding a number $a$, we repeatedly multiply by it, that is, write numbers in the form $a^n \mod m$. Since these are all finite numbers there is going to be a cycle, but what will its length be? If $p$ is prime, it turns out, all of them.
-
-**Theorem.** $a^p \equiv a \pmod p$ for all $a$ that are not multiple of $p$.
-
-**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient*, that is, the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ would appear after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then
-
-$$
-\begin{aligned}
-a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p &
-\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)}
-\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)}
-\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)}
-\\\ &= a
-\end{aligned}
-$$
-
-and then dividing by $a$ gives us the Fermat's theorem.
-
-Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that
-
-$$
-a^{\phi(m)} \equiv 1 \pmod m
-$$
-
-where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case.
-
-Несколько причин:
-
-Это выражение довольно легко вбивать (1e9+7).
-Простое число.
-Достаточно большое.
-int не переполняется при сложении.
-long long не переполняется при умножении.
-Кстати, 10^9 + 910 
-9
- +9 обладает всеми теми же свойствами. Иногда используют и его.
-
 ### Primality Testing
 
 These theorems have a lot of applications. One of them is checking whether a number $n$ is prime or not faster than factoring it. You can pick any base $a$ at random and try to raise it to power $a^{p-1}$ modulo $n$ and check if it is $1$. Such base is called *witness*.
@@ -124,59 +11,6 @@ Such probabilistic tests are therefore returning either "no" or "maybe." It may
 
 Unless the input is provided by an adversary, the mistake probability will be low. This test is adequate for finding large primes: there are roughly $\frac{n}{\ln n}$ primes among the first $n$ numbers, which is another fact that we are not going to prove. These primes are distributed more or less evenly, so one can just pick a random number and check numbers in sequence, and after checking $O(\ln n)$ numbers one will probably be found.
 
-### Binary Exponentiation
-
-To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative:
-
-$$
-\begin{aligned}
-    a^{2k}       &= (a^k)^2
-\\  a^{2k + 1} &= (a^k)^2 \cdot a
-\end{aligned}
-$$
-
-We essentially group it like this:
-
-$$
-a^8 = (aaaa) \cdot (aaaa) = ((aa)(aa))((aa)(aa))
-$$
-
-This allows using only $O(\log n)$ operations (or, more specifically, at most $2 \cdot \log_2 n$ modular multiplications).
-
-```c++
-int binpow(int a, int n) {
-    int res = 1;
-    while (n) {
-        if (n & 1)
-            res = res * a % mod;
-        a = a * a % mod;
-        n >>= 1;
-    }
-    return res;
-}
-```
-
-179.64
-
-This helps if `n` or `mod` is a constant.
-
-```c++
-int inverse(int _a) {
-    long long a = _a, r = 1;
-    
-    #pragma GCC unroll(30)
-    for (int l = 0; l < 30; l++) {
-        if ( (M - 2) >> l & 1 )
-            r = r * a % M;
-        a = a * a % M;
-    }
-
-    return r;
-}
-```
-
-171.68
-
 ### Modular Division
 
 "Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, but $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$.
diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md
new file mode 100644
index 00000000..92e0c687
--- /dev/null
+++ b/content/english/hpc/number-theory/modular.md
@@ -0,0 +1,105 @@
+---
+title: Modular Arithmetic
+weight: -1
+---
+
+
+<!--
+
+In this section, we are going to discuss some preliminaries before discussing more advanced topics.
+
+we use the 1st of January, 1970 as the start of the "Unix era," and all time computations are usually done relative to that timestamp.
+
+And the beautiful thing about it is that remainders are small and cyclic. Think the hour clock: after 12 there comes 1 again, so the number is always small.
+
+![](../img/clock.gif)
+
+-->
+
+Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time.
+
+We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD.
+
+But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
+
+**Problem.** Today is Thursday. What day of the week it will be exactly in a year?
+
+If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday).
+
+**Definition.** Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference:
+
+$$
+m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m
+$$
+
+For example, day 42 of the year is 161 119 = 17 \times 7.
+
+Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$).
+
+<!--
+
+Equivalently, the *remainders* of their division by $m$ should be equal:
+
+a \bmod m = b \bmod m
+
+Here are a few example of how this can be useful.
+
+-->
+
+*Modular arithmetic* studies these sets of residues, which are fundamental for number theory.
+
+**Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now?
+
+For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero and after each year, it changes to
+
+$$
+d_{k + 1} = (d_k + a) \bmod m
+$$
+
+After $k$ years, it will be
+
+$$
+d_k = k \cdot a \bmod m
+$$
+
+Since there are only $m$ days in a week, at some point it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that
+
+$$
+k \cdot a \equiv 0 \pmod m
+$$
+
+First of all, if $a \equiv 0$, it will be ethernal Monday. Now, assuming the non-trivial case of $a \not \equiv 0$:
+
+- For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ week days.
+- If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster.
+- If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders.
+
+Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$ is the [greatest common divisor](/hpc/algorithms/gcd/) of $a$ and $m$.
+
+### Fermat's Theorem
+
+Now, consider what happens if instead of adding a number $a$, we repeatedly multiply by it, that is, write numbers in the form $a^n \mod m$. Since these are all finite numbers there is going to be a cycle, but what will its length be? If $p$ is prime, it turns out, all of them.
+
+**Theorem.** $a^p \equiv a \pmod p$ for all $a$ that are not multiple of $p$.
+
+**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient*, that is, the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ would appear after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then
+
+$$
+\begin{aligned}
+a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p &
+\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)}
+\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)}
+\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)}
+\\\ &= a
+\end{aligned}
+$$
+
+and then dividing by $a$ gives us the Fermat's theorem.
+
+Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that
+
+$$
+a^{\phi(m)} \equiv 1 \pmod m
+$$
+
+where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case.
diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
index e784dfaf..233e355d 100644
--- a/content/english/hpc/number-theory/montgomery.md
+++ b/content/english/hpc/number-theory/montgomery.md
@@ -1,6 +1,6 @@
 ---
 title: Montgomery Multiplication
-weight: 2
+weight: 4
 ---
 
 When we talked about [integers](../integer) in general, we discussed how to perform division and modulo by multiplication, and, unsurprisingly, in modular arithmetic 90% of its time is spent calculating modulo. Apart from using the general tricks described in the previous article, there is another method specifically for modular arithmetic, called *Montgomery multiplication*.
@@ -79,6 +79,9 @@ Since $x < n \cdot n < r \cdot n$ (as $x$ is a product of multiplicatio) and $q
 Here is an equivalent C implementation for 64-bit integers:
 
 ```c++
+typedef unsigned long long u64;
+typedef __uint128_t u128;
+
 u64 reduce(u128 x) {
     u64 q = u64(x) * nr;
     u64 m = ((u128) q * n) >> 64;
@@ -134,7 +137,6 @@ Transforming a number into the space is just a multiplication inside the space o
 ### Complete Implementation
 
 ```c++
-// TODO fix me and prettify me
 struct montgomery {
     u64 n, nr;
     
@@ -148,6 +150,9 @@ struct montgomery {
         u64 q = u64(x) * nr;
         u64 m = ((u128) q * n) >> 64;
         u64 xhi = (x >> 64);
+        //cout << u64(x>>64) << " " << u64(x) << " " << q << endl;
+        //cout << u64(m>>64) << " " << u64(m) << endl;
+        //exit(0);
         if (xhi >= m)
             return (xhi - m);
         else
@@ -163,3 +168,8 @@ struct montgomery {
     }
 };
 ```
+
+```c++
+montgomery m(n);
+m.transform(x);
+```
\ No newline at end of file

From 013fd0109e05d69a4a30b5b86db74ffa6e784adf Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 27 Apr 2022 21:01:34 +0300
Subject: [PATCH 076/173] publish modular arithmetic

---
 content/english/hpc/number-theory/_index.md   |  1 -
 .../{inverse.md => euclid-extended.md}        | 32 ++-------
 .../hpc/number-theory/exponentiation.md       |  9 ++-
 content/english/hpc/number-theory/modular.md  | 66 ++++++++++++++-----
 .../english/hpc/number-theory/montgomery.md   |  1 +
 5 files changed, 65 insertions(+), 44 deletions(-)
 rename content/english/hpc/number-theory/{inverse.md => euclid-extended.md} (51%)

diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md
index d532bcfd..bb6a8b3c 100644
--- a/content/english/hpc/number-theory/_index.md
+++ b/content/english/hpc/number-theory/_index.md
@@ -1,7 +1,6 @@
 ---
 title: Number Theory
 weight: 7
-draft: true
 ---
 
 In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._H._Hardy) published a famous essay titled "[A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology)" discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications.
diff --git a/content/english/hpc/number-theory/inverse.md b/content/english/hpc/number-theory/euclid-extended.md
similarity index 51%
rename from content/english/hpc/number-theory/inverse.md
rename to content/english/hpc/number-theory/euclid-extended.md
index c0d9df08..ea01588c 100644
--- a/content/english/hpc/number-theory/inverse.md
+++ b/content/english/hpc/number-theory/euclid-extended.md
@@ -1,41 +1,21 @@
 ---
 title: Extended Euclidean Algorithm
 weight: 3
+draft: true
 ---
 
-### Primality Testing
 
-These theorems have a lot of applications. One of them is checking whether a number $n$ is prime or not faster than factoring it. You can pick any base $a$ at random and try to raise it to power $a^{p-1}$ modulo $n$ and check if it is $1$. Such base is called *witness*.
-
-Such probabilistic tests are therefore returning either "no" or "maybe." It may be the case that it just happened to be equal to $1$ but in fact $n$ is composite, in which case you need to repeat the test until you are okay with the false positive probability. Moreover, there exist carmichael numbers, which are composite numbers $n$ that satisfy $a^n \equiv 1 \pmod n$ for all $a$. These numbers are rare, but still [exist](https://oeis.org/A002997).
-
-Unless the input is provided by an adversary, the mistake probability will be low. This test is adequate for finding large primes: there are roughly $\frac{n}{\ln n}$ primes among the first $n$ numbers, which is another fact that we are not going to prove. These primes are distributed more or less evenly, so one can just pick a random number and check numbers in sequence, and after checking $O(\ln n)$ numbers one will probably be found.
-
-### Modular Division
-
-"Normal" operations also apply to residues: +, -, *. But there is an issue with division, because we can't just bluntly divide two numbers: $\frac{8}{2} = 4$, but $\frac{8 \\% 5 = 3}{2 \\% 5 = 2} \neq 4$.
-
-To perform division, we need to find an element that will behave itself like the reciprocal $\frac{1}{a} = a^{-1}$, and instead of "division" multiply by it. This element is called a *modular inverse*.
+If the modulo is not prime, then we can still get by calculating $\phi(m)$ and invoking Euler's theorem. But calculating $\phi(m)$ is as difficult as factoring it, which is not fast. There is a more general method.
 
-If the modulo is a prime number, then the solution is $a^{-1} \equiv a^{p-2}$, which follows directly from Fermat's theorem by dividing the equivalence by $a$:
+Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that
 
 $$
-a^p \equiv a \implies a^{p-1} \equiv 1 \implies a^{p-2} \equiv a^{-1}
+a^{\phi(m)} \equiv 1 \pmod m
 $$
 
-This means that $a^{p-2}$ "behaves" like $a^{-1}$ which is what we need.
-
-You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation:
+where $\phi(m)$ is called [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case.
 
-```c++
-int inv(int x) {
-    return binpow(x, mod - 2);
-}
-```
-
-If the modulo is not prime, then we can still get by calculating $\phi(m)$ and invoking Euler's theorem. But calculating $\phi(m)$ is as difficult as factoring it, which is not fast. There is a more general method.
-
-### Extended Euclidean Algorithm
+---
 
 *Extended Euclidean algorithm* apart from finding $g = \gcd(a, b)$ also finds integers $x$ and $y$ such that
 
diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md
index f82af3e6..68142c30 100644
--- a/content/english/hpc/number-theory/exponentiation.md
+++ b/content/english/hpc/number-theory/exponentiation.md
@@ -1,9 +1,16 @@
 ---
 title: Binary Exponentiation
 weight: 2
+draft: true
 ---
 
-### Binary Exponentiation
+You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation:
+
+```c++
+int inv(int x) {
+    return binpow(x, mod - 2);
+}
+```
 
 To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative:
 
diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md
index 92e0c687..b6045a3a 100644
--- a/content/english/hpc/number-theory/modular.md
+++ b/content/english/hpc/number-theory/modular.md
@@ -20,11 +20,13 @@ Computers usually store time as the number of seconds that have passed since the
 
 We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD.
 
-But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainer* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
+But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday, and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainder* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
 
-**Problem.** Today is Thursday. What day of the week it will be exactly in a year?
+**Problem.** Today is Thursday. What day of the week will be exactly in a year?
 
-If we enumerate each day of the week starting with Monday from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday).
+If we enumerate each day of the week, starting with Monday, from $0$ to $6$ inclusive, Thursday gets number $3$. To find out what day it is going to be in a year from now, we need to add $365$ to it and then reduce modulo $7$. Conveniently, $365 \bmod 7 = 1$, so we know that it will be Friday unless it is a leap year (in which case it will be Saturday).
+
+### Residues
 
 **Definition.** Two integers $a$ and $b$ are said to be *congruent* modulo $m$ if $m$ divides their difference:
 
@@ -32,9 +34,9 @@ $$
 m \mid (a - b) \; \Longleftrightarrow \; a \equiv b \pmod m
 $$
 
-For example, day 42 of the year is 161 119 = 17 \times 7.
+For example, the 42nd day of the year is the same weekday as the 161st since $(161 - 42) = 119 = 17 \times 7$.
 
-Congruence modulo $m$ is an equivalence relation, which splits all integers into equivalence classes, called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$).
+Congruence modulo $m$ is an equivalence relation that splits all integers into equivalence classes called *residues*. Each residue class modulo $m$ may be represented by any one of its members — although we commonly use the smallest nonnegative integer of that class (equal to the remainder $x \bmod m$ for all nonnegative $x$).
 
 <!--
 
@@ -50,7 +52,7 @@ Here are a few example of how this can be useful.
 
 **Problem.** Our "week" now consists of $m$ days, and our year consists of $a$ days (no leap years). How many distinct days of the week there will be among one, two, three and so on whole years from now?
 
-For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero and after each year, it changes to
+For simplicity, assume that today is Monday, so that the initial day number $d_0$ is zero, and after each year, it changes to
 
 $$
 d_{k + 1} = (d_k + a) \bmod m
@@ -62,15 +64,15 @@ $$
 d_k = k \cdot a \bmod m
 $$
 
-Since there are only $m$ days in a week, at some point it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that
+Since there are only $m$ days in a week, at some point, it will be Monday again, and the sequence of day numbers is going to cycle. The number of distinct days is the length of this cycle, so we need to find the smallest $k$ such that
 
 $$
 k \cdot a \equiv 0 \pmod m
 $$
 
-First of all, if $a \equiv 0$, it will be ethernal Monday. Now, assuming the non-trivial case of $a \not \equiv 0$:
+First of all, if $a \equiv 0$, it will be eternal Monday. Now, assuming the non-trivial case of $a \not \equiv 0$:
 
-- For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ week days.
+- For a seven-day week, $m = 7$ is prime. There is no $k$ smaller than $m$ such that $k \cdot a$ is divisible by $m$ because $m$ can not be decomposed in such a product by the definition of primality. So, if $m$ is prime, we will cycle through all of $m$ weekdays.
 - If $m$ is not prime, but $a$ is *coprime* with it (that is, $a$ and $m$ do not have common divisors), then the answer is still $m$ for the same reason: the divisors of $a$ do not help in zeroing out the product any faster.
 - If $a$ and $m$ share some divisors, then it is only possible to get residues that are also divisible by them. For example, if the week is $m = 10$ days long, and the year has $a = 42$ or any other even number of days, then we will cycle through all even day numbers, and if the number of days is a multiple of $5$, then we will only oscillate between $0$ and $5$. Otherwise, we will go through all the $10$ remainders.
 
@@ -78,11 +80,21 @@ Therefore, in general, the answer is $\frac{m}{\gcd(a, m)}$, where $\gcd(a, m)$
 
 ### Fermat's Theorem
 
-Now, consider what happens if instead of adding a number $a$, we repeatedly multiply by it, that is, write numbers in the form $a^n \mod m$. Since these are all finite numbers there is going to be a cycle, but what will its length be? If $p$ is prime, it turns out, all of them.
+Now, consider what happens if, instead of adding a number $a$, we repeatedly multiply by it, writing out a sequence of
 
-**Theorem.** $a^p \equiv a \pmod p$ for all $a$ that are not multiple of $p$.
+$$
+d_n = a^n \bmod m
+$$
 
-**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient*, that is, the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ would appear after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then
+Again, since there is a finite number of residues, there is going to be a cycle. But what will its length be? Turns out, if $m$ is prime, it will span all $(m - 1)$ non-zero residues.
+
+**Theorem.** For any $a$ and a prime $p$:
+
+$$
+a^p \equiv a \pmod p
+$$
+
+**Proof**. Let $P(x_1, x_2, \ldots, x_n) = \frac{k}{\prod (x_i!)}$ be the *multinomial coefficient:* the number of times the element $a_1^{x_1} a_2^{x_2} \ldots a_n^{x_n}$ appears after the expansion of $(a_1 + a_2 + \ldots + a_n)^k$. Then:
 
 $$
 \begin{aligned}
@@ -94,12 +106,34 @@ a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p &
 \end{aligned}
 $$
 
-and then dividing by $a$ gives us the Fermat's theorem.
+Note that this is only true for prime $p$. We can use this fact to test whether a given number is prime faster than by factoring it: we can pick a number $a$ at random, calculate $a^{p} \bmod p$, and check whether it is equal to $a$ or not.
+
+This is called *Fermat primality test*, and it is probabilistic — only returning either "no" or "maybe" — since it may be that $a^p$ just happened to be equal to $a$ despite $p$ being composite, in which case you need to repeat the test with a different random $a$ until you are satisfied with the false positive probability.
+
+Primality tests are commonly used to generate large primes (for cryptographic purposes). There are roughly $\frac{n}{\ln n}$ primes among the first $n$ numbers (a fact that we are not going to prove), and they are distributed more or less evenly. One can just pick a random number from the required range, perform a primality check, and repeat until a prime is found, performing $O(\ln n)$ trials on average.
+
+An extremely bad input to the Fermat test is the [Carmichael numbers](https://en.wikipedia.org/wiki/Carmichael_number), which are composite numbers $n$ that satisfy $a^{n-1} \equiv 1 \pmod n$ for all relatively prime $a$. But these are [rare](https://oeis.org/A002997), and the chance of randomly bumping into it is low.
+
+### Modular Division
+
+Implementing most "normal" arithmetic operations with residues is straightforward. You only need to take care of integer overflows and remember to take modulo:
+
+```c++
+c = (a + b) % m;
+c = (a - b + m) % m;
+c = a * b % m;
+```
+
+But there is an issue with division: we can't just bluntly divide two residues. For example, $\frac{8}{2} = 4$, but
+
+$$
+\frac{8 \bmod 5}{2 \bmod 5} = \frac{3}{2} \neq 4
+$$
 
-Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that
+To perform modular division, we need to find an element that "acts" like the reciprocal $\frac{1}{a} = a^{-1}$ and multiply by it. This element is called a *modular multiplicative inverse*, and Fermat's theorem can help us find it when the modulo $p$ is a prime. When we divide its equivalence twice by $a$, we get:
 
 $$
-a^{\phi(m)} \equiv 1 \pmod m
+a^p \equiv a \implies a^{p-1} \equiv 1 \implies a^{p-2} \equiv a^{-1}
 $$
 
-where $\phi(m)$ is called Euler's totient function and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case.
+Therefore, $a^{p-2}$ is like $a^{-1}$ for the purposes of multiplication, which is what we need from a modular inverse of $a$.
diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
index 233e355d..0ede37e5 100644
--- a/content/english/hpc/number-theory/montgomery.md
+++ b/content/english/hpc/number-theory/montgomery.md
@@ -1,6 +1,7 @@
 ---
 title: Montgomery Multiplication
 weight: 4
+draft: true
 ---
 
 When we talked about [integers](../integer) in general, we discussed how to perform division and modulo by multiplication, and, unsurprisingly, in modular arithmetic 90% of its time is spent calculating modulo. Apart from using the general tricks described in the previous article, there is another method specifically for modular arithmetic, called *Montgomery multiplication*.

From 056f8e4c9650449fc999432e540b50c668fe9052 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 1 May 2022 04:19:10 +0300
Subject: [PATCH 077/173] binpow code

---
 .../hpc/number-theory/euclid-extended.md      |   2 +
 .../hpc/number-theory/exponentiation.md       | 106 +++++++++++++-----
 content/english/hpc/number-theory/modular.md  |   2 +-
 content/russian/cs/algebra/binpow.md          |   2 +-
 4 files changed, 82 insertions(+), 30 deletions(-)

diff --git a/content/english/hpc/number-theory/euclid-extended.md b/content/english/hpc/number-theory/euclid-extended.md
index ea01588c..7d7dfeab 100644
--- a/content/english/hpc/number-theory/euclid-extended.md
+++ b/content/english/hpc/number-theory/euclid-extended.md
@@ -91,6 +91,8 @@ int inverse(int a) {
 
 134.33
 
+250 (there is probably a worse input if we change modulo or aternate between inputs to make )
+
 Another application is the exact division modulo $2^k$.
 
 **Exercise**. Try to adapt the technique for binary GCD.
diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md
index 68142c30..ef5f6563 100644
--- a/content/english/hpc/number-theory/exponentiation.md
+++ b/content/english/hpc/number-theory/exponentiation.md
@@ -4,23 +4,77 @@ weight: 2
 draft: true
 ---
 
-You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation:
+In modular arithmetic and computational algebra in general, you often need to raise a number to the $n$-th power — to do [modular division](../modular/#modular-division), perform [primality tests](../modular/#fermats-theorem), or compute some combinatorial values — ­and you usually want to spend fewer than $\Theta(n)$ operations calculating it.
+
+*Binary exponentiation*, also known as *exponentiation by squaring*, is a method that allows for computation of the $n$-th power using $O(\log n)$ multiplications, relying on the following observation:
+
+$$
+\begin{aligned}
+    a^{2k}       &= (a^k)^2
+\\  a^{2k + 1}   &= (a^k)^2 \cdot a
+\end{aligned}
+$$
+
+To compute $a^n$, we can recursively compute $a^{\lfloor n / 2 \rfloor}$, square it, and then optionally multiply by $a$ if $n$ is odd, corresponding to the following recurrence:
+
+$$
+a^n = f(a, n) = \begin{cases}
+   1,               && n = 0
+\\ f(a, \frac{n}{2})^2,     && 2 \mid n
+\\ f(a, n - 1) \cdot a, && 2 \nmid n
+\end{cases}
+$$
+
+Since $n$ is at least halved every two recursive transitions, the depth of this recurrence and the total number of multiplications will be at most $O(\log n)$.
+
+### Implementation
+
+Since we already have a recurrence, it is natural to implement the algorithm as a case matching recursive function:
 
 ```c++
-int inv(int x) {
-    return binpow(x, mod - 2);
+const int M = 1e9 + 7; // modulo
+typedef unsigned long long u64;
+
+u64 binpow(u64 a, u64 n) {
+    if (n == 0)
+        return 1;
+    if (n % 2 == 1)
+        return binpow(a, n - 1) * a % M;
+    else {
+        u64 b = binpow(a, n / 2) % M;
+        return b * b % m;
+    }
 }
 ```
 
-To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications. We can use the fact that multiplication is associative:
+Since $m$ is a compile-time constant, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once).
+
+The execution path and hence the running time depends on the value of $n$. In our benchmark, we use $m = 10^9+7$ and $n = m - 2$ so that we compute the [multiplicative inverse](../modular/#modular-division) of $a$ modulo $m$. This modulo is commonly used in competitive programming to calculate checksums in combinatorial problems because it is prime (allowing inverse via binary exponentiation), sufficiently large, doesn't overflow `int` for addition, doesn't overflow `long long` for multiplication, and is easy to type as `1e9 + 7`.
+
+```c++
+u64 inverse(u64 a) {
+    return binpow(a, M - 2);
+}
+```
+
+For this particular $n$, the recursive implementation takes around 400ns per call.
+
+As recursion introduces some [overhead](/hpc/architecture/functions/)
+
+Эта реализация рекурсивная, что работает долго. Попробуем «развернуть» рекурсию и получить итеративную.
+
+Рассмотрим двоичное представление числа $n$. Результат $a^n$ можно представить как произведение $a$ в степенях каких-то степеней двоек. Например, если $n = 42 = 32 + 8 + 2$, то
 
 $$
-\begin{aligned}
-    a^{2k}       &= (a^k)^2
-\\  a^{2k + 1} &= (a^k)^2 \cdot a
-\end{aligned}
+a^{42} = a^{32+8+2} = a^{32} \cdot a^8 \cdot a^2 
 $$
 
+Чтобы посчитать это произведение итеративно, пройдемся по всем битам числа $n$, поддерживая две переменные: непосредственно результат и текущее значение $a^{2^k}$, где $k$ это номер текущей итерации. На каждом шаге будем домножать $a^{2^k}$ на текущий результат, если $k$-тый бит числа $n$ единичный, и в любом случае возводить её в квадрат, получая $a^{2^k \cdot 2} = a^{2^{k+1}}$ для следующей итерации.
+
+Стоит отметить, что во многих языках программирования бинарное возведение в степень уже реализовано. Но не в C++: функция `pow` из стандартной библиотеки реализована только для действительных чисел и использует приближенные методы, и поэтому не дает точных результатов для целочисленных аргументов.
+
+
+
 We essentially group it like this:
 
 $$
@@ -29,16 +83,25 @@ $$
 
 This allows using only $O(\log n)$ operations (or, more specifically, at most $2 \cdot \log_2 n$ modular multiplications).
 
+
+
+You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation:
+
+
+To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications.
+
 ```c++
-int binpow(int a, int n) {
-    int res = 1;
+u64 binpow(u64 a, u64 n) {
+    u64 r = 1;
+    
     while (n) {
         if (n & 1)
-            res = res * a % mod;
-        a = a * a % mod;
+            r = res * a % M;
+        a = a * a % M;
         n >>= 1;
     }
-    return res;
+    
+    return r;
 }
 ```
 
@@ -47,8 +110,8 @@ int binpow(int a, int n) {
 This helps if `n` or `mod` is a constant.
 
 ```c++
-int inverse(int _a) {
-    long long a = _a, r = 1;
+u64 inverse(u64 a) {
+    u64 r = 1;
     
     #pragma GCC unroll(30)
     for (int l = 0; l < 30; l++) {
@@ -62,16 +125,3 @@ int inverse(int _a) {
 ```
 
 171.68
-
-
-Несколько причин:
-
-Это выражение довольно легко вбивать (1e9+7).
-Простое число.
-Достаточно большое.
-int не переполняется при сложении.
-long long не переполняется при умножении.
-Кстати, 10^9 + 910 
-9
- +9 обладает всеми теми же свойствами. Иногда используют и его.
-
diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md
index b6045a3a..02a84ece 100644
--- a/content/english/hpc/number-theory/modular.md
+++ b/content/english/hpc/number-theory/modular.md
@@ -1,6 +1,6 @@
 ---
 title: Modular Arithmetic
-weight: -1
+weight: 1
 ---
 
 
diff --git a/content/russian/cs/algebra/binpow.md b/content/russian/cs/algebra/binpow.md
index 5c7d2d43..4126061d 100644
--- a/content/russian/cs/algebra/binpow.md
+++ b/content/russian/cs/algebra/binpow.md
@@ -6,7 +6,7 @@ authors:
 weight: -10
 ---
 
-*Бинарное возведение в степень* — приём, позволяющий возводить любое число в $n$-ую степень за $O(\log n)$ умножений (вместо n умножений при обычном подходе).
+*Бинарное возведение в степень* — приём, позволяющий возводить любое число в $n$-ую степень за $O(\log n)$ умножений (вместо $n$ умножений при обычном подходе).
 
 ## Основная идея
 

From 3364b7c4720f41467cdc14ae7e9f0f747793e402 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 1 May 2022 05:13:05 +0300
Subject: [PATCH 078/173] binpow edits

---
 .../hpc/number-theory/exponentiation.md       | 47 +++++--------------
 1 file changed, 13 insertions(+), 34 deletions(-)

diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md
index ef5f6563..199dbb10 100644
--- a/content/english/hpc/number-theory/exponentiation.md
+++ b/content/english/hpc/number-theory/exponentiation.md
@@ -27,9 +27,9 @@ $$
 
 Since $n$ is at least halved every two recursive transitions, the depth of this recurrence and the total number of multiplications will be at most $O(\log n)$.
 
-### Implementation
+### Recursive Implementation
 
-Since we already have a recurrence, it is natural to implement the algorithm as a case matching recursive function:
+As we already have a recurrence, it is natural to implement the algorithm as a case matching recursive function:
 
 ```c++
 const int M = 1e9 + 7; // modulo
@@ -41,15 +41,13 @@ u64 binpow(u64 a, u64 n) {
     if (n % 2 == 1)
         return binpow(a, n - 1) * a % M;
     else {
-        u64 b = binpow(a, n / 2) % M;
-        return b * b % m;
+        u64 b = binpow(a, n / 2);
+        return b * b % M;
     }
 }
 ```
 
-Since $m$ is a compile-time constant, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once).
-
-The execution path and hence the running time depends on the value of $n$. In our benchmark, we use $m = 10^9+7$ and $n = m - 2$ so that we compute the [multiplicative inverse](../modular/#modular-division) of $a$ modulo $m$. This modulo is commonly used in competitive programming to calculate checksums in combinatorial problems because it is prime (allowing inverse via binary exponentiation), sufficiently large, doesn't overflow `int` for addition, doesn't overflow `long long` for multiplication, and is easy to type as `1e9 + 7`.
+In our benchmark, we use $n = m - 2$ so that we compute the [multiplicative inverse](../modular/#modular-division) of $a$ modulo $m$:
 
 ```c++
 u64 inverse(u64 a) {
@@ -57,38 +55,19 @@ u64 inverse(u64 a) {
 }
 ```
 
-For this particular $n$, the recursive implementation takes around 400ns per call.
+We use $m = 10^9+7$, which is a modulo value is commonly used in *competitive programming* to calculate checksums in combinatorial problems because it is prime (allowing inverse via binary exponentiation), sufficiently large while not overflowing `int` in addition or `long long` in multiplication, and is easy to type as `1e9 + 7`. Since it is a compile-time constant, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once and use them for fast reduction).
 
-As recursion introduces some [overhead](/hpc/architecture/functions/)
+The execution path and hence the running time depends on the value of $n$. For this particular $n$, the baseline implementation takes around 330ns per call. As recursion introduces some [overhead](/hpc/architecture/functions/), it makes sense to unroll it into an iterative one.
 
-Эта реализация рекурсивная, что работает долго. Попробуем «развернуть» рекурсию и получить итеративную.
+### Iterative Implementation
 
-Рассмотрим двоичное представление числа $n$. Результат $a^n$ можно представить как произведение $a$ в степенях каких-то степеней двоек. Например, если $n = 42 = 32 + 8 + 2$, то
+The result of $a^n$ can be represented as the product of $a$ to some powers of two — those that correspond to 1s in the binary representation of $n$. For example, if $n = 42 = 32 + 8 + 2$, then
 
 $$
 a^{42} = a^{32+8+2} = a^{32} \cdot a^8 \cdot a^2 
 $$
 
-Чтобы посчитать это произведение итеративно, пройдемся по всем битам числа $n$, поддерживая две переменные: непосредственно результат и текущее значение $a^{2^k}$, где $k$ это номер текущей итерации. На каждом шаге будем домножать $a^{2^k}$ на текущий результат, если $k$-тый бит числа $n$ единичный, и в любом случае возводить её в квадрат, получая $a^{2^k \cdot 2} = a^{2^{k+1}}$ для следующей итерации.
-
-Стоит отметить, что во многих языках программирования бинарное возведение в степень уже реализовано. Но не в C++: функция `pow` из стандартной библиотеки реализована только для действительных чисел и использует приближенные методы, и поэтому не дает точных результатов для целочисленных аргументов.
-
-
-
-We essentially group it like this:
-
-$$
-a^8 = (aaaa) \cdot (aaaa) = ((aa)(aa))((aa)(aa))
-$$
-
-This allows using only $O(\log n)$ operations (or, more specifically, at most $2 \cdot \log_2 n$ modular multiplications).
-
-
-
-You can calculate $a^{p-2}$ in $O(\log p)$ time using binary exponentiation:
-
-
-To perform the Fermat test, we need to raise a number to power $n-1$, preferrably using less than $n-2$ modular multiplications.
+To calculate this product, we can iterate over the bits of $n$ maintaining two variables: the value of $a^{2^k}$ and the current product after considering $k$ lowest bits. On each step, we multiply the current product by $a^{2^k}$ if the $k$-th bit of $n$ is set, and, in either case, square $a^k$ to get $a^{2^k \cdot 2} = a^{2^{k+1}}$ that will be used on the next iteration.
 
 ```c++
 u64 binpow(u64 a, u64 n) {
@@ -105,9 +84,9 @@ u64 binpow(u64 a, u64 n) {
 }
 ```
 
-179.64
+The iterative implementation takes about 180ns per call. The heavy calculations are the same; the improvement mainly comes from the reduced dependency chain: `a = a * a % M` needs to finish before the loop can proceed, and it can now execute concurrently with `r = res * a % M`.
 
-This helps if `n` or `mod` is a constant.
+The performance also benefits from $n$ being a constant, [making all branches predictable](/hpc/pipelining/branching/) and letting the scheduler know what needs to be executed in advance. The compiler, however, does not take advantage of it and does not unroll the `while(n) n >>= 1` loop. We can rewrite it as a `for` loop that takes constant 30 iterations:
 
 ```c++
 u64 inverse(u64 a) {
@@ -124,4 +103,4 @@ u64 inverse(u64 a) {
 }
 ```
 
-171.68
+This forces the compiler to generate only the instructions we need, shoving off another 10ns and making the total running time ~170ns.

From d8d891f593a5c30152962c0fe5b6561ad4a7b39e Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 1 May 2022 05:13:36 +0300
Subject: [PATCH 079/173] publish binpow

---
 content/english/hpc/number-theory/exponentiation.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md
index 199dbb10..fe9b72df 100644
--- a/content/english/hpc/number-theory/exponentiation.md
+++ b/content/english/hpc/number-theory/exponentiation.md
@@ -1,7 +1,6 @@
 ---
 title: Binary Exponentiation
 weight: 2
-draft: true
 ---
 
 In modular arithmetic and computational algebra in general, you often need to raise a number to the $n$-th power — to do [modular division](../modular/#modular-division), perform [primality tests](../modular/#fermats-theorem), or compute some combinatorial values — ­and you usually want to spend fewer than $\Theta(n)$ operations calculating it.

From 65592f0068af03389e2ec60505a6fdad21fcb49a Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 1 May 2022 05:55:41 +0300
Subject: [PATCH 080/173] extended euclidean

---
 .../hpc/number-theory/euclid-extended.md      | 42 ++++++++++---------
 1 file changed, 22 insertions(+), 20 deletions(-)

diff --git a/content/english/hpc/number-theory/euclid-extended.md b/content/english/hpc/number-theory/euclid-extended.md
index 7d7dfeab..54fa0b1e 100644
--- a/content/english/hpc/number-theory/euclid-extended.md
+++ b/content/english/hpc/number-theory/euclid-extended.md
@@ -1,23 +1,21 @@
 ---
 title: Extended Euclidean Algorithm
 weight: 3
-draft: true
 ---
 
-
-If the modulo is not prime, then we can still get by calculating $\phi(m)$ and invoking Euler's theorem. But calculating $\phi(m)$ is as difficult as factoring it, which is not fast. There is a more general method.
-
-Note that this is only true for prime $p$. Euler's theorem handles the case of arbitary $m$, and states that
+[Fermat’s theorem](../modular/#fermats-theorem) allows us to calculate modular multiplicative inverses through [binary exponentiation](..exponentiation/) in $O(\log n)$ operations, but it only works with prime modula. There is a generalization of it, [Euler's theorem](https://en.wikipedia.org/wiki/Euler%27s_theorem), stating that if $m$ and $a$ are coprime, then
 
 $$
 a^{\phi(m)} \equiv 1 \pmod m
 $$
 
-where $\phi(m)$ is called [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) and is equal to the number of residues of $m$ that is coprime with it. In particular case of when $m$ is prime, $\phi(p) = p - 1$ and we get Fermat's theorem, which is just a special case.
+where $\phi(m)$ is [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) defined as the number of positive integers $x < m$ that are coprime with $m$. In particular case when $m$ is a prime, then all the $m - 1$ residues are coprime and $\phi(m) = m - 1$, yielding the Fermat's theorem.
 
----
+This lets us calculate the inverse of $a$ as $a^{\phi(m) - 1}$ if we know $\phi(m)$, but calculating it is, in turn, not so fast: you usually need to obtain the factorization of $m$. There is a more general method that works by modifying the [the Euclidean algorthm](/hpc/algorithms/gcd/).
 
-*Extended Euclidean algorithm* apart from finding $g = \gcd(a, b)$ also finds integers $x$ and $y$ such that
+### Algorithm
+
+*Extended Euclidean algorithm*, apart from finding $g = \gcd(a, b)$, also finds integers $x$ and $y$ such that
 
 $$
 a \cdot x + b \cdot y = g
@@ -29,27 +27,31 @@ $$
 a^{-1} \cdot a + k \cdot m = 1
 $$
 
-Note that if $a$ is not coprime with $m$, then there will be no solution. We can still find *some* element, but it will not work for any dividend.
+Note that, if $a$ is not coprime with $m$, there is no solution since no integer combination of $a$ and $m$ can yield anything that is not a multiple of their greatest common divisor.
 
-The algorithm is also recursive. It makes a recursive call, calculates the coefficients $x'$ and $y'$ for $\gcd(b, a \bmod b)$, and restores the general solution. If we have a solution $(x', y')$ for pair $(b, a \bmod b)$:
+The algorithm is also recursive: it calculates the coefficients $x'$ and $y'$ for $\gcd(b, a \bmod b)$ and restores the solution for the original number pair. If we have a solution $(x', y')$ for the pair $(b, a \bmod b)$
 
 $$
 b \cdot x' + (a \bmod b) \cdot y' = g
 $$
 
-To get the solution for the initial input, rewrite the expression $(a \bmod b)$ as $(a - \lfloor \frac{a}{b} \rfloor \cdot b)$ and subsitute it into the aforementioned equality:
+then, to get the solution for the initial input, we can rewrite the expression $(a \bmod b)$ as $(a - \lfloor \frac{a}{b} \rfloor \cdot b)$ and subsitute it into the aforementioned equation:
 
 $$
 b \cdot x' + (a - \Big \lfloor \frac{a}{b} \Big \rfloor \cdot b) \cdot y' = g
 $$
 
-Now let's rearrange the terms (grouping by $a$ and $b$) to get
+Now we rearrange the terms grouping by $a$ and $b$ to get
 
 $$
 a \cdot \underbrace{y'}_x + b \cdot \underbrace{(x' - \Big \lfloor \frac{a}{b} \Big \rfloor \cdot y')}_y = g
 $$
 
-Comparing it with initial expression, we infer that we can just use coefficients by $a$ and $b$ for the initial $x$ and $y$.
+Comparing it with the initial expression, we infer that we can just use coefficients of $a$ and $b$ for the initial $x$ and $y$.
+
+### Implementation
+
+We implement the algorithm as a recursive function. Since its output is not one but three integers, we pass the coefficients to it by reference:
 
 ```c++
 int gcd(int a, int b, int &x, int &y) {
@@ -64,7 +66,11 @@ int gcd(int a, int b, int &x, int &y) {
     y = x1;
     return d;
 }
+```
+
+To calculate the inverse, we simply pass $a$ and $m$ and return the $x$ coefficient the algorithm finds. Since we pass two positive numbers, one of the coefficient will be positive and the other one is negative (which one depends on whether the number of iterations is odd or even), so we need to optionally check if $x$ is negative and add $m$ to get a correct residue:
 
+```c++
 int inverse(int a) {
     int x, y;
     gcd(a, M, x, y);
@@ -74,7 +80,7 @@ int inverse(int a) {
 }
 ```
 
-159.28
+It works in ~160ns — 10ns faster than inverting numbers with [binary exponentiation](../exponentiation). To optimize it further, we can similarly turn it iterative ­— which takes 135ns:
 
 ```c++
 int inverse(int a) {
@@ -89,10 +95,6 @@ int inverse(int a) {
 }
 ```
 
-134.33
-
-250 (there is probably a worse input if we change modulo or aternate between inputs to make )
-
-Another application is the exact division modulo $2^k$.
+Note that, unlike binary exponentiation, the running time depends on the value of $a$. For example, for this particular value of $m$ ($10^9 + 7$), the worst input happens to be 564400443, on which the algorithm performs 37 iterations and taking 250ns.
 
-**Exercise**. Try to adapt the technique for binary GCD.
+**Exercise**. Try to adapt the same technique for binary GCD (it won't give performance speedup though unless you are better than me at optimization).

From 943c110aab72fda174d3b7b0b04ca8df897f373b Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 1 May 2022 05:55:55 +0300
Subject: [PATCH 081/173] add number theory disclaimer

---
 content/english/hpc/number-theory/_index.md         | 2 ++
 content/english/hpc/number-theory/exponentiation.md | 2 ++
 content/english/hpc/number-theory/modular.md        | 1 +
 3 files changed, 5 insertions(+)

diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md
index bb6a8b3c..6812e14c 100644
--- a/content/english/hpc/number-theory/_index.md
+++ b/content/english/hpc/number-theory/_index.md
@@ -3,6 +3,8 @@ title: Number Theory
 weight: 7
 ---
 
+*Disclaimer: this chapter is a very early draft that is probably not worth reading yet.*
+
 In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._H._Hardy) published a famous essay titled "[A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology)" discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications.
 
 I personally don't agree — and I wrote this book partially to show that there are way too few people working on practical algorithm design instead of theoretical computer science — but I understand where Hardy is coming from. Being 62 years old, he witnessed the devastation caused by the First and the ongoing Second World War that was greatly amplified by the weaponization of science.
diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md
index fe9b72df..b5964463 100644
--- a/content/english/hpc/number-theory/exponentiation.md
+++ b/content/english/hpc/number-theory/exponentiation.md
@@ -103,3 +103,5 @@ u64 inverse(u64 a) {
 ```
 
 This forces the compiler to generate only the instructions we need, shoving off another 10ns and making the total running time ~170ns.
+
+Note that the performance depends not only on the binary length of $n$, but also on the number of binary 1s. If $n$ is $2^{30}$, it takes around 20ns less not having to perform these off-path multiplications.
diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md
index 02a84ece..2f90bd95 100644
--- a/content/english/hpc/number-theory/modular.md
+++ b/content/english/hpc/number-theory/modular.md
@@ -3,6 +3,7 @@ title: Modular Arithmetic
 weight: 1
 ---
 
+TODO: use it in binary exponentiation.
 
 <!--
 

From 185f738ee3981a7153a1539125b4f2340449f6b2 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 5 May 2022 14:58:53 +0300
Subject: [PATCH 082/173] fix scanline bug

---
 content/russian/cs/decomposition/scanline.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/content/russian/cs/decomposition/scanline.md b/content/russian/cs/decomposition/scanline.md
index 6ea7e2e7..4c9bcdf0 100644
--- a/content/russian/cs/decomposition/scanline.md
+++ b/content/russian/cs/decomposition/scanline.md
@@ -8,7 +8,7 @@ prerequisites:
 weight: 1
 ---
 
-Метод сканирующей прямой (англ. *scanline*) заключается в сортировке точек или каких-то абстрактных *событий* (англ. *event*) и последующему проходу по ним.
+Метод сканирующей прямой (англ. *scanline*) заключается в сортировке точек на координатной прямой либо каких-то абстрактных «событий» по какому-то признаку и последующему проходу по ним.
 
 Он часто используется для решения задач на структуры данных, когда все запросы известны заранее, а также в геометрии для нахождения объединений фигур.
 
@@ -62,15 +62,15 @@ int scanline(vector<pair<int, int>> segments) {
 
 **Задача.** Дан набор из $n$ отрезков на прямой, заданных координатами начал и концов $[l_i, r_i]$. Требуется найти суммарную длину их объединения.
 
-Как и в прошлой задаче, отсортируем интересные точки и при проходе будем поддерживать число отрезков, покрывающих данную точку. Если оно больше 0, то отрезок который мы прошли с прошлой рассмотренной точки принадлежит объединению, и его длину нужно прибавить к ответу:
+Как и в прошлой задаче, отсортируем все интересные точки и при проходе будем поддерживать число отрезков, покрывающих текущую точку. Если оно больше 0, то отрезок, который мы прошли с прошлой рассмотренной точки, принадлежит объединению, и его длину нужно прибавить к ответу:
 
 ```cpp
 int cnt = 0, res = 0, prev = -inf;
 
 for (event e : events) {
-    cnt += e.type;
     if (prev != -inf && cnt > 0)
-        res += prev - e.x;
+        res += e.x - prev; // весь отрезок [prev, e.x] покрыт cnt отрезками
+    cnt += e.type;
     prev = e.x;
 }
 ```

From 3d8985a80861588864d9c2cbae757dc8827bc2e7 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 6 May 2022 12:50:53 +0300
Subject: [PATCH 083/173] remove tikz

---
 content/english/hpc/algorithms/gcd.md           |  12 ++++++++----
 .../hpc/algorithms/img/gcd-dependency1.png      | Bin 0 -> 14837 bytes
 .../hpc/algorithms/img/gcd-dependency2.png      | Bin 0 -> 13621 bytes
 .../russian/cs/range-queries/img/prefix-sum.png | Bin 0 -> 5403 bytes
 content/russian/cs/range-queries/prefix-sum.md  |   6 ++++--
 5 files changed, 12 insertions(+), 6 deletions(-)
 create mode 100644 content/english/hpc/algorithms/img/gcd-dependency1.png
 create mode 100644 content/english/hpc/algorithms/img/gcd-dependency2.png
 create mode 100644 content/russian/cs/range-queries/img/prefix-sum.png

diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md
index 59e55f10..b9e9007a 100644
--- a/content/english/hpc/algorithms/gcd.md
+++ b/content/english/hpc/algorithms/gcd.md
@@ -186,7 +186,7 @@ loop:
 
 Let's draw the dependency graph of this loop:
 
-@@
+<!--
 \node [draw, circle] (diff)  at (3, 10) {diff};
 \node [draw, circle] (min)   at (1.5, 8.9) {min};
 \node [draw, circle] (abs)   at (3, 8.9) {abs};
@@ -201,13 +201,15 @@ Let's draw the dependency graph of this loop:
 \path [->, dotted] (shift) edge (test);
 \path [->, dashed] (shift) edge [bend right=75] (diff);
 \path [->, dashed] (shift) edge [bend left=25] (min);
-@@
+-->
+
+![](../img/gcd-dependency1.png)
 
 Modern processors can execute many instructions in parallel, essentially meaning that the true "cost" of this computation is roughly the sum of latencies on its critical path. In this case, it is the total latency of `diff`, `abs`, `ctz`, and `shift`.
 
 We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a negative number divisible by $2^k$ still has $k$ zeros at the end. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this:
 
-@@
+<!--
 \node [draw, circle] (diff)  at (3, 10) {diff};
 \node [draw, circle] (min)   at (1.5, 8.9) {min};
 \node [draw, circle] (abs)   at (4.5, 8.9) {abs};
@@ -222,7 +224,9 @@ We can decrease this latency using the fact that we can actually calculate `ctz`
 \path [->, dotted] (diff) edge (test);
 \path [->, dashed] (shift) edge [bend left=25] (min);
 \path [->, dashed] (abs) edge [bend left=25] (diff);
-@@
+-->
+
+![](../img/gcd-dependency2.png)
 
 Hopefully you will be less confused when you think about how the final code will be executed:
 
diff --git a/content/english/hpc/algorithms/img/gcd-dependency1.png b/content/english/hpc/algorithms/img/gcd-dependency1.png
new file mode 100644
index 0000000000000000000000000000000000000000..4e58904c19b0720e58dd26977fc20d0b837a9eb9
GIT binary patch
literal 14837
zcmaL8cQ}`Q{5JlvH<hhqr9?JmWMr3AMv+KHMs|ddk&!Kl6v;?MG7Bv;t7KFn6cHa4
zArvy6^SYn!@%-`o9nW(gM|bzlXI$g`K40g$VvilsXJO=HB#}rg2Mu&g@OLkXL?KE~
zi~sT|xG#mjsC~2!n$hE*V0wob{6DX+u7$6ur=zdG?fJ7LCl62evywjc=g*$?@NxF^
zouR5z$BV>?7iphAYwPRc=^<$5;(nH-ciu%%X0M=;?L|R3896yYSveI21r-GuK_gQ`
zzl_-&5=oGBQ0IVIK-To<i<(B?$24aBMa_25rW8EbAz8t8OX$6DRGeE%#o0oJPj6+?
zd9Izk<#tjy$XF-pn%W^TOS6IZ+2<@DtJi6)PK315UGTge)b(=l#k$GufBR<)J;`5u
zx-KU12t|hFiM7O3@`vSF@>z<<+3`qn>dGdh$K~<p8R{A{=NIIOMfaE(b2K>@ipAMJ
zHsffP`9Ci!W*MY-P<qzMNo40v+QP!ZwzqF}PoCuc@#9B#mHVl-_Vxn@4p0;o6?K37
zsQ27&x<AQqjZHRC>ycMXQc`qXorac{)-f|PYDq~+F$oFn)2FvrR8(BOdR6oIaZdG+
z)tKVqJuia(vX-8HdAX!SikXGQ*29BUSy{R24lRwP_{XT|=*FfdDi;?QnSJ}N<mL)J
ze)8mLpc~%2eSbh$SQu$(Wu<XofY0!TlrBDFDeKlbdPz=0IeGc7Q&S0%k<@!^YPY?=
zXXN@ZV0n32>EiDlN!ye!U-<dawmFWIS+pQPm(8rKzN?FkmX`LIsj0Ssfk7n`4e9f<
z3r*21TQ~9X#ods$vHLUi$>8WwCqc<(c`b$y4H0zh{pI!tEi8C!$lm_`yR@&ciXOeA
zKK^t6hrY=-8WiXK{PY|gb|h_Ai~IHY`TV8Y%zd7XU%qU=fAmgtcJ}sRa=v<g!In+t
z=H`)6QD3J%RbEX`=jwma(9mE};lLnu`bCU!z8-_ymyZdx_qT+_b4!nYd~`}FWVJ>U
zhqkh^qU1F~+O}=m)!VlvyPuwtcy@oZvcKPghnM#&US}mBPQHR8`r7-@e0E{MK~mfG
zz+bcAs!RX;ruxexdbKFR^7BQ`xw?A)_^|tg&tuxhX2HJ8zt3*cI%99&<i|m(Z%N!*
z?m4V}{rdGX{sP_ng3<hZ{p7}mhVW^gZG3!uWo2dc<@T-pN$>J{Oy=6(X6_S`mS(fI
zwyw$#Zut0->-6c<Ii3Z}SF#H_ch%O`W~g6MIhxEsPha1YuWxK&5p#u20Ba(zVLCo8
zZe)C%hJk^>?8J%2{(hdnzdmd0>e6s<a>@q%nJV3$UtFAE+Z6qEW$~wsy!_a1fo6W3
zYsBl<hlZZ|(53D)8vX8#e_wv|NHSaf(n@o-j=nyfgn7xT-Qdzvz~KxTim}$@D=%NF
z4LotB!vcKl+Q)kS{CSHP0dxN3n@Zlomm?xB2AOd5O=icjWvc~6<FxjEm*0Ki!eVJ{
zU+txTrGG3>Rk=01xHvBpFnfliSr@V<C?&<}>gsA{Zf@V5b71)P*!?>pjd*WBr=Itf
z^f<P>{QTVf{HFeZu57h`KY~7e`eb7@+EZ$y;XBp8*~G-;>a}Z)T^bv!pXUM>e?<9B
z4+u$0vc$#3C5+!y&zIg<65PF;fr^%S<ma}xYTY?SMR9*;hUKqEuSgSTJ~ibgCMM?f
z#NF3d;m^!48z(2{hu+>?r><<9)!^aUOG-Wy%=m%#_-jtH<Hs91J6U66W4-+S^^YCf
zM9QhD;n`T7XLECN>we;DJ~dcf+HjGRgJaUgFj&j5nNCzx)XI7w`|4E^6&2O~(DiMH
z4jm$eMbNQ@l)t^D=z7e=<Z@Y=3`rB4eB}Fg2HfzDI~oG<T#}Te7nlBV^_AL?R}}6Z
zxvkQ}7QP%=yK3KdeYdGnXh~t=fx#;Gt(Q8mIOoQaHO@2AjAY(%U7KxUOK?~%t#TjC
zyLT^o&U?3|c+2F^pZ1GCdI?m?-0NPm5_M^5R%qq#jC7c<&BI5JbTjuZL@{b~o5+6a
zvk7gE*(@S+@wdZ=4<B+KKfeC3m0`u?YG~T^>zXG{q^CVSrLtnf$wt6LMTOj}mKNc?
zd%3U!*GI|L&;6pVU*~w{H~qTz00}2@x1eB7!XpA}W3snCSnID@VoZE|Q?B-vGw04R
zIypH3WUyo@<()cpPMqL!dL9!)UsF>vGTD>QqZX)e?b<ay{-8JMlJpD=2M!%#BnR#D
z7$T84xVX;O>eDa(*c5o-f@0CbhjmMQBrMZcz4N%TS{v(YDOp(wH|3p#Wn`{CY&A~I
z4=U2zAu5`S`vkyf!Kc?f|2$9bDtGP?X693%=jG+C@%pNjZ|~~tJc1?N|Ja3HK~Ygi
zNQmP2@#Dn$GQEE!>iKh3N-8SvseV};KqHm_9qYERv^36h=gwibMd4aJc=)iXx0e%l
zw&m`@R2ghF+M}l8Vq%&{j<B+BQz97|8Ch6Z)c5z-91PU}qNUXZAG5Mb*es$q^7*;{
zCT{LpZmN)w5HBAeZM;KETU$B4`|iDaA>KQ#o#XrZkK3tX?cZNs{-gM%h+e`-u#AM8
z$8c?;Zafzdj^erh%tY{TM+bAI<J-nG33H3XyH+eO6n=SeF(QhI+m76Mdv<Ql)_1DV
z^#16N-iHZE?U7ko+lFc{Z8g1jxZz!aVanC3Vbz|)l%&&jAuQO6)HF1<ZC8c*>q0f+
z;^R*q*>kzy!tEH3lB#Ozfe6|Z_FYG2-IKmHT}j0Xop)co*Z(qT-}2hu;jN#g)YY@y
zhH6qUn3x2Qzj^oL(BkY@?Y#T<^|P|F3^3Mg$SNVLN+%yXZ?++i=mrL=e1CtR?cmX)
z<Q3W7yLYcLeb9RHnu>FMeLZ9U#R|;{xxIT^u8A6+@$itky3AK$*Ub3f!GklsMaSLr
zU3x6f7E*u5`^zdTwKX+KD+}M7ER!mox`>mk30YHd<+Wt86sJ%MUXsBj{W?AU_VsGA
zfJSs;B8$gRP0t8t!;AikvrOv2OC+qwwl{C&<s1!rijvnEUjH6{lUr28sH&<O-nF)}
z=(@-H8Hb05M*=6~(hGyhoC6Uu-lIC(l+Q<&d!M4-JKQ~I8W5ngJY7B9pEMVRTX3y8
zo;!7AB{00>@6;#xyn+JV604_;)yh&<HJZ`Pyjq>NRa>i-b&TTo^uE7;w7ou@s{M&;
zUw_i?G=M+5yuAFitgMc{i^0Lc1~Xq?wlDtdTUKEcihT9z)m4Ge)s#|!qcPXG%*8b{
zG~|Mpyhdkcy8N!ssQ69s7@L}ka7&+d2%h@=`+SAV`<=wK7Zw)g5!eX&QWtvkM6rcX
z*>dE`c=6+)9K0GK`;~nsHw_x|ac<sxprfNB2J8Cb?{5qu_JM(c*DxwlhC}`(0{Ud-
zb~Zofy0GB6{Cj-zQ{}FZjWz$Oe6Q)oC??ENrDqp@(lRkI@$m8e_FWD>e3+S%l5%8v
zn#JASow#;QSJ=q*c15z1_sj|k3M`)cJ!+Gx72x5a4-E~CjEp3xi>0OIu@fislDDgM
zm)q-uH2|s^1vEkkeE9M*=!$vi=>sMvTcV?*F)q(6Om6S>#ivQOW0=qWnLdT>BD`-O
z_syF(aU&z1JyZGi?ORSx4hj31dHep1CyG)JW<U`ryuH1>e0^WtR`u^sYD^s(vQkn}
zX>4we>a9FTofxuYIG6b1#eU3dI(>cps(gA5NzP_-08zxPTYM88H!;n_H-9%tyK$qE
zKYX+iTx1TTr_AqHeom}g;gKvQA+zfByL^BC&D7bqCh@m*bvdar4#zqa{+^#F7=V(p
zayft0=)1NQ5n2CVR#_?+;-<xdcU>!}sH}{5@<dTVK>>&5@6?&~bya!iPQAOHAFDh*
zs}$SRWl!@2@8T#ZDlXQswH3n0ZMBwCmX~Mm>FEK*i4A*KE`dE4x;jS(=2&Lm%F4V=
z>FTtBnF#f|@^L#m;q3hvFN1iE0&hUP^xfSh9UL4mdXqr2a`W;i85tRI`J=0<_Ua|_
zH9wUfDY1Tj%-npJ$$eqPK5Rj+=>f%nfB^dDv&&{p)Fwq(URfHU+d)Ufq@-GK5eQyk
zivbR5s)JLn>#wb<;yQWqq`~3CuYUHGZsy{O28H0+y0xjVj~l$iwWqo_AvsweYlRr%
zH*Q>MY}C4Y|Gxc5eR!6<xS5$*&Z9?0rO!CEf+abHsOi|YzXF*!<L1U}XlR(gD}V2x
z{n@i;8}MPEN1)2lU?t?8?<5JR^q!wE8hY-}24dX(?j120K&X<F_*L_aGWXhK{ks9c
z!+Ofa9~lLO$eTBL)y^KYnEko%Hz!9h@8Ltk6OSx)F@$*d|9xvu)iE_q1@_6SNNvR~
z)&irBEiBxv>^(|-^v#vczn2-w?t@jE!G<c_KFMw2<ZQlww?h@bwZHL_p1t<^>rbC<
zCh)7K54{Y!A<)FGMZbH`o*hX^Ne-{C@7@8t^qil4^X7(h`@;A4?Km$3XJ_ZiR*YuR
z)?|TQ0|NsNKR-Ug>b|!uv^<5~=Fpj??ACf_TYf<iUW3)YllX&Ob(nR{8q;`Z1TBl*
zH*V8oI9)wjW**JXI~rX}{dj=`2)ILIB*TvV`?KceZaTWT9lm_|a(_~TSj!=x!9mB)
ztoGh(jfbE6kXemQOoYS3!zu4%ns43fCdkFbMR_tm&vUp|>*<l!P@Hcf4-Zcne}r*D
z+PPmtPaVKV4zsjY>L+jCtgEYgeq;U8M>z#YC;|uI1!bnRu<&JOeigGxbCq}R-;3DV
z+Ts+LMG7om1}dJcbYx-IR(vx(SRH%s-p+%E4?C#7Ha}5fbr6i4THO;1gjdB^>}lX)
zQ9XGg{mK<OiO_wk@1{&H1_WFM{L80AZQgYxOK@Uh!mV|QV2&}^Nv`vC0dtqdcJEFh
zcipK>&lMUUAAj}sZ6Zc^1%K4$x8ldGthO1One6~sK0I#wDvnd44KVX5=>WaZ4ne^%
z+=*z==JH!(-~|q=OLM+yq7=?23f<h@+vmnwZ@zBKzz~%?Q%~B$&E0w~2^3oH+y}Aq
zGhftB_^i6lPrTKtKIdup;T!uoZ7r><X=%M9ESaB#u?#UDMrZ0a92HyF*Z$?+zfY&8
z_TO8esi2PQu@pWfy{k}sl6d08iC2SzneCV5i%Lq8CMPG$s;XbNw+mfbSr7sXF7DV+
zyMN@iFmAu2s$TrZyK2XSgZFRQvPI_7itjzuP98vA$4&VwPYi3iZmZI5-MV#QvL_;K
zxvmwGA|)jy;O{T~L1W9}ooPQN-ezW~To8YMlE=x>krtSIVxrb>n$NK#qv7>XV)^68
zk)@^5gR-@|Wo0*m8lP3wt5hg7%gxP|k&!v0!bGYc9NhZ*_iuwEM?C&m0DmTjo>tZm
zYwPIg<v}-ffB4|`n+MD~L)M;l=i!X{rO#};jBZ5^R(r~;>@>ZWn!0(@rh2~w$FEP#
znwy&q4jpO)J}#%wxeq)Jzk7EFrXCNkCk~#Tjm^nnkB8c&6%LKH1=GqFsJxa~_FaSm
zShCxVy+R0Lkc#{<xrVkjdRbZ7wvG-5)kCRsRz(>ZJkGAJ^CCFfp`mS<6O(<V0wQBB
zuC8O4xd{o)kHBW^C%bdsZr8`ic^SB9C)m@6VSV|*1F_nK&sLZ`{nefVgR-|0$R;mC
zH@1t5i?6mda^rjhHr8Kik`Es~9C_>3Elu*bG)#*v?ChVC=8P~#g<oDjx%ITb)zs7o
zEY7_(Mp$`VQdTjbT%#ZY1joH}No~7YU?jj93XK`tgW&66d!5;8@u2g<0Q!#~KN=i7
z*ic}2Lk8<?b^YIr(xnxJg@py;cE4(F77`Vu!4_%-$L4$cFefJr(&qFX$4ZB@XBpN8
zzT$iuu~NWKjM#S`4g*+%Kli|Is2ZEa`EYV`?|YV)jJXK~C$wwV73jP%@MY)0DoIv;
zl}HdtLT>{LNF;1z4seWDtIP8}&;31zUR-3yIMp*UVj-OY@(V`R1C&8)TA}wf=@0ak
zSj*bCFarXs8ukz`Ja~|y^k`-K>({R#&DwjVwY|M%S-0<xA*360ubrJ8K`629{x<u8
zp?mrHMZ<Q;%gZ}sYfAy}C)EQPzmASl0%<TNMj*>!WqA6AXGmE;8^M7ah2{nCkd#bJ
z5z!xQPu-<8-=HiVAaxuf)O=%dSQx*-HPtdVzcJ7B@%~W(tgka5%URQbe}Antr>Mgk
z!Ii!wn~n{>zDySAv<=2eTKLh+1nrlYeh9Y@$N1ns@w8vVIvW(jd+{$7sidT2c7A^B
z1-TL%CN=fFxzlAGUEQxZ5iopW;*yn5Sz6lt`!kKhB^U3#FwwyaZK<f@4*_`W`0?g9
zrtA+SPgN=WoZOum5OP!$(-|DHq<UtsdWJh0q+?_B#t|#4tcNz2H<IRO#$Y)-eSOTv
zhRxf%VUM?%kWj<oaB^*}I(VFt&Cy#53=lFl?)ySk1CFfkbswn5zIW{_p%J1cj2jFn
zvRX_^%4U$et+$_-l&}Iiti~N89&j-_H)a=M<6<|B>2uYBa%<wCX-<gaZj@|<5bOxk
z4s1YBRdwsZg9o#2ykr5C>R4LxVKbLsoOi@<(1bAsh2-VsRS!13m7iGH{3^a9;=7u+
zW{19qR1`PScLZv5KRUH9{OzvYyJMkrbHUkq+&#a%h=pNuz}R>*;TAws)!Z7xiXy4c
zJOL;}<uL=K2rkjy&H!GXn3R&484ot*qbPt+H|(wUw1EQkej>RX?9%REH}NNENd4C-
z6I0W=!0&CI5Q~Y~o3-d0u!74S-|~UAQj*ByZdkz(26LZVcS1AT)P+nAR!d_B5e5P~
zJNwjdoq)Q!Iz1!fIR*Rk=jGt$)P%z72#_MS`{?K>-_TBV_3eNOT(J7rX%b%{4&n3I
zVeGbbc1Fj>GJxpFU@K75(T%-|W%oLNp3uMv2?@t6Eg9+P=!WV-1u#ennLIn%%#@Lt
zDWayv50Q2NGZ4EY>SfU1y%|^*M@D`oL$V60s61<M$3;G7Y&_R|;O4)-GsIOAla|&$
zcTSxA54S(-@9o~+UM2ru+pMgt1b6O?9vT{wKEjuto*tc)bR?e_7Kfqdi?_RsEOSr5
z>Y<>ZICkvVi2hDJNsBTProN1#B4>XVknswPNATO85kJrZuL~ENYrMw@=LQ#M&#5YA
zU~m7k3lbb095U_$3fq*tNML1~*x9vlp$;1vDbp!UO-<#%eZr-RiH*&D@PHm?8j+CD
z;ZK$Xft>x?l#-Fr*8ly?Dr^x-PfyQ1mXBkm#a_6)UKIZG=TGp-$|RR3PoC(2QI6cG
z2wh!DA!a+KZg}ULM{uZOIuws%>pQJ6dYo4Lel-IF&LHbxFo+-~IW1s^Yw`u$2n?)E
z18*c&S68W@kv&-5-KX_z4l!l!Ond#|14nszc}#M$lL}k@i<d6}!We!qC|*7FAzR#h
zE+lPJtAn$-clc(bA4kc;&%We`RML%i+XcRKWbR`F^Kz79$|pRp$B!RNt6Op2wFz0K
z!5yy&Tojj>5AA|kr~5<gb(w7wZdZdJ^B0B13-%CUJ@4KLm)g`h{vH>zFD@>&u(`Pi
zKhy#9POEZHKW3AzN9_2*g0I~9Zw6KNvv=Erh<#Xm`lY4V%z>b<jWmSU_xg2Q?>3lV
z@)!tZ7k-{N;X@kP#yM2uEiz~v@D3n{JCjpUacx>`$kjvQ(W6JPnVDPJX`FN*sf>>w
z7sbgrs4@zDn=d*7dA|b!#8$EOG7VE3PRQbIw;5C@`>mI1`)*kmB_uGzk`v*m)-Eh9
z-35b0bo=)0Zu-kwZf<*c6`VK0m^^cK9Yi|?v#LVQ-}O}+3*)chx}(?A(oWV1T)Tan
z|I>?$@(+1y$+#psz>ksU0qF4ZiVESy#YM{7)r$60Csd#;bhNc2p4dIIDA$^t*)})X
zBa&BAa^%H}7u0gYrFH9vV7uvzmc&Ui<olZ}EG~+I_;XKgw7_+HH860i*z(DtGd_C>
zeIcu$umk?jPOP8cn`0OkK>SF;rjQ;pWwBq)u7jr~NLbDg84iJyrTZ`bj$B=x?{a`l
z3BD755Ie{LF2FY!lfu{vPHsb3cNI=u+iO3sTpwk%@wmJ8wTaHo)zxs1q~t*z9qO?$
z^W(=i!&JzFguz<l5wJOitxni?F*k3Xtn_%Do|1CF%Gx>>HuXuU;qFJnU#t)X6xZ31
zbyGV{$RC&I$>pINA*&1TkCr`o(mc`-k*Rq4;zebgzSruKH(~3+QlA`nBBAZ_iJgZh
z27bMeurOua>Tf29XhL`czC)j!rX%6j6E;)8@^4XsWPk|4j5-6>u2rXCX^OjIyEOY%
zE7WqRZpl)7Wd4FS?k9m5cpv0eg7;X9_e;MsE-t$X4GFVp6z(zX895OePCeub^eim8
za6d2-ufl!%Gd-A(OPQspj4Xm4qx3;hk=^V_1CP8TbI9^IcUAuU=P(MYS(qvYhK7~=
z1;)YBy`eNfL{OoC)g@WbDH*R3&4uYfZk!+xf*Isy?)&>)9NgR@khoYccw<wt#(E3`
z`_B3gAIx%Yv@k`Cj@rElTx2B11qgO`q3N!arDnSUP)@J!??l6Gnm==_+Eo73z(UC_
zEoCF&l&LWAu4ZQ^L3e&5lOr}=2E-Cbi*Hz3?6c9Ga61>0lA5XwOF~dow0BbYp5#f)
zLE~e`G!1(<(a=QAoK}8;z2@xV^7T`t)8yyp`$T?TGtD!CS^7}%U}X52kJF<Kup<et
zkBW{})7+dJR=uO@7E$WROLJWtcJ4<P)xLlKPJA0h@M$X(jvH$Q8T)J-sZ(~RX|kEk
zj&|p2FD-m$|5$3%<;U?6<hY@yXS4sWAzM|(W}4if<-gL<z~RG`SMoA6iQI{Oz+`di
zQ|iy>PLCZ_{iZba^k~Dn;?GH)tcV=fF|&`XG6M#Jxo|!mFU+_HFC4xr<`_ZSEFi@O
zQy%EiyBCsrtmYG4;3jIpv4b%%OTH7z7|h@r=da~&uvTYjPuLlA3*SvaZ-g)$u?PUh
z2OvhuFim}b4e>&vgahe80BMOBLxd?3i2}e)$>|%te6Ca)1lr(c>IqOISd$G8%}ZzB
zAJrf{M;=wbJy>^w5)$!XEF;7;6BpM4rMI)U|N7%coTy>?ncv^qGn9QC<JISNuK(JQ
z_)u_;1!e{9zJ2=$j^;Db(Ezr}H*^~8K|7lH>rjy*XbRw@th)LN_8XKX!-WeM4(M<{
zz#&3yRPaZrx>>jN4j__CC5$^5rPlxjGIA4e9AaQ19Fyd2<uElh4SzJ<W2JQ|C$At7
zD;0As7qkf`T&Gc4rfd5((Hj4m?FiRwot#>qj@cPvlD(TWdLKn2nHKgLEE%G(3Hzh`
zT(5XbD@dI*=sgn+4NYfCQ+iHW*(U2}zP75g=FWxG$Ry~G9617WnzHZjU-zO%r>aMB
z6Iy`CIzL*-!(cL05LCWDKEzIoS#2I*KezF7ShWo9X7;q05f>*XjgOB{ma-2mbW&>l
z#OEij+|VOYE3DwY22+J_>!NoZ%`W55Z>JZssP^E;$`|d{qNu-Tl-XTkt-88QzC++J
zd#p(!zbY3jB1pc}>Hcym6%~~XMOQ9~`I&(yOkciyp=V;UKfC@>Z^t1;yI1sZ?Ff!B
zEw(>seu4!<v>XoPM>*f`?`WVL!X7@9fG5kIBF0GqMG}N3FwL`H74G5G^z_NlRCgDb
z*nt5{YFgTgTltEcZkiZnDH(qY5t5UW+q7j1<*8Grs`5!BpuX4o>cuMi#{t5o;@aBU
zZEbDDa=~?&|NHmK)2C0nvnm}O9it~FojVs#=Gl$4B*r8pIH=P0>D)6Tc_E|2GKrt&
zv0`Fi2+z&6t!({NW=looqAo8Zqv`A{W<SQD1v?o$^HWlTVv2a0SYb(t9#Fj{f%pAX
zdB&=e?f+;|xBVd!Qc`;E?vD%}-8^d8ybFrEySv-@<IP^!ElaS2uY%K9NWuO`S|Yc1
zd)?=kLEM01Cso>`lLZ37*Rr!&F|JCgB*nz)u^Aw|=wO3y(kcUo4hRe!1*~9ag~PEX
zodFv1V7yLr-j0VN&A55<0CqS0|CpE<VxL3j)v1~zPrG{S*1+m6F+of%VgK6*rte@w
zc`aTmc=n8sm<|Yy=-?F+23yeIUj!GyFpYy3F$VdFd~<cwo6wOk(gj6CBH)JBJbPvb
zp2)%l^ZVVy$Bz{d3gAflvhIs*0#gsio*I-55U!Bkh+KQ?clNBPxj%XK_iru80kRtt
zMga0yx|Ws}7WE_Wd2u&MVdHOZ*kdRlH<UI1bocJvV<%6x_g1;FH=D$M%N7Sm1D0(0
zA)4F#>Q#;9G5B}?h_DPQkQ^j&qS$%Cd!wZ2+w^oI>^dZ6=|9~fH-S}pzf{Zg`1~vi
z+q%lWAuv4?EdyPm3oxs*<F6Tx07-lEfBwpEYaE7^^?74+^Vat@*fz8z06`x7VsOp=
zN+&je;_SrRttBOQ#HFRlQ{y_}ZB}|48*p;Nx(xl{JeBc}J%8pulLS7kSG^Ue1-tOr
zsZ;-&|8A_Y!SNZ5AN^bxdS~7VydGBUCVVY<wHPSWunge~Sq#X1dP$rkS*OYjj7!JM
zO9osK34jsQ31Ou8s$#*z6Qsin$AV|T2eMjOd1OdYQBedWf{7x<V<SOvHYIM|=k@#B
z!A?VE8ylNdq<}1)H?Cig3|U)ITy!zE+7uhlQP+xm>f*(V#JLR`*B*q_>AIugxgd55
z(aMR^(=5163saxi)q?)KsyO>bNn;q0AuJ*SYxpwAa~|j%p@yK^7{U0#I7g951T6mG
zM9|8Y^%Fi5u73jzG~SCzANq8jnZa)3Q9i!|z`XOi|JmyD@&=rX?YVQW{Wy{c5FdRN
zOZfPzhvG%CaKgIis7h?=q^?I%st=D@WONowOGq?f8S2`l^}$JRXl=bR&0{3|u7FbA
z#!0jo3U?DY4=fG>5~OTu6M)%!S18J50)BssHQ>)(l}?dPllj@5OEuATCpmT3(F%T=
zdUE9|!mu}f33}KYU*X~jKR>b``$G~5MRs;JVTe6c#tkK+>4a^{*Lt@N5#bR`RKbcX
zYc7RLhoNcV#x{YfoP0y14umFzq&_Q2C5?gviWZ%I2|HcA3kDFK4Q{TjaA@B`*jdfZ
zx+6Y)A3xfGMfTTTdfpbSj>3dwMn(n!!Iox)$e_YR^pohv7qz6Mq-rkw<ic0%>b_`R
zY!OcQ>aZt($Az$Ka|#NgaG61z+I3cZiaeE3z#tM?Y}X3@aS0$$hN{05(l!~V&TX|_
z<dG2)%!^ep41&p~rKQ0vvjdnvd7#Eg+GAC-3B$X%_R8{hjKnOpAg*7(ej$b&8{7|b
zrL3wd8k@6?lQGPIo6$MEfRYHWG2pbtJl%_Ob14J_1i+Q4k&lf1xO?g}bZyhCSJX&y
zw3gQ^j-MGXv7q$&HKZ)(@@_|!y+RvemIsA}DbU1wgNkA7RbUVjZaYqk$3S24c>Itl
z1re-@yos25kAU23@yBsc3|p9T#JcS=JT(mb1OU{-mG6P5*&i@V($mumK6KbSznTBR
zrsH#SbA;T&2VyVA3u#9qSz3I)?_E-SyfzRVZXPAZct?gT71oy1KNp3Q*4Cf7%{AiX
zEX9Qu*4Kj2cNl|?AnZ4g4%|~z4<c|c^8IJNpv}?*^=4s{f13gRq0HXB6MGy)emEEE
zXu=ZO#?S9f?&Pn^pKXR_ngwBX`}{0zS}b8xdz8(C^n5S{EWU_4cLWBFdEVQxnFYTA
z?;!|NmHn{|?{96J7iNY9fDLZ?wuc{z_g)p!hA5^(sUUe;j6*%tV0vPr<+XCk>W893
zG$h<6dS>QV2ybB)|5(HA&5(Cu#Y(OB<IsEZ%;W+{7(TuQutO7aUShET+qLPx!byT_
z8l5<-%I><*P@--C-d<KwVW(>H5J|b{Z}Rdb!-s>WYlha={7`|LpY|Tz{k2d0ui1u)
zx%n0NgM+dUxbF4-D%5F45*ZF%N~#Aylp#h>XE#pFE!c7%jTShy`f&C9X1=_gTnnX<
zR8ryrz__)Js9F|z&i?u1s7iUIpg<IZ1in!uKnM3JZQ8)d#Mn3-T&JS8=(=FH&qggl
zcQHz_3(GW<j~oZ=00&Wn(nF$!n?sae5Wj!;_%ZL&zdtb9b)ndD@7|3Jr=qh1n-umo
zeT<EE0F#JF#s-a-lG1;u`u!S;g^O(qVnEQ%;R&J0`qoyueV$)*F#ZYg#Lv&q%p()#
z!%hjMO+=9xoj43)RlryQ6`g}UgAq|*Ur&MnZiX0Xt1dRWurO&n^y~r?;%hhk&CKai
zNgNE|d?L6YJJ3j&D}VFmO)N+dq4=4&_6R~t?gEQ5;*$^+joPLjoB>t$ANkz6&qF4w
z>N_%iBU%<d9l{Y!5|(*vmRJ+Iq0TF7N3D#$lD$3FSvnDJ7N4q4Lgg!|+?zp&_#$wF
z6v&QiZEXg6trO6@r7>}Fj0`L+XBDsH^q6=P{x0_FzMw2N0rgE7GjcGce*Ydfq`h`&
z7;Z-lZnJX0te_3qr2~`t<n22w3u<X$`-7J@Jv{N~GRQj?JB=I}Vx|lP@3H$J7wB>%
zd=JeD_vuNY3lud-54iSN5V?9dC*yw#^!xW%^ndI?qS{`+-qJtxh;Q)uA>$iaSwkaT
za32o)`ueuPpr*b)Ik>+5&r=R6=vqoj!?~o7Vd+J@tODwX!C2bB>}!|jPgQ$tW?^CZ
z4G~GP>)&2~e}6lKS%(Y_c?DF(#fy6RuP$x3EOHEl;sy8#0ULM1^30k+rke^bk&2kA
zJmZ%uY**o`s;Yz}UOv7<w5;0<P!wyMtjUW@j}rn9AacLTzFTNDDTl7_e^XfN-jk>6
z04F3NVf@=S9SG-ktb88cQqdg!4IU0&-dNONP>W#e#;hH9!VTqg#@RWhZex9`eM^GN
zJ04>G)?QKtc_$2Ui&LlMvrHn(KmSPI2`cP-?i?t@ma2S(HVnJNH{}WrCL0|*D1<~I
zVS3t<@7aC3e)yFTuhLugoy-KtV-url;+8G2b^CUyEq~b@sn)XM4uGtqSMkh%gK;T1
z)su}ad8b-B*N=bBg*?7<qq-@9p1BVn#=|HV6cd}w)VMFewU0%TQyZkiYpjLE!qO7y
z*e*m-|IW3K!)9;7286BF3fXe<(9UO*LE=bpmquW<DxqxvZ^^{cG7fgt7(8CW2mo<L
zu1t7<9k&!8>JCL|cUvx=J9L=POYn{R?zTdl5K#q^{YY5VP3?R;LEhoi+kL7yO9Vvg
zD?g>Ns4a4ME%{b{CksiRe&PH(^Do$ZvGsE<<e)4b<jKhxc%-ADHda<OUKFgXtQkt)
zcBvlSYS17Z*=k#nH{Xx4BS~^nVNF1<lQ=okcB+LS#p{3S<CH)6#C`D3!)6T;5fMV5
z0AkrNMIMxt(2}xGSikG;)`XSinOiuMC%wS~IvqzaDc1yq278~qt?uj&SYK5}YqAGQ
z2`&>Oi7*l1Q<RXA*y#Zv`UXu|@`J=_zfDZ!VW|~&SQ8%o+{Qnrj@Pd*!~a6nvt~QS
z;C3L&SN#_!Vt*C#fZ$RpD?in%etv$O5g{pBY>BI@!T*yW*toIt^RvM84qjPcfLquC
zWzd6jA!>VAV5*SyN=zmo0-J6~S!>}KAd_S~hLm)4bim@NFc-_p%PB}G9Q=qMU6Y23
zdi}<Y>Ia!8E8}`o?^?#76qO33RID)q#gUbh8x6KTb;|I+AVFV8XA>+D7~GA^*@((N
zJ@wf|{y8#oMp#H_uDJ&Gy90bio8YeY_SlXr<<$FRG&csm^=3Vg?zs`l3={a@kC>qi
zaX199iO{4vKp!3%`O17>_QS%$@Te#n@=tWQ1Z8DY<`>yCR_onnHy)IhUV%+D+~R43
zbt{M~0h&o$ZH(yvDYO&WA)<nYckg~@>NCNSd9z$uQ=^15gtzNo9j>jC#E8jTeZ%H3
zEOY4UvIckjkT5ZXkzaye85kKwp*8yY&70+0lJExF+uK<@OfgIF`dLMNS9JGCuy^#1
zs$as5-@ku<4hj<J0udEH%&Kp1$l9=uvhqVQ(-PCg5zY~5;eLHrUAz&740ZtFTnvAX
z-gFt8$LDlsY<zq^47#jLgN=XLQ*|?M+sNarwM#3f#y<A-h0dudDk|>Sv4cdyMTYrR
zPdmSB#}0+%2-hF3TS6a456aLJ!W2Ob^0pbz_5CQi5XKvZEJ)4j+L|mPK!lrPC_Y&r
zO+WABL%5~zr5Zkc;zd-agJ2j_O9+7yrhy1f8Sb`>oLr+#=voxqI{bnaKbah^(?E-g
z3eZYmJ>ml*vkVVE5b+6Ofej3inO81SAXM=YtD50$A;9Nll>W%Yw8m%|(ZPQKhE<^8
znuBd4htvrbjASVM_{2nA^z|e;B5*V196NXjLlXSUP5)~wQU$<Re^O5#Du`|G-fhA1
zwSzh)D80-3d*%H3y14~*-RO%FRXdEUGX5wDSQoF55a-^z*W!1bD=Izi0JL!vxKO4q
z@*}W>#?PPm{rvn;67$AaR)J=sWHbV|89(X16S~e|Z*L#SbDIsn{8)P_$nDDuUV<MW
zN$F3TH$pDk0+CKYLskA`1GIT%F8nYfVg!Qfqnq&V-Mg}owZN&Nr>P~^U+r3GvQfVU
z_62q+tT2v-&MU|vTRS_gVZ*u2pH1gdP#^?g(qT+h<>%bHN6W-5MI66N?h+c~$twQ8
z5+G0-hld4_AQq29-CYI=2A2yXx>rxOJ)(g|*)1kE0{AB)@bR-F061A-EJVtdg&~)Q
zoPsT30wGHjDELHUC*X<_`B8XB(8_|#;oB;lSbELi)6{aiP|l_w!c2#;#R`OGLJwuK
zzg%R@2Mu0I^cpom`_MRI#Jz4I=hP<uDx`Kp?<rt*ghE_GLKew0K}gHr-)d}ZEDK&f
zPn5wSUthuMch0rY{A;!clcW*k!wZdB;*z0UO^_@iEZaT?bB*{U@ER5^ddIHpL>eZp
zS!sn{F|cp2>j<>ZBcr2*_v~TDhTxGoLv;EgAR)EjZ(zp(q4i)xV|k}syLK5Tir%F?
zD%c2h>d{>74-w~ZE{rts^V^F_diCbbSi;#IA1BONqUM8QJQO=Cgbh>(IN`17nhzVe
zbQE{gFA?EIG%G*UX&lpkZT-N7<p0y<-$DW16tJ?edr($Ps>dWY3~UNL^K$-((JgAn
z4xsKI^#6GTK$H^f^t*V4aAf+SHds<ba}8yqW}<s)+eBL>Vw{LCK2H=J5T)SM%Fe#w
z%D8o_f<N{gwc1@NO3a#Qm>W1bDqKOg)?B^=hN2EY!A+nUICG?%L{<(`iu#pawZ~^c
za60foEd)wn3_R+=N`ond8+a2nNcPvUv55M5O`=o`m5EDC3k#nPbi{YEJKQJ54pg0Z
zh5()coTV42HsCu6hab`8|Guae2%oQj*)+X>&yG}(_#J|#06q~74FTBE;UFc9OiV<r
z*5c&Jx!tnOESBQw2#C*sEoEqgs&h%7X1;iF*)e$;t}%L()Q}qe2&X`{-)fq~nLjq8
zOh<wXkwWMsqN|HaB<~#p&<Mta(oOmK@k1@G_xbeL3Ym`KQ2>mlocw$$X=&+N4n`7u
zOX6~&a6+`|rG1VK>l=y+0xn(^pR6qNo3;#}QVaYu#ekP1Jg|qHh6z6YFYANT!uTCh
z<drM-XjP%s?KDpR8U9!bW<HWX`ruz5W76Zg{<pMGLKN53(UI_E8=P58wI<M6i#Oet
z-6nINV71ElO<5FMlwY0}Bl`Kca%BN?cEAs3;Ku^G)nD=U+1c4Ms*GV>Zw_ze;Tgdf
z5r*ENx8)v7arJ{oj<iHFu+O3|i|pgZw9+;L=0Flq4Q~g`#cy95V+Vv3TRk;-z(_*^
zi0Ggpg1QD2pdFqYAfySVYt)}F&kWP|G@u#vDXE9}1rdl4L=mNia{lqmJ$v>LcN&Z6
zf6&bfR>TRmJ3jU(q!HG`pseK%OY!LD*4F&m+U-#DK+UV*YRK@q{1Uig)8lM6Z{EB#
z*DCOIApQH|&i!g?n*{{AdaLsF5*sm~X3++exfPyQU}uFi1$}Dtf{|lGN+oozEzQ|y
zSM`1Ppo0yU&x?)UQgH7cR_)UUcPzm(PELfz)SD;<HAqfoL)sM(XhjseQSmXL{ju@b
zR#nq56-xz4mO@QpWu-jf4hsmdq25HQhY2JowZ6Qe#Ldl(9g^!q?nD9T6(~Q!m4a87
zi0Mxh6!A>N>6d|oNe=gzcvuC75V0OXk0N)Q-m7nH%xl=D7Pvr^7(}`<SJ<><&%uS*
zv2R~>&os=HHXwZ6$~r1`F;AZ=lU?zBb}lXl>4hA|U#EQoAI`dSF)(loVr(sXA>t7M
z3Qdgmn$AGT{0blPk&^2BHQF}ovc>2~0;`My-4N}qug)Ls9~d}a>jWblqAd>v*#Bv2
zAa9`{oDQ@^+|Qo?8w<S6DbNjem3sZU4(#RB?OQoHqtMAD=Jw2R9UbXIUtbf{DcDFq
z!d7NxW<kIOP*q7u=?p4)MxB<x+r93SrKI<h!(vTvk5T!%OjIb=GGMF0;nTv+BcXYn
z1hL{3QX9B<jueJ*kA{c$-(S`ksP!mcU``Cba9D?z2GQg0^#uwB^6vcY)KS(;%dKU&
zOX!F^@0kX>QPSVQ;6z5q40(;@f&tRZ#_b<mR!~uR6BM`*GP1G_xLX8XVLNvQkd*At
z&o_8r!{yT*7yHs<B8w_)NY)Jx78IS~vQm<W4g#{ZNE{h-Jp&sX+Yx0%h+d<w7>Hm1
zObIDPmMpOv02d?kr;8zwvSb;!%@+uugCRGztZc%92Ss@96&A)~hw4=m;tCd0neTTK
zlreK){g6mOK|!5%k^o?E(^gpZ6eMsyf@o*mzAc2zBQH<b4XsG|30w*ysDd`$`z+1`
z#V&x2w)BZ_RDmhNO~fgBe%YT}H@664NHl_%mY2)<3z~C@?+44oCIN_nL>+*$zh7F`
z=)Xr`kTgYwuz9VmTnH?cP|qzGjo_7c96=cr;P0sVdY|=+fXzr|60`FS6rtt>5Feev
z)K&^y5QEwKeQ9pYqU;O>_Vu2?oXbav!aGtq<Sbf7Mu`BL(2ccyUurK&JdW~Ya4uYG
z!{cFiMy3^3%cmqt67kd-Xg~7nNwR6Mj?Ma9&uDXe4|vnm=jS)3#ek&E2>OW5v85Ou
z!y1*xZlI>7E~~AssIG2>iS^x@izY1FxWNelS2)NJ4lM$L(thBv3uV{Hw{O(wV7d+0
zT95ct>tPb)<mQHhKHGn7j6#In1lB{d_b>XHXgVbmCaGKJ$E4a-_N#;h1nE->U4NnZ
zhX<d5N1C(|960AvC!`+=!$T}1WbzegC{UoZhwPr2qp+X}){7Xv%(YLdw&TDh0kylG
z*I<&_6N5^|mJ)+1fr5%wA*ciM5(vNnQ0hlKgkm<2!J#=g1G-=fjnJ!r<ygX+1!xj)
zJZe6M*3eg+*6iY<?iE)4G7Q7OBLZ*0NJv7|82*GdCY)M4LUBM_n+iM$rDuJ(`{3E(
zMMdJsx5qGI(U;c2!$?ppIi;n^sIS?7cwp?N!O8g8OzS@I7{wMmN}~-+3!q?($~N$l
zaA0n$T+qT3Qn{#E0T7U{mGR$P5%~Xcng7RgiT~Hf7dO}r^0>BI-_R+*BblUwx<_<M
Iw9Z`qUmeg`5dZ)H

literal 0
HcmV?d00001

diff --git a/content/english/hpc/algorithms/img/gcd-dependency2.png b/content/english/hpc/algorithms/img/gcd-dependency2.png
new file mode 100644
index 0000000000000000000000000000000000000000..b045ada453b656e30b2aab102d8da4f9efd2ce3c
GIT binary patch
literal 13621
zcmbVTcRZGD``1#E%*<?Qkrmmp6_KcrJ+t==l@YRONJeNW$;haTh#O^Qk8G)sQACm7
z_j=zy-~WEk=kq*Ixo_8fUFUh6$9Elf)LCtHs@<%+$;im4PHCu|BO}{Fhd;mEMUMZ9
zFzCwSf7?BkPU-K$KY_cfBk^|@FI59CJy%;VUkeW#GCLPnXB$CJD-RnR7thPCUh~_k
z<?*H?#G8~oY%IL&U0t~J?VWAN)I99DgeADnSom{^3X6(ziHJ%cKQ4Xz1lJio&AM`d
z0x~i#vQsLG`hID1pZ#Soe*3GiaPji;DP7JNN|Cdl)YvHOJ9a%f?EJX0>*-}qXIrDE
zTG8h?b6=f(^+~NG@)q5rJld?;BG=#Z@xhhVEQTY~8}fJluK6`vd8B!o^%fep2AHw$
z;@-}ns%uPRF0<ukwvmF7#-W?ztb%+Ds_wCjbT_lh)hH?Oa!z<y_M!X^B_&;k-pkxs
zhi*E)N;u0<FaAGo=dbhGs^9tO@#FEi!7`20r)%#DnVmH-h<x_!h}VxHmh9~8uOrn8
zlarGV^7G^3*ky{+k9F_$df>WQx_4}BtUTzi#QMgDuCehR8ylNF9J0(oe|{<c{_&an
z(4okpB2oMvm$>-8!otGH$jEo)?i1#hFYltKr}vs4PM`byVBnCEhEe(7-}CRgyZ7C@
zcW-s-u9>;HIhm?zy_)kSBcq!Q@thN_=`!V>(|Z{O46f}vdf}w8aax0+-W-)6pP;^_
zCC~o-`x8=AZ{9w7v9_--<@w^L&QA9~zwFE3zJ1@*!>D=pn9U)*o7v$HRhNGKs=Fh2
z$=u2+6mJ%l_R-SXv-0C}^j>yp^UZ%7Upt>r=S=9$(c`+!*SGB)otdFjS6AP|CJ|%j
zzL!&ekB*Mcez|~KPu!obTpup6SNyj+L0wl@_pYH~r`(m5N&7tSnI4PLch^^!d(Fna
zd?6PU6wDyy3`pP!ckSG{v#!2=tTvpIhHkvAtiHK<kI5Ua>$h+3J7!k3=luEeRdtUB
zIPgv~UteEY|E290Uz$!`qFr+tcx8V`Kp?E7MBHVl{5pqRz=F7Zw?w-mpTL#n@B5mY
zo86{c*$*8$RB)#u^Xk8-cus}!x+wa*sgot9uM!HJzBJzx7885_vijc#pCiH|A~m!9
z#TFJ8H9kl5k1CIpSaiJhoL<P>Tz{g<KugwF;V-tIon6<^u<1XqQ&L158ygqcwy>6V
zjxtMm@R4FzLiL_#7pCv-{?M*eo8;J=uc)oPmsc&uT;?__qo7gP&*`?}baK4<RzhHL
zagq4A`>$eOz7)cR64KLSM@L7=d*kEcZp6n^OM8FcuCz<*%o&u?_3Uh(=2UUY)YR0H
zQe8bgWo*dU{5-XsTqXaMdWluo_}if&V;){!<s(Or-miFZ_b$`g@3J!{Ch2)5Z%av{
z*c(xwS%akHWaVAlbzNPzc2Q6yK7D$~(!t^MW1UP>yBORzt&GfD{%N%mvtN;6JF>E}
zn8j@OY~Qg1mn$mrcwv(+<C`zu^X&Zd(dp@(y1Ke#*B(CPyFtyl7ll3h`7Kj-cegi*
z`Gm`Uxyr26Fh)8?6*V;_W#!PqMf}$>(D?lM@aAT%J<P(_inWZ4SSctd$jJQt*8&>!
z&YnH#<0IkW;lahvPcq2RA5;}8E#i`w=fIC-*)oL}E+?d=?SHaxR$uumEd?bi*m1vL
z*YxsCubYQQNM4@6F7B^`Wv;PpoU(p2c5NBqSS?vOIjNa#?^4Abxda6Ekoj%?lRKxU
zSLimT6vHmVSiP}wsnQ+iD*o56U$(RC-5ecCwIS>4>uQys+34xF1qB6#g@-GiKE0bv
zQg!?-6`KUti4#n18FJee=JG$@6SH&sF(kFVHl3Mp`}Phpls!Z5OS6XlVmr3mR(Mxp
zX6DFhdiH>LfW!NPU%q@vyL$KT-5e}2l}oWz!`{pG_U<=#kzICl>@}LN-rNx4mHPM8
zsv+*_)2HTv#?PFy!^3o5y?RyPKA~o9XQ!;8@l?0^;pX~JYI16h=7?E`=XrUPfA2Ty
z&!|#TWcKy-om5rbfs60p#%}G19mu~_<|5lVJH31N?yq@2^>aEDr1s#?N){H}**Q71
z?Ci0d8}nDi<%NzNi_=b*R>V2~_U+qQUEN&YibAUf9x17NcMta-Ms4+ko7{XxDadCW
zfaUu!QXP*K%X#yLb>q)+C{d9uEsM)D7hBDi+uP4d%E%B+%GA^}GCn@y`gO%htNFgd
zZOQy+Y9pvQxda8N$jbeH@{@6ii6whFhtZsyxZ~>Y?;pE)b<@}0-ahB|QGWg#jC9XB
zE_08a^YfF$-WS+>ypw6VYvULn-*(MA0@o`lWN+QRJvKevJKA^p^y!6_6%rcOgS@;L
zqDM6*aMRGzUN0;Z(N2@Rj(V7wY)WoyZ6y`id@S(#cFt+Ig2lnX!Jslg+{xK_alD>#
zs49r<(W6Hx63(op&Vx5Pa@2cSP8>c=v2Wi#H*fFo&`{;#s$=?pl@3Z+DypfG|5{$A
z-M6pa=ZL}-PGmuW@O~+ehomLc-}=g+YkBa7wC24Mv-<k_)uk!Qs^E>AH*fCtn(d?I
z;o%{g=NqqYwLSsAA|HxyC<O0eXJ;poW(LcIfB-Z!G}of&4~j}0)BExH?M^YVVxx~&
ze*Z9eUQYIIpu`b-viH~b?r6?$C6po}B7W<uBDJ-(3tw8;P#rn0BUQv>CnV6|=%euB
zV`BD1Mn@|f8nW=J$A!hk?bXRl&8L*($yL|aXF#Kl!Lb!EtEwOsa}9a;`l=fkFax$o
zeJNTRT3U56EMlxduWsGF8-ex^!*(K{<bf{#^w~2dGqX%4`%*&=4vun<sokPtVp&g~
z>_ihMUQ<%aZgoa?p=Du-!kx$luF`BH-!r;->t0@^?@;aB=eKtc9XXQJ`9$ZZ^*UfL
z`{6^C;p(ex$&2?L|E|nUwkd2*2Be`oCJUKWbs6}~4U{mJRSVYFM>9p;xx-jkRK&Ao
z%NFC<<<XjLML5HvE<=*zQ&aV4Z!G78Pnv~<gfyb9H~m?jId$OzqmGr7j7+k)W6#2I
zH9bHOjiI4o#@z1=xhsumVh#eD+S>dF4jhn;>Sg8Q=cm)u)U<SQ(Z)kCv;S>MK8W84
z6ELkvdHC?5u=GlGhtd-L3D@DtO9I%l@#*P?GQVZR3O^Ak_wns!RsmP8&|kT7rHbTq
z*kz>Z%;eOR|37hGA0PfBM~+MmS6-2gR-~l($RQu7p%%leh*dLendr*ZQ_<9nzTF34
z6PMRMb&8vxpFeN1R>-b3jr;u#DqiBOn~m?-zMTfW)#Kl~e@2WB<ta+Hg;7OO*J0V7
zafQIu1S}{W`c#q6oY6Ub{W_y#BM12TdyGnnzZ)1dDoHeZxVv*pOH1>~%V)fF&&kP&
zDlL_8_w=+1bOMq_0*<>=eQ$GOS=6j;Y?|IjP*ZS+^$iXtQ_%A#jE<UDO`pr0opoAT
zU7Rrbn4Xr_n4=!Qnt4{tq{1%|P4o`m!Gpzg3TPrR%pxJil>xUq+y|Yx{r{~?^_Ms@
z4cd1V1g?2lIy<+N`RQcHa-#)JVuQr*jFh{F64ijqEeYQJf*L%Cz9oLg<gHI^9f?$Q
zOZRg~XlNA9Rn*O!mg4bDLZ%6dq2!4R3vMKm)7@}BKE5zOI*qHVYpGstJ&AOq`JPzB
z!$*%6kGD>Jyc>1v7Q^E5^2WF6<mBY0q1T>GuGJeCW+YrR%dY%(8K0jwvodsY5;!a<
zXmdEnwe0Uft(21>4>d-66+nUb$jQlLVtaq=Zt8x1G3w15>DS&fjDxvd-xkK+tDHIW
zq9<`HJ<j&&+1c61sw#Q^<?rW1XMj5immM6E&>qDyC>4kWS_jL?s~;(}+Sy-ZL+QA9
z)fP?DdVaVvCZ;=o_bK$^j*bp)ZEbC*C!Zc1)X7Ld;n5#GdX%(R@$=A7<c%BT{e@QQ
zFJulZ`>!r=vw&Unm$`EC`WHL`<0Lh>ySo!GKJCuS!$VNfwLi;8+pNS@ZEg7z@7}eW
z<W^KfRbKfgr~Wa2<mx6ns<EKB*h=O$ITh>iT|d9HDC3;qn1#Q4cd~Vfq}91<;^lR`
zHA5~MWs#Mgy%mMW6-IPzoa1Yeks6^^1bKb@`0>KhQtcznyAN`6`=3hLU1HsGJ@uzw
zR8*A3>cUvsu);_`@JL7JWzV0M5-c|}8=EppJ#jJR(~C<dXp}}lT9lQQ^GOaT<_A!9
z789gcH<YBzZQIvqpf1A|QP-}KVJ($_X`wr*)iULS+AFX4?%KUOJ1=jKl$4alnKP<D
zEYqsMD5C2xO*W-SdGZJR{t+wX`Q;rb?Z1EjR)2k?mz9-Gy%iB1O&PaeYAg6<N=nMY
zW!L^>u&l;Ljj!Lo_ZS3EHqyrImnv$iAN~7Qla`jY#UYa9sH&=Z_TokLvu9~gt-Z7E
z182DHkAC?0adCakvm)p(<Ec}pR@c|WfKLhc?or@xk*Lyl*R~`iCx@b5>=^yYcJ0|y
zSK`<^Iy>8ZelM6*A4-4}%NBv|<MCGWMJ<Sei0=<$fJ0(&Gx+hdXU`H66LSlHjJ=QO
zc<bxheE$SBN(0Qc4y}1L-6O{t;6+DAXYks?=I73_Ka-J`pYNY=Z6vz2u`77z<jUM&
zOr3zB-~kB<3F|+<zU7UyT7QHxaFlbUXJI*&As=-51{IqcdUVwF>syFJs>vduefI3x
zoZ#T^%9N+r+WVB2a(>T`h<Vii!=X`8QVOw)H!YKJadDYM?N6c$r~m@MQI0#j#lncC
zs)Mv9509~kTJL%?7U<b-I60V$jyEx0AMFS7eR0CT$f&lrmnkD7!=O4??!$)<%pVH8
z|HyOS%FN^h8iwF$Pn1(}Z?CDTiTSb!wB~>DGB-C0O31dsu`@hO!tdFom!_2k_MQC1
zaWam71gOVjD*>|f9&3$xZ8*l`9v0Sq+Lf1=n}gXiOM6|eoSB}UwnB-cAh+=@QgHjN
zE!iZ$Y(mT6l8`9vcORV9kA9+)spjV9_PkQ_JvH7M3c~#F&y0}Mz^ka4MVyNroSd8j
z1_fj@JulubO?_l(Eb{Izu-G;|lAD`L#gPd0Gy!aCY-?+PX4yh^U&0v}w3Td>&|(40
zN6?Hw6?WhR9y)wD3yf*!-u>4As02i(r0h~uRLnCf+1=XOikGjyeJcZoob}*A`{?hB
zN=jRCm|8{_Z8I54%pv(y6cw*m`2V~yr@xvl)vHlOfEVh;xb4GoPF<afrG<qeNRV1l
zaWM;9S&3n}8@E4TB6Q~S8}H~}-#VZ4EUCwH&>lW~xHZ%XrD)y$K=I+LSGQJIS7~r6
zTHm06t(U)dr^Snli{JN(o&3^r-@5+hzF5=Ek%7tFmYe&I3IlR*pM~kCqfcCJ=b9a<
z=43nJ0(MW%Yjs*vbK=t@&4fF5ZX_nsZr!?-nWe1c@UdfGp6O**R?6A7rbTV8Pb-jJ
z)6G%avg)(Jd-N!!yu5s5Ow4vNLQMDdmG>9E@t(OcK0a>iI&wtY8P9gs*f{3<cL&Jp
z*Tdg_D+K)sgXSJb0R}cF$N{*4nVOn}1iF9xMphO#m^zDVq>Y{3PAV#@zJUSD)-)-i
z;BqF)Y{F>5!op61_<)0c5$OH&DdfQe2m%2&fa=rSTt#E!{n-BZz(9kKW%c`#CMT`8
z@7y^77>SIEs)4#jH{A(E==HUI3tA$W+DUvaH=Y6{F@{zAX6HiF=g;hvi~^%xS&lQH
zk)(Sy8>pCax3R4t1fM6}Npb8Sy&TBMwr$(?^!anEk9PzMUcCwfxvUh+E;|6wdvX3N
z+XU2UO^S#W7Z=wx91AP#UP%wuQO(?(3&ezmfg!AV^B)_k59I6g3vcvL5zD&BfXqC>
z!-vP7DQHI@;=>&+EG%4ubcE6o310U@8zt5vAt~uux6q^Va%MDBB|x8Pb#N>sInmwK
z)YQuT7J0x(n_J@w%s)LmRqi{_@#)j2(Xp{DP>rQ}&s4ESI4&g6P;+Zyb#~_OeLs_Y
z>sDBq>&Q4*1JMl$x=T1B-@7PqWtHmS4KAz^_zjkEkuPFiMO8I1Ik|H*0LSDh<V)cC
z>Y^2)#3mZz8^F~a^X`Lk>SW4~fdQw;`inyUJb3gdd?fhq3sUZSycJ+sNl&kMSP32M
zeRpr@vXaMapGcalKOGR{Jyg{2Q#F08S}2qvD7CQncYU&V*!0o4IVyDa{gQ5E&o93G
zV!Y<^F(xL4hLMpRgmQi3_lR+N^3MUNT1oYTcwda11P3GI4JfFZkr7Tv+O*#ZP+fcQ
z5&=OK75iVmeywrp)b;9W1@L+<US4eLw)LftLd?f6)9}7tzYK>1uiLv-ax359I(Tp=
zAgl4?$M>T(p<BrI?%gYV{CJ-Gn*bn>fXN$%Fq-``J~5uVM}VlTXTD8MO-Yq`e*E~+
zQFdTc3}=>E(#^bhh2DJ8<UU}LP}#~6du@K*&7j%J#Kg2;+G`ie$<cFq3MK0I=htCk
zXT!p_TU%Q{0EVM*4haf|r=&20G6{QqJ-dffem_0G*5#Fw_>Xsmem-nE=k6|2>eL^C
zvX_*UJieNzY7!U}H1_@bO>{FL=DT<A$dqsFvH!Kg&CR{=^XGezA`$yfe8jz@pSOIN
zvLhgy5L+O%RQWTbqt`&w8XFtUL3Ps1)=@a)jfuP{!L8ieLH`?IaV`8FxymafM1A}A
z?XNvA*dXT%E$eq<k+!X_6pf;rWWh;;9<uy!>+muuQ{k#SaP9s`dtU$VE9C`OH-kk{
z`O<dNmP)V@466dAL{6MIh&Co0{hk^Pa}rBU&}*#>*?V{>8U+OfI!;cVo6*rIvfH;+
zYu~?bg00i|z597ndzNz4qenK$)jf%ci7GHFjH?3gHJRkrw6w%x7ZanIge2_WVgZZH
zYqpA_rqqVn(ZhKUYNv5Qfm%<urk_cD{z$&5smU5}6BASKh#yZv$g~V+%!gY~eYkB6
zS4-Aae)%(=h@f@;*e0`ak(Q2*MB)=(Tq=|J`#J?yQKDJ!S~yWje8R#79fOs@8>#3Q
z$wx2dA3y41rJ|yea9HngLM+>fB&<AnZ#5bW3u;cv`#ZzL>&Y))=q_ElB#OORg0fX3
zb%9;JG^@6?7BHotqB?+cZX8=vrgZN7c~z`tRC+plwUWmWFakFr@~3^@OB~x6W@cs@
zM@L6e+|!E6%BH8fIjLO<M|gNbz~Yiocp^#4Ca4xA2w^GL55eoJg%>i@@q08l0(2}a
zO`XPYU7}Ebu(rkm1kfoiJn?giQTVqSI%BwP?Scj@FcqRjzH}>*^;@_`eDpA5BGWO`
z3JsJjvn$KV#<Fkxz4<1Zwk@fuqN1X#C1(Nxr0<~wAOXj}B|n3`pmO2Do$@ztl0Xw!
z&lG`&H?FV!T<h^XC?F7X`SN8NYHI3yQdC@=OUIKFC@~d7!?h<tVW}ZP;dQO8aW9=2
zzk5uOsIqf&Pss<bH~Jj$mWGgHfp66|y($^FD%{cC?fi+l{olslQ}}G^;y){Y-oAb<
zWmsUorN1gj2B(r6^?op#QE=RTDAWJv7pji-b}roXghQHYG-FAhJ)^OS$?JgQ|70?l
zl8Wux=&Y=)Oe5@HLGL7iqZXu-bkgO6Qb9qND#m{L=I7_DT)bG;q0YaakdmU2Z(4a8
zuz3m_-UucIGC*O)bn^c9wV%@}=guX`URlZOw*sgoz^*7RrS?92<OmNdD{Jg`{&t!S
z#-!C%KT&}6(#qT=(;N1pP!t~MHKm<TYkGU{VtEp<Jyj$*JSR@gv}5xhW)_x2=fP51
z^b?}Up)P9w&^|0HI{|$%_DpDR@(a$CTI=XNxHg(p5^SnPQI%r80|#hv<nr4n?<L*2
zqo%K~A3Tth#MzT?GWd|PV3U!Zy(PKT2j-812-Txfm(Qp1@4)O_adC0f<HxoyNG48@
zXa@!c24YM+#|^nD_Uz$%^X5$jYldUQUUnXESf?k7yViU(hj2_vRzb<Fe|_y}Ic9xz
zbK^wf?b`>gtjt>E`9>EPiy4+U?Blf($to{TA-w44gSBwid9fmS3cU>U^vVqT_OWcd
zuP;`DZ&1sxaJfzg4vqEK_N+Y4+pB;7CgWHZrEcR610iw~-?A{>me~Y8)d=d;1Sop#
z@CMJO!oa{_WV~s5;=~C{8=LxW=3~c>)ipLo;?#-WI^P-iZ~flaZ{KL~vRcE`DekD~
zXyOJcm{WD%`p(;Ka8!qf@96d2kz#3SS-YH^4a?@h)s1DQj;<~~&*@g@9%cwYK2Xhx
zBh;oW@_~}5-Y*U@8-IV(qjZw)-D^5~A@=9D&V!=<OV%02oRg=xrM$k<T$~<Sr`xwr
z74-kT(RyU$t_u^J_ksTRPPlSdO^BEI7G`BxL<iCNk5mQ47$5ic@ZiB32RcqIE=_3?
zcCg2<ueupAZOKZOPEN+pM|SEF$G5y((&`BxY=gXI8|V~jDYsFwqlGhsV<R8*r>>J3
zW(q+P`4kkYd!}LHn9W4>JchpHfzgRVzPqS43wPKW03Q?MJyyGeM^Nz0F_SlSjeg#1
z!=iPaoym40%$W{_1qC!%h;VokoyyzDP|>V`2i_b8?i<`+UPp1aTpe86sh};{EGQ=@
zr=4kGiXsU^U3PVM1MRkeTN5@D;DH2OA?Qv%X&!I_M>Cw7Q=YIHv+}BSFI-R~oUr1m
zwsaXLqz2&;5j(NJq;7#93O{j33=9oxpul{83|+?|&zabaTvS?uwFXy9R8AlQrw?R`
zTTrm2(GTXKA5J?jWH!Vzy3>Qlk4fE<VPhGGa@1mL(ezo^=IwBHDNqP7GAtp(z7CWe
zg{r0mUrUn@VuiUgA*}|7Sj7At8F|r*Gk~1u9Ks5$5>Cu6pWo~N2S=}ck3J<v?dVw<
zxJEbA`?3KkfrvxbA-^kEdQ+Z<F)?TM9pdKBQmgL21Cf~D6#MEP4LoZcl6(@ONI)f<
z40IUC&YnA0i>|@K_EH{9AVtR4CDsP}Dj2ME1jZ}!77^($A8sGDu(u~}IM1MvIQ%Kc
z?djCwIjGK`?+!Sahi$FJ9njLzZM}N+Dgz@UjK{55{lNQQZ!Rh&Wn>JGPQITXsg6KW
zgI{)oX}Nv-_G>gDuoY$p%c$4Z*8EmxSpf0C*XvlJ9Mn4$eE6eB2a=MK+@Q=rqIa*X
ztVjm^Iqv1{ZDDB{(${AIzVZMLDC~U);SqFVUO~aw+|RDBZr!@2*7gBieFwk}o+u^g
zv+DWt`y`wP^j&psp&4`Y@N9*f27*Vz`Jv_Dh=UyjsOVzRJa?{0w>V(&G$NUcV&03P
zVPPuCGX3Ie=oc2w&Qt(Cg8DB`G_3w?Q&>a|z(&mhCH)_J75wrfii+5D$aBe`Oh4%-
z!in>6!FS~<(u{X7F4~LjbbwOhphaK%3VESX3jG!>^GqvQAz;ZD=DH>z5)6PT<a_r=
z0#B6n^>1~mwha$+;OWYJ=2(G%;b-nk5aJPSLs(qA9`fLOiz>7XcCUFhkRG?ul^VTb
z2XBI~VzDC4sBv^R8YZR)xWIWYUWoWDm_t00BNTa2Xma<M$-->E*e*&+Uhq~((swnX
zJ7)xW-j(Tvz|Kdb6Alo3CbEmWv8BbiYbXm;7dxN?Xp?e2zA`^z=D2kNt4YvM(0IR%
zKjK<iTEdP!0%RjqJ5h>Ifoi(Cv=FlvwzfBrlJu}$!tLTREkhqkEMm4=v6-K7f!6)6
zMygNz*A~}_K2#CIlmt(hkb5=DgcigRwm|zx<LpDu$RjPyio%)BJ7NCr8Y}xdtfV7*
z{_<=;J)Sg0K1iC(&GueA$X!7Ku5j;ma0x8^*wm80U<Df5+F!K%``|ng{2v@>+2tT8
zISQC;Wbb|P<HrlEo08O~dLC&q-MVvUsWRdCirr2!WPI*m2#_pLte<UgW+elDU5;G_
zxu$~Wh>-ljvuC@3U0h-2E3^H5FgtDAGN@pMlbwv`kUMK)^4WZL-i$LiDIwu<NAbLv
zsRj-O>vGHLj|QPQ@ObnEy@`l#f2mLEk=Cj{$BTLPKEgI{-MT(c-_y4d@aror3<Sq!
zvMS5^o8)DGhigZ1PRqf3laC#`c=2Md%!K62<5PC(W@eo1(q5y1vNj(WB&DT$jC9o1
zBYtk|V_>Mm(_7ftMHm*_5@Hqj687rV@jokb(eN23K>n#nwlg!GNg}2iSWsQ~$G-|>
z_3*#7k1aZ%VL^43L_&fm@He+@z??Uq?JN8`^jfT_sEEiriKBh)+&QRt6{ul2R`b$j
zFA)RaP#i7p@crHu)%o^m7B68WpF}}GMmIpgqbz%$K3>GR*U;Xnrh(}LhHzpQKwAwl
zHTc19xY@sEda0L|mSTTgG=_)$p-oGZw3;~%S4k{*KI3jx;9BizF+q3tMYWQ~{fH;W
zCng-(O)C8sx4vt;k{wtUu<WphU77)$c``6Rt9li|QjG7A<8I<N*!KbA%|RoO*@ahD
z28BKe7zuJUIDl{oOa?)~I6SlzY40OWPEKy{lE@nfVT(4&w(n)7c}0STh6b0E)N9?D
zBv+lDo;7!?yW&1?Gf!!2+fEI6E{xV7_HuQr1Orrrhl2&56riM_P&{{z?pmsxE*(9+
zl$>s0V4&Mfk6@aV=MB%@Zdf{Up#@Sy!wJ8_w7W<f9E|DzvMBA!%gZx&bfg5SCL3kw
zM2Zw3Lxp!KL+yl;Q`I8UY=<?hl`g$6E-sE}704}Fn-`(P6K>sdsx>VMieaKdkEpo1
zSvBkpgr>%&7?_v{Q$h0TKUr921>Q3k9zJ<u^}Q>1b#;*!A@kzb4qn_Bd1+~BWNd6W
zY!&$R#9qq=uQSW|{-{gjRbPOzfG8Y=W*3%_P=gDL;ez0R;q@Hf`QgH)p^&5`I>>W`
zP-gQ1_^|hQ<05KZz_4h$jL+;z6_xEs<jKeYoqb>ycLWRx6`OMZeke@{(i$Sclf1Gb
z(DTv^!M2=i(4SW|o_~Sai@&bd-aumDaqK}4?C-6xfCyZ*>&Nv0@Hf!A;3p&&%fdYI
z6nqAs7$KFU-vXDk*Vk)pnF{YeevB^uQGH*ld#dIB31fwAT|EmA74M{`vi=yZj1J!T
zBO-Tv0@`G|n3$M$gM6?NOf)Qsl<$3`5Ogt1Jc_Zx<6&bwCFp~CD0Mw>t)`=c7Fzlm
z&hulfdm(U)l=bxJiN^%pvT$+Pi`Xnp;VK6_F;vbq#F9@;z9Nc*We`dPgD{j}4fMVC
zw5#k@fQ=?%-ADt{;L&}wtYxg8QkJOUwP+137_MAcu)G#3%mFCg5i)B%J)3z7wFo+r
z_XSp&7A*Z=olJj>Lf~4^BFrl7yzK7hG1A%IPE%D?MX*qU-GFBWtjso@4#ac`cyI9^
zzHjmQ#;<FoY%Dd=;YEV}c;S2$q^g_bn^m`d=y23*_zy9RL9#9OPkw7>MMZ@<_^Z)+
zzR6q5nvfj?VmnF&-i4tqX4m@e)2F04eF6+F3}1C<L=r+ItgXp~U7KV5$p#~4CmnL(
zvfV)Vc!82)VJ=rsNJuDuWOedRax(pww#*TpA8Ik23LL<?g<rp{F)x8KHMg-jKcOi3
zeRHB)O3&1kg9Y|(w4>@`I_QIx?~nb1_B2Yn+>sw4m&mKHhI7^N=@SpiBX4TK6|S=N
zWMg7U>ciqG&`Fg`m+tFdyjTnuv#yvo38NyT@F5zig8usL%kBsB)jfN*ooDaF7W04W
zYZxaG^7%2;^t+E!Avg?T*=K(EbM_A<ME38otl%q7PePWKJZ6Vp(*w^RtMht~c|R&D
zDBvOvzITe6=IPTsI6!>Rq;vWXa8J2mSaZWR7`@aCYlW~95E<S?i?y`2RvG&I*|6pP
zz(A_gNL70Gej{lJN9Z(BST;*Im!=U83@A;H^;P%M&edEDu_U3hqMb5jG<;}n1>ND#
zYs(%<Kz-7J1rizM=|HG&wS#po&sRUGtfPOb=;|iIvR&dePQUQd^b8Vk!<H;=!nZ(z
zVhsu;8@>2v*-_MeT*U+aaYnuQIV2#7w{G#_s#A_bpbIn*BlzLF-trf;X1kFU8o%+1
zg2z~NK?^Oj5w&d)^;thWJlq<QS6yvwNX3=kaUF$-vl`!B-xdYpoinjy1Va;^>hz_~
z5<+$A>FE)M-d(>J>?643$jvD95HlT59Ehf*1DZzqe;Y_1-rfhmsQBTY@f|%H{k#x+
zr-IpoD5xqzek-#qU9`65yC><+vo;a0;K#^|z|+9MAmRM;3yB{-oX)7XcU(2nfSgPf
z!-`fJUkh4WIv^q}e0g8Xo9`g8+VCXQKsg(ckPsm!YY?UAfdk>|^H(={NhA^xd-(bL
zFG^k-K6>mJKg6kuj!yidUON3jZ5lvFm2r-Pqu<geQ)MJHL}&H-_GyQYj}No7*P#g!
zB5_BbvceLl$X$_`nD}`K>H+?W`~0w6MZofIC`BNWDq)#|yCNC@72%PQVMFEB0a+2z
zkeQg6Xx*2f!sl6EVPf_M%ib>^$PB^_C%Nv{VZBswZAHZ`sA2b&S;Ip|kG?07$YI-A
zNDh|9-&R`soU8X_baIktZvw{U35gPn@U7nChj7AUaCTosM!Da+fSEQV6j%nr)WkqY
zWLzAHaHKJo0UD;HsF-E4u7nmrICSV}?;f5?e(>T2Arz$l5~P(|LV^+LeO>ZF9XC8$
zbD9)AVXa5f@~)-@4c0htZ@Gp64uTVdn=nK0z=svKs0~wV`*hyK1fx&p@Px>SIT{#G
zM3=;#TjBt>cXoy%fXq24?-Dy)6%-A}Y#hm))05g%Lg#UDk%7+<!VIo;`x6~bjLxwW
z&Zm%&I$DN@rzb&xyuH0Kbz=d4`|EI}%;2}OfIOxoPLt|jc0^tcxMkuovy>w&5s$25
z-zjm;*1!zH>_yZ72OP6iS&tu+1DA-M0Qn)>9~kR4Y?#(P(H&?y25=$<-!_Q7O6a;^
zY8r>N7=?Bs#v`6+r?=;*Q^Qk;nClvN$ar2O7}i86gbq-jFv>w?pXTSQVOu(&8we7F
zmQPINi0dwbznfcFTnBPZKnS*Xby?dBt%hQh(K)7#r;QK;uue}nULx#b78ck%OlZCh
z7~x}Xjmv(@xmcY=$%$Xw3Cb=Rye<hz=e9VmTB<kBf(FuBt*oSEZfDnEDWQL*+fv6l
zWeMg0kw3y7&nq<?4Mz;0CgZyosVd<mBR!d&o6CCfqQ%ESU#9a5B?TE65~keZ=8pT>
zt}RW$Bo0NO<>lqoryb!CF`&wAuPGn6N`zl^4Gp}Qb?oo&M-+F`#pM|2=tp}2!2<^n
z|L=f#lczRFesg>Wh7*W!meoIhJOW-Jq_F@mNJvOH2}hY5vs}0_jHmD_C~&f{un-v^
zG1LT|hlHsHcD0yo^V2eugR0_87*y=TGK~KC5r=33y4nicf2Jz=ysqwb=+jkT_~0NL
zprZiZVv{3<{$EuLS^QcUQwEpL2JgVnL-O;F@lG%@FpMG-mGocYvu#e<0;L5-DJ&|{
zT~+c?sR-sZiUiY?chKsw-L>G6wEOq(!8o05b!1OMZLJb6I1bMQXn*kWnfs(<N&UYa
z$0$#sKEX2xVF=}Rcg~9}a7NL|%4!>Otl5DQN+qR=s~3b19ohw8uZNH<L=c+MkTa0)
zW2kY>HU2hy1U5(`yNyOZya}Yf0T&m(`Y*7G)CwN}Qi_PqF_`sJT%z0ORnx&sY|5uj
zQ9x}rgi|sRBWyiO3%`Fi;L+T_wsTi8(T-jkq=6Se#5fp{V;h;!b5?#U$Fp{fpPw9r
zgy=N|!5aZpq`edrHK2>5U%rs~cAR%{8Y-tn4NLmWG7sjSeWCSMi0@rlRas@g^0;F&
z6dz%07nGGX-~_C(zLm$+X+0o;7;Gxl%QB)h(l9`dc^%T*3XZwclXJ@U_J=W-7<WMP
zPP=VCM%XZ&2LpNxlAhRv!Q6FyjEDtqEMEkbDQn6-a`;P!(z$rZ+$?xXpuyHo&A_di
zzCPyb*RK-;x>ggSbBM7)(i=bq`nAIyb_~p!+$RzC1bhc#B$h<-^6?>|wOsO|-Es|c
z5Zp(O?8fAZh0MNZ`yPZp<ml<?Sy*0H1uhbd4;iSN_%K4Hh40-*F;JD3wa4EPw~x^=
z8yTwq{Kghg07>wa3oneq@yv?q>N_#wW8Y%{eoAQn_bf}TIGVJaobi}RAeKThAogJt
z!IO}O@+<t-J$G&^5Irm+LRm{IM!jF8t*ME^tU9=!^XfV=q)Mq4uaSDW>#43cV<K9K
zbQu7HgI_zX?-}L`iAEJ191JJ5#O+mrFuDXKoU#abrW~?<XD2+-xmq&FMR5F(3;QjO
z@7QK2(UQQe++N~%5rYnQ<_mLs&}Yb<vAGDim2hIy($htp`h+nZ#f<RU50feIiazJL
z!vcy&reZl*>b#s^h6yClg7Fc%=WrZCFq^w)&mK1jdj1AZ%&{C*jd74Oax%N$<ItQU
zLJXOb5lJMGuu)N|0#Az%J}50JQUXc>==J9V*kye3#5Yc2!zEmYV-Us=^OlIp$gW{7
za};S$?9G?i*-Ap?=aw-!8S&@$eA}9ktZdrzLd-+58Bg1V6Kw;zGNE9nA7Q*p1@c;4
zx)dJC+_-vobjHAfBCLK#>IVoc6;F<CR<$5#H7IprdG`D{n;hxdmTd_c8LiR3o=Svp
zBrYi8*uorOGoFVkN7Qu^^f@sjBM!mWP=$XnNty__qoN2I_{MjBFB}K5ksrIAN2=Ii
zR;qb;h~n(NHmeF`!Lu4bYVd2NT!Zb*D<{W};Dq=I*(j=F`OASLM{W)c4LzQh>~_IC
z@lF^UUje?AGuF1Y1c(yD<=0%*>1=In#e4iQqKT2=u>bLlh>3FaOkjh1k6D=uJDgTG
zOf6{_8oatpE8*04t*fhxoU+ju1J{bs%M-pS4!&P=&rlMxgqZ6@8_tF~0rV(&%nZi^
zk`Z4Z*s|r<{0JwekkZ;Ve^+l#83q$ZM1P<Dj5;L+r`O!TJ^<o!KH~mpSdfIy{MwPT
z6Q0%%L&JX78Ssa$RfCf!$-o<8rvJL)%~Tw^x^rxTe1UMz+Q?su=$}77hPiAauim>~
z@){CK1SS#&*UE2Wsj-Ivy-kEmCC>3|1G`R5%<7EJ^}q!O!%%|7hjyrb`SKyc@<1mL
z$%qGZO>6K&!Ph7)!Al3o#xKe~Ek%XPGt2r3gS&k<iJ%F=Agz0Fjf~e<GPu~S->A4B
z_Rv4yG+inFEkwax;<604jTcPkU0Ku}qypbt1Tm)!uwKqrE^!TkFj#Rq*wN#SZw3%%
z^PM{v($7>VT>S^Gvj;}Y7#5=(17(oG2xvnv4@8VL5KRfU3sj|od<(mAgGW@94y8!U
zk%C#H_Tqqt%&e@n{rxQJar<j<%4z86Lg3y3y|*H9L^iF2)r51jB`9bk;3p9d!p<WM
z>S!VAZt`~f?W@)Z%FkZ7;Bx+FV&Dnb<;c#HBsiXYSJ81^CER+ds-vBBs_N<(@J(f;
zeTMn?g(%nn7{yzh`p8tOXP*S%==E!i%?`$#uKc&ITzY_)`_zYhc#RHf4%R>chTCh=
ztkq%jhQEFLS7$!|mCA@v5lBZ7rsDte68_xY)8n$Q(sCC!K}hZ#Gr{n^{?U{30}F<D
z986>{U^oEbU}3s;I&4Q`qP2Cfp5Z(vaKa6#e+B#=j3qfeS<*u+ZScnHFur2Jee77P
zk2Ef$<7`mi(0v42Zvycyz(g}RSRMv(7?R3%9wl*n7es&vHK1pSbAS=Tnt=gU$g>(u
zQ}qEDRKM1#2_O$^ASND=G+0=SF)rqj2;b%TMWMd_{{Ig6^U^WeU%otw*~5a;(l2jb
z3YVeH?@-!RS$<j^Gt&ZWGCl$rLMw3WJ(eczeRF<MOjK0Zu5~}`*Rq>~pY%VffNcnS
zOllav^`SzE)?<hv4w3EBfghDuehWJf9FO6UV+EI+*3=?q!XJhsMGrv$g5np3Ov^A0
z!=MQz0?-1xHV(p)Lx?<ODDW)kL0jUhlaRkKZ;`rRxNsqjBJ=fa81UGYF^pM(*VJMh
ztphHS8tL)C2@tLVV&VaVETB}dc5Yy@@%ZuMsmImO>b7p%hLP(C%u<Z^nq3Wt(~t1a
z|CTs0GyiEfH03ykp8Vih!sQ1=zJa-ilh{UJe+Xuz(SmhBhS%2vk?@Yfqk^m;{60*@
z5DphuSq?thX?|}vIWG4haBXP^&K*vKx`V?J3^n1KGyj<d0waN8OgP)16Zi+;wrTbG
qpf7Xszu)HZb^iZ;%V%zL3(MFgbJx15czmUa?3AjuO1_eL$o~N{mQqas

literal 0
HcmV?d00001

diff --git a/content/russian/cs/range-queries/img/prefix-sum.png b/content/russian/cs/range-queries/img/prefix-sum.png
new file mode 100644
index 0000000000000000000000000000000000000000..4e00190ac1d929b15e8b97c26ea1bb8abe73beed
GIT binary patch
literal 5403
zcmb_gcTiJrlnq5dng|x)NACimN^c6%dodt_h>;cz0YXPQ0!o)3Rf<Tj(wo#E2?&Zx
z?;S)+0O_IZ$L#*QJ3F(R$-MV6N#48Pz2}~D?)S#XK$C`&nGynl&}eHtFo8gb?7_7m
z1sOPw?V_x}i4?7>ZAJlJuPB~Jg3nCu>K5*%D0_D=TQ@t1gEI<g2Svl(?ChM;jwp9L
zNxL%mk>KTz)ZFZB-JMX*9A-{PJBWsx6Ni{IhrX>3hlH4f1c$hU{9Re_=Fm6Qy{=gy
z0)gD{(tdE?%sXRq+RyA?J2vud(ZkQe0*DauyW0AXX!Mk(Azm_Fx42O+kh~v+${Gce
z($b3r<R0XB((#HeL%Sbwqkbbowvv;f9GomhKg7A{mzAv|(i|@4wksB?`-8B5qtBFA
zk8lK%CvAUOX4&zIc!Dw<%LwNPqvN7R&k#{?hB@Y9^(Z()a{k9-)kg2N&G8!Y+oDgx
za&s3$zyG}_b@%Smz284ADxU_M5sp{0tDL9h=W9rUOp3phJS-J+{J~M2{O(;yUy3MI
zwoS3wv(uv;3M#5ZM1k_bdugPqw)WMRNTeK}q+YQZW_Fgn#$h1&@;iw%>YF!jR#sNd
zrS^;Eo4n~w<mcw$S!z3gk0g@oVWXKsx3^uWS!B1{0**hG7Zi|Py?RwZL?jvE<m5Eg
z=C82QpVlRw=6bL(9$Hx`)}Jb_R-$w;29J!54KFPf!aWavXlBO2%X`JV!YTnlIK)R-
ziQt4jwP3)1JU7OIfB$yWOOsG9@$vED<l`fwrKM#_WP({*zEz8$o&S}uYl0*kY|5fn
zs0$3Uzk~n0OH8ct+_WSmCDp3aWb+{f(NtDfFWl~@;N|6&^*_1$JyXHUhOqXN3EduW
zCT92bhPR*J?5OBP-PvxR=vWiF&|rFU@}I!tJ{wt{KUw%Tf<fU(1Ik7(Q~u`G))osr
z6aHj>UG~*crGfuaPXeF#L0Vdxyp)53L(A!IpNp&Oe0+~LSj&A4jc|i$91h2WRaR26
zE?ik!3S*MAr?_xlCg9sOVXzT@e8fuXRn7-uL;r`PlBLk%eT1{~h|hUze*P6{Y3U65
zGtZ)7gc%H`ns0Ky9Hy5069OqKFRz0gaE0xOY{kgQ%QMT#$!+iNUx8DD3s4E^H{GzU
zBmj=^aI%);mG>>VjCNlluUiV?Jt$~tu|IP)<#;66;MBdny^Wrm%x|Nje*XGJoTf*n
zqM|bMGumx)LW+l<Kew%o4U{QW!j2MFX2E&-$0<|JYl~&ZcfFjvI%wcic|`>+`>Smt
z%g2v98ACO4?^{`A7*X-KqWbR<1YNQNPVQWyX}Hv!TR<SvE6-U-R1{Zj*SEd(8NVC*
z>|ZQaBl;g1T31)s%G%nnCLcl>%*2FV8a2}|muY+F&K<BIbu+UYkh!zNS<QR*Al22?
z+hg!JSA$V00NcDg9v(hE3<g8ofSSdQRl$v6uy8}=<1nulXF2!ZbiW1%bIb&dO-wKf
zvBn_$V5t`)A8gtKAVH$-r}R;`p!A51rUCKC78dt%j6M_>U!4sglxs`dL5_|V<HVo$
z(9MWPCue8JHKA9Zjg&n~l}0kFs;bWY`SZ5ev@re4I(Q=ayLa#MHUXe(>*{R#lA$c7
zOsgB?HDe7Z_Uf7%B8aoI^U>t&dARjw&-sJR$&5jdfwD(6+v5(|^8<nBl6V0PW5J}J
z@86TtB_Ls8VUFe2P~xqaV_TQK-^MN57%Dy~(XGWrlEA<~r_W7IOp^^L9OgBZT!BOu
zXf(u&7x_p91qEtGu@0W^6DNCXBz2ZC$;r2hSC*HrfXnwXC^x->ZAlp!0M6tQ`AwCx
zy+oJw_f+CZ2^ksD6BCwGO=xleFpKtp*1Zu+_x=6-p5EThKf|Rynrq2^tAq3m4B~<3
zuM(LK2MY~fARFrIKbDnIcXoDW&@+jOiP1?(Nm<tYBjemLHN}#Zm6c&~COtMe$t@v4
z|99nE7f4ejV62|!5;+Cp5Tfs?NF;J+$kfpGcz5~ecWzvKQqt?OF^ie1=z5>O_QIC6
zkzUaR6I0XBM1I2skfUYReB6?ooG=tx<-PN4#<)2~JUo>A8keFX3rK9sqT(KiZ)ddq
z{N}~!s*1SVlGYJ{K;?z)Zxs(bnvY;q@GPrUf@7%xN)WBz7={~W1ELbRH7PHQ)xtrQ
z$QQ9Oi>$dgW@hF}`)|xIry8$!h0`pu7EUT3FEOOPf6vXy8SM4%Y#3O3S7#>?zd=Sv
zJf|8R8(VB+qwGvotqZFE(9Od`-QQn{l8OqX?43?N20TD-p$keg-X5qDevSRRf<_$F
zs*%pv*f=sedWkiEvcjq<1P~S%H#bQ;VH)p^^Z7m9=t{CaQmzKTU!a#3!X9u$G(SHd
z`Q}ZqYR(+1A*N2;!QLJ-J#D=nq7R32Lppw$M6Tft>>M0&D=X=8a&msPfBeYRv+NcY
z9{#bpIo<0)?$Gcsr;w2Dx?tU%@mePaK7))%UfpDf;{~0Csqv#nZ&Fhk^9l<W6ExZG
z+u7xhT+ugXQ3+s)jg7Uu5EK;j#{Eec8Zy&+FRkBfGxhW5bp@}jFOma_+(Vg6R-bL`
z?05lVMkxEAXw@w&JOff;bR^k$m4S}~3e_(K=zE<e>5$&v+7F3ONg=PZq#-9K2jzn*
zDY3#}u*k^Bcm$w$H9b9QoEg@**vz8Ijkg{-$#-U~7;q{}OiVntu>mNczN6!wRokma
zrI&1N;Iw7vaJxY>nvruzQF0V}ilGvno4q)1*t=x_SWzT}TUq(Zk)*x<RzTUfnfsK?
z3ncQJoc7_;>MFOi^es+K&iLNvkcnC3o4q}lImrI}+*}aoOYJ&IhwpRyqc*FlIw~N}
zQ1l^l43_5{-+uC^l!B5{RZA;!FTKL1J+`DoQ1NVUFk|qHeRON}_wUG_9<9?ed|*Ut
zt5Qj6X_fCjBGHe9usz?UrK5u}R5=U#^5wzlVf#gQVRnGvR9c)e0B?9iM8vjX-}DQ>
zC;|ck9bU8kg%1t&^_Tzc@pfynD&_QIlyl4LovIj{U+PV~q?q%+*%u!G4=>{4A{rWG
z4z{L+mR{^^Z+8Tfkkc_Ra9h_lR8@7xvMR{(4`R=1_jYjXx1EieUjRPf^Sl0IVBlJQ
zetrww{c%8mGM++78oi<?hV{mIbA*J1=xsSo11VViJD{@4nyb5Trp+A|KP!vq2i}G8
z|LK^@1|>YQwvLIqExLeX4}A5*hph;t5&(2<NLN=E*oa58Z<<%ZS6@sGmGU2!mX~*r
z+5|#RHSZ^-q^zHLLtm@f(!*_FumV$4!6ezhbH$8|3|StDEY+NQ0J{cf0E<AhmzS62
z>wuy>x3{Of2&Pq6SNFxWefAwn?ZRMOmU^z2$A!@wo0>8J!Kg*VDW3@3Wu^g~jJuPx
zGchsYeyj2t6;;=5n^#1BetrhU(WW*wali(Zr^m%zTXJjCj&6PZ`ii2WVl4vnTYTR#
zMkXRXojJ2ATg;Z&AVU^Jh(e*RnVXxJTRJ)l(6Y!vsHmte?_+IUK8W_kL@Nmkc`~T>
zsD=D@7GX+-moPdqHAVgS@nfg=;$%rWyr)SXsy$sm@^h=J=>?N05p~SWF=UXZNMy|Y
zP_nf%rrfQsM}Pod17tlLtF-Ou>+1pzYJPR~@af=%4iH}wFE1}(hc1OYMbrk{8A|Tq
ztj3U_ksGYXr{?1$2U2NdY#a=bnu9VetEkX;`BFHXmi2Am=_0$lji`tS7G(Eldtl($
z_;@GqkcoWy3-M+tfatcSKf7*C$+b2lu>$5`Vq&VSskyx4jQ1i9@M?PjKm5e7@jz!@
zj8^0{HZnq?P|$&vxFX<QCsFN}3&<%cq4f0hjEss3ZE9jfrPIXtTUs&$D3gJxr&@d%
zfTVU68RyFK4BPjQfx6^2Hd?GYgocK?0$Xh0HI*L{LM-hvcMoQ2+QkSLmXS%>UmuNo
z`0ydqi7_Xz7O_oDP2!rGng_phMXN9v)tQ9odiOPx1K<mFCnxKXaM1fpsYU2--@XY8
z3r9nN1nqB5rXZ&pjaU7gry4si8NA|2>sN4dAO=3Zz6Wk5XCoEXeJkJI<Fa1uYG2-Y
z?Uj;{P*gV#r#l%F6C=#Q5o8SnRR;d2&SM?sIgh#Gq#z;^3rt!yVCmw6i5H)$fb2Q>
zP1HJBPAW1n6X3X5aYI8xn2pT<BfPA#GM<J_F%@X*zNZ;$IJPY(C-_^I();{^g6gh_
zYc;jCcaxKovFF<nD(=UDxTo+e85A284b5;#468yia=N*$@Ke4y!2eF{5ALL-r2dJC
z^v&td<|TJ-P}<wuW1Xj)yT#LNfnNLmgLZKdo2;}At@Zyibf?~Pb7Cgi{zrx!dQUn6
zsO9@ic@Le^1HdRXK(P*BwFbw>$5rMw{tqXdU%q@^Utj<7=STHuajcH6F7V7urUEy9
z9lknVPI)=k5!{+pCzKh}nw`!5UcsvZ_N}iks&{#J=IzKGHF)6KFJ8};ex~wlInd6N
zEk5WJg4B`n3Q%Qv6q7~_vmJe$XL|Le5SEvhyP(mD2;kHqLH+%DPpW+Na*XBxGXipc
zhX7VV<3Dwzrn(yAhi{E|`<AlUbNTB{#q*;@7SlR35NIT_ycX%;5J5yj_8~8i6cV41
zaM|VFzP>uMl87MzVDm+L1_to^Ez>!$(uBmsTn`pV5MXLLrJf&SgrlR5*;$(tUo={>
z9yJ^Ln(DS^)2xW-d0k!IJiviGI^Dp)0JPrvLDU=|2coK~s>=F$4P)c$e1=)k<24R{
zW9^J^-P)d`yg~K#(h$1yr@#1t{NE#-@B;5!R9s91rv^h=a`J*rXqJMP2++cC<6I3}
z;0YX&$nHwpE-}}I2QxoiPWGJ2H%CTb|0X4I%E~fAj7&^E9x{>zm70H|6gQt_LZi_Z
zmX-)s@_U3T4vgzl#|?bj4g~UFHZBxRux7jN;o<Sa@ve)D%a0NTdHFxqA2c3-g;-QS
zzk2B)fv@IN3;z&!agMQa`4j!=lW0a}rtA6$v>0Y&1i><HO*M6iKN`4f&U*xf;3Oe)
zVn{Odad=UYKwe(ncC`J!n(Y2J3Jo$b+2?!w{TXr$0RaKBKD#`43*X*lHxS*korP}5
zglNwng>Z#7J_$)lNx&A4tmMEPHu~-hl2J2uh^GY{#-2?h5D1suCG9ecD$HC!^UFzA
zs_-eos7}6#wgc)TVIfL1ujUG9ZpDk!zmaiqs~KO#0CRr9;TC~3{@RoRh^(xtx*ouJ
z+h6`k*L3Of6*97CqZQUiV3PLfmax1mCrI@4XD@hX$Q9tiv5AJtcdQP;!~{$_8L+iq
zK;|K}?6{LyTL_;k3u>&AXmuc4^Wnpr5DqD+n`~@s7*_HV?76ew!A57J>tghTc}mRq
z_&C$Z$cQ5X@#`w^F;g);3oafW-TAsH-Z(Lx2M<U;e*9R87#0M9!T~`W6`O7E?k=#F
zPg+@ljS(;d50`*?aPsp<m<#7ZpR_QFwjFVS!K2cy@7AmRVRJ_(r%o{adut9>yUeRT
zP*>-akT59$M7O3CPL%q^bz^-!id!>wuIrjgm>yqfO3Lk%gH53+*AJ)-e0|_W05!Yv
zRFOYZ#?;^4T~#NiVpw@eiOxENVK|)TAD7=hIG;N>boKXB-y<|v6o?g$1Dsi;e-acG
zjnK%w9>A=T4g3O$V|(gg&Z2O0U0sqcAKd9KGmD%X>Ej0HBm~e0%xZREEWkr-Z!L17
z_9_Fv!2_`Bpo9IC2p}u~U2`MlkHNCR5KjzQ+ZenN3;%G&1#~v9)*1}aP9#e?HC1>c
zbMuI@GGSrs=Aoi9rWT<^P&Li__k&?j8JSU@bhSs1m^6SFl(PG9Mnry{jZzFdRBXC4
z<A>Ltboidpd9<^*^M%3CMQ>PgK<SWWV`HOM*(&GbN8**0l~f?vV3z{mV=a@GQT$n$
zdm%MyZEXz%^p2SXvUWd4<$NdZX>?9ek=BW&v$F`Oj2PItcrq|y&F;T<J<D%P&gn6j
z8Nn>{sr_6YXLd0<Jmv)$Ak?yaByAlZ-D!%Gneiq8$xu~SzjAVjk3+Q`lY!2oprP4(
z;!6v5P(rD0i)c+!z9t5hYxLgX0#3NJq5_K#yzql`Y)&>@Zh7Pt6u24KrE26Rd=FO`
zo0#C@;v&xWKS;Yfz?aeo2D4KGcHQOsHlCn|T6Vt*UE1j95@}|$!0^xbtY*%{7ICYn
zumd`xxOz2VS4vu1$BVSFsfklSfa0?kUPD!tIOyQ`I~ZTrs4YoPuxyShpYgJ^jBw!4
zgY@_q7EE7J$M{Epsgoq=r|x~meXWC;R=<SgWHK8Yo28YNXJC#4Bl_mni#aYRlm^U4
z@oY{3>*Wo4e10(4t@@Ya#QgmH|E8qCLCbretQGp;{tSVU)7Ziy60E+XyZg%d`MLLI
z-Kd!Vi92M|^p<Y2aDs?zQ*-k(VCq!$puBv13#FghR5dinfLkLbA=&Ad9MD4d03UR@
z?LbLMIk&KIxy1n{6`IRFSzX#Z<e4Q@+;VK!l72BF4Cw&ERr$1&n3`4ohQ7W&@F<dL
zRi8>qFuTirws3gp*RMLHfEH$2d@@WMB<(kX58!e<JfycptUBg8Lus~Blsm)dR7tt9
jlziaK^S|HNxge3-b^F2*)qE3dTtc+f4IWgeJ`MQ~Iec%@

literal 0
HcmV?d00001

diff --git a/content/russian/cs/range-queries/prefix-sum.md b/content/russian/cs/range-queries/prefix-sum.md
index 861200a1..f4e02570 100644
--- a/content/russian/cs/range-queries/prefix-sum.md
+++ b/content/russian/cs/range-queries/prefix-sum.md
@@ -52,13 +52,15 @@ $$
 
 Для ответа на запрос поиска суммы на произвольном полуинтервале нужно просто вычесть друг из друга две предподсчитанные префиксные суммы.
 
-@@
+<!--
 \foreach \n [count=\x] in {5, 4, 7, 2, 2, -1, 8}
   \node[rectangle, minimum size=9mm, draw] at (\x, 0) {\n};
 
 \foreach \s [count=\x] in {0, 5, 9, 16, 18, 20, 19, 27}
   \node[below] at (\x-.5, -.5) {\s};
-@@
+-->
+
+![](../img/prefix-sum.png)
 
 ### Другие операции
 

From 6b3df0447d0d22ba55d4ad7496f7a3965b7f8ff2 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sat, 7 May 2022 00:46:05 +0300
Subject: [PATCH 084/173] typo

---
 content/english/hpc/architecture/functions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md
index f7a74cc6..02614f94 100644
--- a/content/english/hpc/architecture/functions.md
+++ b/content/english/hpc/architecture/functions.md
@@ -16,7 +16,7 @@ Both of these concerns can be solved by having a dedicated location in memory wh
 The hardware stack works the same way software stacks do and is similarly implemented as just two pointers:
 
 - The *base pointer* marks the start of the stack and is conventionally stored in `rbp`.
-- The *stack pointer* marks the last element on the stack and is conventionally stored in `rsp`.
+- The *stack pointer* marks the last element of the stack and is conventionally stored in `rsp`.
 
 When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e. g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers.
 

From 2808896da9ff9168952ef3e4b5725dd172dd400f Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sat, 7 May 2022 00:46:13 +0300
Subject: [PATCH 085/173] extra space

---
 content/english/hpc/algorithms/argmin.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/algorithms/argmin.md b/content/english/hpc/algorithms/argmin.md
index 0a9531c1..2089d083 100644
--- a/content/english/hpc/algorithms/argmin.md
+++ b/content/english/hpc/algorithms/argmin.md
@@ -164,7 +164,7 @@ int argmin(int *a, int n) {
 
 The compiler [optimized the machine code layout](/hpc/architecture/layout), and the CPU is now able to execute the loop at around 2 GFLOPS — a slight but sizeable improvement from 1.5 GFLOPS of the non-hinted loop.
 
-Here is the idea: if we are only updating the minimum a dozen or so times during the entire computation, we can ditch all the vector-blending and index updating and just maintain the minimum and regularly check if it has changed. Inside this check, we can use however slow method of updating the argmin we want because it will only be called a few times. 
+Here is the idea: if we are only updating the minimum a dozen or so times during the entire computation, we can ditch all the vector-blending and index updating and just maintain the minimum and regularly check if it has changed. Inside this check, we can use however slow method of updating the argmin we want because it will only be called a few times.
 
 To implement it with SIMD, all we need to do on each iteration is a vector load, a comparison, and a test-if-zero:
 

From ece7674101f421484943c6df14c142e30059abde Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sat, 7 May 2022 03:29:13 +0300
Subject: [PATCH 086/173] typo

---
 content/english/hpc/pipelining/branchless.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md
index f84627b5..280498b1 100644
--- a/content/english/hpc/pipelining/branchless.md
+++ b/content/english/hpc/pipelining/branchless.md
@@ -41,7 +41,7 @@ sar  ebx, 31    ; t >>= 31
 imul  eax, ebx   ; x *= t
 ```
 
-Another, more complicated way to implement this whole sequence is to convert this sign bit into a mask and then use bitwise `and` instead of multiplication: `((a[i] - 50) >> 31 - 1) & a`. This makes the whole sequence one cycle faster, considering that, unlike other instructions, `imul` takes 3 cycles:
+Another, more complicated way to implement this whole sequence is to convert this sign bit into a mask and then use bitwise `and` instead of multiplication: `((a[i] - 50) >> 31 - 1) & a[i]`. This makes the whole sequence one cycle faster, considering that, unlike other instructions, `imul` takes 3 cycles:
 
 ```nasm
 mov  ebx, eax   ; t = x

From e6d9601a8dcb8a41d6776f57cde82e0abfe22732 Mon Sep 17 00:00:00 2001
From: yatancuyu <45235844+yatancuyu@users.noreply.github.com>
Date: Wed, 11 May 2022 15:40:33 +0300
Subject: [PATCH 087/173] Add missing return value

---
 content/russian/cs/graph-traversals/cycle.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/content/russian/cs/graph-traversals/cycle.md b/content/russian/cs/graph-traversals/cycle.md
index 5347e9cd..7a274da1 100644
--- a/content/russian/cs/graph-traversals/cycle.md
+++ b/content/russian/cs/graph-traversals/cycle.md
@@ -60,6 +60,7 @@ int dfs(int v, int p = -1) {
             }
         }
     }
+    return -1;
 }
 ```
 

From 912a24172441950b850629814c0893ebea8c6915 Mon Sep 17 00:00:00 2001
From: yatancuyu <45235844+yatancuyu@users.noreply.github.com>
Date: Wed, 11 May 2022 15:45:14 +0300
Subject: [PATCH 088/173] Prevent infinite loop

---
 content/russian/cs/graph-traversals/connectivity.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/russian/cs/graph-traversals/connectivity.md b/content/russian/cs/graph-traversals/connectivity.md
index 45ceec28..17628308 100644
--- a/content/russian/cs/graph-traversals/connectivity.md
+++ b/content/russian/cs/graph-traversals/connectivity.md
@@ -31,7 +31,7 @@ void dfs(int v, int num) {
 int num = 0;
 for (int v = 0; v < n; v++)
     if (!component[v])
-        dfs(v, num++);
+        dfs(v, ++num);
 ```
 
 После этого переменная `num` будет хранить число компонент связности, а массив `component` — номер компоненты для каждой вершины, который, например, можно использовать, чтобы быстро проверять, существует ли путь между заданной парой вершин.

From 63526ca0348b0abd5359d53cd91f24c92ddbf654 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 13 May 2022 16:59:54 +0300
Subject: [PATCH 089/173] slides theme

---
 assets/slides.sass                            |  50 +++
 config.yaml                                   |   6 +-
 content/english/hpc/slides/01-intro/_index.md | 297 ++++++++++++++++++
 content/english/hpc/slides/_index.md          |  10 +
 4 files changed, 360 insertions(+), 3 deletions(-)
 create mode 100644 content/english/hpc/slides/01-intro/_index.md
 create mode 100644 content/english/hpc/slides/_index.md

diff --git a/assets/slides.sass b/assets/slides.sass
index e69de29b..671ababe 100644
--- a/assets/slides.sass
+++ b/assets/slides.sass
@@ -0,0 +1,50 @@
+$font-text: 'Source Sans', serif !default
+$font-code: 'Inconsolata', monospace !default
+$font-headings: 'Garamond', serif !default
+
+$borders: 1px solid #eaecef !default
+    
+/* fonts */
+@font-face
+  font-family: 'CMU'
+  src: url(fonts/cmu.woff2)
+
+@font-face
+  font-family: 'Merriweather'
+  src: url(fonts/merriweather.woff2)
+
+@font-face
+  font-family: 'Inconsolata'
+  src: url(fonts/inconsolata.woff2)
+
+@font-face
+  font-family: 'Garamond'
+  src: url(fonts/garamond.woff2)
+
+@font-face
+  font-family: "Open Sans"
+  src: url(fonts/opensans.woff2)
+
+@font-face
+  font-family: "Source Sans"
+  src: url(fonts/sourcesans.ttf)
+
+@font-face
+  font-family: "Crimson"
+  src: url(fonts/crimson.ttf)
+
+body
+    font-family: $font-text
+    font-size: 24px
+
+h1
+  font-size: 2em
+  text-align: center
+  margin-top: 0
+  margin-bottom: 20px
+
+h2
+  font-size: 1.5em
+
+h3
+  font-size: 1.25em
diff --git a/config.yaml b/config.yaml
index 8fb26a1c..1f196de4 100644
--- a/config.yaml
+++ b/config.yaml
@@ -42,8 +42,8 @@ languages:
 params:
   repo: "https://github.com/algorithmica-org/algorithmica"
   reveal_hugo:
-    theme: white
+    #theme: white
     slide_number: true
     transition: none
-    #custom_theme: "slides.sass"
-    #custom_theme_compile: true
+    custom_theme: "slides.sass"
+    custom_theme_compile: true
diff --git a/content/english/hpc/slides/01-intro/_index.md b/content/english/hpc/slides/01-intro/_index.md
new file mode 100644
index 00000000..492ceb6a
--- /dev/null
+++ b/content/english/hpc/slides/01-intro/_index.md
@@ -0,0 +1,297 @@
+---
+title: Why Go Beyond Big O?
+outputs: [Reveal]
+---
+
+# Performance Engineering
+
+Sergey Slotin
+
+$x + y$
+
+May 7, 2022
+
+---
+
+### About me
+
+- Former [competitive programmer](https://codeforces.com/profile/sslotin)
+- Created [Algorithmica.org](https://ru.algorithmica.org/cs) and "co-founded" [Tinkoff Generation](https://algocode.ru/)
+- Wrote [Algorithms for Modern Hardware](https://en.algorithmica.org/hpc/), on which these lectures are based
+- Twitter: [@sergey_slotin](https://twitter.com/sergey_slotin); Telegram: [@bydlokoder](https://t.me/bydlokoder); anywhere else: @sslotin
+
+----
+
+### About this mini-course
+
+- Low-level algorithm optimization
+- Two days, six lectures
+- **Day 1:** CPU architecture & assembly, pipelining, SIMD programming
+- **Day 2:** CPU caches & memory, binary search, tree data structures
+- Prerequisites: CS 102, C/C++
+- No assignments, but you are encouraged to reproduce case studies: https://github.com/sslotin/amh-code
+
+---
+
+## Lecture 0: Why Go Beyond Big O
+
+*(AMH chapter 1)*
+
+---
+
+## The RAM Model of Computation
+
+- There is a set of *elementary operations* (read, write, add, multiply, divide)
+- Each operation is executed sequentially and has some constant *cost*
+- Running time ≈ sum of all elementary operations weghted by their costs
+
+----
+
+![](https://en.algorithmica.org/hpc/complexity/img/cpu.png =400x)
+
+- The “elementary operations” of a CPU are called *instructions*
+- Their “costs” are called *latencies* (measured in cycles)
+- Instructions modify the state of the CPU stored in a number of *registers*
+- To convert to real time, sum up all latencies of executed instructions and divide by the *clock frequency* (the number of cycles a particular CPU does per second) <!-- .element: class="fragment" data-fragment-index="1" -->
+- Clock speed is volatile, so counting cycles is more useful for analytical purposes <!-- .element: class="fragment" data-fragment-index="1" -->
+
+----
+
+![](https://external-preview.redd.it/6PIp0RLbdWFGFUOT6tFuufpMlplgWdnXWOmjuqkpMMU.jpg?auto=webp&s=9bed495f3dbb994d7cdda33cc114aba1cebd30e2 =400x)
+
+http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/
+
+----
+
+### Asymptotic complexity
+
+![](https://en.algorithmica.org/hpc/complexity/img/complexity.jpg =400x)
+
+For sufficiently large $n$, we only care about asymptotic complexity: $O(n) = O(1000 \cdot n)$
+
+$\implies$ The costs of basic ops don't matter since they don't affect complexity <!-- .element: class="fragment" data-fragment-index="1" -->
+
+But can we handle "sufficiently large" $n$? <!-- .element: class="fragment" data-fragment-index="2" -->
+
+---
+
+When complexity theory was developed, computers were different
+
+![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Eniac.jpg/640px-Eniac.jpg =500x)
+
+Bulky, costly, and fundamentally slow (due to speed of light)
+
+----
+
+![](https://researchresearch-news-wordpress-media-live.s3.eu-west-1.amazonaws.com/2022/02/microchip_fingertip-738x443.jpg =500x)
+
+Micro-scale circuits allow signals to propagate faster
+
+----
+
+<style>
+.randomname{
+    display: flex;
+    flex: 1em 5em;
+}
+</style>
+
+<div class="randomname">
+
+<div style="flex: 1; margin-top: -30px">
+
+![](https://en.algorithmica.org/hpc/complexity/img/lithography.png =450x)    
+
+</div>
+
+<div style="flex: 2">
+
+Microchips are "printed" on a slice of silicon using a procees called [photolithography](https://en.wikipedia.org/wiki/Photolithography):
+
+1. grow and slice a [very pure silicon crystal](https://en.wikipedia.org/wiki/Wafer_(electronics))
+2. cover it with a layer of [photoresist](https://en.wikipedia.org/wiki/Photoresist)
+3. hit it with photons in a set pattern
+4. chemically [etch](https://en.wikipedia.org/wiki/Etching_(microfabrication)) the exposed parts
+5. remove the remaining photoresist
+
+(…plus another 40-50 steps over several months to complete the rest of the CPU)
+    
+</div>
+
+</div>
+
+----
+
+The development of microchips and photolithography enabled:
+
+- higher clock rates
+- the ability to scale the production
+- **much** lower material and power usage (= lower cost)
+
+----
+
+![](https://upload.wikimedia.org/wikipedia/commons/4/49/MOS_6502AD_4585_top.jpg =500x)
+
+MOS Technology 6502 (1975), Atari 2600 (1977), Apple II (1977), Commodore 64 (1982)
+
+----
+
+Also a clear path to improvement: just make lenses stronger and chips smaller
+
+**Moore’s law:** transistor count doubles every two years. <!-- .element: class="fragment" data-fragment-index="1" -->
+
+----
+
+**Dennard scaling:** reducing die dimensions by 30%
+
+- doubles the transistor density ($0.7^2 \approx 0.5$)
+- increases the clock speed by 40% ($\frac{1}{0.7} \approx 1.4$)
+- leaves the overall *power density* the same
+  (we have a mechanical limit on how much heat can be dissipated)
+
+$\implies$ Each new "generation" should have roughly the same total cost, but 40% higher clock and twice as many transistors
+
+(which can be used e. g. to add new instructions or increase the word size) <!-- .element: class="fragment" data-fragment-index="1" -->
+
+----
+
+Around 2005, Dennard scaling stopped — due to *leakage* issues:
+
+- transistors became very smal
+- $\implies$ their magnetic fields started to interfere with the neighboring circuitry
+- $\implies$ unnecessary heating and occasional bit flipping
+- $\implies$ have to increase voltage to fix it
+- $\implies$ have to reduce clock frequency to balance off power consumption
+
+----
+
+![](https://en.algorithmica.org/hpc/complexity/img/dennard.ppm =600x)
+
+A limit on the clock speed
+
+---
+
+Clock rates have plateaued, but we still have more transistors to use:
+
+- **Pipelining:** overlapping the execution of sequential instructions to keep different parts of the CPU busy
+- **Out-of-order execution:** no waiting for the previous instructions to complete
+- **Superscalar processing:** adding duplicates of execution units
+- **Caching:** adding layers of faster memory on the chip to speed up RAM access
+- **SIMD:** adding instructions that handle a block of 128, 256, or 512 bits of data
+- **Parallel computing:** adding multiple identinal cores on a chip
+- **Distributed computing:** multiple chips in a motherboard or multiple computers
+- **FPGAs** and **ASICs:** using custom hardware to solve a specific problem
+
+----
+
+![](https://en.algorithmica.org/hpc/complexity/img/die-shot.jpg =500x)
+
+For modern computers, the “let’s count all operations” approach for predicting algorithm performance is off by several orders of magnitude
+
+---
+
+### Matrix multiplication
+
+```python
+n = 1024
+
+a = [[random.random()
+      for row in range(n)]
+      for col in range(n)]
+
+b = [[random.random()
+      for row in range(n)]
+      for col in range(n)]
+
+c = [[0
+      for row in range(n)]
+      for col in range(n)]
+
+for i in range(n):
+    for j in range(n):
+        for k in range(n):
+            c[i][j] += a[i][k] * b[k][j]
+```
+
+630 seconds or 10.5 minutes to multiply two $1024 \times 1024$ matrices in plain Python
+
+~880 cycles per multiplication
+
+----
+
+```java
+public class Matmul {
+    static int n = 1024;
+    static double[][] a = new double[n][n];
+    static double[][] b = new double[n][n];
+    static double[][] c = new double[n][n];
+
+    public static void main(String[] args) {
+        Random rand = new Random();
+
+        for (int i = 0; i < n; i++) {
+            for (int j = 0; j < n; j++) {
+                a[i][j] = rand.nextDouble();
+                b[i][j] = rand.nextDouble();
+                c[i][j] = 0;
+            }
+        }
+
+        for (int i = 0; i < n; i++)
+            for (int j = 0; j < n; j++)
+                for (int k = 0; k < n; k++)
+                    c[i][j] += a[i][k] * b[k][j];
+    }
+}
+```
+
+Java needs 10 seconds, 63 times faster
+
+~13 cycles per multiplication
+
+----
+
+```c
+#define n 1024
+double a[n][n], b[n][n], c[n][n];
+
+int main() {
+    for (int i = 0; i < n; i++) {
+        for (int j = 0; j < n; j++) {
+            a[i][j] = (double) rand() / RAND_MAX;
+            b[i][j] = (double) rand() / RAND_MAX;
+        }
+    }
+
+    for (int i = 0; i < n; i++)
+        for (int j = 0; j < n; j++)
+            for (int k = 0; k < n; k++)
+                c[i][j] += a[i][k] * b[k][j];
+    
+    return 0;
+}
+```
+
+`GCC -O3` needs 9 seconds, but if we include `-march=native` and `-ffast-math`, the compiler vectorizes the code, and it drops down to 0.6s.
+
+----
+
+```python
+import time
+import numpy as np
+
+n = 1024
+
+a = np.random.rand(n, n)
+b = np.random.rand(n, n)
+
+start = time.time()
+
+c = np.dot(a, b)
+
+duration = time.time() - start
+print(duration)
+```
+
+BLAS needs ~0.12 seconds
+(~5x over auto-vectorized C and ~5250x over plain Python)
diff --git a/content/english/hpc/slides/_index.md b/content/english/hpc/slides/_index.md
new file mode 100644
index 00000000..794e67a6
--- /dev/null
+++ b/content/english/hpc/slides/_index.md
@@ -0,0 +1,10 @@
+---
+title: Slides
+ignoreIndexing: true
+weight: 1000
+draft: true
+---
+
+This is an attempt to make a university course out of the book.
+
+Work in progress.

From 498100cf79e8d6d512edcedbb09e274f40030d38 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 13 May 2022 19:00:33 +0300
Subject: [PATCH 090/173] typos

---
 content/english/hpc/external-memory/locality.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/external-memory/locality.md b/content/english/hpc/external-memory/locality.md
index 569d9437..eca83766 100644
--- a/content/english/hpc/external-memory/locality.md
+++ b/content/english/hpc/external-memory/locality.md
@@ -23,7 +23,7 @@ In this article, we continue designing algorithms for the external memory model
 
 In this context, we can talk about the degree of cache reuse primarily in two ways:
 
-- *Temporal locality* refers to the repeated access of the same data within a relatively small time duration, such that the data likely remains cached between the requests.
+- *Temporal locality* refers to the repeated access of the same data within a relatively small time period, such that the data likely remains cached between the requests.
 - *Spatial locality* refers to the use of elements relatively close to each other in terms of their memory locations, such that they are likely fetched in the same memory block.
 
 In other words, temporal locality is when it is likely that this same memory location will soon be requested again, while spatial locality is when it is likely that a nearby location will be requested right after.
@@ -136,7 +136,7 @@ $$
 t[k][i] = \min(t[k-1][i], t[k-1][i+2^{k-1}])
 $$
 
-Now, there are two design choices to make: whether the log-size $k$ should be the first or the second dimension, and whether to iterate over $k$ and then $i$ or the other way around. This means that there are of $2×2=4$ ways to build it, and here is the optimal one:
+Now, there are two design choices to make: whether the log-size $k$ should be the first or the second dimension, and whether to iterate over $k$ and then $i$ or the other way around. This means that there are $2×2=4$ ways to build it, and here is the optimal one:
 
 ```cpp
 int mn[logn][maxn];

From 457960740ed133df92f23d0002176a66c8abd923 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 13 May 2022 19:48:07 +0300
Subject: [PATCH 091/173] "great-grandfather"

---
 content/english/hpc/data-structures/binary-search.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 56f1609a..d9a3dcf6 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -175,7 +175,7 @@ With prefetching, the performance on large arrays becomes roughly the same:
 
 ![](../img/search-branchless-prefetch.svg)
 
-The graph still grows faster as the branchy version also prefetches "grandchildren," "grand-grandchildren," and so on — although the usefulness of each new speculative read diminishes exponentially as the prediction is less and less likely to be correct.
+The graph still grows faster as the branchy version also prefetches "grandchildren," "great-grandchildren," and so on — although the usefulness of each new speculative read diminishes exponentially as the prediction is less and less likely to be correct.
 
 In the branchless version, we could also fetch ahead by more than one layer, but the number of fetches we'd need also grows exponentially. Instead, we will try a different approach to optimize memory operations.
 
@@ -359,9 +359,9 @@ This observation extends to the grand-children of node $k$ — they are also sto
 \end{aligned}
 -->
 
-Their cache line can also be fetched with one instruction. Interesting… what if we continue this, and instead of fetching direct children, we fetch ahead as many descendants as we can cramp into one cache line? That would be $\frac{64}{4} = 16$ elements, our grand-grand-grandchildren with indices from $16k$ to $(16k + 15)$.
+Their cache line can also be fetched with one instruction. Interesting… what if we continue this, and instead of fetching direct children, we fetch ahead as many descendants as we can cramp into one cache line? That would be $\frac{64}{4} = 16$ elements, our great-great-grandchildren with indices from $16k$ to $(16k + 15)$.
 
-Now, if we prefetch just one of these 16 elements, we will probably only get some but not all of them, as they may cross a cache line boundary. We can prefetch the first *and* the last element, but to get away with just one memory request, we need to notice that the index of the first element, $16k$, is divisible by $16$, so its memory address will be the base address of the array plus something divisible by $16 \cdot 4 = 64$, the cache line size. If the array were to begin on a cache line, then these $16$ grand-gran-grandchildren elements will be guaranteed to be on a single cache line, which is just what we needed.
+Now, if we prefetch just one of these 16 elements, we will probably only get some but not all of them, as they may cross a cache line boundary. We can prefetch the first *and* the last element, but to get away with just one memory request, we need to notice that the index of the first element, $16k$, is divisible by $16$, so its memory address will be the base address of the array plus something divisible by $16 \cdot 4 = 64$, the cache line size. If the array were to begin on a cache line, then these $16$ great-great-grandchildren elements will be guaranteed to be on a single cache line, which is just what we needed.
 
 Therefore, we only need to [align](/hpc/cpu-cache/alignment) the array:
 

From 01f16643633967f4b2ac68f889316263e5239b8e Mon Sep 17 00:00:00 2001
From: hectonit <48787141+hectonit@users.noreply.github.com>
Date: Sun, 15 May 2022 19:57:02 +0300
Subject: [PATCH 092/173] Update fenwick.md

Wrong variable naming
---
 content/russian/cs/range-queries/fenwick.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/russian/cs/range-queries/fenwick.md b/content/russian/cs/range-queries/fenwick.md
index f07a1ed4..9e37fc8d 100644
--- a/content/russian/cs/range-queries/fenwick.md
+++ b/content/russian/cs/range-queries/fenwick.md
@@ -84,7 +84,7 @@ int sum (int r1, int r2) {
     int res = 0;
     for (int i = r1; i > 0; i -= i & -i)
         for (int j = r2; j > 0; j -= j & -j)
-            ans += t[i][j];
+            res += t[i][j];
     return res;
 }
 ```

From 0339dbbd098c1cfd2943d443cbdc0b78ee6849f6 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 15 May 2022 22:39:35 +0300
Subject: [PATCH 093/173] elaborate on b-tree insert performance

---
 content/english/hpc/data-structures/b-tree.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/b-tree.md b/content/english/hpc/data-structures/b-tree.md
index 122e1c8e..d69a814e 100644
--- a/content/english/hpc/data-structures/b-tree.md
+++ b/content/english/hpc/data-structures/b-tree.md
@@ -305,7 +305,7 @@ The relative speedup varies with the structure size — 7-18x/3-8x over STL and
 
 ![](../img/btree-relative.svg)
 
-Insertions are only 1.5-2 faster than for `absl::btree`, which uses scalar code to do everything. I don't know (yet) why insertions are *that* slow, but I guess it has something to do with data dependencies between queries.
+Insertions are only 1.5-2 faster than for `absl::btree`, which uses scalar code to do everything. My best guess why insertions are *that* slow is due to data dependency: since the tree nodes may change, the CPU can't start processing the next query before the previous one finishes (the [true latency](../s-tree/#comparison-with-stdlower_bound) of both queries is roughly equal and ~3x of the reciprocal throughput of `lower_bound`).
 
 ![](../img/btree-absl.svg)
 

From eefefe42b7db3cdb8dc5ab74cbfc864e08ad0dff Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 16 May 2022 08:55:12 +0300
Subject: [PATCH 094/173] elaborate on why ctz of negative diff works

---
 content/english/hpc/algorithms/gcd.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md
index b9e9007a..63efdec9 100644
--- a/content/english/hpc/algorithms/gcd.md
+++ b/content/english/hpc/algorithms/gcd.md
@@ -207,7 +207,7 @@ Let's draw the dependency graph of this loop:
 
 Modern processors can execute many instructions in parallel, essentially meaning that the true "cost" of this computation is roughly the sum of latencies on its critical path. In this case, it is the total latency of `diff`, `abs`, `ctz`, and `shift`.
 
-We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a negative number divisible by $2^k$ still has $k$ zeros at the end. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this:
+We can decrease this latency using the fact that we can actually calculate `ctz` using just `diff = a - b`, because a [negative number](../hpc/arithmetic/integer/#signed-integers) divisible by $2^k$ still has $k$ zeros at the end of its binary representation. This lets us not wait for `max(diff, -diff)` to be computed first, resulting in a shorter graph like this:
 
 <!--
 \node [draw, circle] (diff)  at (3, 10) {diff};

From 9ee7345602333c8cb4a2e8187a69124b8a9ca2f2 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 17 May 2022 07:11:15 +0300
Subject: [PATCH 095/173] divisibility check

---
 content/english/hpc/arithmetic/division.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/arithmetic/division.md b/content/english/hpc/arithmetic/division.md
index 41b35e30..e3f699db 100644
--- a/content/english/hpc/arithmetic/division.md
+++ b/content/english/hpc/arithmetic/division.md
@@ -199,7 +199,7 @@ This works perfectly because what we do here can be interpreted as just three ch
 ```c++
 uint32_t y;
 
-uint64_t m = uint64_t(-1) / y + 1; // ceil(2^64 / d)
+uint64_t m = uint64_t(-1) / y + 1; // ceil(2^64 / y)
 
 uint32_t mod(uint32_t x) {
     uint64_t lowbits = m * x;
@@ -211,6 +211,14 @@ uint32_t div(uint32_t x) {
 }
 ```
 
+We can also check divisibility of $x$ by $y$ with just one multiplication using the fact that the remainder of division is zero if and only if the fractional part (the lower 64 bits of $m \cdot x$) does not exceed $m$ (otherwise, it would become a nonzero number when multiplied back by $y$ and right-shifted by 64).
+
+```c++
+bool is_divisible(uint32_t x) {
+    return m * x < m;
+}
+```
+
 The only downside of this method is that it needs integer types four times the original size to perform the multiplication, while other reduction methods can work with just the double.
 
 There is also a way to compute 64x64 modulo by carefully manipulating the halves of intermediate results; the implementation is left as an exercise to the reader.

From 283025ae28cc2c449b6696f95d0453c3b91d61f5 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 17 May 2022 10:32:07 +0300
Subject: [PATCH 096/173] montgomery multiplication intro

---
 .../english/hpc/number-theory/montgomery.md   | 168 ++++++++++++++----
 1 file changed, 134 insertions(+), 34 deletions(-)

diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
index 0ede37e5..c5d47d9d 100644
--- a/content/english/hpc/number-theory/montgomery.md
+++ b/content/english/hpc/number-theory/montgomery.md
@@ -4,53 +4,100 @@ weight: 4
 draft: true
 ---
 
-When we talked about [integers](../integer) in general, we discussed how to perform division and modulo by multiplication, and, unsurprisingly, in modular arithmetic 90% of its time is spent calculating modulo. Apart from using the general tricks described in the previous article, there is another method specifically for modular arithmetic, called *Montgomery multiplication*.
+Unsurprisingly, large fractions of computations in [modular arithmetic](../modular) are often spent on calculating the modulo operation, which is as slow as general integer division and typically taking 15-20 cycles, depending on the operand size.
 
-As all other fast reduction methods, it doesn't come for free. It works only in *Montgomery space*, so we need to transform our numbers in and out of it before doing the multiplications. This means that on top of doing some compile-time computations, we would also need to do some operations before the multiplication.
+The best way to deal this nuisance is to avoid modulo operation altogether, delaying or replacing it with [predication](/hpc/pipelining/branchless), which can be done when calculating sums, for example:
 
-For the space we need a positive integer $r \ge n$ coprime to $n$. In practice we always choose $r$ to be $2^m$ (with $m$ usually being equal 32 or 64), since multiplications, divisions and modulo $r$ operations can then be efficiently implemented using shifts and bitwise operations. Therefore $n$ needs to be an odd number so that every power of $2$ will be coprime to $n$. And if it is not, we can make it odd (?).
+```cpp
+const int M = 1e9 + 7;
 
-The representative $\bar x$ of a number $x$ in the Montgomery space is defined as
+// input: array of n integers in the [0, M) range
+// output: sum modulo M
+int slow_sum(int *a, int n) {
+    int s = 0;
+    for (int i = 0; i < n; i++)
+        s = (s + a[i]) % M;
+    return s;
+}
+
+int fast_sum(int *a, int n) {
+    int s = 0;
+    for (int i = 0; i < n; i++) {
+        s += a[i]; // s < 2 * M
+        s = (s >= M ? s - M : s); // will be replaced with cmov
+    }
+    return s;
+}
+
+int faster_sum(int *a, int n) {
+    long long s = 0; // 64-bit integer to handle overflow
+    for (int i = 0; i < n; i++)
+        s += a[i]; // will be vectorized
+    return s % M;
+}
+```
+
+However, sometimes you only have a chain of modular multiplications, and there is no good way to eel out of computing the remainder of the division — other than with the [integer division tricks](../hpc/arithmetic/division/) requiring a constant modulo and some precomputation.
+
+But there is another technique designed specifically for modular arithmetic, called *Montgomery multiplication*.
+
+### Montgomery Space
+
+Montgomery multiplication works by first transforming the multipliers into *Montgomery space*, where modular multiplication can be performed cheaply, and then transforming them back when their actual values are needed. Unlike general integer division methods, Montgomery multiplication is not efficient for performing just one modular reduction and only becomes worthwhile when there is a chain of modular operations.
+
+The space is defined by the modulo $n$ and a positive integer $r \ge n$ coprime to $n$. The algorithm involves division and modulo by $r$, so in practice, $r$ is chosen to be $2^m$ with $m$ being equal 32 or 64, so that these operations can be done with a right-shift and a bitwise AND respectively.
+
+<!-- Therefore $n$ needs to be an odd number so that every power of $2$ will be coprime to $n$. And if it is not, we can make it odd (?). -->
+
+**Definition.** The *representative* $\bar x$ of a number $x$ in the Montgomery space is defined as
 
 $$
 \bar{x} = x \cdot r \bmod n
 $$
 
-Note that the transformation is actually such a multiplication that we want to optimize, so it is still an expensive operation. However, we will only need to transform a number into the space once, perform as many operations as we want efficiently in that space and at the end transform the final result back, which should be profitable if we are doing lots of operations modulo $n$.
+Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we don't use this method for general modular multiplication and only long sequences of operations where transforming numbers to and from the Montgomery space is worth it.
+
+<!-- Note that the transformation is actually such a multiplication that we want to optimize, so it is still an expensive operation. However, we will only need to transform a number into the space once, perform as many operations as we want efficiently in that space and at the end transform the final result back, which should be profitable if we are doing lots of operations modulo $n$. -->
 
-Inside the Montgomery space addition, substraction and checking for equality is performed as usual ($x \cdot r + y \cdot r \equiv (x + y) \cdot r \bmod n$). However, this is not the case for multiplication. Denoting multiplication in Montgomery space as $*$ and normal multiplication as $\cdot$, we expect the result to be:
+Inside the Montgomery space, addition, substraction, and checking for equality is performed as usual:
+
+$$
+x \cdot r + y \cdot r \equiv (x + y) \cdot r \bmod n
+$$
+
+However, this is not the case for multiplication. Denoting multiplication in the Montgomery space as $*$ and the "normal" multiplication as $\cdot$, we expect the result to be:
 
 $$
 \bar{x} * \bar{y} = \overline{x \cdot y} = (x \cdot y) \cdot r \bmod n
 $$
 
-But the normal multiplication will give us:
+But the normal multiplication in the Montgomery space yields:
 
 $$
 \bar{x} \cdot \bar{y} = (x \cdot y) \cdot r \cdot r \bmod n
 $$
 
-Therefore the multiplication in the Montgomery space is defined as
+Therefore, the multiplication in the Montgomery space is defined as
 
 $$
 \bar{x} * \bar{y} = \bar{x} \cdot \bar{y} \cdot r^{-1} \bmod n
 $$
 
-This means that whenever we multiply two numbers, after the multiplication we need to *reduce* them. Therefore, we need to have an efficient way of calculating $x \cdot r^{-1} \bmod n$.
+This means that, after we normally multiply two numbers in the Montgomery space, we need to *reduce* the result by multiplying it by $r^{-1}$ and taking the modulo — and there is an efficent way to do this particular operation.
 
 ### Montgomery reduction
 
-Assume that $r=2^{64}$, the modulo $n$ is 64-bit and the number $x$ we need to reduce (multiply by $r^{-1}$) is 128-bit (the product of two 64-bit numbers).
+Assume that $r=2^{32}$, the modulo $n$ is 32-bit, and the number $x$ we need to reduce (multiply by $r^{-1}$ and take it modulo $n$) is the 64-bit the product of two 32-bit numbers.
 
-Because $\gcd(n, r) = 1$, we know that there are two numbers $r^{-1}$ and $n'$ in the $[0, n)$ range such that
+By definition, $\gcd(n, r) = 1$, so we know that there are two numbers $r^{-1}$ and $n'$ in the $[0, n)$ range such that
 
 $$
 r \cdot r^{-1} + n \cdot n' = 1
 $$
 
-and both $r^{-1}$ and $n'$ can be computed using the extended Euclidean algorithm.
+and both $r^{-1}$ and $n'$ can be computed using the [extended Euclidean algorithm](../euclid-extended).
 
-Using this identity we can express $r \cdot r^{-1}$ as $(-n \cdot n' + 1)$ and write $x \cdot r^{-1}$ as
+Using this identity, we can express $r \cdot r^{-1}$ as $(-n \cdot n' + 1)$ and write $x \cdot r^{-1}$ as
 
 $$
 \begin{aligned}
@@ -75,7 +122,13 @@ def reduce(x):
     return a
 ```
 
-Since $x < n \cdot n < r \cdot n$ (as $x$ is a product of multiplicatio) and $q \cdot n < r \cdot n$, we know that $-n < (x - q \cdot n) / r < n$. Therefore the final modulo operation can be implemented using a single bound check and addition.
+Since $x < n \cdot n < r \cdot n$ and $q \cdot n < r \cdot n$, we know that
+
+$$
+-n < (x - q \cdot n) / r < n
+$$
+
+Therefore, the final modulo operation can be implemented using a single bound check and addition.
 
 Here is an equivalent C implementation for 64-bit integers:
 
@@ -138,39 +191,86 @@ Transforming a number into the space is just a multiplication inside the space o
 ### Complete Implementation
 
 ```c++
+typedef __uint32_t u32;
+typedef __uint64_t u64;
+
 struct montgomery {
-    u64 n, nr;
+    u32 n, nr;
     
-    montgomery(u64 n) : n(n) {
-        nr = 1;
+    constexpr montgomery(u32 n) : n(n), nr(1) {
         for (int i = 0; i < 6; i++)
             nr *= 2 - n * nr;
     }
 
-    u64 reduce(u128 x) {
-        u64 q = u64(x) * nr;
-        u64 m = ((u128) q * n) >> 64;
-        u64 xhi = (x >> 64);
-        //cout << u64(x>>64) << " " << u64(x) << " " << q << endl;
-        //cout << u64(m>>64) << " " << u64(m) << endl;
-        //exit(0);
-        if (xhi >= m)
-            return (xhi - m);
-        else
-            return (xhi - m) + n;
+    u32 reduce(u64 x) const {
+        u32 q = u32(x) * nr;
+        u32 m = ((u64) q * n) >> 32;
+        u32 xhi = (x >> 32);
+        return xhi + n - m;
+        
+        // if you need 
+        // u32 t = xhi - m;
+        // return xhi >= m ? t : t + n;
     }
 
-    u64 mult(u64 x, u64 y) {
-        return reduce((u128) x * y);
+    u32 multiply(u32 x, u32 y) const {
+        return reduce((u64) x * y);
     }
 
-    u64 transform(u64 x) {
-        return (u128(x) << 64) % n;
+    u32 transform(u32 x) const {
+        return (u64(x) << 32) % n;
     }
 };
 ```
 
 ```c++
 montgomery m(n);
-m.transform(x);
-```
\ No newline at end of file
+
+a = m.transform(a);
+b = m.transform(b);
+c = m.multiply(a, b);
+c = m.reduce(c);
+```
+
+```c++
+int inverse(int _a) {
+    u32 a = space.transform(_a);
+    u32 r = space.transform(1);
+    
+    int n = M - 2;
+    while (n) {
+        if (n & 1)
+            r = space.multiply(r, a);
+        a = space.multiply(a, a);
+        n >>= 1;
+    }
+    
+    return space.reduce(r);
+}
+```
+
+SIMD
+
+166.79 ns
+
+207.04 ns
+
+```c++
+constexpr montgomery space(M);
+
+int inverse(int _a) {
+    u64 a = space.transform(_a);
+    u64 r = space.transform(1);
+    
+    #pragma GCC unroll(30)
+    for (int l = 0; l < 30; l++) {
+        if ( (M - 2) >> l & 1 )
+            r = space.multiply(r, a);
+        a = space.multiply(a, a);
+    }
+
+    return space.reduce(r);
+}
+```
+
+**Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).

From acfb5c857b2bf915adf6f28c17bc2d2ba5adef91 Mon Sep 17 00:00:00 2001
From: Project Nayuki <me@nayuki.io>
Date: Wed, 18 May 2022 05:40:26 +0000
Subject: [PATCH 097/173] Improved spelling and word choice.

---
 content/english/hpc/_index.md                  |  4 ++--
 content/english/hpc/architecture/assembly.md   |  6 +++---
 content/english/hpc/architecture/functions.md  |  8 ++++----
 content/english/hpc/architecture/isa.md        |  2 +-
 content/english/hpc/architecture/layout.md     |  6 +++---
 content/english/hpc/architecture/loops.md      |  4 ++--
 content/english/hpc/arithmetic/division.md     |  2 +-
 content/english/hpc/arithmetic/float.md        |  2 +-
 content/english/hpc/compilation/_index.md      |  2 +-
 content/english/hpc/complexity/_index.md       |  2 +-
 content/english/hpc/complexity/hardware.md     | 10 +++++-----
 content/english/hpc/complexity/languages.md    |  4 ++--
 content/english/hpc/external-memory/sorting.md |  2 +-
 content/english/hpc/pipelining/_index.md       |  8 ++++----
 content/english/hpc/pipelining/branchless.md   |  4 ++--
 content/english/hpc/pipelining/hazards.md      |  4 ++--
 content/english/hpc/pipelining/tables.md       |  4 ++--
 content/english/hpc/pipelining/throughput.md   |  4 ++--
 18 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index 92d0cd91..942c9f6a 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -33,7 +33,7 @@ A "release" for an open-source book like this essentially means:
 - mostly freezing the table of contents (except for the case studies),
 - doing one final round of heavy copyediting (hopefully, with the help of a professional editor — I still haven’t figured out how commas work in English),
 - drawing illustrations (I stole a lot of those that are currently displayed),
-- making a print-optimized pdf and figuring out the best way to distribute it.
+- making a print-optimized PDF and figuring out the best way to distribute it.
 
 After that, I will mostly be fixing errors and only doing some minor edits reflecting the changes in technology or new algorithm advancements. The e-book/printed editions will most likely be sold on a "pay what you want" basis, and in any case, the web version will always be fully available online.
 
@@ -51,7 +51,7 @@ However, as the book is still evolving, it is probably not the best idea to star
 
 There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, but [one of them](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming) is 50 years old, and [the other](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) is 30 years old, and [computers have changed a lot](/hpc/complexity/hardware) since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs.
 
-And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat the hardware like something from the 90s.
+And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat the hardware like something from the 1990s.
 
 What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools for *teaching* performance engineering — and I'm looking forward to collaborating with other people who want to make it a reality as well.
 
diff --git a/content/english/hpc/architecture/assembly.md b/content/english/hpc/architecture/assembly.md
index 5c981547..00c7caac 100644
--- a/content/english/hpc/architecture/assembly.md
+++ b/content/english/hpc/architecture/assembly.md
@@ -19,7 +19,7 @@ Jumping right into it, here is how you add two numbers (`*c = *a + *b`) in Arm a
 ldr w0, [x0]    ; load 4 bytes from wherever x0 points into w0
 ldr w1, [x1]    ; load 4 bytes from wherever x1 points into w1
 add w0, w0, w1  ; add w0 with w1 and save the result to w0
-str w0, [x2]    ; write contents of w0 to wherever x2 points/
+str w0, [x2]    ; write contents of w0 to wherever x2 points
 ```
 
 Here is the same operation in x86 assembly:
@@ -33,7 +33,7 @@ mov DWORD PTR [rdx], eax  ; write contents of eax to wherever rdx points
 
 Assembly is very simple in the sense that it doesn't have many syntactical constructions compared to high-level programming languages. From what you can observe from the examples above:
 
-- A program is a sequence of instructions, each written as its name followed by a variable amount of operands.
+- A program is a sequence of instructions, each written as its name followed by a variable number of operands.
 - The `[reg]` syntax is used for "dereferencing" a pointer stored in a register, and on x86 you need to prefix it with size information (`DWORD` here means 32 bit).
 - The `;` sign is used for line comments, similar to `#` and `//` in other languages.
 
@@ -55,7 +55,7 @@ Most instructions write their result into the first operand, which can also be i
 
 **Registers** are named `rax`, `rbx`, `rcx`, `rdx`, `rdi`, `rsi`, `rbp`, `rsp`, and `r8`-`r15` for a total of 16 of them. The "letter" ones are named like that for historical reasons: `rax` is "accumulator," `rcx` is "counter," `rdx` is "data" and so on — but, of course, they don't have to be used only for that.
 
-There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the first 32 bits of `rax` are `eax`, the first 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. 
+There are also 32-, 16-bit and 8-bit registers that have similar names (`rax` → `eax` → `ax` → `al`). They are not fully separate but *aliased*: the lowest 32 bits of `rax` are `eax`, the lowest 16 bits of `eax` are `ax`, and so on. This is made to save die space while maintaining compatibility, and it is also the reason why basic type casts in compiled programming languages are usually free. 
 
 These are just the *general-purpose* registers that you can, with [some exceptions](../functions), use however you like in most instructions. There is also a separate set of registers for [floating-point arithmetic](/hpc/arithmetic/float), a bunch of very wide registers used in [vector extensions](/hpc/simd), and a few special ones that are needed for [control flow](../loops), but we'll get there in time.
 
diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md
index 02614f94..412fc027 100644
--- a/content/english/hpc/architecture/functions.md
+++ b/content/english/hpc/architecture/functions.md
@@ -18,7 +18,7 @@ The hardware stack works the same way software stacks do and is similarly implem
 - The *base pointer* marks the start of the stack and is conventionally stored in `rbp`.
 - The *stack pointer* marks the last element of the stack and is conventionally stored in `rsp`.
 
-When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e. g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers.
+When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e.g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers.
 
 <!--
 
@@ -94,7 +94,7 @@ Note that the data in the stack is written top-to-bottom. This is just a convent
 
 ### Calling Conventions
 
-The people who develop compilers and operating systems eventually came up with [conventions](https://wiki.osdev.org/Calling_Conventions) on how to write and call functions. These conventions enable some important [software engineering marvels](/hpc/compilation/stages/) such as splitting compilation into separate units, re-using already compiled libraries, and even writing them in different programming languages.
+The people who develop compilers and operating systems eventually came up with [conventions](https://wiki.osdev.org/Calling_Conventions) on how to write and call functions. These conventions enable some important [software engineering marvels](/hpc/compilation/stages/) such as splitting compilation into separate units, reusing already-compiled libraries, and even writing them in different programming languages.
 
 Consider the following example in C:
 
@@ -142,7 +142,7 @@ length:
 ```
 -->
 
-By convention, a function should take its arguments in `rdi`, `rsi`, `rdx`, `rcx`, `r8`, `r9` (and the rest in the stack if that wasn't enough), put the return value into `rax`, and then return. Thus, `square`, being a simple one-argument function, can be implemented like this:
+By convention, a function should take its arguments in `rdi`, `rsi`, `rdx`, `rcx`, `r8`, `r9` (and the rest in the stack if those weren't enough), put the return value into `rax`, and then return. Thus, `square`, being a simple one-argument function, can be implemented like this:
 
 ```nasm
 square:             ; x = edi, ret = eax
@@ -190,7 +190,7 @@ distance:
     ret
 ```
 
-This is better, but we are still implicitly accessing stack memory: you need to push and pop the instruction pointer on each function call. In simple cases like this, we can *inline* function calls by stitching callee's code into the caller and resolving conflicts over registers. In our example:
+This is better, but we are still implicitly accessing stack memory: you need to push and pop the instruction pointer on each function call. In simple cases like this, we can *inline* function calls by stitching the callee's code into the caller and resolving conflicts over registers. In our example:
 
 ```nasm
 distance:
diff --git a/content/english/hpc/architecture/isa.md b/content/english/hpc/architecture/isa.md
index a1a4e66c..4862efb3 100644
--- a/content/english/hpc/architecture/isa.md
+++ b/content/english/hpc/architecture/isa.md
@@ -14,7 +14,7 @@ Abstractions help us in reducing all this complexity down to a single *interface
 
 Hardware engineers love abstractions too. An abstraction of a CPU is called an *instruction set architecture* (ISA), and it defines how a computer should work from a programmer's perspective. Similar to software interfaces, it gives computer engineers the ability to improve on existing CPU designs while also giving its users — us, programmers — the confidence that things that worked before won't break on newer chips.
 
-An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, ISA importantly defines counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance.
+An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, an ISA importantly defines the counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance.
 
 ### RISC vs CISC
 
diff --git a/content/english/hpc/architecture/layout.md b/content/english/hpc/architecture/layout.md
index 11735951..9ddebfd5 100644
--- a/content/english/hpc/architecture/layout.md
+++ b/content/english/hpc/architecture/layout.md
@@ -16,7 +16,7 @@ During the **fetch** stage, the CPU simply loads a fixed-size chunk of bytes fro
 
 <!-- todo: what happens when an instruction crosses the boundary? -->
 
-Next comes the **decode** stage: the CPU looks at this chunk of bytes, discards everything that comes before the instruction pointer, and splits the rest of them into instructions. Machine instructions are encoded using a variable amount of bytes: something simple and very common like `inc rax` takes one byte, while some obscure instruction with encoded constants and behavior-modifying prefixes may take up to 15. So, from a 32-byte block, a variable number of instructions may be decoded, but no more than a certain machine-dependant limit called the *decode width*. On my CPU (a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2)), the decode width is 4, which means that on each cycle, up to 4 instructions can be decoded and passed to the next stage.
+Next comes the **decode** stage: the CPU looks at this chunk of bytes, discards everything that comes before the instruction pointer, and splits the rest of them into instructions. Machine instructions are encoded using a variable number of bytes: something simple and very common like `inc rax` takes one byte, while some obscure instruction with encoded constants and behavior-modifying prefixes may take up to 15. So, from a 32-byte block, a variable number of instructions may be decoded, but no more than a certain machine-dependent limit called the *decode width*. On my CPU (a [Zen 2](https://en.wikichip.org/wiki/amd/microarchitectures/zen_2)), the decode width is 4, which means that on each cycle, up to 4 instructions can be decoded and passed to the next stage.
 
 The stages work in a pipelined fashion: if the CPU can tell (or [predict](/hpc/pipelining/branching/)) which instruction block it needs next, then the fetch stage doesn't wait for the last instruction in the current block to be decoded and loads the next one right away.
 
@@ -49,12 +49,12 @@ The instructions are stored and fetched using largely the same [memory system](/
 The instruction cache is crucial in situations when you either
 
 - don't know what instructions you are going to execute next, and need to fetch the next block with [low latency](/hpc/cpu-cache/latency),
-- or executing a long sequence of verbose-but-quick-to-process instructions, and need [high bandwidth](/hpc/cpu-cache/bandwidth).
+- or are executing a long sequence of verbose-but-quick-to-process instructions, and need [high bandwidth](/hpc/cpu-cache/bandwidth).
 
 The memory system can therefore become the bottleneck for programs with large machine code. This consideration limits the applicability of the optimization techniques we've previously discussed:
 
 - [Inlining functions](../functions) is not always optimal, because it reduces code sharing and increases the binary size, requiring more instruction cache.
-- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of loops is known during compile-time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth.
+- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of iterations is known during compile-time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth.
 - Huge [code alignments](#code-alignment) increase the binary size, again requiring more instruction cache. Spending one more cycle on fetch is a minor penalty compared to missing the cache and waiting for the instructions to be fetched from the main memory.
 
 Another aspect is that placing frequently used instruction sequences on the same [cache lines](/hpc/cpu-cache/cache-lines) and [memory pages](/hpc/cpu-cache/paging) improves [cache locality](/hpc/external-memory/locality). To improve instruction cache utilization, you should  group hot code with hot code and cold code with cold code, and remove dead (unused) code if possible. If you want to explore this idea further, check out Facebook's [Binary Optimization and Layout Tool](https://engineering.fb.com/2018/06/19/data-infrastructure/accelerate-large-scale-applications-with-bolt/), which was recently [merged](https://github.com/llvm/llvm-project/commit/4c106cfdf7cf7eec861ad3983a3dd9a9e8f3a8ae) into LLVM.
diff --git a/content/english/hpc/architecture/loops.md b/content/english/hpc/architecture/loops.md
index b441ae67..9dc1faba 100644
--- a/content/english/hpc/architecture/loops.md
+++ b/content/english/hpc/architecture/loops.md
@@ -23,11 +23,11 @@ Assembly doesn't have if-s, for-s, functions, or other control flow structures t
 
 **Jump** moves the instruction pointer to a location specified by its operand. This location may be either an absolute address in memory, relative to the current address or even [computed during runtime](../indirect). To avoid the headache of managing these addresses directly, you can mark any instruction with a string followed by `:`, and then use this string as a label which gets replaced by the relative address of this instruction when converted to machine code.
 
-Labels can be any strings, but compilers don't get creative and [typically](https://godbolt.org/z/T45x8GKa5) just use the line numbers in the source code and function names with their signatures when picking names for labels.
+Labels can be any string, but compilers don't get creative and [typically](https://godbolt.org/z/T45x8GKa5) just use the line numbers in the source code and function names with their signatures when picking names for labels.
 
 **Unconditional** jump `jmp` can only be used to implement `while (true)` kind of loops or stitch parts of a program together. A family of **conditional** jumps is used to implement actual control flow.
 
-It is reasonable to think that these conditions are computed as `bool`-s somewhere and passed to conditional jumps as operands: after all, this is how it works in programming languages. But that is not how it is implemented in hardware. Conditional operations use a special `FLAGS` register, which first needs to be populated by executing instructions that perform some kind of checks.
+It is reasonable to think that these conditions are computed as `bool`-s somewhere and passed to conditional jumps as operands: after all, this is how it works in programming languages. But that is not how it is implemented in hardware. Conditional operations use a special `FLAGS` register, which first needs to be populated by executing instructions that perform some kind of check.
 
 In our example, `cmp rax, rcx` compares the iterator `rax` with the end-of-array pointer `rcx`. This updates the FLAGS register, and now it can be used by `jne loop`, which looks up a certain bit there that tells whether the two values are equal or not, and then either jumps back to the beginning or continues to the next instruction, thus breaking the loop.
 
diff --git a/content/english/hpc/arithmetic/division.md b/content/english/hpc/arithmetic/division.md
index e3f699db..ad1cf525 100644
--- a/content/english/hpc/arithmetic/division.md
+++ b/content/english/hpc/arithmetic/division.md
@@ -45,7 +45,7 @@ You can also divide 128-bit integer (stored in `rdx:rax`) by a 64-bit integer:
 ```nasm
 div(u128, u64):
     ; a = rdi + rsi, b = rdx
-    mov  rcx, rdx ;
+    mov  rcx, rdx
     mov  rax, rdi
     mov  rdx, rsi
     div  edx 
diff --git a/content/english/hpc/arithmetic/float.md b/content/english/hpc/arithmetic/float.md
index cda42944..70217a91 100644
--- a/content/english/hpc/arithmetic/float.md
+++ b/content/english/hpc/arithmetic/float.md
@@ -139,7 +139,7 @@ $$
 \{ \pm \; (1 + m) \cdot 2^e \; | \; m = \frac{x}{2^{32}}, \; x \in [0, 2^{32}) \}
 $$
 
-Since $m$ is now a nonnegative value, we will now make it unsigned integer, and instead add a separate boolean field for the sign of the number:
+Since $m$ is now a nonnegative value, we will now make it unsigned integer, and instead add a separate Boolean field for the sign of the number:
 
 ```cpp
 struct fp {
diff --git a/content/english/hpc/compilation/_index.md b/content/english/hpc/compilation/_index.md
index cbc0f691..07b0e07f 100644
--- a/content/english/hpc/compilation/_index.md
+++ b/content/english/hpc/compilation/_index.md
@@ -8,4 +8,4 @@ The main benefit of [learning assembly language](../architecture/assembly) is no
 
 There are rare cases where we *really* need to switch to handwritten assembly for maximal performance, but most of the time compilers are capable of producing near-optimal code all by themselves. When they do not, it is usually because the programmer knows more about the problem than what can be inferred from the source code, but failed to communicate this extra information to the compiler.
 
-In this chapter, we will discuss the intricacies of getting compiler to do exactly what we want and gathering useful information that can guide further optimizations.
+In this chapter, we will discuss the intricacies of getting the compiler to do exactly what we want and gathering useful information that can guide further optimizations.
diff --git a/content/english/hpc/complexity/_index.md b/content/english/hpc/complexity/_index.md
index 69cebf4c..c537c4ce 100644
--- a/content/english/hpc/complexity/_index.md
+++ b/content/english/hpc/complexity/_index.md
@@ -11,7 +11,7 @@ Complexity is an old concept. It was [systematically formulated](http://www.cs.a
 
 ### Classical Complexity Theory
 
-The "elementary operations" of a CPU are called *instructions*, and their "costs" are called *latencies*. Instructions are stored in *memory* and executed one by one by the processor, which has some internal *state* stored in a number of *registers*. One of these registers is the *instruction pointer* that indicates the address of the next instruction to read and execute. Each instruction changes the state of the processor in a certain way (including moving the instruction pointer), possibly modifies the main memory, and takes a different amount of *CPU cycles* to complete before the next one can be started.
+The "elementary operations" of a CPU are called *instructions*, and their "costs" are called *latencies*. Instructions are stored in *memory* and executed one by one by the processor, which has some internal *state* stored in a number of *registers*. One of these registers is the *instruction pointer* that indicates the address of the next instruction to read and execute. Each instruction changes the state of the processor in a certain way (including moving the instruction pointer), possibly modifies the main memory, and takes a different number of *CPU cycles* to complete before the next one can be started.
 
 To estimate the real running time of a program, you need to sum all latencies for its executed instructions and divide it by the *clock frequency*, that is, the number of cycles a particular CPU does per second. 
 
diff --git a/content/english/hpc/complexity/hardware.md b/content/english/hpc/complexity/hardware.md
index 1d59d101..d1c950b6 100644
--- a/content/english/hpc/complexity/hardware.md
+++ b/content/english/hpc/complexity/hardware.md
@@ -4,9 +4,9 @@ weight: 1
 ignoreIndexing: true
 ---
 
-The main disadvantage of the supercomputers of the 1960s wasn't that they were slow — relatively speaking, they weren't — but that they were giant, complex to use, and so expensive that only the governments of the world superpowers could afford them. Their size was the reason they were so expensive: they required a lot of custom components that had to be very carefully assembled in the macro-world, by people holding advanced degrees in electrical engineering, in a process that couldn't be up-scaled for mass production.
+The main disadvantage of the supercomputers of the 1960s wasn't that they were slow — relatively speaking, they weren't — but that they were giant, complex to use, and so expensive that only the governments of the world superpowers could afford them. Their size was the reason they were so expensive: they required a lot of custom components that had to be very carefully assembled in the macro-world, by people holding advanced degrees in electrical engineering, in a process that couldn't be scaled up for mass production.
 
-The turning point was the development of *microchips* — single, tiny, complete circuits — which revolutionized the industry and turned out to be probably the most important invention of the 20th century. What was a multimillion-dollar cupboard of computing machinery in 1965 could in 1975 fit on a [4×4 mm slice of silicon](https://en.wikipedia.org/wiki/MOS_Technology_6502)[^size] that you can buy for $25. This dramatic improvement in affordability started the home computer revolution during the following decade, with computers like Apple II, Atari 2600, Commodore 64, and IBM PC becoming available to the masses.
+The turning point was the development of *microchips* — single, tiny, complete circuits — which revolutionized the industry and turned out to be probably the most important invention of the 20th century. What was a multimillion-dollar cupboard of computing machinery in 1965 could in 1975 fit on a [4mm × 4mm slice of silicon](https://en.wikipedia.org/wiki/MOS_Technology_6502)[^size] that you can buy for $25. This dramatic improvement in affordability started the home computer revolution during the following decade, with computers like Apple II, Atari 2600, Commodore 64, and IBM PC becoming available to the masses.
 
 [^size]: Actual sizes of CPUs are about centimeter-scale because of power management, heat dissipation, and the need to plug it into the motherboard without excessive swearing.
 
@@ -17,7 +17,7 @@ Microchips are "printed" on a slice of crystalline silicon using a process calle
 1. growing and slicing a [very pure silicon crystal](https://en.wikipedia.org/wiki/Wafer_(electronics)),
 2. covering it with a layer of [a substance that dissolves when photons hit it](https://en.wikipedia.org/wiki/Photoresist),
 3. hitting it with photons in a set pattern,
-4. chemically [etching](https://en.wikipedia.org/wiki/Etching_(microfabrication)) the now exposed parts,
+4. chemically [etching](https://en.wikipedia.org/wiki/Etching_(microfabrication)) the now-exposed parts,
 5. removing the remaining photoresist,
 
 …and then performing another 40-50 steps over several months to complete the rest of the CPU.
@@ -56,11 +56,11 @@ Throughout most of the computing history, optical shrinking was the main driving
 
 Both Dennard scaling and Moore's law are not actual laws of physics, but just observations made by savvy engineers. They are both destined to stop at some point due to fundamental physical limitations, the ultimate one being the size of silicon atoms. In fact, Dennard scaling already did — due to power issues.
 
-Thermodynamically, a computer is just a very efficient device for converting electrical power into heat. This heat eventually needs to be removed, and there are physical limits to how much power you can dissipate from a millimeter-scale crystal. Computer engineers, aiming to maximize performance, essentially just choose the maximum possible clock rate so that the overall power consumption stays the same. If transistors become smaller, they have less capacity, meaning less required voltage to flip them, which in turn allows increasing the clock rate.
+Thermodynamically, a computer is just a very efficient device for converting electrical power into heat. This heat eventually needs to be removed, and there are physical limits to how much power you can dissipate from a millimeter-scale crystal. Computer engineers, aiming to maximize performance, essentially just choose the maximum possible clock rate so that the overall power consumption stays the same. If transistors become smaller, they have less capacitance, meaning less required voltage to flip them, which in turn allows increasing the clock rate.
 
 Around 2005–2007, this strategy stopped working because of *leakage* effects: the circuit features became so small that their magnetic fields started to make the electrons in the neighboring circuitry move in directions they are not supposed to, causing unnecessary heating and occasional bit flipping.
 
-The only way to mitigate this is to increase voltage; and to balance off power consumption you need to reduce clock frequency, which in turn makes the whole process progressively less profitable as transistor density increases. At some point, clock rates could no longer be increased by scaling, and the miniaturization trend started to slow down.
+The only way to mitigate this is to increase the voltage; and to balance off power consumption you need to reduce clock frequency, which in turn makes the whole process progressively less profitable as transistor density increases. At some point, clock rates could no longer be increased by scaling, and the miniaturization trend started to slow down.
 
 <!--
 
diff --git a/content/english/hpc/complexity/languages.md b/content/english/hpc/complexity/languages.md
index 435b450d..a24eb34f 100644
--- a/content/english/hpc/complexity/languages.md
+++ b/content/english/hpc/complexity/languages.md
@@ -47,7 +47,7 @@ Since running machine code in an interpreter doesn't make sense, this makes a to
 - Compiled languages with a runtime, such as Java, C#, or Erlang (and languages that work on their VMs, such as Scala, F#, or Elixir).
 - Compiled native languages, such as C, Go, or Rust.
 
-There is no "right" way of executing computer programs: each approach has its own gains and drawbacks. Interpreters and virtual machines provide flexibility and enable some nice high-level programming features such as dynamic typing, run-time code alteration, and automatic memory management, but this comes with some unavoidable performance trade-offs, which we will now talk about.
+There is no "right" way of executing computer programs: each approach has its own gains and drawbacks. Interpreters and virtual machines provide flexibility and enable some nice high-level programming features such as dynamic typing, run-time code alteration, and automatic memory management, but these come with some unavoidable performance trade-offs, which we will now talk about.
 
 ### Interpreted languages
 
@@ -94,7 +94,7 @@ This is not surprising if you consider the things that Python needs to do to fig
 - looks up its type, figures out that it's a `float`, and fetches the method implementing `*` operator;
 - does the same things for `b` and `c` and finally add-assigns the result to `c[i][j]`.
 
-Granted, the interpreters of widely-used languages such as Python are well-optimized, and they can skip through some of these steps on repeated execution of the same code. But still, some quite significant overhead is unavoidable due to the language design. If we get rid of all this type checking and pointer chasing, perhaps we can get cycles per multiplication ratio closer to 1, or whatever the "cost" of native multiplication is?
+Granted, the interpreters of widely used languages such as Python are well-optimized, and they can skip through some of these steps on repeated execution of the same code. But still, some quite significant overhead is unavoidable due to the language design. If we get rid of all this type checking and pointer chasing, perhaps we can get cycles per multiplication ratio closer to 1, or whatever the "cost" of native multiplication is?
 
 ### Managed Languages
 
diff --git a/content/english/hpc/external-memory/sorting.md b/content/english/hpc/external-memory/sorting.md
index 8274788c..c7effc46 100644
--- a/content/english/hpc/external-memory/sorting.md
+++ b/content/english/hpc/external-memory/sorting.md
@@ -44,7 +44,7 @@ In the external memory model, when we read a block of size $M$, we can sort its
 
 ![](../img/k-way.png)
 
-This effectively means that, in terms of IO operations, the first $O(\log M)$ layers of mergesort are free, and there are only $O(\log_2 \frac{N}{M})$ non-zero-cost layers, each mergeable in $O(\frac{N}{B})$ IOPS in total. This brings total I/O complexity to
+This effectively means that, in terms of I/O operations, the first $O(\log M)$ layers of mergesort are free, and there are only $O(\log_2 \frac{N}{M})$ non-zero-cost layers, each mergeable in $O(\frac{N}{B})$ IOPS in total. This brings total I/O complexity to
 
 $$
 O\left(\frac{N}{B} \log_2 \frac{N}{M}\right)
diff --git a/content/english/hpc/pipelining/_index.md b/content/english/hpc/pipelining/_index.md
index 8c388a94..e18a31cc 100644
--- a/content/english/hpc/pipelining/_index.md
+++ b/content/english/hpc/pipelining/_index.md
@@ -5,7 +5,7 @@ weight: 3
 
 When programmers hear the word *parallelism*, they mostly think about *multi-core parallelism*, the practice of explicitly splitting a computation into semi-independent *threads* that work together to solve a common problem.
 
-This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as many computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware), and parallel algorithm design is becoming an increasingly more important area, for now, we will consider the use of more than one CPU core cheating.
+This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as much computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware), and parallel algorithm design is becoming an increasingly important area, for now, we will consider the use of more than one CPU core cheating.
 
 But there are other types of parallelism, already existing inside a CPU core, that you can use *for free*.
 
@@ -42,16 +42,16 @@ Pipelining does not reduce *actual* latency but functionally makes it seem like
 
 Having this in mind, hardware manufacturers prefer to use *cycles per instruction* (CPI) instead of something like "average instruction latency" as the main performance indicator for CPU designs. It is a [pretty good metric](/hpc/profiling/benchmarking) for algorithm designs too, if we only consider *useful* instructions.
 
-CPI of a perfectly pipelined processor should tend to one, but it can actually be even lower if we make each stage of the pipeline "wider" by duplicating it, so that more than one instruction can be processed at a time. Because the cache and most of the ALU can be shared, this ends up being cheaper than adding a fully separate core. Such architectures, capable of executing more than one instruction per cycle, are called *superscalar*, and most modern CPUs are.
+The CPI of a perfectly pipelined processor should tend to one, but it can actually be even lower if we make each stage of the pipeline "wider" by duplicating it, so that more than one instruction can be processed at a time. Because the cache and most of the ALU can be shared, this ends up being cheaper than adding a fully separate core. Such architectures, capable of executing more than one instruction per cycle, are called *superscalar*, and most modern CPUs are.
 
-You can only take advantage of superscalar processing if the stream of instructions contains groups of logically independent operations that can be processed separately. The instructions don't always arrive in the most convenient order, so, when possible, modern CPUs can execute them *out-of-order* to improve overall utilization and minimize pipeline stalls. How this magic works is a topic for a more advanced discussion<!--[a more advanced discussion](scheduling)-->, but for now, you can assume that the CPU maintains a buffer of pending instructions up to some distance in the future, and executes them as soon as the values of its operands are computed and there is an execution unit available.
+You can only take advantage of superscalar processing if the stream of instructions contains groups of logically independent operations that can be processed separately. The instructions don't always arrive in the most convenient order, so, when possible, modern CPUs can execute them *out of order* to improve overall utilization and minimize pipeline stalls. How this magic works is a topic for a more advanced discussion<!--[a more advanced discussion](scheduling)-->, but for now, you can assume that the CPU maintains a buffer of pending instructions up to some distance in the future, and executes them as soon as the values of its operands are computed and there is an execution unit available.
 
 ### An Education Analogy
 
 Consider how our education system works:
 
 1. Topics are taught to groups of students instead of individuals as broadcasting the same things to everyone at once is more efficient.
-2. An intake of students is split into groups lead by different teachers; assignments and other course materials are shared between groups.
+2. An intake of students is split into groups led by different teachers; assignments and other course materials are shared between groups.
 3. Each year the same course is taught to a new intake so that the teachers are kept busy.
 
 These innovations greatly increase the *throughput* of the whole system, although the *latency* (time to graduation for a particular student) remains unchanged (and maybe increases a little bit because personalized tutoring is more effective).
diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md
index 280498b1..0f87da83 100644
--- a/content/english/hpc/pipelining/branchless.md
+++ b/content/english/hpc/pipelining/branchless.md
@@ -32,7 +32,7 @@ Suddenly, the loop now takes ~7 cycles per element instead of the original ~14.
 
 But wait… shouldn't there still be a branch? How does `(a[i] < 50)` map to assembly?
 
-There are no boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift.
+There are no Boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift.
 
 ```nasm
 mov  ebx, eax   ; t = x
@@ -101,7 +101,7 @@ In our example, the branchy code wins when the branch can be predicted with a pr
 
 ![](../img/branchy-vs-branchless.svg)
 
-This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile-time, so it needs to be provided in one of several ways:
+This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile time, so it needs to be provided in one of several ways:
 
 - We can use [profile-guided optimization](/hpc/compilation/situational/#profile-guided-optimization) which will decide for itself whether to use predication or not.
 - We can use [likeliness attributes](../branching#hinting-likeliness-of-branches) and [compiler-specific intrinsics](/hpc/compilation/situational) to hint at the likeliness of branches: `__builtin_expect_with_probability` in GCC and `__builtin_unpredictable` in Clang.
diff --git a/content/english/hpc/pipelining/hazards.md b/content/english/hpc/pipelining/hazards.md
index 02a0869d..d4a2d7df 100644
--- a/content/english/hpc/pipelining/hazards.md
+++ b/content/english/hpc/pipelining/hazards.md
@@ -20,6 +20,6 @@ Different hazards have different penalties:
 
 - In structural hazards, you have to wait (usually one more cycle) until the execution unit is ready. They are fundamental bottlenecks on performance and can't be avoided — you have to engineer around them.
 - In data hazards, you have to wait for the required data to be computed (the latency of the *critical path*). Data hazards are solved by restructuring computations so that the critical path is shorter.
-- In control hazards, you generally have to flush the entire pipeline and start over, wasting whole 15-20 cycles. They are solved by either removing branches completely, or making them predictable so that the CPU can effectively *speculate* on what is going to be executed next.
+- In control hazards, you generally have to flush the entire pipeline and start over, wasting a whole 15-20 cycles. They are solved by either removing branches completely, or making them predictable so that the CPU can effectively *speculate* on what is going to be executed next.
 
-As they have very different impact on performance, we are going to go in the reversed order and start with the more grave ones.
+As they have very different impacts on performance, we are going to go in the reversed order and start with the more grave ones.
diff --git a/content/english/hpc/pipelining/tables.md b/content/english/hpc/pipelining/tables.md
index 24678270..5f69c579 100644
--- a/content/english/hpc/pipelining/tables.md
+++ b/content/english/hpc/pipelining/tables.md
@@ -14,7 +14,7 @@ In this context, it makes sense to use two different "[costs](/hpc/complexity)"
 
 <!-- alternative throughput definitions, maybe in scheduling? -->
 
-You can get latency and throughput numbers for a specific architecture from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf). Here are some samples values for my Zen 2 (all specified for 32-bit operands, if there is any difference):
+You can get latency and throughput numbers for a specific architecture from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf). Here are some sample values for my Zen 2 (all specified for 32-bit operands, if there is any difference):
 
 | Instruction | Latency | RThroughput |
 |-------------|---------|:------------|
@@ -34,7 +34,7 @@ Some comments:
 - If a certain instruction is especially frequent, its execution unit could be duplicated to increase its throughput — possibly to even more than one, but not higher than the [decode width](/hpc/architecture/layout).
 - Some instructions have a latency of 0. This means that these instruction are used to control the scheduler and don't reach the execution stage. They still have non-zero reciprocal throughput because the [CPU front-end](/hpc/architecture/layout) still needs to process them.
 - Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is the [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all.
-- Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), latency is usually specified for the best case (an L1 cache hit).
+- Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), the latency is usually specified for the best case (an L1 cache hit).
 
 There are many more important little details, but this mental model will suffice for now.
 
diff --git a/content/english/hpc/pipelining/throughput.md b/content/english/hpc/pipelining/throughput.md
index 27789b28..03562291 100644
--- a/content/english/hpc/pipelining/throughput.md
+++ b/content/english/hpc/pipelining/throughput.md
@@ -6,7 +6,7 @@ weight: 4
 Optimizing for *latency* is usually quite different from optimizing for *throughput*:
 
 - When optimizing data structure queries or small one-time or branchy algorithms, you need to [look up the latencies](../tables) of its instructions, mentally construct the execution graph of the computation, and then try to reorganize it so that the critical path is shorter. <!-- [Binary GCD](/hpc/algorithms/gcd) is a good example of that. -->
-- When optimizing hot loops and large-dataset algorithms, you need to look up the throughputs of its instructions, count how many times each one is used per iteration, determine which of them is the bottleneck, and then try to restructure the loop so that it is used less often.
+- When optimizing hot loops and large-dataset algorithms, you need to look up the throughputs of their instructions, count how many times each one is used per iteration, determine which of them is the bottleneck, and then try to restructure the loop so that it is used less often.
 
 The last advice only works for *data-parallel* loops, where each iteration is fully independent of the previous one. When there is some interdependency between consecutive iterations, there may potentially be a pipeline stall caused by a [data hazard](../hazards) as the next iteration is waiting for the previous one to complete.
 
@@ -64,7 +64,7 @@ If an instruction has a latency of $x$ and a throughput of $y$, then you would n
 
 This technique is mostly used with [SIMD](/hpc/simd) and not in scalar code. You can [generalize](/hpc/simd/reduction) the code above and compute sums and other reductions faster than the compiler.
 
-In general, when optimizing loops, you usually have just one or a few *execution ports* that you want to utilize to their fullest, and you engineer the rest of the loop around them. As different instructions may use different sets of ports, it is not always clear which one is going to be the overused. In situations like this, [machine code analyzers](/hpc/profiling/mca) can be very helpful for finding bottlenecks of small assembly loops.
+In general, when optimizing loops, you usually have just one or a few *execution ports* that you want to utilize to their fullest, and you engineer the rest of the loop around them. As different instructions may use different sets of ports, it is not always clear which one is going to be overused. In situations like this, [machine code analyzers](/hpc/profiling/mca) can be very helpful for finding the bottlenecks of small assembly loops.
 
 <!--
 

From 73fbdf4a3f32095b1684453d5bf673ce8d1e33d2 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 18 May 2022 09:45:27 +0300
Subject: [PATCH 098/173] montgomery multiplication

---
 .../english/hpc/number-theory/montgomery.md   | 184 ++++++++++--------
 1 file changed, 100 insertions(+), 84 deletions(-)

diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
index c5d47d9d..3488474b 100644
--- a/content/english/hpc/number-theory/montgomery.md
+++ b/content/english/hpc/number-theory/montgomery.md
@@ -1,7 +1,6 @@
 ---
 title: Montgomery Multiplication
 weight: 4
-draft: true
 ---
 
 Unsurprisingly, large fractions of computations in [modular arithmetic](../modular) are often spent on calculating the modulo operation, which is as slow as general integer division and typically taking 15-20 cycles, depending on the operand size.
@@ -87,73 +86,122 @@ This means that, after we normally multiply two numbers in the Montgomery space,
 
 ### Montgomery reduction
 
-Assume that $r=2^{32}$, the modulo $n$ is 32-bit, and the number $x$ we need to reduce (multiply by $r^{-1}$ and take it modulo $n$) is the 64-bit the product of two 32-bit numbers.
+Assume that $r=2^{32}$, the modulo $n$ is 32-bit, and the number $x$ we need to reduce is 64-bit (the product of two 32-bit numbers). Our goal is to calculate $y = x \cdot r^{-1} \bmod n$. 
 
-By definition, $\gcd(n, r) = 1$, so we know that there are two numbers $r^{-1}$ and $n'$ in the $[0, n)$ range such that
+Since $r$ is coprime with $n$, we know that there are two numbers $r^{-1}$ and $n^\prime$ in the $[0, n)$ range such that
 
 $$
-r \cdot r^{-1} + n \cdot n' = 1
+r \cdot r^{-1} + n \cdot n^\prime = 1
 $$
 
-and both $r^{-1}$ and $n'$ can be computed using the [extended Euclidean algorithm](../euclid-extended).
+and both $r^{-1}$ and $n^\prime$ can be computed e. g. using the [extended Euclidean algorithm](../euclid-extended).
 
-Using this identity, we can express $r \cdot r^{-1}$ as $(-n \cdot n' + 1)$ and write $x \cdot r^{-1}$ as
+Using this identity, we can express $r \cdot r^{-1}$ as $(1 - n \cdot n^\prime)$ and write $x \cdot r^{-1}$ as
 
 $$
 \begin{aligned}
 x \cdot r^{-1} &= x \cdot r \cdot r^{-1} / r
-\\             &= x \cdot (-n \cdot n^{\prime} + 1) / r
-\\             &= (-x \cdot n \cdot n^{\prime} + x) / r
-\\             &\equiv (-x \cdot n \cdot n^{\prime} + l \cdot r \cdot n + x) / r \bmod n
-\\             &\equiv ((-x \cdot n^{\prime} + l \cdot r) \cdot n + x) / r \bmod n
+\\             &= x \cdot (1 - n \cdot n^{\prime}) / r
+\\             &= (x - x \cdot n \cdot n^{\prime}    ) / r
+\\             &\equiv (x - x \cdot n \cdot n^{\prime} + k \cdot r \cdot n) / r &\pmod n &\;\;\text{(for any integer $k$)}
+\\             &\equiv (x - (x \cdot n^{\prime} - k \cdot r) \cdot n) / r &\pmod n
 \end{aligned}
 $$
 
-The equivalences hold for any integer $l$. This means that we can add or subtract an arbitrary multiple of $r$ to $x \cdot n'$, or in other words, we can compute $q = x \cdot n'$ modulo $r$.
+Now, if we choose $k$ to be $\lfloor x \cdot n^\prime / r \rfloor$ (the upper 64 bits of the $x \cdot n^\prime$ product), it will cancel out, and $(k \cdot r - x \cdot n^{\prime})$ will simply be equal to $x \cdot n^{\prime} \bmod r$ (the lower 32 bits of $x \cdot n^\prime$), implying:
 
-This gives us the following algorithm to compute $x \cdot r^{-1} \bmod n$:
+$$
+x \cdot r^{-1} \equiv (x - x \cdot n^{\prime} \bmod r \cdot n) / r
+$$
+
+The algorithm itself just evaluates this formula, performing two multiplications to calculate $q = x \cdot n^{\prime} \bmod r$ and $m = q \cdot n$, and then subtracts it from $x$ and right-shifts the result to divide it by $r$.
+
+The only remaining thing to handle is that the result may not be in the $[0, n)$ range; but since
+
+$$
+x < n \cdot n < r \cdot n \implies x / r < n
+$$
+
+and
+
+$$
+m = q \cdot n < r \cdot n \implies m / r < n
+$$
+
+it is guaranteed that
+
+$$
+-n < (x - m) / r < n
+$$
+
+Therefore, we can simply check if the result is negative and in that case, add $n$ to it, giving the following algorithm:
+
+```c++
+typedef __uint32_t u32;
+typedef __uint64_t u64;
+
+const u32 n = 1e9 + 7, nr = inverse(n, 1ull << 32);
 
-```python
-def reduce(x):
-    q = (x % r) * nr % r
-    a = (x - q * n) / r
-    if a < 0:
-        a += n
-    return a
+u32 reduce(u64 x) {
+    u32 q = u32(x) * nr;      // q = x * n' mod r
+    u64 m = (u64) q * n;      // m = q * n
+    u32 y = (x - m) >> 32;    // y = (x - m) / r
+    return x < m ? y + n : y; // if y < 0, add n to make it be in the [0, n) range
+}
 ```
 
-Since $x < n \cdot n < r \cdot n$ and $q \cdot n < r \cdot n$, we know that
+This last check is relatively cheap, but it is still on the critical path. If we are fine with the result being in the $[0, 2 \cdot n - 2]$ range instead of $[0, n)$, we can remove it and add $n$ to the result unconditionally:
+
+```c++
+u32 reduce(u64 x) {
+    u32 q = u32(x) * nr;
+    u64 m = (u64) q * n;
+    u32 y = (x - m) >> 32;
+    return y + n
+}
+```
+
+We can also move the `>> 32` operation one step earlier in the computation graph and compute $\lfloor x / r \rfloor - \lfloor m / r \rfloor$ instead of $(x - m) / r$. This is correct because the lower 32 bits of $x$ and $m$ are equal anyway since
 
 $$
--n < (x - q \cdot n) / r < n
+m = x \cdot n^\prime \cdot n \equiv x \pmod r
 $$
 
-Therefore, the final modulo operation can be implemented using a single bound check and addition.
+But why would we voluntarily choose to perfom two right-shifts instead of just one? This is beneficial because for `((u64) q * n) >> 32` we need to do a 32-by-32 multiplication and take the upper 32 bits of the result (which the x86 `mul` instruction [already writes](../hpc/arithmetic/integer/#128-bit-integers) in a separate register, so it doesn't cost anything), and the other right-shift `x >> 32` is not on the critical path.
+
+```c++
+u32 reduce(u64 x) {
+    u32 q = u32(x) * nr;
+    u32 m = ((u64) q * n) >> 32;
+    return (x >> 32) + n - m;
+}
+```
 
-Here is an equivalent C implementation for 64-bit integers:
+One of the main advantages of Montgomery multiplication over other modular reduction methods is that it doesn't require very large data types: it only needs a $r \times r$ multiplication that extracts the lower and higher $r$ bits of the result, which [has special support](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=7395,7392,7269,4868,7269,7269,1820,1835,6385,5051,4909,4918,5051,7269,6423,7410,150,2138,1829,1944,3009,1029,7077,519,5183,4462,4490,1944,5055,5012,5055&techs=AVX,AVX2&text=mul) on most hardware also makes it easily generalizable to [SIMD](../hpc/simd/) and larger data types:
 
 ```c++
-typedef unsigned long long u64;
 typedef __uint128_t u128;
 
-u64 reduce(u128 x) {
+u64 reduce(u128 x) const {
     u64 q = u64(x) * nr;
     u64 m = ((u128) q * n) >> 64;
-    u64 xhi = (x >> 64);
-    if (xhi >= m)
-        return (xhi - m);
-    else
-        return (xhi - m) + n;
+    return (x >> 64) + n - m;
 }
 ```
 
-We also need to implement calculating calculating the inverse of $n$ (`nr`) and transformation of numbers in and our of Montgomery space. Before providing complete implementation, let's discuss how to do that smarter, although they are just done once.
+Note that a 128-by-64 modulo is not possible with general integer division tricks: the compiler [falls back](https://godbolt.org/z/fbEE4v4qr) to calling a slow [long arithmetic library function](https://github.com/llvm-mirror/compiler-rt/blob/69445f095c22aac2388f939bedebf224a6efcdaf/lib/builtins/udivmodti4.c#L22) to support it.
+
+### Faster Inverse and Transform
 
-To transfer a number back from the Montgomery space we can just use Montgomery reduction.
+Montgomery multiplication itself is fast, but it requires some precomputation:
 
-### Fast inverse
+- inverting $n$ modulo $r$ to compute $n^\prime$,
+- transforming a number *to* the Montgomery space,
+- transforming a number *from* the Montgomery space.
 
-For computing the inverse $n' = n^{-1} \bmod r$ more efficiently, we can use the following trick inspired from the Newton's method:
+The last operation is already efficiently performed with the `reduce` procedure we just implemented, but the first two can be slightly optimized.
+
+**Computing the inverse** $n^\prime = n^{-1} \bmod r$ can be done faster than with the extended Euclidean algorithm by taking advantage of the fact that $r$ is a power of two and using the following identity:
 
 $$
 a \cdot x \equiv 1 \bmod 2^k
@@ -163,7 +211,7 @@ a \cdot x \cdot (2 - a \cdot x)
 1 \bmod 2^{2k}
 $$
 
-This can be proven this way:
+Proof:
 
 $$
 \begin{aligned}
@@ -176,41 +224,36 @@ a \cdot x \cdot (2 - a \cdot x)
 \end{aligned}
 $$
 
-This means we can start with $x = 1$ as the inverse of $a$ modulo $2^1$, apply the trick a few times and in each iteration we double the number of correct bits of $x$.
-
-### Fast transformation
+We can start with $x = 1$ as the inverse of $a$ modulo $2^1$ and apply this identity exactly $\log_2 r$ times, each time doubling the number of bits in the inverse — somewhat reminiscent of [the Newton's method](../hpc/arithmetic/newton/).
 
-Although we can just multiply a number by $r$ and compute one modulo the usual way, there is a faster way that makes use of the following relation:
+**Transforming** a number into the Montgomery space can be done by multiplying it by $r$ and computing modulo [the usual way](../hpc/arithmetic/division/), but we can also take advantage of this relation:
 
 $$
 \bar{x} = x \cdot r \bmod n = x * r^2
 $$
 
-Transforming a number into the space is just a multiplication inside the space of the number with $r^2$. Therefore we can precompute $r^2 \bmod n$ and just perform a multiplication and reduction instead.
+Transforming a number into the space is just a multiplication by $r^2$. Therefore, we can precompute $r^2 \bmod n$ and perform a multiplication and reduction instead — which may or may not be actually faster because multiplying a number by $r=2^{k}$ can be implemented with a left-shift, while multiplication by $r^2 \bmod n$ can not.
 
 ### Complete Implementation
 
-```c++
-typedef __uint32_t u32;
-typedef __uint64_t u64;
+It is convenient to wrap everything into a single `constexpr` structure:
 
-struct montgomery {
+```c++
+struct Montgomery {
     u32 n, nr;
     
-    constexpr montgomery(u32 n) : n(n), nr(1) {
-        for (int i = 0; i < 6; i++)
+    constexpr Montgomery(u32 n) : n(n), nr(1) {
+        // log(2^32) = 5
+        for (int i = 0; i < 5; i++)
             nr *= 2 - n * nr;
     }
 
     u32 reduce(u64 x) const {
         u32 q = u32(x) * nr;
         u32 m = ((u64) q * n) >> 32;
-        u32 xhi = (x >> 32);
-        return xhi + n - m;
-        
-        // if you need 
-        // u32 t = xhi - m;
-        // return xhi >= m ? t : t + n;
+        return (x >> 32) + n - m;
+        // returns a number in the [0, 2 * n - 2] range
+        // (add a "x < n ? x : x - n" type of check if you need a proper modulo)
     }
 
     u32 multiply(u32 x, u32 y) const {
@@ -219,44 +262,15 @@ struct montgomery {
 
     u32 transform(u32 x) const {
         return (u64(x) << 32) % n;
+        // can also be implemented as multiply(x, r^2 mod n)
     }
 };
 ```
 
-```c++
-montgomery m(n);
-
-a = m.transform(a);
-b = m.transform(b);
-c = m.multiply(a, b);
-c = m.reduce(c);
-```
-
-```c++
-int inverse(int _a) {
-    u32 a = space.transform(_a);
-    u32 r = space.transform(1);
-    
-    int n = M - 2;
-    while (n) {
-        if (n & 1)
-            r = space.multiply(r, a);
-        a = space.multiply(a, a);
-        n >>= 1;
-    }
-    
-    return space.reduce(r);
-}
-```
-
-SIMD
-
-166.79 ns
-
-207.04 ns
+To test its performance, we can plug Montgomery multiplication into the [binary exponentiation](../hpc/number-theory/exponentiation/):
 
 ```c++
-constexpr montgomery space(M);
+constexpr Montgomery space(M);
 
 int inverse(int _a) {
     u64 a = space.transform(_a);
@@ -273,4 +287,6 @@ int inverse(int _a) {
 }
 ```
 
+While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case in modular arithmetic is for `inverse` to be used as a subprocedure in a bigger computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
+
 **Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).

From 9cb7479ea6f7d6869711866369da08d94e2bf3af Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 18 May 2022 10:20:23 +0300
Subject: [PATCH 099/173] consistent "run/compile(-)time" hyphenation

---
 content/english/hpc/architecture/layout.md    |  2 +-
 content/english/hpc/compilation/contracts.md  |  2 +-
 content/english/hpc/compilation/flags.md      |  2 +-
 content/english/hpc/compilation/precalc.md    | 12 ++++++------
 content/english/hpc/complexity/languages.md   |  2 +-
 content/english/hpc/data-structures/s-tree.md |  2 +-
 content/english/hpc/parallel/gpu/_index.en.md |  2 +-
 content/english/hpc/simd/moving.md            |  2 +-
 8 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/content/english/hpc/architecture/layout.md b/content/english/hpc/architecture/layout.md
index 9ddebfd5..37779325 100644
--- a/content/english/hpc/architecture/layout.md
+++ b/content/english/hpc/architecture/layout.md
@@ -54,7 +54,7 @@ The instruction cache is crucial in situations when you either
 The memory system can therefore become the bottleneck for programs with large machine code. This consideration limits the applicability of the optimization techniques we've previously discussed:
 
 - [Inlining functions](../functions) is not always optimal, because it reduces code sharing and increases the binary size, requiring more instruction cache.
-- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of iterations is known during compile-time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth.
+- [Unrolling loops](../loops) is only beneficial up to some extent, even if the number of iterations is known during compile time: at some point, the CPU would have to fetch both instructions and data from the main memory, in which case it will likely be bottlenecked by the memory bandwidth.
 - Huge [code alignments](#code-alignment) increase the binary size, again requiring more instruction cache. Spending one more cycle on fetch is a minor penalty compared to missing the cache and waiting for the instructions to be fetched from the main memory.
 
 Another aspect is that placing frequently used instruction sequences on the same [cache lines](/hpc/cpu-cache/cache-lines) and [memory pages](/hpc/cpu-cache/paging) improves [cache locality](/hpc/external-memory/locality). To improve instruction cache utilization, you should  group hot code with hot code and cold code with cold code, and remove dead (unused) code if possible. If you want to explore this idea further, check out Facebook's [Binary Optimization and Layout Tool](https://engineering.fb.com/2018/06/19/data-infrastructure/accelerate-large-scale-applications-with-bolt/), which was recently [merged](https://github.com/llvm/llvm-project/commit/4c106cfdf7cf7eec861ad3983a3dd9a9e8f3a8ae) into LLVM.
diff --git a/content/english/hpc/compilation/contracts.md b/content/english/hpc/compilation/contracts.md
index 2337ddcc..cea2c198 100644
--- a/content/english/hpc/compilation/contracts.md
+++ b/content/english/hpc/compilation/contracts.md
@@ -45,7 +45,7 @@ T at(size_t k) {
 }
 ```
 
-Interestingly, these checks are rarely actually executed during runtime because the compiler can often prove — during compile-time — that each access will be within bounds. For example, when iterating in a `for` loop from 1 to the array size and indexing $i$-th element on each step, nothing illegal can possibly happen, so the bounds checks can be safely optimized away.
+Interestingly, these checks are rarely actually executed during runtime because the compiler can often prove — during compile time — that each access will be within bounds. For example, when iterating in a `for` loop from 1 to the array size and indexing $i$-th element on each step, nothing illegal can possibly happen, so the bounds checks can be safely optimized away.
 
 ### Assumptions
 
diff --git a/content/english/hpc/compilation/flags.md b/content/english/hpc/compilation/flags.md
index 1f70a622..21db0134 100644
--- a/content/english/hpc/compilation/flags.md
+++ b/content/english/hpc/compilation/flags.md
@@ -35,7 +35,7 @@ This is useful when you need to optimize a single high-performance procedure wit
 
 ### Multiversioned Functions
 
-Sometimes you may also want to provide several architecture-specific implementations in a single library. You can use attribute-based syntax to select between multiversioned functions automatically during compile-time:
+Sometimes you may also want to provide several architecture-specific implementations in a single library. You can use attribute-based syntax to select between multiversioned functions automatically during compile time:
 
 ```c++
 __attribute__(( target("default") )) // fallback implementation
diff --git a/content/english/hpc/compilation/precalc.md b/content/english/hpc/compilation/precalc.md
index de592d00..29b31cd6 100644
--- a/content/english/hpc/compilation/precalc.md
+++ b/content/english/hpc/compilation/precalc.md
@@ -3,13 +3,13 @@ title: Precomputation
 weight: 8
 ---
 
-When compilers can infer that a certain variable does not depend on any user-provided data, they can compute its value during compile-time and turn it into a constant by embedding it into the generated machine code.
+When compilers can infer that a certain variable does not depend on any user-provided data, they can compute its value during compile time and turn it into a constant by embedding it into the generated machine code.
 
 This optimization helps performance a lot, but it is not a part of the C++ standard, so compilers don't *have to* do that. When a compile-time computation is either hard to implement or time-intensive, they have a full legal right to pass on that opportunity.
 
 ### Constant Expressions
 
-In modern C++, you can mark a function as `constexpr`, and if it is called by passing constants, its value is guaranteed to be computed during compile-time:
+In modern C++, you can mark a function as `constexpr`, and if it is called by passing constants, its value is guaranteed to be computed during compile time:
 
 ```c++
 constexpr int fibonacci(int n) {
@@ -23,7 +23,7 @@ static_assert(fibonacci(10) == 55);
 
 These functions have some restrictions like that they only call other `constexpr` functions and can't do memory allocation, but otherwise, they are executed "as is."
 
-Note that while they don't cost anything during the run-time, they still increase compilation time, so at least remotely care about their efficiency and don't put something NP-complete in them:
+Note that while they don't cost anything during the run time, they still increase compilation time, so at least remotely care about their efficiency and don't put something NP-complete in them:
 
 ```c++
 constexpr int fibonacci(int n) {
@@ -54,20 +54,20 @@ constexpr Precalc P;
 static_assert(P.isqrt[42] == 6);
 ```
 
-Note that when you call `constexpr` functions while passing non-constants, the compiler may or may not compute them during compile-time:
+Note that when you call `constexpr` functions while passing non-constants, the compiler may or may not compute them during compile time:
 
 ```c++
 for (int i = 0; i < 100; i++)
     cout << fibonacci(i) << endl;
 ```
 
-In this example, even though technically we perform a constant number of iterations and call `fibonacci` with parameters known at compile-time, they are technically not compile-time constants. It's up to the compiler whether to optimize this loop or not — and for heavy computations, it often chooses not to.
+In this example, even though technically we perform a constant number of iterations and call `fibonacci` with parameters known at compile time, they are technically not compile-time constants. It's up to the compiler whether to optimize this loop or not — and for heavy computations, it often chooses not to.
 
 <!--
 
 ### Code Generation
 
-There are plenty of languages that support computing *data* during compile-time, but none can produce efficient code at all times.
+There are plenty of languages that support computing *data* during compile time, but none can produce efficient code at all times.
 
 One huge example is generating lexers and parsers: which is usually done in.
 
diff --git a/content/english/hpc/complexity/languages.md b/content/english/hpc/complexity/languages.md
index a24eb34f..abb80979 100644
--- a/content/english/hpc/complexity/languages.md
+++ b/content/english/hpc/complexity/languages.md
@@ -47,7 +47,7 @@ Since running machine code in an interpreter doesn't make sense, this makes a to
 - Compiled languages with a runtime, such as Java, C#, or Erlang (and languages that work on their VMs, such as Scala, F#, or Elixir).
 - Compiled native languages, such as C, Go, or Rust.
 
-There is no "right" way of executing computer programs: each approach has its own gains and drawbacks. Interpreters and virtual machines provide flexibility and enable some nice high-level programming features such as dynamic typing, run-time code alteration, and automatic memory management, but these come with some unavoidable performance trade-offs, which we will now talk about.
+There is no "right" way of executing computer programs: each approach has its own gains and drawbacks. Interpreters and virtual machines provide flexibility and enable some nice high-level programming features such as dynamic typing, run time code alteration, and automatic memory management, but these come with some unavoidable performance trade-offs, which we will now talk about.
 
 ### Interpreted languages
 
diff --git a/content/english/hpc/data-structures/s-tree.md b/content/english/hpc/data-structures/s-tree.md
index 3fcf97b5..bf3b2805 100644
--- a/content/english/hpc/data-structures/s-tree.md
+++ b/content/english/hpc/data-structures/s-tree.md
@@ -325,7 +325,7 @@ The disadvantage is that this layout is not *succinct*: we need some additional
 
 ### Implicit B+ Tree
 
-To be more explicit with pointer arithmetic, we will store the entire tree in a single one-dimensional array. To minimize index computations during run-time, we will store each layer sequentially in this array and use compile-time computed offsets to address them: the keys of the node number `k` on layer `h` start with `btree[offset(h) + k * B]`, and its `i`-th child will at `btree[offset(h - 1) + (k * (B + 1) + i) * B]`.
+To be more explicit with pointer arithmetic, we will store the entire tree in a single one-dimensional array. To minimize index computations during run time, we will store each layer sequentially in this array and use compile time computed offsets to address them: the keys of the node number `k` on layer `h` start with `btree[offset(h) + k * B]`, and its `i`-th child will at `btree[offset(h - 1) + (k * (B + 1) + i) * B]`.
 
 To implement all that, we need slightly more `constexpr` functions:
 
diff --git a/content/english/hpc/parallel/gpu/_index.en.md b/content/english/hpc/parallel/gpu/_index.en.md
index f08d10c7..540d5d5c 100644
--- a/content/english/hpc/parallel/gpu/_index.en.md
+++ b/content/english/hpc/parallel/gpu/_index.en.md
@@ -167,7 +167,7 @@ There is also `drv.InOut` function, which makes it available for both reading an
 
 Most of the operations here are memory operations, so measuring performance here is useless. Don't worry, we will get to more complex examples soon enough.
 
-GPUs have very specific operations. However, in case of NVIDIA GPUs managing it is quite simple: the cards have *compute capabilities* (1.0, 1.1, 1.2, 1.3, 2.0, etc.) and all features added at capability $x$ is also available at later versions. These can be checked at run-time or compile-time.
+GPUs have very specific operations. However, in case of NVIDIA GPUs managing it is quite simple: the cards have *compute capabilities* (1.0, 1.1, 1.2, 1.3, 2.0, etc.) and all features added at capability $x$ is also available at later versions. These can be checked at run time or compile time.
 
 You can check differences in this Wikipedia article: https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
 
diff --git a/content/english/hpc/simd/moving.md b/content/english/hpc/simd/moving.md
index 20c03327..948c31c4 100644
--- a/content/english/hpc/simd/moving.md
+++ b/content/english/hpc/simd/moving.md
@@ -39,7 +39,7 @@ for (int i = 0; i < n; i += 8) {
 
 In the first version, assuming that arrays `a`, `b` and `c` are all 64-byte *aligned* (the addresses of their first elements are divisible by 64, and so they start at the beginning of a cache line), roughly half of reads and writes will be "bad" because they cross a cache line boundary.
 
-Note that the performance difference is caused by the cache system and not by the instructions themselves. On most modern architectures, the `loadu` / `storeu` intrinsics should be equally as fast as `load` / `store` given that in both cases the blocks only span one cache line. The advantage of the latter is that they can act as free run-time assertions that all reads and writes are aligned.
+Note that the performance difference is caused by the cache system and not by the instructions themselves. On most modern architectures, the `loadu` / `storeu` intrinsics should be equally as fast as `load` / `store` given that in both cases the blocks only span one cache line. The advantage of the latter is that they can act as free run time assertions that all reads and writes are aligned.
 
 This makes it important to properly [align](/hpc/cpu-cache/alignment) arrays and other data on allocation, and it is also one of the reasons why compilers can't always [auto-vectorize](../auto-vectorization) efficiently. For most purposes, we only need to guarantee that any 32-byte SIMD block will not cross a cache line boundary, and we can specify this alignment with the `alignas` specifier:
 

From 8fa3559e2537d9c35d4c8581deac5a85d3efc708 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 18 May 2022 10:41:45 +0300
Subject: [PATCH 100/173] consistent "i.e." and "e.g." punctuation

---
 README.md                                            | 2 +-
 content/english/hpc/algorithms/matmul.md             | 2 +-
 content/english/hpc/architecture/assembly.md         | 2 +-
 content/english/hpc/architecture/functions.md        | 2 +-
 content/english/hpc/architecture/loops.md            | 2 +-
 content/english/hpc/arithmetic/bit-hacks.md          | 4 ++--
 content/english/hpc/arithmetic/ieee-754.md           | 2 +-
 content/english/hpc/arithmetic/integer.md            | 4 ++--
 content/english/hpc/arithmetic/newton.md             | 2 +-
 content/english/hpc/arithmetic/rsqrt.md              | 4 ++--
 content/english/hpc/compilation/contracts.md         | 2 +-
 content/english/hpc/compilation/flags.md             | 2 +-
 content/english/hpc/compilation/limitations.md       | 2 +-
 content/english/hpc/complexity/_index.md             | 2 +-
 content/english/hpc/cpu-cache/alignment.md           | 4 ++--
 content/english/hpc/cpu-cache/bandwidth.md           | 2 +-
 content/english/hpc/cpu-cache/paging.md              | 2 +-
 content/english/hpc/data-structures/b-tree.md        | 2 +-
 content/english/hpc/data-structures/binary-search.md | 2 +-
 content/english/hpc/data-structures/segment-trees.md | 6 +++---
 content/english/hpc/external-memory/hierarchy.md     | 4 ++--
 content/english/hpc/external-memory/oblivious.md     | 2 +-
 content/english/hpc/external-memory/virtual.md       | 2 +-
 content/english/hpc/number-theory/cryptography.md    | 2 +-
 content/english/hpc/number-theory/montgomery.md      | 2 +-
 content/english/hpc/parallel/concurrency/fibers.md   | 2 +-
 content/english/hpc/parallel/gpu/_index.en.md        | 4 ++--
 content/english/hpc/pipelining/branching.md          | 2 +-
 content/english/hpc/pipelining/hazards.md            | 2 +-
 content/english/hpc/pipelining/scheduling.md         | 2 +-
 content/english/hpc/pipelining/throughput.md         | 2 +-
 content/english/hpc/profiling/_index.md              | 2 +-
 content/english/hpc/profiling/benchmarking.md        | 4 ++--
 content/english/hpc/profiling/events.md              | 2 +-
 content/english/hpc/profiling/mca.md                 | 2 +-
 content/english/hpc/profiling/noise.md               | 4 ++--
 content/english/hpc/profiling/simulation.md          | 2 +-
 content/english/hpc/simd/intrinsics.md               | 4 ++--
 content/english/hpc/simd/reduction.md                | 4 ++--
 content/english/hpc/slides/01-intro/_index.md        | 2 +-
 content/english/hpc/stats.md                         | 4 ++--
 41 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/README.md b/README.md
index 959dc025..7d298284 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 
 Algorithmica is an open-access web book dedicated to the art and science of computing.
 
-You can contribute via [Prose](https://prose.io/) by clicking on the pencil icon on the top right on any page or by editing its source directly on GitHub. We use a slightly different Markdown dialect, so if you are not sure that the change is correct (e. g. editing an intricate LaTeX formula), you can install [Hugo](https://gohugo.io/) and build the site locally — or just create a pull request, and a preview link will be automatically generated for you.
+You can contribute via [Prose](https://prose.io/) by clicking on the pencil icon on the top right on any page or by editing its source directly on GitHub. We use a slightly different Markdown dialect, so if you are not sure that the change is correct (for example, editing an intricate LaTeX formula), you can install [Hugo](https://gohugo.io/) and build the site locally — or just create a pull request, and a preview link will be automatically generated for you.
 
 If you happen to speak Russian, please also read the [contributing guidelines](https://ru.algorithmica.org/contributing/).
 
diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index c692a227..02c68f36 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -226,7 +226,7 @@ Instead of designing a kernel that computes an $h \times w$ submatrix of $C$ fro
 
 <!--
 
-We follow this approach and design a general kernel that updates a $h \times w$ submatrix of C using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$ (i. e. not a full computation, but only a partial update — it will be clear why later). 
+We follow this approach and design a general kernel that updates a $h \times w$ submatrix of C using columns from $l$ to $r$ of $A$ and rows from $l$ to $r$ of $B$ (i.e., not a full computation, but only a partial update — it will be clear why later). 
 
 -->
 
diff --git a/content/english/hpc/architecture/assembly.md b/content/english/hpc/architecture/assembly.md
index 00c7caac..de94e4cf 100644
--- a/content/english/hpc/architecture/assembly.md
+++ b/content/english/hpc/architecture/assembly.md
@@ -128,7 +128,7 @@ movl %eax, (%rdx)
 The key differences can be summarized as follows:
 
 1. The *last* operand is used to specify the destination.
-2. Registers and constants need to be prefixed by `%` and `$` respectively (e. g. `addl    $1, %rdx` increments `rdx`).
+2. Registers and constants need to be prefixed by `%` and `$` respectively (e.g., `addl $1, %rdx` increments `rdx`).
 3. Memory addressing looks like this: `displacement(%base, %index, scale)`.
 4. Both `;` and `#` can be used for line comments, and also `/* */` can be used for block comments.
 
diff --git a/content/english/hpc/architecture/functions.md b/content/english/hpc/architecture/functions.md
index 412fc027..3f98a381 100644
--- a/content/english/hpc/architecture/functions.md
+++ b/content/english/hpc/architecture/functions.md
@@ -18,7 +18,7 @@ The hardware stack works the same way software stacks do and is similarly implem
 - The *base pointer* marks the start of the stack and is conventionally stored in `rbp`.
 - The *stack pointer* marks the last element of the stack and is conventionally stored in `rsp`.
 
-When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances, e.g. when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers.
+When you need to call a function, you push all your local variables onto the stack (which you can also do in other circumstances; e.g., when you run out of registers), push the current instruction pointer, and then jump to the beginning of the function. When exiting from a function, you look at the pointer stored on top of the stack, jump there, and then carefully read all the variables stored on the stack back into their registers.
 
 <!--
 
diff --git a/content/english/hpc/architecture/loops.md b/content/english/hpc/architecture/loops.md
index 9dc1faba..5da022f5 100644
--- a/content/english/hpc/architecture/loops.md
+++ b/content/english/hpc/architecture/loops.md
@@ -105,7 +105,7 @@ It is a notoriously difficult math problem that seems ridiculously simple.
 
 Make use of [lea instruction](../assembly).
 
-E. g. if you want to make a computational experiment [Collatz conjecture](https://en.wikipedia.org/wiki/Collatz_conjecture), you may use `lea rax, [rax + rax * 2 + 1]`, and then try to `sar` it.
+E.g., if you want to make a computational experiment [Collatz conjecture](https://en.wikipedia.org/wiki/Collatz_conjecture), you may use `lea rax, [rax + rax * 2 + 1]`, and then try to `sar` it.
 
 Another way is to check add.
 
diff --git a/content/english/hpc/arithmetic/bit-hacks.md b/content/english/hpc/arithmetic/bit-hacks.md
index 5d54b1c1..44a365eb 100644
--- a/content/english/hpc/arithmetic/bit-hacks.md
+++ b/content/english/hpc/arithmetic/bit-hacks.md
@@ -24,11 +24,11 @@ Left or right-shifting negative numbers invokes undefined behavior in C/C++.
 
 `__builtin_popcount` `popcnt` Returns the number of 1-bits in x.
 
-`__builtin_parity` Returns the parity of x, i.e. the number of 1-bits in x modulo 2.
+`__builtin_parity` Returns the *parity* of x (that is, the number of 1-bits in x modulo 2).
 
 This is presumably for [error detection](https://en.wikipedia.org/wiki/Parity_bit).
 
-`__builtin_clrsb` Returns the number of leading redundant sign bits in x, i.e. the number of bits following the most significant bit that are identical to it. There are no special cases for 0 or other values.
+`__builtin_clrsb` Returns the number of leading redundant sign bits in x, i.e., the number of bits following the most significant bit that are identical to it. There are no special cases for 0 or other values.
 
 `__builtin_ffs` Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.
 
diff --git a/content/english/hpc/arithmetic/ieee-754.md b/content/english/hpc/arithmetic/ieee-754.md
index 6b1e2a24..06d58e4d 100644
--- a/content/english/hpc/arithmetic/ieee-754.md
+++ b/content/english/hpc/arithmetic/ieee-754.md
@@ -55,7 +55,7 @@ Their availability ranges from chip to chip:
 - Half-precision arithmetic only supports a small subset of operations and is generally used for machine learning applications, especially neural networks, because they tend to do a large amount of calculation, but don't require a high level of precision.
 - Half-precision is being gradually replaced by bfloat, which trades off 3 mantissa bits to have the same range as single-precision, enabling interoperability with it. It is mostly being adopted by specialized hardware: TPUs, FGPAs, and GPUs. The name stands for "[Brain](https://en.wikipedia.org/wiki/Google_Brain) float."
 
-Lower-precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e. g. the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it.
+Lower-precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e.g., the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it.
 
 Deep learning, emerging as a very popular and computationally-intensive field, created a huge demand for low-precision matrix multiplication, which led to manufacturers developing separate hardware or at least adding specialized instructions that support these types of computations — most notably, Google developing a custom chip called TPU (*tensor processing unit*) that specializes on multiplying 128-by-128 bfloat matrices, and NVIDIA adding "tensor cores," capable of performing 4-by-4 matrix multiplication in one go, to all their newer GPUs.
 
diff --git a/content/english/hpc/arithmetic/integer.md b/content/english/hpc/arithmetic/integer.md
index bd70314b..47f5bd32 100644
--- a/content/english/hpc/arithmetic/integer.md
+++ b/content/english/hpc/arithmetic/integer.md
@@ -19,7 +19,7 @@ $$
 \end{aligned}
 $$
 
-When the result of an operation can't fit into the word size (e. g. is more or equal to $2^{32}$ for 32-bit unsigned integers), it *overflows* by leaving only the lowest 32 bits of the result. Similarly, if the result is a negative value, it *underflows* by adding it to $2^{32}$, so that it always stays in the $[0, 2^{32})$ range.
+When the result of an operation can't fit into the word size (e.g., is more or equal to $2^{32}$ for 32-bit unsigned integers), it *overflows* by leaving only the lowest 32 bits of the result. Similarly, if the result is a negative value, it *underflows* by adding it to $2^{32}$, so that it always stays in the $[0, 2^{32})$ range.
 
 This is equivalent to performing all operations modulo a power of two:
 
@@ -90,7 +90,7 @@ The bits of an integer are simply stored sequentially. The only ambiguity here i
 
 This seems like an important architecture aspect, but in most cases, it doesn't make a difference: just pick one style and stick with it. But in some cases it does:
 
-- Little-endian has the advantage that you can cast a value to a smaller type (e. g. `long long` to `int`) by just loading fewer bytes, which in most cases means doing nothing — thanks to *register aliasing*, `eax` refers to the first 4 bytes of `rax`, so conversion is essentially free. It is also easier to read values in a variety of type sizes — while on big-endian architectures, loading an `int` from a `long long` array would require shifting the pointer by 2 bytes.
+- Little-endian has the advantage that you can cast a value to a smaller type (e.g., `long long` to `int`) by just loading fewer bytes, which in most cases means doing nothing — thanks to *register aliasing*, `eax` refers to the first 4 bytes of `rax`, so conversion is essentially free. It is also easier to read values in a variety of type sizes — while on big-endian architectures, loading an `int` from a `long long` array would require shifting the pointer by 2 bytes.
 - Big-endian has the advantage that higher bytes are loaded first, which in theory can make highest-to-lowest routines such as comparisons and printing faster. You can also perform certain checks such as finding out whether a number is negative by only loading its first byte.
 
 Big-endian is also more "natural" — this is how we write binary numbers on paper — but the advantage of having faster type conversions outweigh it. For this reason, little-endian is used by default on most hardware, although some CPUs are "bi-endian" and can be configured to switch modes on demand.
diff --git a/content/english/hpc/arithmetic/newton.md b/content/english/hpc/arithmetic/newton.md
index 8d6ddbae..38bcddda 100644
--- a/content/english/hpc/arithmetic/newton.md
+++ b/content/english/hpc/arithmetic/newton.md
@@ -62,7 +62,7 @@ double sqrt(double n) {
 }
 ```
 
-The algorithm converges for many functions, although it does so reliably and provably only for a certain subset of them (e. g. convex functions). Another question is how fast the convergence is, if it occurs.
+The algorithm converges for many functions, although it does so reliably and provably only for a certain subset of them (e.g., convex functions). Another question is how fast the convergence is, if it occurs.
 
 ### Rate of Convergence
 
diff --git a/content/english/hpc/arithmetic/rsqrt.md b/content/english/hpc/arithmetic/rsqrt.md
index 0fa4d209..f1529d42 100644
--- a/content/english/hpc/arithmetic/rsqrt.md
+++ b/content/english/hpc/arithmetic/rsqrt.md
@@ -3,13 +3,13 @@ title: Fast Inverse Square Root
 weight: 4
 ---
 
-The inverse square root of a floating-point number $\frac{1}{\sqrt x}$ is used in calculating normalized vectors, which are in turn extensively used in various simulation scenarios such as computer graphics, e. g. to determine angles of incidence and reflection to simulate lighting.
+The inverse square root of a floating-point number $\frac{1}{\sqrt x}$ is used in calculating normalized vectors, which are in turn extensively used in various simulation scenarios such as computer graphics (e.g., to determine angles of incidence and reflection to simulate lighting).
 
 $$
 \hat{v} = \frac{\vec v}{\sqrt {v_x^2 + v_y^2 + v_z^2}}
 $$
 
-Calculating inverse square root directly — by first calculating square root and then dividing by it — is extremely slow, because both of these operations are slow even though they are implemented in hardware.
+Calculating an inverse square root directly — by first calculating a square root and then dividing $1$ by it — is extremely slow because both of these operations are slow even though they are implemented in hardware.
 
 But there is a surprisingly good approximation algorithm that takes advantage of the way floating-point numbers are stored in memory. In fact, it is so good that it has been [implemented in hardware](https://www.felixcloutier.com/x86/rsqrtps), so the algorithm is no longer relevant by itself for software engineers, but we are nonetheless going to walk through it for its intrinsic beauty and great educational value.
 
diff --git a/content/english/hpc/compilation/contracts.md b/content/english/hpc/compilation/contracts.md
index cea2c198..e3db2955 100644
--- a/content/english/hpc/compilation/contracts.md
+++ b/content/english/hpc/compilation/contracts.md
@@ -17,7 +17,7 @@ There are two major groups of actions that cause undefined behavior:
 
 - Operations that have slightly different observable behavior on different platforms. For example, the result of left-shifting an integer by more than 31 bits is undefined, because the instruction that does it is implemented differently on Arm and x86 CPUs. If you standardize one specific behavior, then all programs compiled for the other platform will have to spend a few more cycles checking for that edge case, so it is best to leave it undefined.
 
-  Sometimes, when there is a legitimate use case for some platform-specific behavior, instead of declaring it undefined, it can be left *implementation-defined*. For example, the result of right-shifting a [negative integer](/hpc/arithmetic/integer) depends on the platform: it either shifts in zeros or ones (e. g. right shifting `11010110 = -42` by one may mean either `01101011 = 107` or `11101011 = -21`, both use cases being realistic).
+  Sometimes, when there is a legitimate use case for some platform-specific behavior, instead of declaring it undefined, it can be left *implementation-defined*. For example, the result of right-shifting a [negative integer](/hpc/arithmetic/integer) depends on the platform: it either shifts in zeros or ones (e.g., right-shifting `11010110 = -42` by one may mean either `01101011 = 107` or `11101011 = -21`, both use cases being realistic).
 
 Designating something as undefined instead of implementation-defined behavior also helps compilers in optimization. Consider the case of signed integer overflow. On almost all architectures, [signed integers](/hpc/arithmetic/integer) overflow the same way as unsigned ones, with `INT_MAX + 1 == INT_MIN`, and yet, this is undefined behavior according to the C++ standard. This is very much intentional: if you disallow signed integer overflow, then `(x + 1) > x` is guaranteed to be always true for `int`, but not for `unsigned int`, because `(x + 1)` may overflow. For signed types, this lets compilers optimize such checks away.
 
diff --git a/content/english/hpc/compilation/flags.md b/content/english/hpc/compilation/flags.md
index 21db0134..ceae9e87 100644
--- a/content/english/hpc/compilation/flags.md
+++ b/content/english/hpc/compilation/flags.md
@@ -14,7 +14,7 @@ There are 4 *and a half* main levels of optimization for speed in GCC:
 - `-O1` (also aliased as `-O`) does a few "low-hanging fruit" optimizations, almost not affecting the compilation time.
 - `-O2` enables all optimizations that are known to have little to no negative side effects and take reasonable time to complete (this is what most projects use for production builds).
 - `-O3` does very aggressive optimization, enabling almost all *correct* optimizations implemented in GCC.
-- `-Ofast` does everything in `-O3`, plus a few more optimizations flags that may break strict standard compliance, but not in a way that would be critical for most applications (e. g. floating-point operations may be rearranged so that the result is off by a few bits of the mantissa).
+- `-Ofast` does everything in `-O3`, plus a few more optimizations flags that may break strict standard compliance, but not in a way that would be critical for most applications (e.g., floating-point operations may be rearranged so that the result is off by a few bits in the mantissa).
 
 There are also many other optimization flags that are not included even in `-Ofast`, because they are very situational, and enabling them by default is more likely to hurt performance rather than improve it — we will talk about some of them in [the next section](../situational).
 
diff --git a/content/english/hpc/compilation/limitations.md b/content/english/hpc/compilation/limitations.md
index 1aab8936..521f78e7 100644
--- a/content/english/hpc/compilation/limitations.md
+++ b/content/english/hpc/compilation/limitations.md
@@ -21,7 +21,7 @@ In general, when an optimization doesn't happen, it is usually because one of th
 
 - The compiler doesn't have enough information to know it will be beneficial.
 - The optimization is actually not always correct: there is an input on which the result doesn't comply with the spec, even if it is correct on every input that the programmer expects.
-- It isn't implemented in the compiler yet, either because it is too hard to implement in general, too costly to compute or too rare to be worth the trouble (e. g. writing a tiny library for some specific algorithm is usually better than hardcoding it into compiler).
+- It isn't implemented in the compiler yet, either because it is too hard to implement in general, too costly to compute or too rare to be worth the trouble (e.g., writing a tiny library for some specific algorithm is usually better than hardcoding it into compiler).
 
 In addition, optimization sometimes fails just due to the source code being overly complicated.
 
diff --git a/content/english/hpc/complexity/_index.md b/content/english/hpc/complexity/_index.md
index c537c4ce..f38545e0 100644
--- a/content/english/hpc/complexity/_index.md
+++ b/content/english/hpc/complexity/_index.md
@@ -19,7 +19,7 @@ To estimate the real running time of a program, you need to sum all latencies fo
 
 The clock frequency is a volatile and often unknown variable that depends on the CPU model, operating system settings, current microchip temperature, power usage of other components, and quite a few other things. In contrast, instruction latencies are static and even somewhat consistent across different CPUs when expressed in clock cycles, so counting them instead is much more useful for analytical purposes.
 
-For example, the by-definition matrix multiplication algorithm requires the total of $n^2 \cdot (n + n - 1)$ arithmetic operations: specifically, $n^3$ multiplications and $n^2 \cdot (n - 1)$ additions. If we look up the latencies for these instructions (in special documents called *instruction tables*, like [this one](https://www.agner.org/optimize/instruction_tables.pdf)), we can find that multiplication takes e. g. 3 cycles, while addition takes 1, so we need a total of $3 \cdot n^3 + n^2 \cdot (n - 1) = 4 \cdot n^3 - n^2$ clock cycles for the entire computation (bluntly ignoring everything else that needs to be done to "feed" these instructions with the right data).
+For example, the by-definition matrix multiplication algorithm requires the total of $n^2 \cdot (n + n - 1)$ arithmetic operations: specifically, $n^3$ multiplications and $n^2 \cdot (n - 1)$ additions. If we look up the latencies for these instructions (in special documents called *instruction tables*, like [this one](https://www.agner.org/optimize/instruction_tables.pdf)), we can find that, e.g., multiplication takes 3 cycles, while addition takes 1, so we need a total of $3 \cdot n^3 + n^2 \cdot (n - 1) = 4 \cdot n^3 - n^2$ clock cycles for the entire computation (bluntly ignoring everything else that needs to be done to "feed" these instructions with the right data).
 
 Similar to how the sum of instruction latencies can be used as a clock-independent proxy for total execution time, computational complexity can be used to quantify the intrinsic time requirements of an abstract algorithm, without relying on the choice of a specific computer.
 
diff --git a/content/english/hpc/cpu-cache/alignment.md b/content/english/hpc/cpu-cache/alignment.md
index 32c54b6d..83d62310 100644
--- a/content/english/hpc/cpu-cache/alignment.md
+++ b/content/english/hpc/cpu-cache/alignment.md
@@ -33,7 +33,7 @@ struct alignas(64) Data {
 };
 ```
 
-Whenever an instance of `Data` is allocated, it will be at the beginning of a cache line. The downside is that the effective size of the structure will be rounded up to the nearest multiple of 64 bytes. This has to be done so that, e. g. when allocating an array of `Data`, not just the first element is properly aligned.
+Whenever an instance of `Data` is allocated, it will be at the beginning of a cache line. The downside is that the effective size of the structure will be rounded up to the nearest multiple of 64 bytes. This has to be done so that, e.g., when allocating an array of `Data`, not just the first element is properly aligned.
 
 ### Structure Alignment
 
@@ -94,7 +94,7 @@ Now, each of them is aligned without any padding, and the size of the structure
 
 As a rule of thumb, place your type definitions from largest data types to smallest — this greedy algorithm is guaranteed to work unless you have some weird non-power-of-two type sizes such as the [10-byte](/hpc/arithmetic/ieee-754#float-formats) `long double`[^extended].
 
-[^extended]: The 80-bit `long double` takes *at least* 10 bytes, but the exact format is up to the compiler — e. g. it may pad it to 12 or 16 bytes to minimize alignment issues (64-bit GCC and Clang use 16 bytes by default; you can override this by specifying one of `-mlong-double-64/80/128` or `-m96/128bit-long-double` [options](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html)).
+[^extended]: The 80-bit `long double` takes *at least* 10 bytes, but the exact format is up to the compiler — for example, it may pad it to 12 or 16 bytes to minimize alignment issues (64-bit GCC and Clang use 16 bytes by default; you can override this by specifying one of `-mlong-double-64/80/128` or `-m96/128bit-long-double` [options](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html)).
 
 <!--
 
diff --git a/content/english/hpc/cpu-cache/bandwidth.md b/content/english/hpc/cpu-cache/bandwidth.md
index a28570f5..472a5689 100644
--- a/content/english/hpc/cpu-cache/bandwidth.md
+++ b/content/english/hpc/cpu-cache/bandwidth.md
@@ -106,4 +106,4 @@ Theoretically, both requests should use the same bandwidth: a read request sends
 
 Also, for these reasons, a single CPU core usually [can't fully saturate the memory bandwidth](../sharing).
 
-The same technique generalizes to `memcpy`: it also just moves 32-byte blocks with SIMD load/store instructions, and it can be similarly made non-temporal, increasing the throughput twofold for large arrays. There is also a non-temporal load instruction (`_mm256_stream_load_si256`) for when you want to *read* without polluting cache (e. g. when you don't need the original array after a `memcpy`, but will need some data that you had accessed before calling it).
+The same technique generalizes to `memcpy`: it also just moves 32-byte blocks with SIMD load/store instructions, and it can be similarly made non-temporal, increasing the throughput twofold for large arrays. There is also a non-temporal load instruction (`_mm256_stream_load_si256`) for when you want to *read* without polluting cache (e.g., when you don't need the original array after a `memcpy`, but will need some data that you had accessed before calling it).
diff --git a/content/english/hpc/cpu-cache/paging.md b/content/english/hpc/cpu-cache/paging.md
index 3e6cfd8f..8320d437 100644
--- a/content/english/hpc/cpu-cache/paging.md
+++ b/content/english/hpc/cpu-cache/paging.md
@@ -94,7 +94,7 @@ For sparse reads, it often makes sense to increase page size, which improves the
 
 Typical size of a page is 4KB, but it can be up to 1G or so for large databases, but enabling it by default is not a good idea as scenarios when we have a VPS with 256M or RAM and more than 256 processes are not uncommon.
 
-Typical page sizes are 4K, 2M and 1G (e. g. allowing for 256K, 128M, 64G memory regions to be stored in a 64-entry L1 TLB respectively).
+Typical page sizes are 4K, 2M and 1G (e.g., allowing for 256K, 128M, 64G memory regions to be stored in a 64-entry L1 TLB respectively).
 
 
 - There are other types of cache inside CPUs that are used for things other than data. The most important for us are *instruction cache* (I-cache), which is used to speed up the fetching of machine code from memory, and *translation lookaside buffer* (TLB), which is used to store physical locations of virtual memory pages, which is instrumental to the efficiency of virtual memory.
diff --git a/content/english/hpc/data-structures/b-tree.md b/content/english/hpc/data-structures/b-tree.md
index d69a814e..0189a185 100644
--- a/content/english/hpc/data-structures/b-tree.md
+++ b/content/english/hpc/data-structures/b-tree.md
@@ -289,7 +289,7 @@ We start at the size $10^4$ and end at $10^7$, for around $50$ data points in to
 
 <!--
 
-Keys are uniform, but we should not rely on that fact (e. g. using ).
+Keys are uniform, but we should not rely on that fact (e.g., using interpolation search).
 
 It is common that >90% of operations are lookups. Optimizing searches is important because every other operation starts with locating a key.
 
diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index d9a3dcf6..7c408228 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -311,7 +311,7 @@ eytzinger:  4 2 5 1 6 3 7 8
 
 Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $4$, $2$, and $5$, go left-right-right, and end up with $k = 11$, which isn't even a valid array index.
 
-The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (i. e. leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns.
+The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (that is, leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns.
 
 This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that amount. To do this, we can invert the number (`~k`) and call the "find first set" instruction:
 
diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md
index a61dcd71..54b32b7f 100644
--- a/content/english/hpc/data-structures/segment-trees.md
+++ b/content/english/hpc/data-structures/segment-trees.md
@@ -79,7 +79,7 @@ There are many things segment trees can do. Persistent structures, computational
 
 Segment trees are used for windowing queries or range queries in general, either by themselves or as part of a larger algorithm.
 
-Functional programming, e. g. for implementing persistent arrays and derived structures.
+Functional programming, e.g., for implementing persistent arrays and derived structures.
 
 -->
 
@@ -249,7 +249,7 @@ Apart from requiring much less memory, which is good for fitting into the CPU ca
 
 To improve the performance further, we can:
 
-- manually optimize the index arithmetic (e. g. noticing that we need to multiply `v` by `2` either way),
+- manually optimize the index arithmetic (e.g., noticing that we need to multiply `v` by `2` either way),
 - replace division by two with an explicit binary shift (because [compilers aren't always able to do it themselves](/hpc/compilation/contracts/#arithmetic)),
 - and, most importantly, get rid of [recursion](/hpc/architecture/functions) and make the implementation fully iterative.
 
@@ -724,7 +724,7 @@ This makes both queries much slower — especially the reduction — but this sh
 
 **Minimum** is a nice exception where the update query can be made slightly faster if the new value of the element is less than the current one: we can skip the horizontal reduction part and just update $\log_B n$ nodes using a scalar procedure.
 
-This works very fast when we mostly have such updates, which is the case e. g. for the sparse-graph Dijkstra algorithm when we have more edges than vertices. For this problem, the wide segment tree can serve as an efficient fixed-universe min-heap.
+This works very fast when we mostly have such updates, which is the case, e.g., for the sparse-graph Dijkstra algorithm when we have more edges than vertices. For this problem, the wide segment tree can serve as an efficient fixed-universe min-heap.
 
 **Lazy propagation** can be done by storing a separate array for the delayed operations in a node. To propagate the updates, we need to go top to bottom (which can be done by simply reversing the direction of the `for` loop and using `k >> (h * b)` to calculate the `h`-th ancestor), [broadcast](/hpc/simd/moving/#broadcast) and reset the delayed operation value stored in the parent of the current node, and apply it to all values stored in the current node with SIMD.
 
diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md
index 35670da9..f0ca9c65 100644
--- a/content/english/hpc/external-memory/hierarchy.md
+++ b/content/english/hpc/external-memory/hierarchy.md
@@ -40,8 +40,8 @@ Everything up to the RAM level is called *volatile memory* because it does not p
 
 From fastest to slowest:
 
-- **CPU registers**, which are the zero-time access data cells CPU uses to store all its intermediate values, can also be thought of as a memory type. There is only a limited number of them (e. g. 16 "general purpose" ones), and in some cases, you may want to use all of them for performance reasons.
-- **CPU caches.** Modern CPUs have multiple layers of cache (L1, L2, often L3, and rarely even L4). The lowest layer is shared between cores and is usually scaled with their number (e. g. a 10-core CPU should have around 10M of L3 cache).
+- **CPU registers**, which are the zero-time access data cells CPU uses to store all its intermediate values, can also be thought of as a memory type. There is only a limited number of them (e.g., just 16 "general purpose" ones), and in some cases, you may want to use all of them for performance reasons.
+- **CPU caches.** Modern CPUs have multiple layers of cache (L1, L2, often L3, and rarely even L4). The lowest layer is shared between cores and is usually scaled with their number (e.g., a 10-core CPU should have around 10M of L3 cache).
 - **Random access memory,** which is the first scalable type of memory: nowadays you can rent machines with half a terabyte of RAM on the public clouds. This is the one where most of your working data is supposed to be stored.
 
 The CPU cache system has an important concept of a *cache line*, which is the basic unit of data transfer between the CPU and the RAM. The size of a cache line is 64 bytes on most architectures, meaning that all main memory is divided into blocks of 64 bytes, and whenever you request (read or write) a single byte, you are also fetching all its 63 cache line neighbors whether your want them or not.
diff --git a/content/english/hpc/external-memory/oblivious.md b/content/english/hpc/external-memory/oblivious.md
index a0327855..93c4f2fc 100644
--- a/content/english/hpc/external-memory/oblivious.md
+++ b/content/english/hpc/external-memory/oblivious.md
@@ -118,7 +118,7 @@ It seems like we can't do better, but it turns out we can.
 
 ### Algorithm
 
-Cache-oblivious matrix multiplication relies on essentially the same trick as the transposition. We need to divide the data until it fits into lowest cache (i. e. $N^2 \leq M$). For matrix multiplication, this equates to using this formula:
+Cache-oblivious matrix multiplication relies on essentially the same trick as the transposition. We need to divide the data until it fits into lowest cache (i.e., $N^2 \leq M$). For matrix multiplication, this equates to using this formula:
 
 $$
 \begin{pmatrix}
diff --git a/content/english/hpc/external-memory/virtual.md b/content/english/hpc/external-memory/virtual.md
index 6535283d..92bb454c 100644
--- a/content/english/hpc/external-memory/virtual.md
+++ b/content/english/hpc/external-memory/virtual.md
@@ -19,7 +19,7 @@ Virtual memory gives each process the impression that it fully controls a contig
 
 To achieve this, the memory address space is divided into *pages* (typically 4KB in size), which are the base units of memory that the programs can request from the operating system. The memory system maintains a special hardware data structure called the *page table*, which contains the mappings of virtual page addresses to the physical ones. When a process accesses data using its virtual memory address, the memory system calculates its page number (by right-shifting it by $12$ if $4096=2^{12}$ is the page size), looks up in the page table that its physical address is, and forwards the read or write request to where that data is actually stored.
 
-Since the address translation needs to be done for each memory request, and the number of memory pages itself may be large (e. g. 16G RAM / 4K page size = 4M pages), address translation poses a difficult problem in itself. One way to speed it up is to use a special cache for the page table itself called *translation lookaside buffer* (TLB), and the other is to [increase the page size](/hpc/cpu-cache/paging) so that the total number of memory pages is made smaller at the cost of reduced granularity.
+Since the address translation needs to be done for each memory request, and the number of memory pages itself may be large (e.g., 16G RAM / 4K page size = 4M pages), address translation poses a difficult problem in itself. One way to speed it up is to use a special cache for the page table itself called *translation lookaside buffer* (TLB), and the other is to [increase the page size](/hpc/cpu-cache/paging) so that the total number of memory pages is made smaller at the cost of reduced granularity.
 
 <!--
 
diff --git a/content/english/hpc/number-theory/cryptography.md b/content/english/hpc/number-theory/cryptography.md
index e552372a..0b8c6b76 100644
--- a/content/english/hpc/number-theory/cryptography.md
+++ b/content/english/hpc/number-theory/cryptography.md
@@ -30,7 +30,7 @@ There is an issue when establishing initial communication that the attacker coul
 
 Between your browser and a bank. "Hey this is a message from a bank."
 
-Trust networks. E. g. everyone can trust Google or whoever makes the device or operating system.
+Trust networks. E.g., everyone can trust Google or whoever makes the device or operating system.
 
 ## Symmetric Cryptography
 
diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
index 3488474b..038391dd 100644
--- a/content/english/hpc/number-theory/montgomery.md
+++ b/content/english/hpc/number-theory/montgomery.md
@@ -94,7 +94,7 @@ $$
 r \cdot r^{-1} + n \cdot n^\prime = 1
 $$
 
-and both $r^{-1}$ and $n^\prime$ can be computed e. g. using the [extended Euclidean algorithm](../euclid-extended).
+and both $r^{-1}$ and $n^\prime$ can be computed, e.g., using the [extended Euclidean algorithm](../euclid-extended).
 
 Using this identity, we can express $r \cdot r^{-1}$ as $(1 - n \cdot n^\prime)$ and write $x \cdot r^{-1}$ as
 
diff --git a/content/english/hpc/parallel/concurrency/fibers.md b/content/english/hpc/parallel/concurrency/fibers.md
index 2ec2806c..cce7b860 100644
--- a/content/english/hpc/parallel/concurrency/fibers.md
+++ b/content/english/hpc/parallel/concurrency/fibers.md
@@ -28,4 +28,4 @@ func main() {
 
 The way they work is that the language maintains a group of threads ready to pick up from where they left. This is called N:M scheduling.
 
-Similar runtimes exist for other languages, e. g. for C++ and Rust.
+Similar runtimes exist for other languages, e.g., for C++ and Rust.
diff --git a/content/english/hpc/parallel/gpu/_index.en.md b/content/english/hpc/parallel/gpu/_index.en.md
index 540d5d5c..ac2a4aa9 100644
--- a/content/english/hpc/parallel/gpu/_index.en.md
+++ b/content/english/hpc/parallel/gpu/_index.en.md
@@ -195,7 +195,7 @@ Some tasks, especially in cryptography, cannot be parallelized. But some can.
 
 ## Summing arrays in $O(\log n)$ time
 
-Assume we want to perform some associative (i. e. $A*(B*C) = (A*B)*C$) operation on an array of $n$ elements. Say, sum it up.
+Assume we want to perform some associative (i.e., $A*(B*C) = (A*B)*C$) operation on an array of $n$ elements. Say, sum it up.
 
 Normally, we would do that with a simple loop:
 
@@ -418,7 +418,7 @@ Intrinsics for that.
 
 Now, a lot of value comes from cryptocurrency and deep learning. The latter relies on two specific operations: matrix multiplications for linear layers and convolutions for convolutional layers used in computer vision.
 
-First, they introduced "multiply-accumulate" operation (e. g. `x += y * z`) per 1 GPU clock cycle.
+First, they introduced "multiply-accumulate" operation (e.g., `x += y * z`) per 1 GPU clock cycle.
 
 Google uses Tensor Processing Units. Nobody really knows how they work (proprietary hardware that they rent, not sell).
 
diff --git a/content/english/hpc/pipelining/branching.md b/content/english/hpc/pipelining/branching.md
index 2a3b81af..08d7887d 100644
--- a/content/english/hpc/pipelining/branching.md
+++ b/content/english/hpc/pipelining/branching.md
@@ -47,7 +47,7 @@ body:
 
 Our goal is to simulate a completely unpredictable branch, and we successfully achieve it: the code takes ~14 CPU cycles per element. For a very rough estimate of what it is supposed to be, we can assume that the branches alternate between `<` and `>=`, and the pipeline is mispredicted every other iteration. Then, every two iterations:
 
-- We discard the pipeline, which is 19 cycles deep on Zen 2 (i. e. it has 19 stages, each taking one cycle).
+- We discard the pipeline, which is 19 cycles deep on Zen 2 (i.e., it has 19 stages, each taking one cycle).
 - We need a memory fetch and a comparison, which costs ~5 cycles. We can check the conditions of even and odd iterations concurrently, so let's assume we only pay it once per 2 iterations.
 - In the case of the `<` branch, we need another ~4 cycles to add `a[i]` to a volatile (memory-stored) variable `s`.
 
diff --git a/content/english/hpc/pipelining/hazards.md b/content/english/hpc/pipelining/hazards.md
index d4a2d7df..872b59de 100644
--- a/content/english/hpc/pipelining/hazards.md
+++ b/content/english/hpc/pipelining/hazards.md
@@ -8,7 +8,7 @@ published: true
 
 There are multiple ways this may happen:
 
-* A *structural hazard* happens when two or more instructions need the same part of CPU (e. g. an execution unit).
+* A *structural hazard* happens when two or more instructions need the same part of CPU (e.g., an execution unit).
 * A *data hazard* happens when you have to wait for an operand to be computed from some previous step.
 * A *control hazard* happens when a CPU can't tell which instructions it needs to execute next.
 
diff --git a/content/english/hpc/pipelining/scheduling.md b/content/english/hpc/pipelining/scheduling.md
index b4857a0c..0cda777c 100644
--- a/content/english/hpc/pipelining/scheduling.md
+++ b/content/english/hpc/pipelining/scheduling.md
@@ -14,7 +14,7 @@ As there are many different instructions, It is very common for programs to have
 
 <!-- Pipeline of a superscalar CPU with the width of 2 img/superscalar.png -->
 
-Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp). Most execution units have their own little pipelines, and can take another instruction just one or two cycles after the previous one. If a certain instruction is frequently used, it makes sense to duplicate its execution unit also, and also place frequently jointly used instructions on the same execution unit: e. g. not using the same for arithmetic and memory operation.
+Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp). Most execution units have their own little pipelines, and can take another instruction just one or two cycles after the previous one. If a certain instruction is frequently used, it makes sense to duplicate its execution unit also, and also place frequently jointly used instructions on the same execution unit: e.g., not using the same for arithmetic and memory operation.
 
 ### Microcode
 
diff --git a/content/english/hpc/pipelining/throughput.md b/content/english/hpc/pipelining/throughput.md
index 03562291..0b596404 100644
--- a/content/english/hpc/pipelining/throughput.md
+++ b/content/english/hpc/pipelining/throughput.md
@@ -84,7 +84,7 @@ Bandwidth is the rate at which data can be read or stored. For the purpose of de
 
 In the previous version, we have an inherently sequential chain of operations in the innermost loop. We accumulate the minimum in variable v by a sequence of min operations. There is no way to start the second operation before we know the result of the first operation; there is no room for parallelism here:
 
-The result will be clearly the same, but we are calculating the operations in a different order. In essence, we split the work in two independent parts, calculating the minimum of odd elements and the minimum of even elements, and finally combining the results. If we calculate the odd minimum v0 and even minimum v1 in an interleaved manner, as shown above, we will have more opportunities for parallelism. For example, the 1st and 2nd operation could be calculated simultaneously in parallel (or they could be executed in a pipelined fashion in the same execution unit). Once these results are available, the 3rd and 4th operation could be calculated simultaneously in parallel, etc. We could potentially obtain a speedup of a factor of 2 here, and naturally the same idea could be extended to calculating e.g. 4 minimums in an interleaved fashion.
+The result will be clearly the same, but we are calculating the operations in a different order. In essence, we split the work in two independent parts, calculating the minimum of odd elements and the minimum of even elements, and finally combining the results. If we calculate the odd minimum v0 and even minimum v1 in an interleaved manner, as shown above, we will have more opportunities for parallelism. For example, the 1st and 2nd operation could be calculated simultaneously in parallel (or they could be executed in a pipelined fashion in the same execution unit). Once these results are available, the 3rd and 4th operation could be calculated simultaneously in parallel, etc. We could potentially obtain a speedup of a factor of 2 here, and naturally the same idea could be extended to calculating, e.g., 4 minimums in an interleaved fashion.
 
 Instruction-level parallelism is automatic Now that we know how to reorganize calculations so that there is potential for parallelism, we will need to know how to realize the potential. For example, if we have these two operations in the C++ code, how do we tell the computer that the operations can be safely executed in parallel?
 
diff --git a/content/english/hpc/profiling/_index.md b/content/english/hpc/profiling/_index.md
index 0b7ca30f..ceca0f2f 100644
--- a/content/english/hpc/profiling/_index.md
+++ b/content/english/hpc/profiling/_index.md
@@ -10,7 +10,7 @@ There are many different types of profilers. I like to think about them by analo
 
 - When objects are on a micrometer scale, they use optical microscopes.
 - When objects are on a nanometer scale, and light no longer interacts with them, they use electron microscopes.
-- When objects are smaller than that (e. g. the insides of an atom), they resort to theories and assumptions about how things work (and test these assumptions using intricate and indirect experiments).
+- When objects are smaller than that (e.g., the insides of an atom), they resort to theories and assumptions about how things work (and test these assumptions using intricate and indirect experiments).
 
 Similarly, there are three main profiling techniques, each operating by its own principles, having distinct areas of applicability, and allowing for different levels of precision:
 
diff --git a/content/english/hpc/profiling/benchmarking.md b/content/english/hpc/profiling/benchmarking.md
index d873ca62..dd543bcc 100644
--- a/content/english/hpc/profiling/benchmarking.md
+++ b/content/english/hpc/profiling/benchmarking.md
@@ -59,7 +59,7 @@ Although *efficient* in terms of execution speed, C and C++ are not the most *pr
 
 One way to improve modularity and reusability is to separate all testing and analytics code from the actual implementation of the algorithm, and also make it so that different versions are implemented in separate files, but have the same interface.
 
-In C/C++, you can do this by creating a single header file (e. g. `gcd.hh`) with a function interface and all its benchmarking code in `main`:
+In C/C++, you can do this by creating a single header file (e.g., `gcd.hh`) with a function interface and all its benchmarking code in `main`:
 
 ```c++
 int gcd(int a, int b); // to be implemented
@@ -93,7 +93,7 @@ int main() {
 }
 ```
 
-Then you create many implementation files for each algorithm version (e. g. `v1.cc`, `v2.cc` and so on, or some meaningful names if applicable) that all include that single header file:
+Then you create many implementation files for each algorithm version (e.g., `v1.cc`, `v2.cc`, and so on, or some meaningful names if applicable) that all include that single header file:
 
 ```c++
 #include "gcd.hh"
diff --git a/content/english/hpc/profiling/events.md b/content/english/hpc/profiling/events.md
index 71ae9cd3..eb2ba613 100644
--- a/content/english/hpc/profiling/events.md
+++ b/content/english/hpc/profiling/events.md
@@ -93,7 +93,7 @@ Overhead  Command  Shared Object        Symbol
    0.80%  run      libc-2.33.so         [.] rand
 ```
 
-Note that, for each function, just its *overhead* is listed and not the total running time (e. g. `setup` includes `std::__introsort_loop` but only its own overhead is accounted as 3.43%). There are tools for constructing [flame graphs](https://www.brendangregg.com/flamegraphs.html) out of perf reports to make them more clear. You also need to account for possible inlining, which is apparently what happened with `std::lower_bound` here. Perf also tracks shared libraries (like `libc`) and, in general, any other spawned processes: if you want, you can launch a web browser with perf and see what's happening inside.
+Note that, for each function, just its *overhead* is listed and not the total running time (e.g., `setup` includes `std::__introsort_loop` but only its own overhead is accounted as 3.43%). There are tools for constructing [flame graphs](https://www.brendangregg.com/flamegraphs.html) out of perf reports to make them more clear. You also need to account for possible inlining, which is apparently what happened with `std::lower_bound` here. Perf also tracks shared libraries (like `libc`) and, in general, any other spawned processes: if you want, you can launch a web browser with perf and see what's happening inside.
 
 Next, you can "zoom in" on any of these functions, and, among others things, it will offer to show you its disassembly with an associated heatmap. For example, here is the assembly for `query`:
 
diff --git a/content/english/hpc/profiling/mca.md b/content/english/hpc/profiling/mca.md
index 4634ba25..99cfe2ed 100644
--- a/content/english/hpc/profiling/mca.md
+++ b/content/english/hpc/profiling/mca.md
@@ -40,7 +40,7 @@ First, it outputs general information about the loop and the hardware:
 - It "ran" the loop 100 times, executing 400 instructions in total in 108 cycles, which is the same as executing $\frac{400}{108} \approx 3.7$ [instructions per cycle](/hpc/complexity/hardware) on average (IPC).
 - The CPU is theoretically capable of executing up to 6 instructions per cycle ([dispatch width](/hpc/architecture/layout)).
 - Each cycle in theory can be executed in 0.8 cycles on average ([block reciprocal throughput](/hpc/pipelining/tables)).
-- The "uOps" here are the micro-operations that CPU splits each instruction into (e. g. fused load-add is composed of two uOps).
+- The "uOps" here are the micro-operations that the CPU splits each instruction into (e.g., fused load-add is composed of two uOps).
 
 Then it proceeds to give information about each individual instruction: 
 
diff --git a/content/english/hpc/profiling/noise.md b/content/english/hpc/profiling/noise.md
index 8dcdb032..74ff0272 100644
--- a/content/english/hpc/profiling/noise.md
+++ b/content/english/hpc/profiling/noise.md
@@ -87,7 +87,7 @@ for (int i = 0; i < N; i++)
     checksum ^= lower_bound(checksum ^ q[i]);
 ```
 
-It usually makes the most difference in algorithms with possible pipeline stall issues, e. g. when comparing branchy and branch-free algorithms.
+It usually makes the most difference in algorithms with possible pipeline stall issues, e.g., when comparing branchy and branch-free algorithms.
 
 **Cold cache.** Another source of bias is the *cold cache effect*, when memory reads initially take longer time because the required data is not in cache yet.
 
@@ -130,7 +130,7 @@ The issues we've described produce *bias* in measurements: they consistently giv
 These type of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling:
 
 - If you benchmark a compute-bound algorithm, measure its performance in cycles using `perf stat`: this way it will be independent of clock frequency, fluctuations of which is usually the main source of noise.
-- Otherwise, set core frequency to the what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e. g. `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it.
+- Otherwise, set core frequency to the what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e.g., `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it.
 - If applicable, turn hyper-threading off and attach jobs to specific cores. Make sure no other jobs are running on the system, turn off networking and try not to fiddle with the mouse.
 
 You can't remove noises and biases completely. Even a program's name can affect its speed: the executable's name ends up in an environment variable, environment variables end up on the call stack, and so the length of the name affects stack alignment, which can result in data accesses slowing down due to crossing cache line or memory page boundaries.
diff --git a/content/english/hpc/profiling/simulation.md b/content/english/hpc/profiling/simulation.md
index 2f6c6dc6..75401b8a 100644
--- a/content/english/hpc/profiling/simulation.md
+++ b/content/english/hpc/profiling/simulation.md
@@ -50,7 +50,7 @@ Mispred rate:         22.0% (      22.5%     +        0.0%   )
 
 We've fed Cachegrind exactly the same example code as in [the previous section](../events): we create an array of a million random integers, sort it, and then perform a million binary searches on it. Cachegrind shows roughly the same numbers as perf does, except that that perf's measured numbers of memory reads and branches are slightly inflated due to [speculative execution](/hpc/pipelining): they really happen in hardware and thus increment hardware counters, but are discarded and don't affect actual performance, and thus ignored in the simulation.
 
-Cachegrind only models the first (`D1` for data, `I1` for instructions) and the last (`LL`, unified) levels of cache, the characteristics of which are inferred from the system. It doesn't limit you in any way as you can also set them from the command line, e. g. to model the L2 cache: `--LL=<size>,<associativity>,<line size>`.
+Cachegrind only models the first (`D1` for data, `I1` for instructions) and the last (`LL`, unified) levels of cache, the characteristics of which are inferred from the system. It doesn't limit you in any way as you can also set them from the command line, e g., to model the L2 cache: `--LL=<size>,<associativity>,<line size>`.
 
 It seems like it only slowed down our program so far and hasn't provided us any information that `perf stat` couldn't. To get more out of it than just the summary info, we can inspect a special file with profiling info, which it dumps by default in the same directory named as `cachegrind.out.<pid>`. It is human-readable, but is expected to be read via the `cg_annotate` command:
 
diff --git a/content/english/hpc/simd/intrinsics.md b/content/english/hpc/simd/intrinsics.md
index e091ddb6..4e9c6804 100644
--- a/content/english/hpc/simd/intrinsics.md
+++ b/content/english/hpc/simd/intrinsics.md
@@ -95,7 +95,7 @@ for (int i = 0; i < 100; i += 4) {
 
 The main challenge of using SIMD is getting the data into contiguous fixed-sized blocks suitable for loading into registers. In the code above, we may in general have a problem if the length of the array is not divisible by the block size. There are two common solutions to this:
 
-1. We can "overshoot" by iterating over the last incomplete segment either way. To make sure we don't segfault by trying to read from or write to a memory region we don't own, we need to pad the arrays to the nearest block size (typically with some "neutral" element, e. g. zero).
+1. We can "overshoot" by iterating over the last incomplete segment either way. To make sure we don't segfault by trying to read from or write to a memory region we don't own, we need to pad the arrays to the nearest block size (typically with some "neutral" element, e.g., zero).
 2. Make one iteration less and write a little loop in the end that calculates the remainder normally (with scalar operations).
 
 Humans prefer #1 because it is simpler and results in less code, and compilers prefer #2 because they don't really have another legal option.
@@ -135,7 +135,7 @@ Also, some of the intrinsics don't map to a single instruction but a short seque
 
 <!--
 
-For example, the group of `extract` intrinsics that are used to get individual elements out of vectors: e. g. `_mm256_extract_epi32(x, 0)` returns the first element out of 8-integer vector. t is quite slow (~5 cycles) to move data between "normal" and SIMD registers in general.
+For example, the group of `extract` intrinsics that are used to get individual elements out of vectors: e g., `_mm256_extract_epi32(x, 0)` returns the first element out of 8-integer vector. t is quite slow (~5 cycles) to move data between "normal" and SIMD registers in general.
 
 -->
 
diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md
index 078983d2..c67c1942 100644
--- a/content/english/hpc/simd/reduction.md
+++ b/content/english/hpc/simd/reduction.md
@@ -3,7 +3,7 @@ title: Sums and Other Reductions
 weight: 3
 ---
 
-*Reduction* (also known as *folding* in functional programming) is the action of computing the value of some associative and commutative operation (i.e. $(a \circ b) \circ c = a \circ (b \circ c)$ and $a \circ b = b \circ a$) over a range of arbitrary elements.
+*Reduction* (also known as *folding* in functional programming) is the action of computing the value of some associative and commutative operation (i.e., $(a \circ b) \circ c = a \circ (b \circ c)$ and $a \circ b = b \circ a$) over a range of arbitrary elements.
 
 The simplest example of reduction is calculating the sum an array:
 
@@ -68,7 +68,7 @@ int hsum(__m256i x) {
 }
 ```
 
-There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e. g. for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
+There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
 
 There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.
 
diff --git a/content/english/hpc/slides/01-intro/_index.md b/content/english/hpc/slides/01-intro/_index.md
index 492ceb6a..615a89aa 100644
--- a/content/english/hpc/slides/01-intro/_index.md
+++ b/content/english/hpc/slides/01-intro/_index.md
@@ -151,7 +151,7 @@ Also a clear path to improvement: just make lenses stronger and chips smaller
 
 $\implies$ Each new "generation" should have roughly the same total cost, but 40% higher clock and twice as many transistors
 
-(which can be used e. g. to add new instructions or increase the word size) <!-- .element: class="fragment" data-fragment-index="1" -->
+(which can be used, e.g., to add new instructions or increase the word size) <!-- .element: class="fragment" data-fragment-index="1" -->
 
 ----
 
diff --git a/content/english/hpc/stats.md b/content/english/hpc/stats.md
index 6e436d15..15d81e39 100644
--- a/content/english/hpc/stats.md
+++ b/content/english/hpc/stats.md
@@ -18,7 +18,7 @@ A **random variable** is any variable whose value depends on an outcome of a ran
 2. $\forall x \in X, 0 \leq P \leq 1$.
 3. $\sum_{x \in X} P(x) = 1$.
 
-For example, consider a random variable $X$ with $k$ discrete states (e. g. the result of a die toss). We can place a *uniform distribution* on $X$ — that is, make each of its states equally likely — by setting its probability distribution to:
+For example, consider a random variable $X$ with $k$ discrete states (e.g., the result of a die toss). We can place a *uniform distribution* on $X$ — that is, make each of its states equally likely — by setting its probability distribution to:
 
 $$
 P(x=x_i) = \frac{1}{k}
@@ -121,7 +121,7 @@ The last transition is true because it is a sum of harmonic series.
 
 ### Order Statistics
 
-There is a slight modification of quicksort called quickselect that allows finding the $k$-th smallest element in $O(n)$ time, which is useful when we need to quickly compute order statistics, e. g. medians or 75-th quantiles.
+There is a slight modification of quicksort called quickselect that allows finding the $k$-th smallest element in $O(n)$ time, which is useful when we need to quickly compute order statistics; e.g., medians or 75-th quantiles.
 
 1. Select a random element $p$ from the array.
 2. Partition the array into two arrays $L$ and $R$ using the predicate $a_i > p$.

From 3bb8fad0b2f4c9bfeca09d3dfd8e2c9d24763184 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 18 May 2022 10:57:34 +0300
Subject: [PATCH 101/173] amount/number and much/many

---
 content/english/hpc/algorithms/gcd.md                | 2 +-
 content/english/hpc/arithmetic/float.md              | 2 +-
 content/english/hpc/arithmetic/ieee-754.md           | 4 ++--
 content/english/hpc/compilation/precalc.md           | 2 +-
 content/english/hpc/cpu-cache/alignment.md           | 2 +-
 content/english/hpc/data-structures/binary-search.md | 2 +-
 content/english/hpc/external-memory/locality.md      | 2 +-
 content/english/hpc/external-memory/sorting.md       | 4 ++--
 content/english/hpc/profiling/benchmarking.md        | 2 +-
 content/english/hpc/simd/shuffling.md                | 2 +-
 10 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md
index 63efdec9..d56be8f7 100644
--- a/content/english/hpc/algorithms/gcd.md
+++ b/content/english/hpc/algorithms/gcd.md
@@ -135,7 +135,7 @@ int gcd(int a, int b) {
 
 Let's run it, and… it sucks. The difference in speed compared to `std::gcd` is indeed 2x, but on the other side of the equation. This is mainly because of all the branching needed to differentiate between the cases. Let's start optimizing.
 
-First, let's replace all divisions by 2 with divisions by whichever highest power of 2 we can. We can do it efficiently with `__builtin_ctz`, the "count trailing zeros" instruction available on modern CPUs. Whenever we are supposed to divide by 2 in the original algorithm, we will call this function instead, which will give us the exact amount to right-shift the number by. Assuming that the we are dealing with large random numbers, this is expected to decrease the number of iterations by almost a factor 2, because $1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \ldots \to 2$.
+First, let's replace all divisions by 2 with divisions by whichever highest power of 2 we can. We can do it efficiently with `__builtin_ctz`, the "count trailing zeros" instruction available on modern CPUs. Whenever we are supposed to divide by 2 in the original algorithm, we will call this function instead, which will give us the exact number of bits to right-shift the number by. Assuming that the we are dealing with large random numbers, this is expected to decrease the number of iterations by almost a factor 2, because $1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \ldots \to 2$.
 
 Second, we can notice that condition 2 can now only be true once — in the very beginning — because every other identity leaves at least one of the numbers odd. Therefore we can handle this case just once in the beginning and not consider it in the main loop.
 
diff --git a/content/english/hpc/arithmetic/float.md b/content/english/hpc/arithmetic/float.md
index 70217a91..dcc33039 100644
--- a/content/english/hpc/arithmetic/float.md
+++ b/content/english/hpc/arithmetic/float.md
@@ -9,7 +9,7 @@ The users of floating-point arithmetic deserve one of these IQ bell curve memes
 - Then they discover that `0.1 + 0.2 != 0.3` or some other quirk like that, freak out, start thinking that some random error term is added to every computation, and for many years avoid any real data types completely.
 - Then they finally man up, read the specification of how IEEE-754 floats work and start using them appropriately.
 
-Too many people are unfortunately still at stage 2, breeding various misconceptions about floating-point arithmetic — thinking that it is fundamentally imprecise and unstable, and slower than integer arithmetic.
+Unfortunately, too many people are still at stage 2, breeding various misconceptions about floating-point arithmetic — thinking that it is fundamentally imprecise and unstable, and slower than integer arithmetic.
 
 ![](../img/iq.svg)
 
diff --git a/content/english/hpc/arithmetic/ieee-754.md b/content/english/hpc/arithmetic/ieee-754.md
index 06d58e4d..65cc5f48 100644
--- a/content/english/hpc/arithmetic/ieee-754.md
+++ b/content/english/hpc/arithmetic/ieee-754.md
@@ -52,7 +52,7 @@ Their availability ranges from chip to chip:
 - Most CPUs support single- and double-precision — which is what `float` and `double` types refer to in C.
 - Extended formats are exclusive to x86, and are available in C as the `long double` type, which falls back to double precision on arm. The choice of 64 bits for mantissa is so that every `long long` integer can be represented exactly. There is also a 40-bit format that similarly allocates 32 mantissa bits.
 - Quadruple as well as the 256-bit "octuple" formats are only used for specific scientific computations and are not supported by general-purpose hardware.
-- Half-precision arithmetic only supports a small subset of operations and is generally used for machine learning applications, especially neural networks, because they tend to do a large amount of calculation, but don't require a high level of precision.
+- Half-precision arithmetic only supports a small subset of operations and is generally used for applications such as machine learning, especially neural networks, because they tend to perform large amounts of calculations but don't require high levels of precision.
 - Half-precision is being gradually replaced by bfloat, which trades off 3 mantissa bits to have the same range as single-precision, enabling interoperability with it. It is mostly being adopted by specialized hardware: TPUs, FGPAs, and GPUs. The name stands for "[Brain](https://en.wikipedia.org/wiki/Google_Brain) float."
 
 Lower-precision types need less memory bandwidth to move them around and usually take fewer cycles to operate on (e.g., the division instruction may take $x$, $y$, or $z$ cycles depending on the type), which is why they are preferred when error tolerance allows it.
@@ -77,7 +77,7 @@ This is a complex mechanism that deserves an article of its own, but since this
 
 ### NaNs, Zeros and Infinities
 
-Floating-point arithmetic often deals with noisy, real-world data, and exceptions there are much more common than in the integer case. For this reason, the default behavior is different. Instead of crashing, the result is substituted with a special value without interrupting the executing, unless the programmer explicitly wants to.
+Floating-point arithmetic often deals with noisy, real-world data. Exceptions there are much more common than in the integer case, and for this reason, the default behavior when handling them is different. Instead of crashing, the result is substituted with a special value without interrupting the program execution (unless the programmer explicitly wants it to).
 
 The first type of such value is the two infinities: a positive and a negative one. They are generated if the result of an operation can't fit within the representable range, and they are treated as such in arithmetic.
 
diff --git a/content/english/hpc/compilation/precalc.md b/content/english/hpc/compilation/precalc.md
index 29b31cd6..4a7cb7b7 100644
--- a/content/english/hpc/compilation/precalc.md
+++ b/content/english/hpc/compilation/precalc.md
@@ -37,7 +37,7 @@ constexpr int fibonacci(int n) {
 }
 ```
 
-There used to be much more limitations in earlier C++ standards, like you could not use any sort of state inside them and had to rely on recursion, so the whole process felt more like Haskell programming rather than C++. Since C++17, you can even compute static arrays using the imperative style, which is useful for precomputing lookup tables:
+There used to be many more limitations in earlier C++ standards, like you could not use any sort of state inside them and had to rely on recursion, so the whole process felt more like Haskell programming rather than C++. Since C++17, you can even compute static arrays using the imperative style, which is useful for precomputing lookup tables:
 
 ```c++
 struct Precalc {
diff --git a/content/english/hpc/cpu-cache/alignment.md b/content/english/hpc/cpu-cache/alignment.md
index 83d62310..59579467 100644
--- a/content/english/hpc/cpu-cache/alignment.md
+++ b/content/english/hpc/cpu-cache/alignment.md
@@ -77,7 +77,7 @@ This potentially wastes space but saves a lot of CPU cycles. This trade-off is m
 
 ### Optimizing Member Order
 
-Padding is only inserted before a not-yet-aligned member or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the required amount of padding bytes and the total size of the structure.
+Padding is only inserted before a not-yet-aligned member or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the required number of padding bytes and the total size of the structure.
 
 In the previous example, we could reorder the structure members like this:
 
diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 7c408228..48bf07b4 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -313,7 +313,7 @@ Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compar
 
 The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (that is, leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns.
 
-This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that amount. To do this, we can invert the number (`~k`) and call the "find first set" instruction:
+This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that. To do this, we can invert the number (`~k`) and call the "find first set" instruction:
 
 ```c++
 int lower_bound(int x) {
diff --git a/content/english/hpc/external-memory/locality.md b/content/english/hpc/external-memory/locality.md
index eca83766..e61cb5a3 100644
--- a/content/english/hpc/external-memory/locality.md
+++ b/content/english/hpc/external-memory/locality.md
@@ -174,7 +174,7 @@ The AoS layout is usually preferred for data structures, but SoA still has good
 
 This difference in design is important in data processing applications. For example, databases can be either *row-* or *column-oriented* (also called *columnar*):
 
-- *Row-oriented* storage formats are used when you need to search for a limited amount of objects in a large dataset and fetch all or most of their fields. Examples: PostgreSQL, MongoDB.
+- *Row-oriented* storage formats are used when you need to search for a limited number of objects in a large dataset and/or fetch all or most of their fields. Examples: PostgreSQL, MongoDB.
 - *Columnar* storage formats are used for big data processing and analytics, where you need to scan through everything anyway to calculate certain statistics. Examples: ClickHouse, Hbase.
 
 Columnar formats have the additional advantage that you can only read the fields that you need, as different fields are stored in separate external memory regions.
diff --git a/content/english/hpc/external-memory/sorting.md b/content/english/hpc/external-memory/sorting.md
index c7effc46..299da78f 100644
--- a/content/english/hpc/external-memory/sorting.md
+++ b/content/english/hpc/external-memory/sorting.md
@@ -34,7 +34,7 @@ So far the examples have been simple, and their analysis doesn't differ too much
 
 In the standard RAM model, the asymptotic complexity would be multiplied $k$, since we would need to perform $O(k)$ comparisons to fill each next element. But in the external memory model, since everything we do in-memory doesn't cost us anything, its asymptotic complexity would not change as long as we can fit $(k+1)$ full blocks in memory, that is, if $k = O(\frac{M}{B})$.
 
-Remember [the $M \gg B$ assumption](../model) when we introduced the computational model? If we have $M \geq B^{1+ε}$ for $\epsilon > 0$, then we can fit any sub-polynomial amount of blocks in memory, certainly including $O(\frac{M}{B})$. This condition is called *tall cache assumption*, and it is usually required in many other external memory algorithms.
+Remember [the $M \gg B$ assumption](../model) when we introduced the computational model? If we have $M \geq B^{1+ε}$ for $\epsilon > 0$, then we can fit any sub-polynomial number of blocks in memory, certainly including $O(\frac{M}{B})$. This condition is called *tall cache assumption*, and it is usually required in many other external memory algorithms.
 
 ### Merge Sorting
 
@@ -58,7 +58,7 @@ Half of a page ago we have learned that in the external memory model, we can mer
 
 Let's sort each block of size $M$ in-memory just as we did before, but during each merge stage, we will split sorted blocks not just in pairs to be merged, but take as many blocks we can fit into our memory during a $k$-way merge. This way the height of the merge tree would be greatly reduced, while each layer would still be done in $O(\frac{N}{B})$ IOPS.
 
-How many sorted arrays can we merge at once? Exactly $k = \frac{M}{B}$, since we need memory for one block for each array. Since the total amount of layers will be reduced to $\log_{\frac{M}{B}} \frac{N}{M}$, the total complexity will be reduced to
+How many sorted arrays can we merge at once? Exactly $k = \frac{M}{B}$, since we need memory for one block for each array. Since the total number of layers will be reduced to $\log_{\frac{M}{B}} \frac{N}{M}$, the total complexity will be reduced to
 
 $$
 SORT(N) \stackrel{\text{def}}{=} O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{M} \right)
diff --git a/content/english/hpc/profiling/benchmarking.md b/content/english/hpc/profiling/benchmarking.md
index dd543bcc..2be61235 100644
--- a/content/english/hpc/profiling/benchmarking.md
+++ b/content/english/hpc/profiling/benchmarking.md
@@ -186,4 +186,4 @@ plt.plot(ns, [x / y for x, y in zip(baseline, results)])
 plt.show()
 ```
 
-Once established, this workflow makes you iterate much faster and just focus on optimizing the algorithm itself.
+Once established, this workflow makes you iterate much faster and focus on optimizing the algorithm itself.
diff --git a/content/english/hpc/simd/shuffling.md b/content/english/hpc/simd/shuffling.md
index 111c34d5..6ff3b749 100644
--- a/content/english/hpc/simd/shuffling.md
+++ b/content/english/hpc/simd/shuffling.md
@@ -175,7 +175,7 @@ The general idea of our algorithm is as follows:
 - use this mask to index a lookup table that returns a permutation moving the elements that satisfy the predicate to the beginning of the vector (in their original order);
 - use the `_mm256_permutevar8x32_epi32` intrinsic to permute the values;
 - write the whole permuted vector to the buffer — it may have some trailing garbage, but its prefix is correct;
-- calculate the population count of the scalar mask and move the buffer pointer by that amount.
+- calculate the population count of the scalar mask and move the buffer pointer by that number.
 
 First, we need to precompute the permutations:
 

From 893772a2538f1592fb1fdc55611267a7effd5868 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 18 May 2022 11:49:06 +0300
Subject: [PATCH 102/173] fix eytzinger example (tnx @tmp-coder)

---
 content/english/hpc/data-structures/binary-search.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 48bf07b4..babe0092 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -286,7 +286,9 @@ This function takes the current node number `k`, recursively writes out all elem
 
 Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time.
 
-Note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`).
+Note that this traversal and the resulting permutation are not exactly equivalent to the "tree" of vanilla binary search: for example, the left child subtree may be larger than the right child subtree — and even more than just by one node — but it doesn't matter since both approaches result in the same logarithmic tree depth.
+
+Also note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`).
 
 ### Search Implementation
 
@@ -302,18 +304,18 @@ The only problem arises when we need to restore the index of the resulting eleme
 
 ```
     array:  1 2 3 4 5 6 7 8
-eytzinger:  4 2 5 1 6 3 7 8
+eytzinger:  5 3 7 2 4 6 8 1
 1st range:  ---------------  k := 1
 2nd range:  -------          k := 2*k      (=2)
 3rd range:      ---          k := 2*k + 1  (=5)
-4th range:        -          k := 2*k + 1  (=11)
+4th range:        -          k := 2*k      (=10)
 ```
 
-Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $4$, $2$, and $5$, go left-right-right, and end up with $k = 11$, which isn't even a valid array index.
+Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $5$, $3$, and $4$, go left-right-left, and end up with $k = 10$, which isn't even a valid array index.
 
 The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (that is, leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns.
 
-This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing ones in the binary representation and right-shift $k$ by exactly that. To do this, we can invert the number (`~k`) and call the "find first set" instruction:
+This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing 1s in the binary representation and right-shift $k$ by exactly that number of bits. To do this, we can invert the number (`~k`) and call the "find first set" instruction:
 
 ```c++
 int lower_bound(int x) {

From b82fb8fa10e5eaac97e6111016f9886464b3135c Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 19 May 2022 13:44:01 +0300
Subject: [PATCH 103/173] on optimizing latency and efficiency

---
 content/english/hpc/complexity/levels.md | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md
index 281bdea2..b1e29e2e 100644
--- a/content/english/hpc/complexity/levels.md
+++ b/content/english/hpc/complexity/levels.md
@@ -40,3 +40,25 @@ Programmers can be put in several "levels" in terms of their software optimizati
 In this book, we expect that the average reader is somewhere around stage 1, and hopefully by the end of it will get to 4.
 
 You should also go through these levels when designing algorithms. First get it working in the first place, then select a bunch of reasonably asymptotically optimal algorithm. Then think about how they are going to work in terms of their memory operations or ability to execute in parallel (even if you consider single-threaded programs, there is still going to be plenty of parallelism inside a core, so this model is extremely ), and then proceed toward actual implementation. Avoid premature optimization, as Knuth once said.
+
+---
+
+For most web services, efficiency doesn't matter, but *latency* does.
+
+Increasing efficiency is not how it is done nowadays.
+
+A pageview usually generates somewhere on the order of 0.1 to 1 cent per pageview. This is a typical rate at which you monetize user attention. Say, if I simply installed AdSense, i'd be getting something like that — depending on where most of my readers are from and how many of them are using an ad blocker.
+
+At the same time, a server with a dedicated core and 1GB of ram (which is an absurdly large amount of resources for a simple web service) costs around one millionth per second when amortized. You could fetch 100 photos with that.
+
+Amazon had an experiment where they A/B tested their service with artificial delays and found out that a 100ms delay decreased revenue. This follows for most other services, say, you lose your "flow" at twitter, the user is likely to start thinking on something else and leave. If the delay at Google is more than a few seconds, people will just think that Google isn't working and quit.
+
+Minimization of latency can be usually done with parallel computing, which is why distributed systems are scaled more on scalability. This part of the book is concerned with improving *efficiency* of algorithms, which makes latency lower as the by-product.
+
+However, there are still use cases when there is a trade-off between quality and cost of servers.
+
+- Search is hierarchical. There are usually many layers of more accurate but slower models. The more documents you rank on each layer, the better the final quality.
+- Games. They are more enjoyable on large scale, but computational power also increases. This includes AI.
+- AI workloads — those that have large quantities of data such as language models. Heavier models require more compute. The bottleneck in them is not the number of data, but efficiencty.
+
+Inherently sequential algorithms, or cases when the resources are constrained. Ctrl+f'ing a large PDF is painful. Factorization.

From 25333d550985213cdde5a743f5d6e4862207e4ce Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 20 May 2022 06:37:07 +0300
Subject: [PATCH 104/173] estimating performance engineering impact

---
 content/english/hpc/complexity/levels.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md
index b1e29e2e..e2e8b58f 100644
--- a/content/english/hpc/complexity/levels.md
+++ b/content/english/hpc/complexity/levels.md
@@ -62,3 +62,17 @@ However, there are still use cases when there is a trade-off between quality and
 - AI workloads — those that have large quantities of data such as language models. Heavier models require more compute. The bottleneck in them is not the number of data, but efficiencty.
 
 Inherently sequential algorithms, or cases when the resources are constrained. Ctrl+f'ing a large PDF is painful. Factorization.
+
+## Estimating the impact
+
+Sometime the optimization needs to happen in the calling layer.
+
+SIMDJSON speeds up JSON parsing, but it may be better to not use JSON in the first place.
+
+Protobuf or flat binary formats.
+
+There is also a chicken and egg problem: people don't use an approach that much because it is slow and not feasible.
+
+Cost to implement, bugs, maintainability. It is perfectly fine that most software in the world is inefficient.
+
+What does it mean to be a better programmer? Faster programs? Faster speed of work? Fewer bugs? It is a combination of those.

From bf8a1e817963151180f6f7352e3d979b1f6f7f33 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 20 May 2022 06:50:32 +0300
Subject: [PATCH 105/173] how to read this book

---
 content/english/hpc/complexity/levels.md |  2 ++
 content/english/hpc/preface.md           | 16 ++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md
index e2e8b58f..981d467c 100644
--- a/content/english/hpc/complexity/levels.md
+++ b/content/english/hpc/complexity/levels.md
@@ -76,3 +76,5 @@ There is also a chicken and egg problem: people don't use an approach that much
 Cost to implement, bugs, maintainability. It is perfectly fine that most software in the world is inefficient.
 
 What does it mean to be a better programmer? Faster programs? Faster speed of work? Fewer bugs? It is a combination of those.
+
+Implementing compiler optimizations or databases are examples of high-leverage activities because they act as a tax on everything else — which is why you see most people writing books on these particular topics rather than software optimization in general.
diff --git a/content/english/hpc/preface.md b/content/english/hpc/preface.md
index 28adae07..2e18e715 100644
--- a/content/english/hpc/preface.md
+++ b/content/english/hpc/preface.md
@@ -19,6 +19,22 @@ There are a lot of forward references I couldn't get rid of.
 
 Read some of the SIMD and memory chapter first.
 
+Chapter 1 is a "why you should care" sort of read.
+
+Chapter 2 is an introduction to computer architectures from the perspective of performance. There is a high chance that you already know it from a college course, but I still advise to read it to get into context, as we will cover assembly-level optimization techniques there.
+
+Chapter 3 is where experienced programmers should start from.
+
+Chapter 4 discusses compilation with the example of C++ and GCC/Clang. Chapter 5 discusses language-agnostic profiling methods. You are free to skip both.
+
+Chapter 6 discusses arithmetic and chapter 7 discusses modular arithmetic and its applications. They also acts as a sort of reference for algorithms in the case studies.
+
+Chapter 8 introduces the external memory model and how the memory system works. Chapter 9 follows up with experimental studies of how it can affect performance.
+
+Chapters 10 discusses SIMD programming, which is a major part. It is not *that* intertwined with the preivous ones, and if you are feeling comfortable, I'd suggest that you start reading with it because it will teach you powerful techniques right away.
+
+Chapters 11-12 contain case studies of complex algorithms. Performance engineering is a practical field, so you should learn from major examples.
+
 The first 5 chapters build up general understanding of performance.
 
 Chapters 6-10 go deeper into modern features. Arithmetic, number theory (the techniques that are also relevant outside of it). Some are theoretic, and then applied in practice.

From 22ad3b1ff984da97081d26b4ef60f9b7c7137a24 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 23 May 2022 10:25:34 +0300
Subject: [PATCH 106/173] fix formatting

---
 .../russian/cs/factorization/eratosthenes.md    | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/content/russian/cs/factorization/eratosthenes.md b/content/russian/cs/factorization/eratosthenes.md
index 02e72c0e..acf47749 100644
--- a/content/russian/cs/factorization/eratosthenes.md
+++ b/content/russian/cs/factorization/eratosthenes.md
@@ -12,10 +12,10 @@ published: true
 
 Основная идея соответствует названию алгоритма: запишем ряд чисел $1, 2,\ldots, n$, а затем будем вычеркивать
 
-* сначала числа, делящиеся на $2$, кроме самого числа $2$,
-* потом числа, делящиеся на $3$, кроме самого числа $3$,
-* с числами, делящимися на $4$, ничего делать не будем — мы их уже вычёркивали,
-* потом продолжим вычеркивать числа, делящиеся на $5$, кроме самого числа $5$,
+- сначала числа, делящиеся на $2$, кроме самого числа $2$,
+- потом числа, делящиеся на $3$, кроме самого числа $3$,
+- с числами, делящимися на $4$, ничего делать не будем — мы их уже вычёркивали,
+- потом продолжим вычеркивать числа, делящиеся на $5$, кроме самого числа $5$,
 
 …и так далее.
 
@@ -23,10 +23,10 @@ published: true
 
 ```c++
 vector<bool> sieve(int n) {
-    vector<bool> is_prime(n+1, true);
+    vector<bool> is_prime(n + 1, true);
     for (int i = 2; i <= n; i++)
         if (is_prime[i])
-            for (int j = 2*i; j <= n; j += i)
+            for (int j = 2 * i; j <= n; j += i)
                 is_prime[j] = false;
     return is_prime;            
 }
@@ -49,7 +49,6 @@ $$
 У исходного алгоритма асимптотика должна быть ещё лучше. Чтобы найти её точнее, нам понадобятся два факта про простые числа:
 
 1. Простых чисел от $1$ до $n$ примерно $\frac{n}{\ln n}$ .
-
 2. Простые числа распределены без больших «разрывов» и «скоплений», то есть $k$-тое простое число примерно равно $k \ln k$.
 
 Мы можем упрощённо считать, что число $k$ является простым с «вероятностью» $\frac{1}{\ln n}$. Тогда, время работы алгоритма можно более точнее оценить как
@@ -65,11 +64,11 @@ $$
 
 ## Линейное решето
 
-Основная проблема решета Эратосфена состоит в том, что некоторые числа мы будем помечать как составные несколько раз — а именно столько раз, сколько у них различных простых делителей. Чтобы достичь линейного времени работы, нам нужно придумать способ, как рассматривать все составные числа ровно один раз.
+Основная проблема решета Эратосфена состоит в том, что некоторые числа мы будем помечать как составные несколько раз — столько, сколько у них различных простых делителей. Чтобы достичь линейного времени работы, нам нужно придумать способ, как рассматривать все составные числа ровно один раз.
 
 Обозначим за $d(k)$ минимальный простой делитель числа $k$ и заметим следующий факт: у составного числа $k$ есть единственное представление $k = d(k) \cdot r$, и при этом у числа $r$ нет простых делителей меньше $d(k)$.
 
-Идея оптимизации состоит в том, чтобы перебирать этот $r$, и для каждого перебирать только нужные множители — а именно все от $2$ до $d(r)$ включительно.
+Идея оптимизации состоит в том, чтобы перебирать этот $r$, и для каждого перебирать только нужные множители — а именно, все от $2$ до $d(r)$ включительно.
 
 ### Алгоритм
 

From a80dc4f9e0aeb28a873c5c3347626477cb45ac42 Mon Sep 17 00:00:00 2001
From: Timofey <molney239@gmail.com>
Date: Tue, 24 May 2022 13:11:02 +0300
Subject: [PATCH 107/173] Update products.md

---
 content/russian/cs/geometry-basic/products.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/russian/cs/geometry-basic/products.md b/content/russian/cs/geometry-basic/products.md
index a4e1a3d5..ca0a5dd3 100644
--- a/content/russian/cs/geometry-basic/products.md
+++ b/content/russian/cs/geometry-basic/products.md
@@ -1,6 +1,7 @@
 ---
 title: Скалярное и векторное произведение
 weight: 2
+published: true
 ---
 
 Помимо очевидных сложения, вычитания и умножения на константу, у векторов можно ввести и свои особенные операции, которые нам упростят жизнь.
@@ -42,7 +43,7 @@ $$
 
 Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$.
 
-Геометрически, это ориентированный объем параллелограмма, натянутого на вектора $a$ и $b$:
+Геометрически, это ориентированная площадь параллелограмма, натянутого на вектора $a$ и $b$:
 
 ![](../img/cross.jpg)
 

From 9ef98a1c9f5b4a103e68d9db4d44634753e9378e Mon Sep 17 00:00:00 2001
From: Timofey <molney239@gmail.com>
Date: Tue, 24 May 2022 13:27:57 +0300
Subject: [PATCH 108/173] 
 http://www.gramota.ru/slovari/dic/?word=%D0%B2%D0%B5%D0%BA%D1%82%D0%BE%D1%80&all=x

---
 content/russian/cs/geometry-basic/vectors.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/russian/cs/geometry-basic/vectors.md b/content/russian/cs/geometry-basic/vectors.md
index 05051396..ee1a052a 100644
--- a/content/russian/cs/geometry-basic/vectors.md
+++ b/content/russian/cs/geometry-basic/vectors.md
@@ -1,6 +1,7 @@
 ---
-title: Точки и векторы
+title: Точки и вектора
 weight: 1
+published: true
 ---
 
 Отрезок, для которого указано, какой из его концов считается началом, а какой концом, называется *вектором*. Вектор на плоскости можно задать двумя числами — его координатами по горизонтали и вертикали.

From 689bc2ee0615285809b0086df388dc5b3dafcfbc Mon Sep 17 00:00:00 2001
From: Timofey <molney239@gmail.com>
Date: Tue, 24 May 2022 14:47:45 +0300
Subject: [PATCH 109/173] =?UTF-8?q?=D0=9E=D0=BF=D0=B8=D1=81=D0=BA=D0=B0?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 content/russian/cs/geometry-basic/products.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/russian/cs/geometry-basic/products.md b/content/russian/cs/geometry-basic/products.md
index ca0a5dd3..488dbca6 100644
--- a/content/russian/cs/geometry-basic/products.md
+++ b/content/russian/cs/geometry-basic/products.md
@@ -41,7 +41,7 @@ $$
 a \times b = |a| \cdot |b| \cdot \sin \theta = x_a y_b - y_a x_b
 $$
 
-Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$.
+Так же, как и со скалярным произведением, доказательство координатной формулы оставляется упражнением читателю. Если кто-то захочет это сделать: это следует из линейности обоих произведений (что в свою очередь тоже нужно доказать) и разложения по базисным векторам $\overline{(0, 1)}$ и $\overline{(1, 0)}$.
 
 Геометрически, это ориентированная площадь параллелограмма, натянутого на вектора $a$ и $b$:
 
@@ -66,7 +66,7 @@ int operator^(r a, r b) { return a.x*b.y - b.x*a.y; }
 
 Скалярное и векторное произведения тесно связаны с углами между векторами и могут использоваться для подсчета величин вроде ориентированных углов и площадей, которые обычно используются для разных проверок.
 
-Когда они уже реализованы, использовать произведения гораздо проще, чем опираться на алгебру. Например, можно легко угол между двумя векторами, подставив в знакомый нам `atan2` векторное и скалярное произведение:
+Когда они уже реализованы, использовать произведения гораздо проще, чем опираться на алгебру. Например, можно легко вычислить угол между двумя векторами, подставив в знакомый нам `atan2` векторное и скалярное произведение:
 
 ```c++
 double angle(r a, r b) {

From 2638bf74c962ab81b0515318e832d0a9451e4b7c Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 24 May 2022 21:04:56 +0300
Subject: [PATCH 110/173] fix formatting

---
 content/russian/cs/interactive/answer-search.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/russian/cs/interactive/answer-search.md b/content/russian/cs/interactive/answer-search.md
index 28e4b4bc..0b38ce24 100644
--- a/content/russian/cs/interactive/answer-search.md
+++ b/content/russian/cs/interactive/answer-search.md
@@ -66,7 +66,7 @@ int solve() {
 Здесь, в отличие от предыдущей задачи, кажется, существует прямое решение с формулой. Но вместо того, чтобы о нем думать, можно просто свести задачу к обратной. Давайте подумаем, как по числу минут $t$ (ответу) понять, сколько листов напечатается за это время? Очень легко:
 
 $$
-\lfloor\frac{t}{x}\rfloor + \lfloor\frac{t}{y}\rfloor
+\left \lfloor \frac{t}{x} \right \rfloor + \left \lfloor \frac{t}{y} \right \rfloor
 $$
 
-Ясно, что за $0$ минут $n$ листов распечатать нельзя, а за $xn$ минут один только первый принтер успеет напечатать $n$ листов. Поэтому $0$ и $xn$ — это подходящие изначальные границы для бинарного поиска.
+Ясно, что за $0$ минут $n$ листов распечатать нельзя, а за $x \cdot n$ минут один только первый принтер успеет напечатать $n$ листов. Поэтому $0$ и $xn$ — это подходящие изначальные границы для бинарного поиска.

From f81a9ea614b579811ba6b8e7edf9a112c601b441 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 24 May 2022 21:05:09 +0300
Subject: [PATCH 111/173] factorization code

---
 .../english/hpc/algorithms/factorization.md   | 243 +++++++++++++++++-
 1 file changed, 242 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 4ff8061d..7c2d8aa7 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -14,7 +14,6 @@ Integer factorization is interesting because of RSA problem.
 - Less than 10^100: Quadratic Sieve
 - More than 10^100: General Number Field Sieve
 
-
 and do other computations such as computing the greatest common multiple (given that it is not even so that ) (since $\gcd(n, r) = 1$)
 
 For all methods, we will implement `find_factor` function which returns one divisor ot 1. You can apply it recurively to get the factorization, so whatever asymptotic you had won't affect it:
@@ -32,6 +31,23 @@ vector<u64> factorize(u64 n) {
 }
 ```
 
+0.056024
+2043.968140
+
+```c++
+typedef __uint16_t u16;
+typedef __uint32_t u32;
+typedef __uint64_t u64;
+typedef __uint128_t u128;
+
+u64 find_factor(u64 n) {
+    for (u64 d = 2; d * d <= n; d++)
+        if (n % d == 0)
+            return d;
+    return 1;
+}
+```
+
 ## Trial division
 
 This is the most basic algorithm to find a prime factorization.
@@ -193,3 +209,228 @@ This is exactly the type of problem when we need specific knowledge, because we
 ## Further optimizations
 
 Существуют также [субэкспоненциальные](https://ru.wikipedia.org/wiki/%D0%A4%D0%B0%D0%BA%D1%82%D0%BE%D1%80%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F_%D1%86%D0%B5%D0%BB%D1%8B%D1%85_%D1%87%D0%B8%D1%81%D0%B5%D0%BB#%D0%A1%D1%83%D0%B1%D1%8D%D0%BA%D1%81%D0%BF%D0%BE%D0%BD%D0%B5%D0%BD%D1%86%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5_%D0%B0%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D1%8B), но не полиномиальные алгоритмы факторизации. Человечество [умеет](https://en.wikipedia.org/wiki/Integer_factorization_records) факторизовывать числа порядка $2^{200}$.
+
+
+---
+
+If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other.
+
+How to optimize for the *average* case is unclear.
+
+0.087907
+3964.321045
+
+```c++
+u64 find_factor(u64 n) {
+    if (n % 2 == 0)
+        return 2;
+    for (u64 d = 3; d * d <= n; d += 2)
+        if (n % d == 0)
+            return d;
+    return 1;
+}
+```
+
+0.199740
+7615.217773
+
+```c++
+u64 find_factor(u64 n) {
+    for (u64 d : {2, 3, 5})
+        if (n % d == 0)
+            return d;
+    u64 increments[] = {0, 4, 6, 10, 12, 16, 22, 24};
+    for (u64 d = 7; d * d <= n; d += 30) {
+        for (u64 k = 0; k < 8; k++) {
+            u64 x = d + increments[k];
+            if (n % x == 0)
+                return x;
+        }
+    }
+    return 1;
+}
+```
+
+19430.058594
+
+```c++
+const int N = (1 << 16);
+
+struct Precalc {
+    u16 primes[6542]; // # of primes under N=2^16
+
+    constexpr Precalc() : primes{} {
+        bool marked[N] = {};
+        int n_primes = 0;
+
+        for (int i = 2; i < N; i++) {
+            if (!marked[i]) {
+                primes[n_primes++] = i;
+                for (int j = 2 * i; j < N; j += i)
+                    marked[j] = true;
+            }
+        }
+    }
+};
+
+constexpr Precalc P{};
+
+u64 find_factor(u64 n) {
+    for (u16 p : P.primes)
+        if (n % p == 0)
+            return p;
+    return 1;
+}
+```
+
+352997.656250
+
+```c++
+u64 magic[6542];
+magic[n_primes++] = u64(-1) / i + 1;
+
+u64 find_factor(u64 n) {
+    for (u64 m : P.magic)
+        if (m * n < m)
+            return u64(-1) / m + 1;
+    return 1;
+}
+```
+
+Except that it is contant, so the speedup should be twice as much.
+
+---
+
+```c++
+u64 find_factor(u64 n) {
+    while (true) {
+        if (u64 g = gcd(randint(2, n - 1), n); g != 1)
+            return g;
+    }
+}
+```
+
+99.292641
+25720.164062 almost 15x slower
+
+```c++
+u64 f(u64 x, u64 a, u64 mod) {
+    return ((u128) x * x + a) % mod;
+}
+
+u64 diff(u64 a, u64 b) {
+    // a and b are unsigned and so is their difference, so we can't just call abs(a - b)
+    return a > b ? a - b : b - a;
+}
+
+u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+    u64 x = x0, y = x0, g = 1;
+    while (g == 1) {
+        x = f(x, a, n);
+        y = f(y, a, n);
+        y = f(y, a, n);
+        g = gcd(diff(x, y));
+    }
+    return g;
+}
+
+u64 find_factor(u64 n) {
+    return rho(n);
+}
+```
+
+56.745281
+
+```c++
+u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+    u64 x = x0, y = x0;
+    
+    for (int l = 256; l < (1 << 20); l *= 2) {
+        x = y;
+        for (int i = 0; i < l; i++) {
+            y = f(y, a, n);
+            if (u64 g = gcd(diff(x, y), n); g != 1)
+                return g;
+        }
+    }
+
+    return 1;
+}
+```
+
+426.389160
+
+```c++
+const int M = 1024;
+
+u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+    u64 x = x0, y = x0, p = 1;
+    
+    for (int l = M; l < (1 << 20); l *= 2) {
+        x = y;
+        for (int i = 0; i < l; i += M) {
+            for (int j = 0; j < M; j++) {
+                y = f(y, a, n);
+                p = (u128) p * diff(x, y) % n;
+            }
+            if (u64 g = gcd(p, n); g != 1)
+                return g;
+        }
+    }
+
+    return 1;
+}
+```
+
+2948.260986
+
+```c++
+struct Montgomery {
+    u64 n, nr;
+    
+    Montgomery(u64 n) : n(n) {
+        nr = 1;
+        for (int i = 0; i < 6; i++)
+            nr *= 2 - n * nr;
+    }
+
+    u64 reduce(u128 x) const {
+        u64 q = u64(x) * nr;
+        u64 m = ((u128) q * n) >> 64;
+        return (x >> 64) + n - m;
+    }
+
+    u64 multiply(u64 x, u64 y) {
+        return reduce((u128) x * y);
+    }
+};
+
+u64 f(u64 x, u64 a, Montgomery m) {
+    return m.multiply(x, x) + a;
+}
+
+const int M = 1024;
+
+u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+    Montgomery m(n);
+    u64 y = x0;
+    
+    for (int l = M; l < (1 << 20); l *= 2) {
+        u64 x = y, p = 1;
+        for (int i = 0; i < l; i += M) {
+            for (int j = 0; j < M; j++) {
+                y = f(y, a, m);
+                p = m.multiply(p, diff(x, y));
+            }
+            if (u64 g = gcd(p, n); g != 1)
+                return g;
+        }
+    }
+
+    return 1;
+}
+```
+
+There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows).
+
+788.4861246275735

From 968abd50c4b267ab6a7b27e991b02267c1d08518 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 24 May 2022 23:02:47 +0300
Subject: [PATCH 112/173] factorization intro

---
 .../english/hpc/algorithms/factorization.md   | 70 +++++++++++++------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 7c2d8aa7..8baf4aaf 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -4,42 +4,59 @@ weight: 3
 draft: true
 ---
 
-Integer factorization is interesting because of RSA problem.
+The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs.
 
-"How big are your numbers?" determines the method to use:
+In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches, and then gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms, which is almost 4x faster than the previous state-of-the-art.
 
-- Less than 2^16 or so: Lookup table.
-- Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm.
-- Less than 10^50: Lenstra elliptic curve factorization
-- Less than 10^100: Quadratic Sieve
-- More than 10^100: General Number Field Sieve
+<!--
+Integer factorization is interesting because of the RSA problem.
+Unlike other case studies of this book, in this one you will actually learn an asymptotically better algorithm that you've never known before — Pollard's rho algorithm — which we optimize so that it is almost 4 times faster than the existing implementation, to the best of my knowledge.
+-->
 
-and do other computations such as computing the greatest common multiple (given that it is not even so that ) (since $\gcd(n, r) = 1$)
+### Benchmark
 
-For all methods, we will implement `find_factor` function which returns one divisor ot 1. You can apply it recurively to get the factorization, so whatever asymptotic you had won't affect it:
+For all methods, we will implement `find_factor` function that takes a positive integer $n$ and returns either its smallest divisor (or `1` if the number is prime):
 
 ```c++
-typedef uint32_t u32;
-typedef uint64_t u64;
+// I don't feel like typing "unsigned long long" each time
+typedef __uint16_t u16;
+typedef __uint32_t u32;
+typedef __uint64_t u64;
 typedef __uint128_t u128;
 
+u64 find_factor(u64 n);
+```
+
+To find full factorization, you can apply it to $n$, reduce it, and continue until a new factor can no longer be found:
+
+```c++
 vector<u64> factorize(u64 n) {
-    vector<u64> res;
-    while (int d = find_factor(n); d > 1) // does it work?
-        res.push_back(d);
-    return res;
+    vector<u64> factorization;
+    do {
+        u64 d = find_factor(n);
+        factorization.push_back(d);
+        n /= d;
+    } while (d != 1);
+    return factorization;
 }
 ```
 
+Since after each removed factor the problem becomes considerably smaller and simpler, the worst-case running time of full factorization is equal to the worst-case running time of a `find_factor` call. 
+
+For many factorization algorithms, including those presented in this article, the running time scales with the least prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. To generate a $k$-bit semiprime, we generate two random $\lfloor k / 2 \rfloor$-bit primes.
+
+Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of errors, although they can be reduced to almost zero without significant performance penalties.
+
+### Trial division
+
+Trial division was first described by Fibonacci in 1202. Although it was probably known to animals. Perhaps some animals can factor? The scientific priority probably belongs to dinosaurs or ancient fish trying to divvy stuff up.
+
+I tried finding references to who invented trial division, but probably it was known to animals long before to split into equal parts.
+
 0.056024
 2043.968140
 
 ```c++
-typedef __uint16_t u16;
-typedef __uint32_t u32;
-typedef __uint64_t u64;
-typedef __uint128_t u128;
-
 u64 find_factor(u64 n) {
     for (u64 d = 2; d * d <= n; d++)
         if (n % d == 0)
@@ -48,8 +65,6 @@ u64 find_factor(u64 n) {
 }
 ```
 
-## Trial division
-
 This is the most basic algorithm to find a prime factorization.
 
 We divide by each possible divisor $d$.
@@ -434,3 +449,14 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
 There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows).
 
 788.4861246275735
+
+### Larger Numbers
+
+"How big are your numbers?" determines the method to use:
+
+
+- Less than 2^16 or so: Lookup table.
+- Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm.
+- Less than 10^50: Lenstra elliptic curve factorization
+- Less than 10^100: Quadratic Sieve
+- More than 10^100: General Number Field Sieve

From 6f0850d4fc30ba584623340bc8ad3a774b8ff8e9 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 25 May 2022 14:10:08 +0300
Subject: [PATCH 113/173] centered codeblock

---
 content/english/hpc/arithmetic/newton.md             |  6 +++---
 content/english/hpc/data-structures/binary-search.md | 12 ++++++------
 content/russian/cs/numerical/newton.md               |  6 +++---
 themes/algorithmica/assets/style.sass                |  8 +++++++-
 .../_default/_markup/render-codeblock-center.html    |  3 +++
 5 files changed, 22 insertions(+), 13 deletions(-)
 create mode 100644 themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html

diff --git a/content/english/hpc/arithmetic/newton.md b/content/english/hpc/arithmetic/newton.md
index 38bcddda..de42104c 100644
--- a/content/english/hpc/arithmetic/newton.md
+++ b/content/english/hpc/arithmetic/newton.md
@@ -68,9 +68,9 @@ The algorithm converges for many functions, although it does so reliably and pro
 
 Let's run a few iterations of Newton's method to find the square root of $2$, starting with $x_0 = 1$, and check how many digits it got correct after each iteration:
 
-<pre>
-<b>1</b>
-<b>1</b>.5
+<pre class='center-pre'>
+<b>1</b>.0000000000000000000000000000000000000000000000000000000000000
+<b>1</b>.5000000000000000000000000000000000000000000000000000000000000
 <b>1.41</b>66666666666666666666666666666666666666666666666666666666675
 <b>1.41421</b>56862745098039215686274509803921568627450980392156862745
 <b>1.41421356237</b>46899106262955788901349101165596221157440445849057
diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index babe0092..8a4924ea 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -302,12 +302,12 @@ while (k <= n)
 
 The only problem arises when we need to restore the index of the resulting element, as $k$ may end up not pointing to a leaf node. Here is an example of how that can happen:
 
-```
-    array:  1 2 3 4 5 6 7 8
-eytzinger:  5 3 7 2 4 6 8 1
-1st range:  ---------------  k := 1
-2nd range:  -------          k := 2*k      (=2)
-3rd range:      ---          k := 2*k + 1  (=5)
+```center
+    array:  1 2 3 4 5 6 7 8                     
+eytzinger:  5 3 7 2 4 6 8 1                     
+1st range:  ---------------  k := 1             
+2nd range:  -------          k := 2*k      (=2) 
+3rd range:      ---          k := 2*k + 1  (=5) 
 4th range:        -          k := 2*k      (=10)
 ```
 
diff --git a/content/russian/cs/numerical/newton.md b/content/russian/cs/numerical/newton.md
index 248e1b4e..5426cff5 100644
--- a/content/russian/cs/numerical/newton.md
+++ b/content/russian/cs/numerical/newton.md
@@ -66,9 +66,9 @@ double sqrt(double n) {
 
 Запустим метод Ньютона для поиска квадратного корня $2$, начиная с $x_0 = 1$, и посмотрим, сколько первых цифр оказались правильными после каждой итерации:
 
-<pre>
-<b>1</b>
-<b>1</b>.5
+<pre class='center-pre'>
+<b>1</b>.0000000000000000000000000000000000000000000000000000000000000
+<b>1</b>.5000000000000000000000000000000000000000000000000000000000000
 <b>1.41</b>66666666666666666666666666666666666666666666666666666666675
 <b>1.41421</b>56862745098039215686274509803921568627450980392156862745
 <b>1.41421356237</b>46899106262955788901349101165596221157440445849057
diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass
index a6835c1e..eb5e2410 100644
--- a/themes/algorithmica/assets/style.sass
+++ b/themes/algorithmica/assets/style.sass
@@ -492,7 +492,13 @@ pre
   padding-left: 8px
   font-size: 0.85em
   text-align: left
-  
+
+pre.center-pre
+  text-align: center
+  font-size: 1em
+  background: none
+  border: none
+
 .highlight
   margin: 0px
 
diff --git a/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html b/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html
new file mode 100644
index 00000000..d263bb5a
--- /dev/null
+++ b/themes/algorithmica/layouts/_default/_markup/render-codeblock-center.html
@@ -0,0 +1,3 @@
+<pre class='center-pre'>
+{{.Inner}}
+</pre>

From 7297d591846a63f1615ec5415db99d0e5d447e26 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 25 May 2022 14:11:59 +0300
Subject: [PATCH 114/173] bump hugo version

---
 netlify.toml | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/netlify.toml b/netlify.toml
index 1b5ed16e..fb612037 100644
--- a/netlify.toml
+++ b/netlify.toml
@@ -2,7 +2,7 @@
 command = "hugo --gc --minify"
 
 [context.production.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 HUGO_ENV = "production"
 HUGO_ENABLEGITINFO = "true"
 
@@ -10,20 +10,20 @@ HUGO_ENABLEGITINFO = "true"
 command = "hugo --gc --minify --enableGitInfo"
 
 [context.split1.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 HUGO_ENV = "production"
 
 [context.deploy-preview]
 command = "hugo --gc --minify --buildFuture -b $DEPLOY_PRIME_URL"
 
 [context.deploy-preview.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 
 [context.branch-deploy]
 command = "hugo --gc --minify -b $DEPLOY_PRIME_URL"
 
 [context.branch-deploy.environment]
-HUGO_VERSION = "0.87.0"
+HUGO_VERSION = "0.96.0"
 
 [context.next.environment]
 HUGO_ENABLEGITINFO = "true"

From 251dd08c54db23dac6a977a84ea4a60ced3c9532 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 25 May 2022 16:38:32 +0300
Subject: [PATCH 115/173] wheel and lookup factorization

---
 .../english/hpc/algorithms/factorization.md   | 259 ++++++++----------
 1 file changed, 118 insertions(+), 141 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 8baf4aaf..9f7958ed 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -15,7 +15,7 @@ Unlike other case studies of this book, in this one you will actually learn an a
 
 ### Benchmark
 
-For all methods, we will implement `find_factor` function that takes a positive integer $n$ and returns either its smallest divisor (or `1` if the number is prime):
+For all methods, we will implement `find_factor` function that takes a positive integer $n$ and returns any of its non-trivial divisors (or `1` if the number is prime):
 
 ```c++
 // I don't feel like typing "unsigned long long" each time
@@ -45,35 +45,30 @@ Since after each removed factor the problem becomes considerably smaller and sim
 
 For many factorization algorithms, including those presented in this article, the running time scales with the least prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. To generate a $k$-bit semiprime, we generate two random $\lfloor k / 2 \rfloor$-bit primes.
 
-Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of errors, although they can be reduced to almost zero without significant performance penalties.
+Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of false negative errors (when `find_factor` returns `1` despite number $n$ being composite), although this rate can be reduced to almost zero without significant performance penalties.
 
 ### Trial division
 
-Trial division was first described by Fibonacci in 1202. Although it was probably known to animals. Perhaps some animals can factor? The scientific priority probably belongs to dinosaurs or ancient fish trying to divvy stuff up.
+<!--
 
-I tried finding references to who invented trial division, but probably it was known to animals long before to split into equal parts.
+Trial division was first described by Fibonacci in 1202. Although it was probably known to animals. Perhaps some animals can factor? The scientific priority probably belongs to dinosaurs or ancient fish trying to divvy stuff up.
 
 0.056024
-2043.968140
+
+-->
+
+The most basic approach is to try every number less than $n$ as a divosor:
 
 ```c++
 u64 find_factor(u64 n) {
-    for (u64 d = 2; d * d <= n; d++)
+    for (u64 d = 2; d < n; d++)
         if (n % d == 0)
             return d;
     return 1;
 }
 ```
 
-This is the most basic algorithm to find a prime factorization.
-
-We divide by each possible divisor $d$.
-We can notice, that it is impossible that all prime factors of a composite number $n$ are bigger than $\sqrt{n}$.
-Therefore, we only need to test the divisors $2 \le d \le \sqrt{n}$, which gives us the prime factorization in $O(\sqrt{n})$.
-
-The smallest divisor has to be a prime number.
-We remove the factor from the number, and repeat the process.
-If we cannot find any divisor in the range $[2; \sqrt{n}]$, then the number itself has to be prime.
+One simple optimization is to notice that it is enough to only check divisors that do not exceed $\sqrt n$. This works because if $n$ is divided by $d > \sqrt n$, then it is also divided by $\frac{n}{d} < \sqrt n$, so we can don't have to check it separately.
 
 ```c++
 u64 find_factor(u64 n) {
@@ -84,13 +79,43 @@ u64 find_factor(u64 n) {
 }
 ```
 
+In our benchmark, $n$ is a semiprime, and we always find the lesser divisor, so both $O(n)$ and $O(\sqrt n)$ implementations perform the same and are able to factorize ~2k 30-bit numbers per second, while taking whole ~20 seconds to factorize a single 60-bit number.
+
+### Lookup Table
+
+Nowadays, you can type `factor 57` in your Linux terminal or Google search bar to get the factorization of any number. But before computers were invented, it was more practical to use *factorization tables:* special books containing factorizations of the first $N$ numbers.
+
+We can also use this approach to compute these lookup tables [during compile time](/hpc/compilation/precalc/). To save space, it is convenient to only store the smallest divisor of a number, requiring just one byte for a 16-bit integer:
+
+```c++
+template <int N = (1<<16)>
+struct Precalc {
+    unsigned char divisor[N];
+
+    constexpr Precalc() : divisor{} {
+        for (int i = 0; i < N; i++)
+            divisor[i] = 1;
+        for (int i = 2; i * i < N; i++)
+            if (divisor[i] == 1)
+                for (int k = i * i; k < N; k += i)
+                    divisor[k] = i;
+    }
+};
+
+constexpr Precalc P{};
+
+u64 find_factor(u64 n) {
+    return P.divisor[n];
+}
+```
+
+This approach can process 3M 16-bit integers per second, although it [probably gets slower](../hpc/cpu-cache/bandwidth/) for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first $2^{16}$ numbers, it does not scale well for larger inputs.
+
 ### Wheel factorization
 
-This is an optimization of the trial division.
-The idea is the following.
-Once we know that the number is not divisible by 2, we don't need to check every other even number.
-This leaves us with only $50\%$ of the numbers to check.
-After checking 2, we can simply start with 3 and skip every other number.
+To save paper space, pre-computer era factorization tables typically excluded numbers divisible by 2 and 5: in decimal numeral system, you can quickly determine whether a number is divisible by 2 or 5 (by looking at its last digit) and keep dividing the number $n$ by 2 or 5 while it is possible, eventually arriving to some entry in the factorization table. This makes the factorization table just ½ × ⅘ = 0.4 its original size.
+
+We can apply a similar trick to trial division, first checking if the number is divisible by $2$, and then only check for odd divisors:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -103,24 +128,27 @@ u64 find_factor(u64 n) {
 }
 ```
 
-This method can be extended.
-If the number is not divisible by 3, we can also ignore all other multiples of 3 in the future computations.
-So we only need to check the numbers $5, 7, 11, 13, 17, 19, 23, \dots$.
-We can observe a pattern of these remaining numbers.
-We need to check all numbers with $d \bmod 6 = 1$ and $d \bmod 6 = 5$.
-So this leaves us with only $33.3\%$ percent of the numbers to check.
-We can implement this by checking the primes 2 and 3 first, and then start checking with 5 and alternatively skip 1 or 3 numbers.
+With 50% fewer divisions to do, this algorithm works twice as fast, but it can be extended. If the number is not divisible by $3$, we can also ignore all multiples of $3$, and the same goes for all other divisors. 
+
+The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small. For example, if we consider $2$, $3$, and $5$, then, among the first $90$ numbers, we only need to check:
+
+```center
+(1,) 7, 11, 13, 17, 19, 23, 29,
+31, 37, 41, 43, 47, 49, 53, 59,
+61, 67, 71, 73, 77, 79, 83, 89…
+```
+
+You can notice a pattern: the sequence repeats itself every $30$ numbers because remainder modulo $2 \times 3 \times 5 = 30$ is all we need to determine whether a number is divisible by $2$, $3$, or $5$. This means that we only need to check $8$ specific numbers in every $30$, proportionally improving the performance:
 
 ```c++
 u64 find_factor(u64 n) {
     for (u64 d : {2, 3, 5})
         if (n % d == 0)
             return d;
-    u64 increments[] =   {0, 4, 6, 10, 12, 16, 22, 24};
-    u64 sum = 30;
-    for (u64 d = 7; d * d <= n; d += sum) {
-        for (u64 k = 0; k < 8; k++) {
-            u64 x = d + increments[k];
+    u64 offsets[] = {0, 4, 6, 10, 12, 16, 22, 24};
+    for (u64 d = 7; d * d <= n; d += 30) {
+        for (u64 offset : offsets) {
+            u64 x = d + offset;
             if (n % x == 0)
                 return x;
         }
@@ -129,38 +157,80 @@ u64 find_factor(u64 n) {
 }
 ```
 
-We can extend this even further.
-Here is an implementation for the prime number 2, 3 and 5.
-It's convenient to use an array to store how much we have to skip.
+As expected, it works $\frac{30}{8} = 3.75$ times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved by considering more primes, but the returns are diminishing: adding a new prime $p$ reduces the number of iterations by $\frac{1}{p}$, but increases the size of the skip-list by a factor of $p$, requiring proportionally more memory.
 
-### Lookup table
+### Precomputed Primes
 
-We will choose to store smallest factors of first $2^16$ — because this way they all fit in just one byte, so we are sort of saving on memory here.
+If we keep increasing the number of primes we exclude in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don't need this array of offsets, but we need to precompute primes, which we can do during compile time like this:
 
 ```c++
-template<int N = (1<<16)>
+const int N = (1 << 16);
+
 struct Precalc {
-    char divisor[N];
+    u16 primes[6542]; // # of primes under N=2^16
 
-    constexpr Precalc() : divisor{} {
-        for (int i = 0; i < N; i++)
-            divisor[i] = 1;
-        for (int i = 2; i * i < N; i++)
-            if (divisor[i] == 1)
-                for (int k = i * i; k < N; k += i)
-                    divisor[k] = i;
+    constexpr Precalc() : primes{} {
+        bool marked[N] = {};
+        int n_primes = 0;
+
+        for (int i = 2; i < N; i++) {
+            if (!marked[i]) {
+                primes[n_primes++] = i;
+                for (int j = 2 * i; j < N; j += i)
+                    marked[j] = true;
+            }
+        }
     }
 };
 
-constexpr Precalc precalc{};
+constexpr Precalc P{};
+
+u64 find_factor(u64 n) {
+    for (u16 p : P.primes)
+        if (n % p == 0)
+            return p;
+    return 1;
+}
+```
+
+This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but fixed fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$.
+
+All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](../hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation:
+
+```c++
+// ...precomputation is the same as before,
+// but we store the reciprocal instead of the prime number itself
+u64 magic[6542];
+// for each prime i:
+magic[n_primes++] = u64(-1) / i + 1;
 
 u64 find_factor(u64 n) {
-    return precalc.divisor[n];
+    for (u64 m : P.magic)
+        if (m * n < m)
+            return u64(-1) / m + 1;
+    return 1;
 }
 ```
 
+This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have 
+
+
+$\tilde{O}(\sqrt n)$ territory
+
 ### Pollard's Rho Algorithm
 
+---
+
+```c++
+u64 find_factor(u64 n) {
+    while (true) {
+        if (u64 g = gcd(randint(2, n - 1), n); g != 1)
+            return g;
+    }
+}
+```
+
+
 The algorithm is probabilistic. This means that it may or may not work. You would also need to 
 
 Ро-алгоритм Полларда — рандомизированный алгоритм факторизации целых чисел, работающий за время $O(n^\frac{1}{4})$ и основывающийся не следствии из парадокса дней рождений:
@@ -232,99 +302,6 @@ If you have limited time, you should probably compute as much forward as possibl
 
 How to optimize for the *average* case is unclear.
 
-0.087907
-3964.321045
-
-```c++
-u64 find_factor(u64 n) {
-    if (n % 2 == 0)
-        return 2;
-    for (u64 d = 3; d * d <= n; d += 2)
-        if (n % d == 0)
-            return d;
-    return 1;
-}
-```
-
-0.199740
-7615.217773
-
-```c++
-u64 find_factor(u64 n) {
-    for (u64 d : {2, 3, 5})
-        if (n % d == 0)
-            return d;
-    u64 increments[] = {0, 4, 6, 10, 12, 16, 22, 24};
-    for (u64 d = 7; d * d <= n; d += 30) {
-        for (u64 k = 0; k < 8; k++) {
-            u64 x = d + increments[k];
-            if (n % x == 0)
-                return x;
-        }
-    }
-    return 1;
-}
-```
-
-19430.058594
-
-```c++
-const int N = (1 << 16);
-
-struct Precalc {
-    u16 primes[6542]; // # of primes under N=2^16
-
-    constexpr Precalc() : primes{} {
-        bool marked[N] = {};
-        int n_primes = 0;
-
-        for (int i = 2; i < N; i++) {
-            if (!marked[i]) {
-                primes[n_primes++] = i;
-                for (int j = 2 * i; j < N; j += i)
-                    marked[j] = true;
-            }
-        }
-    }
-};
-
-constexpr Precalc P{};
-
-u64 find_factor(u64 n) {
-    for (u16 p : P.primes)
-        if (n % p == 0)
-            return p;
-    return 1;
-}
-```
-
-352997.656250
-
-```c++
-u64 magic[6542];
-magic[n_primes++] = u64(-1) / i + 1;
-
-u64 find_factor(u64 n) {
-    for (u64 m : P.magic)
-        if (m * n < m)
-            return u64(-1) / m + 1;
-    return 1;
-}
-```
-
-Except that it is contant, so the speedup should be twice as much.
-
----
-
-```c++
-u64 find_factor(u64 n) {
-    while (true) {
-        if (u64 g = gcd(randint(2, n - 1), n); g != 1)
-            return g;
-    }
-}
-```
-
 99.292641
 25720.164062 almost 15x slower
 

From dd88f5e0bdc1fbf03ac4f62c35ac458b085eeafb Mon Sep 17 00:00:00 2001
From: arnu152 <36503815+arnu152@users.noreply.github.com>
Date: Wed, 25 May 2022 17:36:10 +0200
Subject: [PATCH 116/173] Fix a typo in the prefix sum code sample

---
 content/english/hpc/algorithms/prefix.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/algorithms/prefix.md b/content/english/hpc/algorithms/prefix.md
index f07daaf3..81d31900 100644
--- a/content/english/hpc/algorithms/prefix.md
+++ b/content/english/hpc/algorithms/prefix.md
@@ -76,7 +76,7 @@ v4i prefix(v4i x) {
     // x = 1, 3, 5, 7
     //   + 0, 0, 1, 3
     //   = 1, 3, 6, 10
-    return s;
+    return x;
 }
 ```
 
@@ -91,7 +91,7 @@ v8i prefix(v8i x) {
     x = _mm256_add_epi32(x, _mm256_slli_si256(x, 8));
     x = _mm256_add_epi32(x, _mm256_slli_si256(x, 16)); // <- this does nothing
     // x = 1, 3, 6, 10, 5, 11, 18, 26
-    return s;
+    return x;
 }
 ```
 

From 88b757a7ceb792b7ca1435a06f4e17617e443381 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 25 May 2022 19:45:20 +0300
Subject: [PATCH 117/173] pollard rho

---
 .../english/hpc/algorithms/factorization.md   | 147 ++++++++----------
 content/english/hpc/algorithms/img/rho.jpg    | Bin 0 -> 14570 bytes
 2 files changed, 67 insertions(+), 80 deletions(-)
 create mode 100644 content/english/hpc/algorithms/img/rho.jpg

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 9f7958ed..90a1bf43 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -195,7 +195,7 @@ u64 find_factor(u64 n) {
 
 This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but fixed fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$.
 
-All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](../hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation:
+All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation. In particular, we can use [Lemire division check](/hpc/arithmetic/division/#lemire-reduction):
 
 ```c++
 // ...precomputation is the same as before,
@@ -212,14 +212,13 @@ u64 find_factor(u64 n) {
 }
 ```
 
-This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have 
-
-
-$\tilde{O}(\sqrt n)$ territory
+This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have for this number range. While it can probably be even further optimized by performing these checks in parallel with [SIMD](/hpc/simd), we will stop there and consider a different, asymptotically better approach.
 
 ### Pollard's Rho Algorithm
 
----
+<!--
+
+Consider this weird code snippet:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -230,84 +229,41 @@ u64 find_factor(u64 n) {
 }
 ```
 
+It also searches for a factor, but it does so by repeatedly trying to compute the [GCD](../gcd) of $n$ and its random remainder, which would yield a valid divisor of $n$ if this remainder is not coprime with it. Surprisingly, this algorithm is not *that* terrible: it needs expected $O(\sqrt n)$ iterations in the worst case (times $\log n$ from GCD) because on each trial, it can hit not only $p$ or $q = \frac{n}{p}$, but also $\frac{n}{p} + \frac{n}{q} = O(\sqrt n)$ of their multiples.
 
-The algorithm is probabilistic. This means that it may or may not work. You would also need to 
-
-Ро-алгоритм Полларда — рандомизированный алгоритм факторизации целых чисел, работающий за время $O(n^\frac{1}{4})$ и основывающийся не следствии из парадокса дней рождений:
-
-> В мультимножество нужно добавить $O(\sqrt{n})$ случайных чисел от 1 до $n$, чтобы какие-то два совпали.
-
-## $\rho$-алгоритм Полларда
-
-Итак, мы хотим факторизовать число $n$. Предположим, что $n = p q$ и $p \approx q$. Понятно, что труднее случая, наверное, нет. Алгоритм итеративно ищет наименьший делитель и таким образом сводит задачу к как минимум в два раза меньшей.
-
-Возьмём произвольную «достаточно случайную» с точки зрения теории чисел функцию. Например $f(x) = (x+1)^2 \mod n$.
-
-Граф, в котором из каждой вершины есть единственное ребро $x \to f(x)$, называется *функциональным*. Если в нём нарисовать «траекторию» произвольного элемента — какой-то путь, превращающийся в цикл — то получится что-то похожее на букву $\rho$ (ро). Алгоритм из-за этого так и назван.
-
-![](https://upload.wikimedia.org/wikipedia/commons/4/47/Pollard_rho_cycle.jpg)
-
-Рассмотрим траекторию какого-нибудь элемента $x_0$: {$x_0$, $f(x_0)$, $f(f(x_0))$, $\ldots$}. Сделаем из неё новую последовательность, мысленно взяв каждый элемент по модулю $p$ — наименьшего из простых делителей $n$. 
+By itself, this algorithm is just an esoteric way of computing factorization, but can be made useful. If, instead of random numbers, we apply this $\gcd$ trick to a particular number sequence, we get a $O(n^\frac{1}{4})$ approach known as Pollard's rho algorithm.
 
-**Утверждение**. Ожидаемая длина цикла в этой последовательности $O(\sqrt[4]{n})$.
-
-*Доказательство:* так как $p$ — меньший делитель, то $p \leq \sqrt n$. Теперь просто подставлим в следствие из парадокса дней рождений: в множество нужно добавить $O(\sqrt{p}) = O(\sqrt[4]{n})$ элементов, чтобы какие-то два совпали, а значит последовательность зациклилась.
-
-Если мы найдём цикл в такой последовательности — то есть такие $i$ и $j$, что $f^i(x_0) \equiv f^j(x_0) \pmod p$ — то мы сможем найти и какой-то делитель $n$, а именно $\gcd(|f^i(x_0) - f^j(x_0)|, n)$ — это число меньше $n$ и делится на $p$.
-
-Алгоритм по сути находит цикл в этой последовательности, используя для этого стандартный алгоритм («черепаха и заяц»): будем поддерживать два удаляющихся друг от друга указателя $i$ и $j$ ($i = 2j$) и проверять, что $f^i(x_0) \equiv f^j(x_0) \pmod p$, что эквивалентно проверке $\gcd(|f^i(x_0) - f^j(x_0)|, n) \not \in \{ 1, n \}$.
-
-```c++
-typedef long long ll;
-
-inline ll f(ll x) { return (x+1)*(x+1); }
-
-ll find_divisor(ll n, ll seed = 1) {
-    ll x = seed, y = seed;
-    ll divisor = 1;
-    while (divisor == 1 || divisor == n) {
-        // двигаем первый указатель на шаг
-        y = f(y) % n;
-        // а второй -- на два
-        x = f(f(x) % n) % n;
-        // пытаемся найти общий делитель
-        divisor = __gcd(abs(x-y), n);
-    }
-    return divisor;
-}
-```
-
-Так как алгоритм рандомизированный, при полной реализации нужно учитывать разные детали. Например, что иногда делитель не находится (нужно запускать несколько раз), или что при попытке факторизовать простое число он будет работать за $O(\sqrt n)$ (нужно добавить отсечение по времени).
+-->
 
-### Brent's Method
+To construct this sequence, we need a "seemingly random" function that maps the remainders of $n$. Typical choice is $f(x) = (x + 1)^2 \mod n$.
 
-Another idea is to accumulate the product and instead of calculating GCD on each step to calculate it every log n steps.
+Now, consider a graph where each vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. The "trajectory" of any element — the path we walk starting from that element and following edges — eventually loop around. This trajectory resembles the greek letter $\rho$ (rho), which is why the algorithm is named so.
 
-### Optimizing division
+![](../img/rho.jpg)
 
-The next step is to actually apply Montgomery Multiplication.
+Apart from this trick, Pollard's rho algorithm relies on a consequence from the Birthday paradox: we need to add $O(\sqrt{n})$ random numbers from $1$ to $n$ to a set until we get a collision.
 
-This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it.
+Now, consider a trajectory of some element $x_0$: {$x_0$, $f(x_0)$, $f(f(x_0))$, $\ldots$}.
 
-...
+Make another sequence out of it, virtually taking each element modulo $p$, the lesser of prime divisors of $n$.
 
-## Further optimizations
+**Lemma.** The expected length in that sequence is $O(\sqrt[4]{n})$.
 
-Существуют также [субэкспоненциальные](https://ru.wikipedia.org/wiki/%D0%A4%D0%B0%D0%BA%D1%82%D0%BE%D1%80%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F_%D1%86%D0%B5%D0%BB%D1%8B%D1%85_%D1%87%D0%B8%D1%81%D0%B5%D0%BB#%D0%A1%D1%83%D0%B1%D1%8D%D0%BA%D1%81%D0%BF%D0%BE%D0%BD%D0%B5%D0%BD%D1%86%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5_%D0%B0%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D1%8B), но не полиномиальные алгоритмы факторизации. Человечество [умеет](https://en.wikipedia.org/wiki/Integer_factorization_records) факторизовывать числа порядка $2^{200}$.
+**Proof.** Each time we walk a new edge, we generate a random number. It has some chance if looping around.
 
+As $p$ is the lesser divisor, $p \leq \sqrt n$. Now we need to plug it into the [Birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): we need to add $O(\sqrt{p}) = O(\sqrt[4]{n})$ elements to the set to get a collision, which means that the.
 
----
+Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently.
 
-If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other.
+Now, if we find a cycle in this sequence — $i$ and $j$ such that $f^i(x_0) \equiv f^j(x_0) \pmod p$ — we can find some divisor of $n$ using the $\gcd$ trick: $\gcd(|f^i(x_0) - f^j(x_0)|, n)$ would be less than $n$ and divisible by $p$.
 
-How to optimize for the *average* case is unclear.
+Floyd's cycle-finding algorithm
 
-99.292641
-25720.164062 almost 15x slower
+The algorithm itself just finds a loop in this sequence using the Ford algorithms, also known as the "hare and turtle" technique: we maintain two pointers $i$ and $j$ ($i = 2j$) and check that $f^i(x_0) \equiv f^j(x_0) \pmod p$, which is equivalent to checking $\gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1$.
 
 ```c++
-u64 f(u64 x, u64 a, u64 mod) {
-    return ((u128) x * x + a) % mod;
+u64 f(u64 x, u64 mod) {
+    return ((u128) x * x + 1) % mod;
 }
 
 u64 diff(u64 a, u64 b) {
@@ -315,7 +271,7 @@ u64 diff(u64 a, u64 b) {
     return a > b ? a - b : b - a;
 }
 
-u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+u64 find_factor(u64 n) {
     u64 x = x0, y = x0, g = 1;
     while (g == 1) {
         x = f(x, a, n);
@@ -325,16 +281,16 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
     }
     return g;
 }
-
-u64 find_factor(u64 n) {
-    return rho(n);
-}
 ```
 
-56.745281
+While it processes 25k 30-bit numbers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, processing around 90 of them per second.
+
+### Pollard-Brent Algorithm 
+
+Floyd's cycle-finding algorithm has a problem in that it does more iterator increments than necessary. One way to solve it is to memorize the values that the faster iterator visits and compute the gcd using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using this trick:
 
 ```c++
-u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
     u64 x = x0, y = x0;
     
     for (int l = 256; l < (1 << 20); l *= 2) {
@@ -350,12 +306,14 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
 }
 ```
 
-426.389160
+It actually does *not* improve performance and even makes it ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it.
+
+We can remove the logarithm from the asymptotic using the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic:
 
 ```c++
 const int M = 1024;
 
-u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
     u64 x = x0, y = x0, p = 1;
     
     for (int l = M; l < (1 << 20); l *= 2) {
@@ -374,7 +332,13 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
 }
 ```
 
-2948.260986
+It now works at 425 factorizations per second, bottlenecked by the speed of modulo.
+
+### Optimizing Modulo
+
+The next step is to actually apply [Montgomery Multiplication](/hpc/number-theory/montgomery/).
+
+This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it.
 
 ```c++
 struct Montgomery {
@@ -403,7 +367,7 @@ u64 f(u64 x, u64 a, Montgomery m) {
 
 const int M = 1024;
 
-u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
+u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
     Montgomery m(n);
     u64 y = x0;
     
@@ -423,15 +387,38 @@ u64 rho(u64 n, u64 x0 = 2, u64 a = 1) {
 }
 ```
 
+It processes around 3000 per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) library can do (invocated via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)).
+
+### Further Optimization
+
+There might be a way to .
+
+It may be beneficial to start multiplying only after a certain threshold since there is little probability that we enter a cycle in the beginning.
+
+It may be worth it to run a few versions in parallel and stop whichever finishes first. If we run $p$ runs, it is expected to finish $\sqrt p$ times faster. Either scalar code and taking advantage of there being multiple execution ports for multiplication, or using [SIMD](/hpc/simd) instructions to do 4 or 8 multiplications in parallel.
+
+Would not be surprised to see another 3x improvement and throughputs of 10k/sec.
+
+If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other.
+
+How to optimize for the *average* case is unclear.
+
+### Reducing Errors
+
 There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows).
 
-788.4861246275735
+Our implementation has less than 0.7% error rate, but it grows higher if the numbers are lower than $10^{18}$.
+
+Since Pollard's rho algorithm is randomized, you need to account for errors. There may be several sources:
+
+- Factors not being found (need to perform a primality test and start again if it's negative).
+- The `p` variable can get zeroed out (need to either restart or roll back and do it iteration-by-iteration).
+- Overflows in Montgomery multiplication (our implementation is pretty loose).
 
 ### Larger Numbers
 
 "How big are your numbers?" determines the method to use:
 
-
 - Less than 2^16 or so: Lookup table.
 - Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm.
 - Less than 10^50: Lenstra elliptic curve factorization
diff --git a/content/english/hpc/algorithms/img/rho.jpg b/content/english/hpc/algorithms/img/rho.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..d7f01ad81ee48c90ae02e9b248cc880cc3665e9e
GIT binary patch
literal 14570
zcmc(F1y~$=vTvh<TX1&>?hssp2TOum2oNl|ySqaI!QDN`;1FCwfPvsna2p(gJ<i*^
z_uajF_H1_VdGGan^>u&KJw5;G>Z)H={pw-nVFkcakdc=GU|?W?H;*6SVIGhIkPs0;
zhzLj^5C|C=2?Y%o9Ss!~jR@xnCN3E<1vwcpDJdl_8v`XZ3k@kLBmYwt4o)5(9twtM
zA_82(Y}`Ctzug1|85tQ36^#%bosf%)l#1(L-X1yuY$TW!SR^<YY5*1+1`ZqMp$DJ<
z02qYF)&92N|JYz);ouPvK}g6bsE-%aV*#)*aB#5ja0m$S@Q+t}Kb{BRu@P{nI3*CD
zs2PE%?QyyM<8zQ`Bx}Cly&Q+qa=&#5Kt{nQAS5EDqi0}z%EZIV$1m_qQ0lp~jI5lz
z!mHQn8k$<#I>siZ@660CEFGPkU0mJVJpzM*LqfyCBN7rnd`wFIl#-g8mtRm=R9sT}
zxwfvp0o>Ts+|}LF+t)uZI0TuP{5~~3Gds7sw*F&db8CBN_xR-W?EK>L>iXwzdcgp2
ze^Ki{HT!RRVL$2x3l9$m5Bg0n7+BXwfy0JJpyEWtkx&B}**~G?@<+mzjL)h0f=t8x
z5{mcMVH^dYmS>gj_&3%5w`RYlSirxf*?%hbfApFI(BNPmjR%Jfhyj<ojJbX&|2Kp=
zvike2l2{nccwx(UbZ=WvtjG8qheiV|SEk|7AWp_!BbMY5{P9LgGjgj`p{}0c{D^Qk
zYryl<G6@Fim(M*#4VUmunP6?`LY7)p5ex7_gbj1<0+!Ae2jY}ZikEGfKD*nzl!;e3
zjK&TjT0qc?ijU7gI#-mhm#^E-lx+;=S8QNl%)ppRB!OG~u+QLvB=BVf(94Zr1pePh
zj&H?|;7fu@XCWbJGNqn9!aF6vFt>5<C}Ig=Vlt&w{@v99;!ejIV37r_;~{{b*V)q)
z2~8|ecJM}gtqQV`w)cg8E*IN|S@XSpGJKEg+ciAV&meIuKsE@WTR+RjfJCg(UsbK@
zK4<r)&wCe0<QdJiDn8h+pIPrRy`~vZh09~WFo8>PeIO9_0IXkSW0XdO%$>}gh;dD=
z5sf_nxXOe+Z<}n1z2u>n{x3K!6O>pdyFGN~vfcT%NMhsw6Y`>TQm1Uxi+j{@R-gV2
zI_M4A{L+_}mrd5SV#~%U82cNaF=W5pdL4Y)sGiO{W&@XK_Z9{<PEdnCTNF8XZAX<C
zK1bpRE~H!y@4`ItDqDE~NX;_^CVn)JZ#j%((Fd5$(>_02;F1*(4v+;wngxNi^DG1<
zzUX`l50m3-d_JY3kS2$6tWw7KhDtgVn(OR0G|4nBd=WLP4*-f%?p4gmwRFCn!lm*N
z;`8Ka;VyM<GyqM?$<pB(LuK$%MOzxcRr_-VPx(}1!s#Yb*K<)4GUo}NwI!Wk;u(qK
z@Jp{5hVkDOvD|RHXN<VmyWyJ}R?gJ$tbf-LypCSvKP8Q^s~tvX2Zvs8$sD`bh*+3b
zX3;#-qG#a|TyTc6L9D2qq>k=qr;xxMfFaVWS4plLt*dmUUZ%`aG*-_l3R%+{;8ZX%
zK~&*S`ar?)qx$Vgl5I*1>8=WW;gWD|L*(nk{kKbs;RnVcToHauGL>>{WQ!Re0~x7d
zTaXtfVIXRHh{5G<$vP1BgsDTsjwA@=aIsRdFXx@v(lMS)51s0@fn}Y)|0!-1-1N5Q
zV9Ax|i;f8SA2GX-rymdTo<|YEAar@<#JJD*dDx2+<Ep)Owl-55D<{8EZwrCU3Jx;A
z@dlM#kd7hj#(0KVzj3dmk0FdR&6GhY6&b9=bp!XbKZWt4%Dqb+j1jA^cjnA7cosu-
zDn^FfS;E>*G-ohEwjAj!G>W}gNXR<z9xT(xRb7eeEaE7Rv*4vXZqjI_kGOQxL!jXD
zVS<vKIP0Q7KwRX+k}hpcxlTgSnxZ@DD~8u5e=^)eaF@p=@7%o3{Oo)|TRFA_3L1+&
zrMlWa2(};_gV0Mb={j)lo^`sO28@pD3On^Av&iI)uV9-Ibo7gdr9UW?g)}hZ68rif
zed}2weRDG-EUKrFu3Dt0xe;8Qeek`y2(BuO^rO7Cb<$^6i_UAGcz>wYTt3DecpcFu
z4cLPXQEo}>1sA0ADx$m0-Z$X1wP4<?`2W-lztz4ntJq0}816g(b93LaXpNrNa_cD~
z6I~PfXXMq0Wq6B7177RUMEPUaK>FrsQ)p6-tEn?)!gyY}BO8V*QWip+kaLwWM3lzO
zN~0NH=DeI21LpIjG2h9Li|#j1&K{RL`u&WR!Jz(zucAoTKzjCF6rX5lv6uod2BM)U
z!0UVfx=_omoE5K_dvNBT=WR^za@nvdgHZA_j$<;AA1y)6mts$r9Cemb#~f;1=AtEU
zMs*M_yZDVeq>mA<IX4BjW)x=_H<H$dt1^@_R@LGRP?N^}GX=)rXV>Vf+JtZS5Y!fJ
z&UT`89GW))Tt$HaFt>~i)wQ+r7rJtk^El$d#a(MW-+I=sG#%eK5ee*RFsC2)M^<<}
zSq(o=`;_{&oF)Xpx=@=WcdL3O!5d}|;>SL^T7kH`r0W^|L@5>727NMq&@!rulaLEb
z1~tC#u)gqXaY0HzX;gwTfVV<UahH2&`n^ZzA|6(2U`F#Znn~;gWnu@k|E%&*guvh`
zU~Q`>*P5ZbXOk75uLlfQt0%B6v`?PL5F^gA<hFaG-<+rnrTgIVA;Lwe^qa!$eAYYC
z0q^9g3JVHMmet88TF2-M%@=jg2nc+4SB33y4V@8Xn=UIWhlZw=Ya<{-OU$0;$?xr-
z=cvn8<B0%djJXlO9|;q4q&eHc($Xg3W{5=ndb|`frX;3LQUtI<2W87!P076MObwmI
zZP?-Q>@%xNPS0ptV-RWF8PmuLZFROTHFy9_EE^KGn@^(ZgU2bvT2&y@KYi4vG-Shs
zx%01bdvtai${1fYKYe|WfWJS`{qgaon$B%K;LY${Ju9<alc)QBE0iu>PW{fxC=-5H
z6xx3rt1K(*dI8qBY^c%K86mwmDR$e44rDIbmLQV`6o2!fKS!pWBKQ(yUP}|bGxh_O
z@Bq8eswf}knqw2YD{jdKYZcWSoh&}mkMggvYlJvqV%EoHU!{WARl<m4BWRVLbSa9>
zvgLEoeKllB{wfN6#7Z=&I0(5A%Kkn1?XzmAdcX*EuV@vub<zq3jX}Gf`wuo>eBo5!
zBo^d_Fr9l$R<P85QLQzv@VU-@xNT$3vF^<qv~&9Qn15ogzuJ<EJL*K}uTC^eBX4Z}
zL)ytGM1u(41KYsWHFq})<B;yVFuN_?MZmsE>o{)0*pFL??oW>o+n$~tytpt~xmF&o
zVRfH)Mc6iMSm}v04YDucU1%D4bMz3KR`(tNR7;1etS$}$&9^@@<3f?0ga!!@7$1QC
z_BM|LRK4nB{8-ykiwR`bO}*Hor-$W29KzV*#616F<kX(qY$?35l}~&1-R9-e6n4VM
z_gGW4h0HELmSw{^izIG?Wj$G2Ym~zTZ4#8ch3PoAydGCkCdkzN32K@zni5lQkwwXg
z-kw9^$yT#;rnlS{q(NtDz2{ggqSdZQ8M+b2kDvNHD605}q8-fjrwirTe!VTJFF#F3
z!|H9XlzE3;B4nk=<4qZEo946D`Xy*$<j({yvQb6hpqdwG)@7DuXsHL2Mkbt;S_jnk
zgj5C*kq46z2j;nc%*&B36P40G$h}L#gT~$I4LLbn**cNEqN1}&&%RU%Lv1n&b*<*~
zqF3iz99Z)8pZ=IIkuoCqw2_<gRRKO9M+#rqhKMdInLPOs4=#Qlg^`cyf+Ob8rC!w}
z39P<uu#%!Q4vCXy$>Om({pid2JeW*oF=tNxd_+TSHKlGZgU?Dyebw;UK|X~!I*Aeu
zQ!*m<hD@*HCeGJVNk6S^Ok@j18)%uOcuSe^`s=3Cd2_1Q!<)8@{G7-w<}Auwjyl*o
z8XOBcD(H3gA{i|@^I`YuG4zZI2&BfB7W2`q#;HIzG&byrclqiz2H~{sMoF21PT|^q
z!9DHm`^2W?p=b7Wd`HcVTgByf+|MWKeEIkB?=@5W9QNnc9UFKTln166EZOiJ>=2dA
zf8B=-oGW*dMd79jkKYBO%I7ce9at%aIa8b-6Yh6k`R$4cp6N)!J=FjLI3$k5v9t_z
zc3rRxwO=I8D+kEsOT`U&5Z4&kzZYB8am!(;?(XvF*%}b*E?TI$$sF&n>HO(v5G#&=
ztM+e-hGtl+bdj5(T;j{g0%qcj65^=VN)Kv`eRJTLAgt~PbqB{%=R<+<t9lx><d37z
zB^IL{i3viTDX=XY?w=kz8PP|w7FmcG*MjmoE(`j7mIu9rRu`r8zONsfMN*@>wC>Cq
zJyXOL87(i0%3K~+={KNi9IJaCs$|HmJz0S7;pH}Zq<b}b!^G8dw&uriF788l^>zCV
z2s#x>-Ja`Fu{f|{Boa}D(00YU?FRsGfChW8un;J?^_f+H<fcp%e(F2gItZBUN2Y_k
zSb_DgTjh5mFrp7QPwOgFrWsOC_`Ctks5W_VMl%)|fccg=70OoQjSJ4inKnT$+Yva~
z6g#<m?-pZN1qC-=R!%i$g%PVys?A=3<s|1C!d`^%j{a1tc>r{=`wgCv2iZ^g30X2_
zCYI{#=_nNz88f6<IpYkIcw#S)T~Js4`kM8!BH5gUNN%*x?iBW~lTLFcSl6R!<3gVV
zZpd9pYTA7PJ@p$2+t_0kJ(V0kNB!dNS)jGBh?EM$Gqrt4^G1q(jk_-7jjUAT&oM+n
z%jss3(9j%i-!7@UoOZsB7`Zj~bP3vRV`*5favD=2(>_hQCf+QDSa#O;w|gVx2)6a3
zu%B!d^Y7)=or3Bt4O|?y<MH(GU&YOQ5lXz$8EB7TF-d#@hEx)-k?0cU%B7@e%b7CT
zp|l2W%wG=f1u&1URPe1U)>c~_c1hUMm4625U#bq*efylynxZYmT{-<t&L}YsmDn)c
z=zFVT`3{!uz>^np@=dh3Dd_`*^nyX=Ly1|fWOjU~O<ETQ1mkB%SWA0JE*uix?l`^N
z{WxrP`}?zSLZ7M^<hR~>s4I3*Tq*P&u0ghX1if7G=YwieLR72N9{@LF!o5sJI7Pi_
zo+Av4C%dTyU0<wVH$I&+RKqoi1UvYUjMrm6u^hYV=o$MI5#Ut8+K(URyO9_80I<sS
zPh3mnT1@+hOj+NmSL(J3iqv4{StMKvKLGg(Vy&`^WF4=zKLDCpSN1BAx30x*xu%On
z7B7DKBoNx^=1*XjTuYoG*85QH@e9fZn?x#$9rI3_=%Qk$Q}9VecL!d|=t3)EG0l^f
zNTKH;u@?7*5WFZ((wf8mh4idGKc4Mh#MZxJ?BDHaXg~$HsvwwEUQwLmn7oy>7M}TF
zYgzeyJo&qQn!70>zcc!lpC=DMwBYxuP@V%+Z5bM^x(*y|g#K@hJ*CV0cme`jv>LYE
z%Q2u{1FA^r?{(W1^i2ccR1lpaOh!4T@XBdcm!D7-#e&6DJAu11x{fzDPSNZ}z)rRW
z{pc_e)r13>%HI^mVLs93ylPdbvFHH^3a+TSpRMS3Sv&00m;MnlC6N)ILw6zL+e$!t
zh=5h~A;S&5H9=+p4b1`b+15#+3W<^}i#$)9P>y;MM?qZtWyp$*tbqi1_85qrBGR_<
zeJii4b)L-lp1e)}*Oocep%9tZU1S?H@?>E5H-4~5#&z<bNH%s}eOcu?Nb6pt^_sx_
z7~Y3+e{MUZyP@_YPNXFR-gj`6Pn!zd4CWQU<JknRuJZZ1ZEfH=6mleW4XW4ZyAf(y
zfJ;kYM$e=#G;FpU43T#CM%C-J$EANNnWj?5cb!U)n(=i-RI34kxR@6KG_5XbSlf1A
zc%)B_&!^>cQOiD!*Wps`M^jfOa%n&b;D6&MQK1lnL{$%tSXyFS4WLCPwJ%neh64Zr
z0E6@8FNFz^$a+JoA&gaF=Kx!KFMIOFmWxYPON`@A%=xi`G|NOeW6VjdH4>s$)(em6
zc^`sfX>i1TW#xf2JL@{Q$<y|%xGR|nW}%URd34)yZCg;#((IK!eUByR-XOTyz2kC;
z`{x*}n~%^;D}O<M3F&e{x}t;p=#eD)a;23_M9iAofptM+N<%LWfy)_Sk_xOY-!q9t
z4y7!buBPetn5--+3tR{pQ}|mRF-Ke(+OlL4B<eN4WIG=&;YH?hP}f`I{RxkHJ#F=G
zqS@b~Ai4Lk+3Nn1$k>5&)Oc_4Op7qNEz<7HLViq^vz+MCHMub~Z8hVdsz)(}w<u<5
zUD71JLdwR>?4_DNsr9boE~laFG-`#J)KyvbZdtr-xdeT(gJb`vj@GdFT_4kIjNJ&v
zY}j9w8f&jt1qN=Fm))ug2PLzTwaS^=D5a9|r%ntQZ6cOa1(g=5!c+E^1l99{o*V&-
zq^HaY0N_J&T2_W@MI%>J9DjcbGx#$V@cs?!?K^NQO(Q#9lQepzr;2EGoP=6Z(y1O(
zTSZM%J^skaT~RTpM%-I%>X<ce1^wkNqZd^DH={pQL8iuL7+FTvrZ)bR;PAK6;+-hj
zfEd-Q@js_qgxBZwvevy7-M&~dVp`|LxTPxo{y2y6*|jTXUyEQxv376q4p@Vm(Y(^g
zO<!(+sWb3)8|EUJb4mM)GdAM0JZ-F)m8jUQSDIwd&xg8Ip~RBPA<qtIte&Z+ADDmY
zj~$eD;rXSS`y~#ZA}0Wopkk?0R$_7Svmx2FNWa9&s@Q<&H7QOxl7L7f4-DS7jDkHe
zlcuKn`qmra04obJsUZ9D4;P~6KE>N}B1!&v^A<W@)?@{r_oKCkvA8pfG7GEz;)Z(9
zQKmbH4>#C9ogH8Z1lV^EKYi=dD50Ox+ryW1qNg`P%F$PlDNo2$OeW9Z$2{tVTDmjW
z(PKzrsAH+BkDLh+z@tO|v>pjxJgW57F^2m*8&f8?`c}tg9eP_kYk6Xk*tNSyEfc$W
z8q17MlHNom^Zng|Cd#sN$u@W{uJ(I9(U11YLVi<Y2<&`*gE;oh=l)@Y>Mw)uXe{*0
z*D)zUYSNB7Yd6RpH8_Mt?&r~cOQd}jHJqEG+5yd#5_hxSL#~w0n|X<SbDHzc%vQEs
z(4A4zCv7c#oqBS+@m~T~!L=MDVMrJ&dU$UDK?SDF<PxbeE!G8QYS<%^eoC0Uc+$L_
zFEmU;!*Ox~1GI0x5B*_@{&z6+J218;O8&IA?bzj$$pY*7U_rFo^-Fba;C5KkR!=rm
zt-^=d?Hk#_=S>ol0;w^wsG!)&as9S`(&p65*a~;v^M2{g#|jHkthI}L7jieLr<;D1
zUL6#Kd=m0f9)7({dJ}N|p#F^4nvd)y8JoW3I4^Rdmqs*KxW4<wt7kMDHsbr<Od9}R
zu-WQOnJwWi@qX&}g{eypT270R`%*Mx7Lh^F#1ZLMlvuCg8sZqeA(}|3Wx)rTX&OdE
zne}rxh7L#r&GJT33fKqnppAPO7x*jlAe?s*pfAx*-e=99X4)#{S}W#=9(7|gf3?=i
zuo~qNl!e2?zQ4@<O*fzIRh@5tmUT<*fuZRju@Z?aHFMue88cdd2vU{Um2rXg4tR-E
zj@3HPY5H9>G2=(R-WnnrLQ<x!A_d@}o|Qpq%6e+~^~3>twkAf+qU7h|_t<fS*y%Xj
zhn5+mq5F{0m-=k|3U$qszU!Ov@8QGk6I|U(3(!)~?sTuhx8htFvoQ?BPz_jPvI_Z%
zU?x}#V%Fr6#>zvUr+rfoZ|$n-D#z9kDVJnhTzp(l0J&^4=W?g!9p!BK)Ym2X#zw_(
z?TCSbPl9kHR8d7B7IEGgHF2V!^s%<tJ6IMCRl@PTjVU&CT=LAd_^vJ@+9@fqO)D_H
z=KteD{ILjiF)8ZPSBJSa8E0K=&(#JhA&o7D*Q~^`cYfIqz&J$H;*X)m;nD|Hw`I7j
zyUMa=QOOk#@*-f_yCaE^#l9vio}A48Rr*Hl{b!{+X<V2Pr^4huWH0Ny7|7CMJ_cV_
zo!(#&o2=?_FP3;eZTW@-faT^J^#EWyvm?yT%xurow6|gOtxocPi1a@R{6d>~$2pNH
z^0Kvd-7xh1tyIyrOSZri8~fDM;0Gw&DV5QpWTj8zWj2P{Dn)ObtKso$55zWVebjnK
z_%4p9s8X%KC|Hh)>Gbu~%hXfMP(u*QEoNy$g<~<9<Smt)T;#Bv|4JN}Jgt3}q}9as
ze{w;U?49dF)3;VIl=0=)S(lTXsRzuV(6OZXwXEoh^jJ3M&n;oDo7B-i3jo(YlT@&G
zp=i76uMm|zbn?`Kiu{OgdNfTeDl(fWo!i=R7$?o_oUH$ZiFDfD@U-n8+vqnywA1lf
zesZhPKB*A$8M<Kjw!sF8p9^hBwz=|=i*sscn$Cgt2cMI@^TMkSYAz92y2JCur}W`f
z?9?Hh0|o9M9L}`e-c(lUt+NV;I&P5%twXahlKv~UH@ZvB+**MX?XY$|X!^lqCLyY`
zqc@dSjg<ubyA6U5K%nzFk(kMMpP@rBe>SW>`HOFJz+v(41iD`dhJSxet(R0uW#np`
z%Io)QkSY0U!GuydxRY!QtJQ4JG!?Af%GHiK-j)&?O5b5#YF>(~4VUxb+h>rD8XvN4
z4H+@k-rPaz%$((|_xCJT#s4I2c}nyVw-yT*Ex{87SQXY=hQ_%qcQ4_-;B9^3Y%LB4
z5%@;^s$VAVkzA`^(gZ<wbhWc|j?;5G*6bZWkGlBiYJHA}de~5Xy0dEi+ka~E4Qp21
zGsHYklanF|_8Ud)4mqWvng-PjMeH(CVo4Ao`f+=Sl+m}Wj_|8T8QB~pyeIGdfFlMM
zucO}O#V0naSm`D|fSzZ_AJ=hqz<xv{e5d7uw>b8v^i%c3dWnel#A~3p5?@m{bI?;}
z?4Emq2Yk|A$M&^EtYYv!08xrjyx@?1dC_TpidEwKI+#W3QtI3&rYJxNMdzO+gN@7K
z9?`uW!F?a!^jHcm>=o(8Av15<9rRPeFF4M%Fs-bBda?-2(*34Hi5pIg2UZqeaP^R`
zZ-|4Gk+U&|UrZdFWw1>s*XuW2)Ua^IN~`&66#qSgEx?YQxH=k6Ob`YdHZ6S$u~l}%
zP~Ikt^7VB1aY5ZO*K_RR*UkW~%*juhr5)cS65C2IJEieKgDJg7JciGtg;vxrLfmzN
z=qjpmk4TR*Dv<{J<OVLN{R9R7$6R?njmS1^LqW0bj*jV6j>`?}L9Ru<t+ZO=ZCTQl
znXw~92!+<<6Puh$J<&V?1|g}t-faOSg4hc6tALTJU1He)*7;}!l{!m0c?6bq%IcK>
zem-1^3fHhmVkYWfbwa=i)(`)`y6o>9qun`b1WJxo)$3-R^Ti2&6|v*?KlJe74Tdhu
zW+_0Oo){G0k5A@%OG1Oca1FV>xr2d&mWqlQ9vtY3Lmi^W4fVV$Z{J{Fn9IGzycw<5
zba|XjQTW}|hk101=8GXtf^SovTSkuqj$QRK<F(3Noq@=VX10cG_^p`jnlGmQjwYJ%
zX)@<fhsRS8<N5*?I$$-fSLAmq<ydZkqqW;9Te;)lova!uqUSveQu)|NNKxVJ^h+u5
zUpwpneXsfH*6=FPeZ@+MS)Yk9O|=c9OJ38VC(=JoC_!Z+&>+J=;#jJ?<d~$_!!_h=
zDn@8Tkcf?)U6;;!ymOWauH-HEv>*2f;79@Tq14+m#9=4UQ?oCztVUXGMg<RJGxI0m
z+Tr@Cko9{_kA45~0HC*{RRx?^9aJc)KGN3M3Gm@aa$q}Mh6MZgc_I<3F=2J`4F2GA
zq;q{fZ{^-xYDFP8Vhj2$zlheuZyauO?gM?|h`o4UuBacUR&-k|&QfT;rw%QkZcJn%
zr7lnh2XRJlkhv{zAYZ$ww!h|5S);)5?V?X^I4VHhn53_^pasYEouJlQVPS`JmG+~J
zv_<&YRb>J~5zLX_TaCO6_rOy-))p-IgVf~?@*6g$4cv$PXkK-+(V6ljUH4}hA8S^<
zb+$A<H(7PC=7SqKt|M(OHYI8(SmA?OWw-jKpHYG+m@5FA|NfZO^s=H^t=VjNIGhIA
zc%QdvRJY)>3%gjyH~nDBhwnqg;Btu@1b>aCk(<nc9tD(F+={H0H9DitPc`6wo{Myt
zCee|nr<Oo+nkC^{p`>JjA?NTT50!oeu_aI&FSkut6AEeq$C#!Hr1Hjb1vb*)ye8d(
zPY@&?I9_UE2yG=QUe*nmHWf}&hodWWtOv-hDerM%qE~taFQbHI*Jhtw8rWW_`@iA-
z(dFX;8qXiRWo|7uL^kKVGVAhq7frq}*IBxs2a{FSATpD9HTC@;X4-ZJJfEAP_cfrH
zx=*RH4H6g???dlJS-CShU7dDCYDWnxr@DU5zp4)S`OP$SZWTU$por>2RN$*PB86&r
z>;O!U)5w=TL~haVX$he*WzTMyFD5DpfAo3N7>`|uWdI)1xXzDJj(M2Zn_nuihd63<
z)vC#<wq(BkmPYpD+~H%;+zAUe|B#i19(o<tw+0qTpR2e7N3+%q()yR%teM3=GGwa7
zH@ObIba8P%Wbov~zp4zu`8nZgVm<)KF16&5p^)1}A8)Dsr$x`hL<cr0#4!4-6Ar=%
z7Ru93%EAVPOdwuHj>UTod;=HB_?l0N%g4KpuhS}m?(N>a#Z5oB;ryE8>!v&h9Q-|K
z{}FLnYU%MYGq!NSlHjx)zhT=zo+KOY_ADDMi^wfQm_3&yY?WUyxI5C5MEYjp>jKlA
zP?pgPsAh;^x5x9W_h|=$^=E~9s!a@{CKZ#I(r|tvZF>Y=I?h|+^)>6!4vRTMxZ~yd
zdsZ-eG#C(p+fWMSpMRNl>Yun}&)<pH?YBnTus;0J7B=F_()-TY*=4v!IGlo}P%GVJ
zvw<-it?5=MU*B-p)83dAq7ZB%k8Gaxw7MoLBY4Ow?YPBkIHOfPnqRRFVm(LlJPzb=
zBK*1av)l|E+o2+7CxD|2>pupg`8$YMva6$CSYcDWm>)zF8ol;$ErIU~k*p+<GBSJs
z7hz0LM3Xe&=)*pJaErQ<R-ue4_tnKdU*lrjFASLqTb!^fDh6cK>$Q?a;Jj$u)2!4V
z_QEyQx!|WhszB~*m=N(Vr0Hffv?mDO&mM#8>{U^Bax?F>DY)qcfmA>cmK0_VFztUU
za>Yy#M$YPR`U*2QG6kM9-_&3OdnklKKy~IIQ%n%}DX+y90-}tH&;dF-cm8o_1W!dB
zvR_}pwyrcuseNFBl^<WdE0#khv2VG73$rnG&ziOxnv!$XY;njlv(8>%Y1y*m_62S*
z4)fYPJBl2J{A10^Q6xFhab@<Lci_Uj!OCX|w}@y)(u@La|NcV9F9MyFp6mTV^|m<X
z#LiQRXkb<&niSStl9Z2K(Vfn;)(5**L!yLk#;<FRor}VF6+NTb<skE;jjhemx#G`M
zv*W+X1M4-}nB`V%zL-0gnkP#m7vPMs;*tINw?pq)iI3)=9MIZ|SDf;TagOZi7PD7S
zwbXy42d6wbX(-dKj09)`j~6h09Gzg#e*nz9BL}l=)3&Z30ITE!@^^ekr<d+&4Yw(a
z8CmR$Yx#lQ&C(UOnhh76-L!J}JW(IRInv;ezRZv^40ez-cs0%UTK;W&?vF{U*Zi(7
z7x)Q_i=(9Y&=PmL_=Zx(UHNQ~iZEZpV}nH@UWz!`ope=QmNs!lGef)E{#Z;a@>Oei
zF4X%&L3u^GV~Jc5xmE1-rdY7&x#)+m<Sab4yiD_=TBx2<vq095#Mf`!-jQjrae-dl
zwvjXD1>i=|YOPD76z9Q+Zv)w&vtJmNgZpq5U8YpvMt!p%bk}@p{V^K^Km2rt^N>qr
zRo4UmBSvucjD@OiS)D%>XxhBbm%M(+&F=xQ@lt~xh81yG*o#&<@R`^=0Gddn;tpE(
zzmmzS8qDhKYP_6Xn(w!_<~m4tO5jj^MqQNa$6ZAah8~4(rY;T6&GjvOew=4ptd~WJ
z?A+!+U<)rT1n16KTG)zwUZ8&f=vVm$GQWygNv1Hyb*)k7Div86#bLf5gWE2*MUBNm
z#{-2&r%YZptea)A+xQr?9qXd(_6AhbDI<Y$vJw9Alab^?`X<r~u;OBX^lrxgqiWM6
ztGqg4wtl|tLtbZXWgtxArgB*EI0%%g9469yaU@PMcNyk1inbG5YzbC`N9`PCIha-V
zsJX!Fn>SCb>S6v=&yG7U5)(**%*rDpCOq>@v+3T)@crpgL5>@>;eo)d#9B-Jcjxbh
zi+-H8W|?Iu6n97`_H!+Qz;o$nJZ#Cw{M_FURKWTxV;6X#>#P=DLMw}g8Z4z#Q{To9
z1rQ}U0*~K<*2M`8;J#H<x)t?|CV*X+jT>lxxbv;Qp_yqo)gN)Z2aSuZl0ju+3%Hnu
z5k6n%c({jd`rf*wZvXn&Ww}16tJyH*(D~Wf%!bUQ4f@wz2)_Q0#FLo7k~<M*%Lt%5
zw8xq~%0^gpPC17#VnG=^Lh(~LLsYf%{Kt9!qUh{GZgqVJ#z@&7u}ltrZL!v)*AD4t
zgA=hA;9@^cLOrn081}?T>NHfsog1Ksf%YYp8Lq{IE)PK79Q_UnEe0riRP?ce-8Puf
zT|_DfQz9OqNkI`bF+(egleII4^>|vmLpj7%JxKH_Q(EwEr)MSYkk8Hva{T}YZfR+I
zM}&NIk+0+3IB0bV9O@J=tzw(5GI9s)ta*^wg{9mu&^%0}$GzuX1;H)T`w8i6wMwsB
zqjn>_ZExl>7`!hXntf3_<Ou|3^{i-MPYAoO<x|Y128j|@+n?&}jBM#+$x?om!8-je
zv7-+Py(P(UGjrT&U}7_QWL#aJaw$<AYR@w~HoFDhx0<fM^{q4UMKh0vOf88|*H(#n
z_hSceb`I`x>$?Q6-Z6LUSX%s){-DvYHw$9%);c#nv{L>f*#6sd%t2RC%X9ph1W?H5
ztykZ1<l~|a17J2A`zKjlox63i4t%4)kdQ@;&_cS##|*y6`p4Gv+u~buxTb&XMF04O
z$v4%MkparkuskDl7~u%I$?+hqRE2yQdHpOdBTqk+C-QtU-bw6Sak_UJpg!g}j{aJ^
zFvA~;-iXyz{Ctm{i1v)8%4a<WS*T7Gksw@Hnm=@^vp3I@IIg)-i*UB^h?Zi_Xquvk
zG``o$U}=75ihF5sh`IqBou4GLU5Y4wTk0XBI;8$MWQjx*zz7=>Ua_-84Ke!8Q|51P
z#^)jjLxzDwd4eJ9aWWdVOp?`=dYeigZG`Z>w|%)zNr)_>$0QpyM!f&o-h9OQ6=ld2
ze83XpS}-Dxz$8=g8Q@ffdX)jy)tDAmv9EM<ulPX_#B*^XmZW|h(0N9=&d{y|IFl1z
zZLCJWlH&JR9J1h#Z<u8ru;BOLvWAa8nDnWy_F*!2l-S%!^=T^&;Ock_*|6@Osa9KX
zt@K?z`eMy5Uac65JwI=_+qG&I{h?uiFZQFX(n5l=40ntGV%u9k0heRj$SQpTMJOJG
z#yRa}TLh4$PVxzY&b-D}DmKw7RbW1zG^oTi_Vvb_f5;BAQRmP+8-sp+)x~duwXdG_
zzQZBEI`MUr#p@^A>6GL%m0fdFOG5|&Q%j6gYQZtID0$l;Jl=nOXz+=DBBaqU1b&zJ
zUF{lvoE&eohWpwZv5g3dr=WINif;|Aj%5$Pn=2BX$-q)!zSxagr&B^}m0P!ntS%kR
z3yL6~XpVK~`h5E)I3nvj`~GPjnA+bNfB09MD2QYaKxCaa*ScE3p<<o4;G#`s#LTxn
z(ie6V4moIoXAC5UH__2y^tx$oMG}<qSc2g+CU1~NnL6=Adtim88P-HddWb7aFZmjN
zY78@etm{j4K`1)K{`9q7(dXy~p}I{mN$3o8uxgjtwDOAk06adczp&KGrCYFvw1IBT
zrU^2SL-xu>*85S(p$iXJp4$qo7%2>2)0ZSud6@*~iC3z`JO9xn`a2sfZPuBciB*tm
z6yx(KL53lIvpLJu*5;6Kv`1Puap*YBe6<CQ60F+<e=<bmpNR-YidP=$jR?6WmuQ5W
z-2OLOc&qochwY)=b2tdL2O<UtAGOXyxJ-TT08M*6-nD4Wb<s5BNGA`rkF4KLqgWCf
z#YJ1xqFsND+4{RW^Y^Au#lJoK`_>T3A2e|O%)u|C?YV;U&DyfrU2NI+7vCxns$YQP
zw7=7Ffv5tn-%B3IyugxJjgjqHCgu0LW#dJnls{tlAOS0XDTO8f?ob>q`y6k4CzUlR
zeltxyBR?5tqp2xjO$JUt|Ad=wJc-KDPdki=L4X3MmBD!J2(+MU@NMb=FddnE730EH
zvMUtz<X#$Z19#g-G=Ip#F-g{rU{0DQCb6K4cB!LtD5-OrKP8Q6JMF`qz&m|RZDRuy
z+~;H?ovj8^$x2+Ut+1Zao)irBaenFiX0?s-yRDsnt<L*f!1z~w%N%=^=Sm0euFzm4
z@a|n>{h)Za`7y|MFxmyL>uR4ZZ_N;4+cfL)XG)v1-bvE7VU~5T#~8@CXtuw|BB_C+
zv|M8>pAq*BkPQ<%RQ@M1y#Fo*fY%-UGr6x&>}jCfxZln|l7-mShsw1^6#%~D7!=Jr
zPK3CnenZ{ji=&$@F5BR$3Ud7uBGkVc(f;f82-cD(B0?{k+fzQiZ3y0U=m!ws)9=-X
zZ&~V+rdRvoPMZ=u8|O|mX-HBvxXezBWP625tnW#8Yji7bB6}4X69pC^&HcBZ?$sw}
z=46eE=Cyl(7^EaxxNH=FV#tr>m;CtmeEZM&4yNl0`j~`=dlBR$*)H#foPT!=QFI$V
zjd{f8%H2lCO-nK24_<}K^5fx`I3o-Xz!Ngpe;v90R3#Oq0lRkidEYf#;=MP|PU|_}
zM$K+mucP)iY}$NW_|D2KS2J!+JgpKq7+m0<;#!My*!yWCTsXn<9H7IkQ-xJ~yuttf
J1JJ|l{{rESw}}7%

literal 0
HcmV?d00001


From e796f1669013f46eb3e7d31db9f227534b78a2ed Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 25 May 2022 19:50:36 +0300
Subject: [PATCH 118/173] pollard refactor

---
 .../english/hpc/algorithms/factorization.md   | 33 ++++++++++---------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 90a1bf43..18d46824 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -271,12 +271,13 @@ u64 diff(u64 a, u64 b) {
     return a > b ? a - b : b - a;
 }
 
+const u64 SEED = 42;
+
 u64 find_factor(u64 n) {
-    u64 x = x0, y = x0, g = 1;
+    u64 x = SEED, y = SEED, g = 1;
     while (g == 1) {
-        x = f(x, a, n);
-        y = f(y, a, n);
-        y = f(y, a, n);
+        x = f(f(x, n), n); // advance x twice
+        y = f(y, n);       // advance y once
         g = gcd(diff(x, y));
     }
     return g;
@@ -290,13 +291,13 @@ While it processes 25k 30-bit numbers — almost 15 times slower than the fastes
 Floyd's cycle-finding algorithm has a problem in that it does more iterator increments than necessary. One way to solve it is to memorize the values that the faster iterator visits and compute the gcd using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using this trick:
 
 ```c++
-u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
-    u64 x = x0, y = x0;
+u64 find_factor(u64 n) {
+    u64 x = SEED;
     
     for (int l = 256; l < (1 << 20); l *= 2) {
-        x = y;
+        u64 y = x;
         for (int i = 0; i < l; i++) {
-            y = f(y, a, n);
+            x = f(x, n);
             if (u64 g = gcd(diff(x, y), n); g != 1)
                 return g;
         }
@@ -313,14 +314,14 @@ We can remove the logarithm from the asymptotic using the fact that if one of $a
 ```c++
 const int M = 1024;
 
-u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
-    u64 x = x0, y = x0, p = 1;
+u64 find_factor(u64 n) {
+    u64 x = SEED;
     
     for (int l = M; l < (1 << 20); l *= 2) {
-        x = y;
+        u64 y = x, p = 1;
         for (int i = 0; i < l; i += M) {
             for (int j = 0; j < M; j++) {
-                y = f(y, a, n);
+                y = f(y, n);
                 p = (u128) p * diff(x, y) % n;
             }
             if (u64 g = gcd(p, n); g != 1)
@@ -340,6 +341,8 @@ The next step is to actually apply [Montgomery Multiplication](/hpc/number-theor
 
 This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it.
 
+We do not need to convert numbers out of Montgomery representation before computing the GCD.
+
 ```c++
 struct Montgomery {
     u64 n, nr;
@@ -369,13 +372,13 @@ const int M = 1024;
 
 u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
     Montgomery m(n);
-    u64 y = x0;
+    u64 x = SEED;
     
     for (int l = M; l < (1 << 20); l *= 2) {
-        u64 x = y, p = 1;
+        u64 y = x, p = 1;
         for (int i = 0; i < l; i += M) {
             for (int j = 0; j < M; j++) {
-                y = f(y, a, m);
+                x = f(x, m);
                 p = m.multiply(p, diff(x, y));
             }
             if (u64 g = gcd(p, n); g != 1)

From 54fe1ba3afb88fdd5b2ec9a041c74be42469afb5 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 25 May 2022 20:23:30 +0300
Subject: [PATCH 119/173] trial division edits

---
 .../english/hpc/algorithms/factorization.md   | 50 +++++++++++--------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 18d46824..9e886375 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -6,7 +6,7 @@ draft: true
 
 The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs.
 
-In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches, and then gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms, which is almost 4x faster than the previous state-of-the-art.
+In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and almost 4 times faster than the previous state-of-the-art.
 
 <!--
 Integer factorization is interesting because of the RSA problem.
@@ -27,7 +27,7 @@ typedef __uint128_t u128;
 u64 find_factor(u64 n);
 ```
 
-To find full factorization, you can apply it to $n$, reduce it, and continue until a new factor can no longer be found:
+To find the full factorization, you can apply it to $n$, reduce it, and continue until a new factor can no longer be found:
 
 ```c++
 vector<u64> factorize(u64 n) {
@@ -41,11 +41,11 @@ vector<u64> factorize(u64 n) {
 }
 ```
 
-Since after each removed factor the problem becomes considerably smaller and simpler, the worst-case running time of full factorization is equal to the worst-case running time of a `find_factor` call. 
+After each removed factor, the problem becomes considerably smaller, so the worst-case running time of full factorization is equal to the worst-case running time of a `find_factor` call. 
 
-For many factorization algorithms, including those presented in this article, the running time scales with the least prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. To generate a $k$-bit semiprime, we generate two random $\lfloor k / 2 \rfloor$-bit primes.
+For many factorization algorithms, including those presented in this article, the running time scales with the smaller prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. We generate a $k$-bit semiprime as the product of two random $\lfloor k / 2 \rfloor$-bit primes.
 
-Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of false negative errors (when `find_factor` returns `1` despite number $n$ being composite), although this rate can be reduced to almost zero without significant performance penalties.
+Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of false-negative errors (when `find_factor` returns `1` despite number $n$ being composite), although this rate can be reduced to almost zero without significant performance penalties.
 
 ### Trial division
 
@@ -57,7 +57,7 @@ Trial division was first described by Fibonacci in 1202. Although it was probabl
 
 -->
 
-The most basic approach is to try every number less than $n$ as a divosor:
+The most basic approach is to try every integer smaller than $n$ as a divisor:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -68,7 +68,7 @@ u64 find_factor(u64 n) {
 }
 ```
 
-One simple optimization is to notice that it is enough to only check divisors that do not exceed $\sqrt n$. This works because if $n$ is divided by $d > \sqrt n$, then it is also divided by $\frac{n}{d} < \sqrt n$, so we can don't have to check it separately.
+We can notice that if $n$ is divided by $d < \sqrt n$, then it is also divided by $\frac{n}{d} > \sqrt n$, and there is no need to check for it separately. This lets us stop trial division early and only check for potential divisors that do not exceed $\sqrt n$:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -79,13 +79,13 @@ u64 find_factor(u64 n) {
 }
 ```
 
-In our benchmark, $n$ is a semiprime, and we always find the lesser divisor, so both $O(n)$ and $O(\sqrt n)$ implementations perform the same and are able to factorize ~2k 30-bit numbers per second, while taking whole ~20 seconds to factorize a single 60-bit number.
+In our benchmark, $n$ is a semiprime, and we always find the lesser divisor, so both $O(n)$ and $O(\sqrt n)$ implementations perform the same and are able to factorize ~2k 30-bit numbers per second — while taking whole 20 seconds to factorize a single 60-bit number.
 
 ### Lookup Table
 
 Nowadays, you can type `factor 57` in your Linux terminal or Google search bar to get the factorization of any number. But before computers were invented, it was more practical to use *factorization tables:* special books containing factorizations of the first $N$ numbers.
 
-We can also use this approach to compute these lookup tables [during compile time](/hpc/compilation/precalc/). To save space, it is convenient to only store the smallest divisor of a number, requiring just one byte for a 16-bit integer:
+We can also use this approach to compute these lookup tables [during compile time](/hpc/compilation/precalc/). To save space, we can store only the smallest divisor of a number. Since the smallest divisor does not exceed the $\sqrt n$, we need just one byte per a 16-bit integer:
 
 ```c++
 template <int N = (1<<16)>
@@ -109,13 +109,13 @@ u64 find_factor(u64 n) {
 }
 ```
 
-This approach can process 3M 16-bit integers per second, although it [probably gets slower](../hpc/cpu-cache/bandwidth/) for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first $2^{16}$ numbers, it does not scale well for larger inputs.
+With this approach, we can process 3M 16-bit integers per second, although it would probably [get slower](../hpc/cpu-cache/bandwidth/) for larger numbers. While it requires just a few milliseconds and 64KB of memory to calculate and store the divisors of the first $2^{16}$ numbers, it does not scale well for larger inputs.
 
 ### Wheel factorization
 
-To save paper space, pre-computer era factorization tables typically excluded numbers divisible by 2 and 5: in decimal numeral system, you can quickly determine whether a number is divisible by 2 or 5 (by looking at its last digit) and keep dividing the number $n$ by 2 or 5 while it is possible, eventually arriving to some entry in the factorization table. This makes the factorization table just ½ × ⅘ = 0.4 its original size.
+To save paper space, pre-computer era factorization tables typically excluded numbers divisible by $2$ and $5$, making the factorization table ½ × ⅘ = 0.4 of its original size. In the decimal numeral system, you can quickly determine whether a number is divisible by $2$ or $5$ (by looking at its last digit) and keep dividing the number $n$ by $2$ or $5$ while it is possible, eventually arriving at some entry in the factorization table.
 
-We can apply a similar trick to trial division, first checking if the number is divisible by $2$, and then only check for odd divisors:
+We can apply a similar trick to trial division by first checking if the number is divisible by $2$ and then only considering odd divisors:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -128,9 +128,11 @@ u64 find_factor(u64 n) {
 }
 ```
 
-With 50% fewer divisions to do, this algorithm works twice as fast, but it can be extended. If the number is not divisible by $3$, we can also ignore all multiples of $3$, and the same goes for all other divisors. 
+With 50% fewer divisions to perform, this algorithm works twice as fast.
 
-The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small. For example, if we consider $2$, $3$, and $5$, then, among the first $90$ numbers, we only need to check:
+This method can be extended: if the number is not divisible by $3$, we can also ignore all multiples of $3$, and the same goes for all other divisors. The problem is, as we increase the number of primes to exclude, it becomes less straightforward to iterate only over the numbers not divisible by them as they follow an irregular pattern — unless the number of primes is small.
+
+For example, if we consider $2$, $3$, and $5$, then, among the first $90$ numbers, we only need to check:
 
 ```center
 (1,) 7, 11, 13, 17, 19, 23, 29,
@@ -138,7 +140,7 @@ The problem is, as we increase the number of primes to exclude, it becomes less
 61, 67, 71, 73, 77, 79, 83, 89…
 ```
 
-You can notice a pattern: the sequence repeats itself every $30$ numbers because remainder modulo $2 \times 3 \times 5 = 30$ is all we need to determine whether a number is divisible by $2$, $3$, or $5$. This means that we only need to check $8$ specific numbers in every $30$, proportionally improving the performance:
+You can notice a pattern: the sequence repeats itself every $30$ numbers. This is not surprising since the remainder modulo $2 \times 3 \times 5 = 30$ is all we need to determine whether a number is divisible by $2$, $3$, or $5$. This means that we only need to check $8$ numbers with specific remainders out of every $30$, proportionally improving the performance:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -157,11 +159,11 @@ u64 find_factor(u64 n) {
 }
 ```
 
-As expected, it works $\frac{30}{8} = 3.75$ times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved by considering more primes, but the returns are diminishing: adding a new prime $p$ reduces the number of iterations by $\frac{1}{p}$, but increases the size of the skip-list by a factor of $p$, requiring proportionally more memory.
+As expected, it works $\frac{30}{8} = 3.75$ times faster than the naive trial division, processing about 7.6k 30-bit numbers per second. The performance can be improved further by considering more primes, but the returns are diminishing: adding a new prime $p$ reduces the number of iterations by $\frac{1}{p}$ but increases the size of the skip-list by a factor of $p$, requiring proportionally more memory.
 
 ### Precomputed Primes
 
-If we keep increasing the number of primes we exclude in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don't need this array of offsets, but we need to precompute primes, which we can do during compile time like this:
+If we keep increasing the number of primes in wheel factorization, we eventually exclude all composite numbers and only check for prime factors. In this case, we don't need this array of offsets but just the array of primes:
 
 ```c++
 const int N = (1 << 16);
@@ -193,9 +195,11 @@ u64 find_factor(u64 n) {
 }
 ```
 
-This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors. Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but fixed fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$.
+This approach lets us process almost 20k 30-bit integers per second, but it does not work for larger (64-bit) numbers unless they have small ($< 2^{16}$) factors.
+
+Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but constant fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$.
 
-All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advice and allow for some precomputation. In particular, we can use [Lemire division check](/hpc/arithmetic/division/#lemire-reduction):
+All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advice and allow for some additional precomputation. In our case, it is suitable to use [the Lemire division check](/hpc/arithmetic/division/#lemire-reduction):
 
 ```c++
 // ...precomputation is the same as before,
@@ -212,7 +216,7 @@ u64 find_factor(u64 n) {
 }
 ```
 
-This makes the algorithm ~18x faster: we can now process ~350k 30-bit numbers per second. This is actually the most efficient algorithm we have for this number range. While it can probably be even further optimized by performing these checks in parallel with [SIMD](/hpc/simd), we will stop there and consider a different, asymptotically better approach.
+This makes the algorithm ~18x faster: we can now factorize **~350k** 30-bit numbers per second, which is actually the most efficient algorithm we have for this number range. While it can probably be optimized even further by performing these checks in parallel with [SIMD](/hpc/simd), we will stop there and try a different, asymptotically better approach.
 
 ### Pollard's Rho Algorithm
 
@@ -235,6 +239,8 @@ By itself, this algorithm is just an esoteric way of computing factorization, bu
 
 -->
 
+Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm.
+
 To construct this sequence, we need a "seemingly random" function that maps the remainders of $n$. Typical choice is $f(x) = (x + 1)^2 \mod n$.
 
 Now, consider a graph where each vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. The "trajectory" of any element — the path we walk starting from that element and following edges — eventually loop around. This trajectory resembles the greek letter $\rho$ (rho), which is why the algorithm is named so.
@@ -427,3 +433,7 @@ Since Pollard's rho algorithm is randomized, you need to account for errors. The
 - Less than 10^50: Lenstra elliptic curve factorization
 - Less than 10^100: Quadratic Sieve
 - More than 10^100: General Number Field Sieve
+
+Requiring about 100KB of memory.
+
+6542 * 8

From 002b4aece30f3a63c2dc06a8a1b016afb55c4904 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 25 May 2022 21:46:07 +0300
Subject: [PATCH 120/173] pollard rho description

---
 .../english/hpc/algorithms/factorization.md   | 47 +++++++++++++------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 9e886375..d44ca6af 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -237,35 +237,49 @@ It also searches for a factor, but it does so by repeatedly trying to compute th
 
 By itself, this algorithm is just an esoteric way of computing factorization, but can be made useful. If, instead of random numbers, we apply this $\gcd$ trick to a particular number sequence, we get a $O(n^\frac{1}{4})$ approach known as Pollard's rho algorithm.
 
+Apart from this trick, Pollard's rho algorithm relies on a consequence from the Birthday paradox: we need to add $O(\sqrt{n})$ random numbers from $1$ to $n$ to a set until we get a collision. 
+
 -->
 
-Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm.
+Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm that makes use of the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): one only needs to draw $\Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability.
 
-To construct this sequence, we need a "seemingly random" function that maps the remainders of $n$. Typical choice is $f(x) = (x + 1)^2 \mod n$.
+Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes.
 
-Now, consider a graph where each vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. The "trajectory" of any element — the path we walk starting from that element and following edges — eventually loop around. This trajectory resembles the greek letter $\rho$ (rho), which is why the algorithm is named so.
+Now, consider a graph where each number-vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. In functional graphs, the "trajectory" of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point we have to go to a vertex we have already visited).
 
-![](../img/rho.jpg)
+![The trajectory of an element resembles the greek letter ρ (rho), which is what the algorithm is named after](../img/rho.jpg)
 
-Apart from this trick, Pollard's rho algorithm relies on a consequence from the Birthday paradox: we need to add $O(\sqrt{n})$ random numbers from $1$ to $n$ to a set until we get a collision.
+Consider a trajectory of some particular element $x_0$:
 
-Now, consider a trajectory of some element $x_0$: {$x_0$, $f(x_0)$, $f(f(x_0))$, $\ldots$}.
+$$
+x_0, \; f(x_0), \; f(f(x_0)), \; \ldots
+$$
 
-Make another sequence out of it, virtually taking each element modulo $p$, the lesser of prime divisors of $n$.
+Now, let's make another sequence out of this one by reducing each element modulo $p$, the smallest prime divisor of $n$.
 
-**Lemma.** The expected length in that sequence is $O(\sqrt[4]{n})$.
+**Lemma.** The expected length of that sequence before it turns into a cycle is $O(\sqrt[4]{n})$.
 
-**Proof.** Each time we walk a new edge, we generate a random number. It has some chance if looping around.
+**Proof:** Since $p$ is the smallest divisor, $p \leq \sqrt n$. Each time we follow a new edge, we essentially generate a random number between $0$ and $p$ (we treat $f$ as a "deterministically-random" function). The birthday paradox states that we only need to generate $O(\sqrt p) = O(\sqrt[4]{n})$ numers until we get a collision and thus enter a loop.
 
-As $p$ is the lesser divisor, $p \leq \sqrt n$. Now we need to plug it into the [Birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): we need to add $O(\sqrt{p}) = O(\sqrt[4]{n})$ elements to the set to get a collision, which means that the.
+Since we don't know $p$, this mod-$p$ sequence is only imaginary, but if find a cycle in it — that is, $i$ and $j$ such that
 
-Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently.
+$$
+f^i(x_0) \equiv f^j(x_0) \pmod p
+$$
+
+then we can also find $p$ itself as
+
+$$
+p = \gcd(|f^i(x_0) - f^j(x_0)|, n)
+$$
 
-Now, if we find a cycle in this sequence — $i$ and $j$ such that $f^i(x_0) \equiv f^j(x_0) \pmod p$ — we can find some divisor of $n$ using the $\gcd$ trick: $\gcd(|f^i(x_0) - f^j(x_0)|, n)$ would be less than $n$ and divisible by $p$.
+The algorithm itself just finds this cycle and $p$ using this GCD trick and Floyd's "[tortoise and hare](https://en.wikipedia.org/wiki/Cycle_detection#Floyd's_tortoise_and_hare)" algorithm: we maintain two pointers $i$ and $j = 2i$ and check that 
 
-Floyd's cycle-finding algorithm
+$$
+\gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1
+$$
 
-The algorithm itself just finds a loop in this sequence using the Ford algorithms, also known as the "hare and turtle" technique: we maintain two pointers $i$ and $j$ ($i = 2j$) and check that $f^i(x_0) \equiv f^j(x_0) \pmod p$, which is equivalent to checking $\gcd(|f^i(x_0) - f^j(x_0)|, n) \neq 1$.
+which is equivalent to comparing $f^i(x_0)$ and $f^j(x_0)$ modulo $p$. Since $j$ (hare) is increasing at twice the rate of $i$ (tortoise), their difference is increasing by $1$ each iteration and eventually will become equal to (or a multiple of) the cycle length, with $i$ and $j$ pointing to the same elements. And as we proved half a page ago, reaching a cycle would only require $O(\sqrt[4]{n})$ iterations:
 
 ```c++
 u64 f(u64 x, u64 mod) {
@@ -290,7 +304,7 @@ u64 find_factor(u64 n) {
 }
 ```
 
-While it processes 25k 30-bit numbers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, processing around 90 of them per second.
+While it processes only ~25k 30-bit integers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, factorizing around 90 of them per second.
 
 ### Pollard-Brent Algorithm 
 
@@ -412,6 +426,9 @@ If you have limited time, you should probably compute as much forward as possibl
 
 How to optimize for the *average* case is unclear.
 
+Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently.
+
+
 ### Reducing Errors
 
 There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows).

From 428407e09d0461d13b55a5bae555d98bea75d320 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 26 May 2022 15:32:28 +0300
Subject: [PATCH 121/173] factorization edits

---
 .../english/hpc/algorithms/factorization.md   | 76 ++++++++-----------
 1 file changed, 33 insertions(+), 43 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index d44ca6af..fd61d441 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -1,7 +1,6 @@
 ---
 title: Integer Factorization
 weight: 3
-draft: true
 ---
 
 The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs.
@@ -241,7 +240,11 @@ Apart from this trick, Pollard's rho algorithm relies on a consequence from the
 
 -->
 
-Pollard's rho algorithm is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm that makes use of the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem): one only needs to draw $\Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability.
+Pollard's rho is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm that makes use of the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem):
+
+> One only needs to draw $d = \Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability.
+
+You can look up formal proof on Wikipedia, but the informal reasoning behind it is that that each of $d$ added numbers has a chance of approximately $\frac{d}{n}$ of colliding with anythin else, meaning that the expected number of collisions is $\frac{d^2}{n}$. If $d$ is asymptotically smaller than $\sqrt n$, then this ratio grows to zero as $n$ rises and to infinity otherwise.
 
 Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes.
 
@@ -308,7 +311,9 @@ While it processes only ~25k 30-bit integers — almost 15 times slower than the
 
 ### Pollard-Brent Algorithm 
 
-Floyd's cycle-finding algorithm has a problem in that it does more iterator increments than necessary. One way to solve it is to memorize the values that the faster iterator visits and compute the gcd using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using this trick:
+Floyd's cycle-finding algorithm has a problem in that it moves iterators more than necessary: at least half of the vertices are visited one additional time by the slower iterator.
+
+One way to solve it is to memorize the values $x_i$ that the faster iterator visits and every two iterations compute the GCD using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using a different principle: the tortoise doesn't move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare $x_i$ and $x_{2^{\lfloor \log_2 i \rfloor}}$ on each iteration:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -327,9 +332,11 @@ u64 find_factor(u64 n) {
 }
 ```
 
-It actually does *not* improve performance and even makes it ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it.
+Note that we also set an upper limit on the number of iterations so that the algorithm finishes in reasonable time and returns `1` if $n$ turns out to be a prime.
+
+It actually does *not* improve performance and even makes the algorithm ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it.
 
-We can remove the logarithm from the asymptotic using the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic:
+Instead of [optimizing the GCD itself](../gcd), we can optimize the number of its invocations. We can use the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic:
 
 ```c++
 const int M = 1024;
@@ -357,11 +364,7 @@ It now works at 425 factorizations per second, bottlenecked by the speed of modu
 
 ### Optimizing Modulo
 
-The next step is to actually apply [Montgomery Multiplication](/hpc/number-theory/montgomery/).
-
-This is exactly the type of problem when we need specific knowledge, because we have 64-bit modulo by not-compile-constants, and compiler can't really do much to optimize it.
-
-We do not need to convert numbers out of Montgomery representation before computing the GCD.
+The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/): the modulo is constant, so we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap.
 
 ```c++
 struct Montgomery {
@@ -410,47 +413,34 @@ u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
 }
 ```
 
-It processes around 3000 per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) library can do (invocated via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)).
-
-### Further Optimization
-
-There might be a way to .
-
-It may be beneficial to start multiplying only after a certain threshold since there is little probability that we enter a cycle in the beginning.
-
-It may be worth it to run a few versions in parallel and stop whichever finishes first. If we run $p$ runs, it is expected to finish $\sqrt p$ times faster. Either scalar code and taking advantage of there being multiple execution ports for multiplication, or using [SIMD](/hpc/simd) instructions to do 4 or 8 multiplications in parallel.
-
-Would not be surprised to see another 3x improvement and throughputs of 10k/sec.
-
-If you have limited time, you should probably compute as much forward as possible, and then half the time computing the other.
-
-How to optimize for the *average* case is unclear.
-
-Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently.
+This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than the [PARI](https://pari.math.u-bordeaux.fr/) library (invoked via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)).
 
+### Further Improvements
 
-### Reducing Errors
+I belive there is still a lot of potential for optimization in our implementation of the Pollard's algorithm:
 
-There are slightly more errors because we are a bit loose with modular arithmetic here. The error rate grows higher when we increase and decrease (due to overflows).
+- There is probably be a better cycle-finding algorithm that exploits the fact that the graph is random. It is currently bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we do that, we could calculate more than one multiplication of the values we've seen to detect a loop sooner. On the other hand, there is little chance that we enter the loop in within the first few iterations, so we may just advance the iterator for some time before starting the trials with the GCD trick.
+- If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (try to prove it). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could run two or three pairs of operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel.
 
-Our implementation has less than 0.7% error rate, but it grows higher if the numbers are lower than $10^{18}$.
+I would not be surprised to see another 3x improvement and a throughput of ~10k/sec.
 
-Since Pollard's rho algorithm is randomized, you need to account for errors. There may be several sources:
+<!-- Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently. How to optimize for the *average* case is unclear. -->
 
-- Factors not being found (need to perform a primality test and start again if it's negative).
-- The `p` variable can get zeroed out (need to either restart or roll back and do it iteration-by-iteration).
-- Overflows in Montgomery multiplication (our implementation is pretty loose).
+Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate which grows higher if the numbers are lower than $10^{18}$. They come from three main sources:
 
-### Larger Numbers
+- Factors simply not being found (the algorithm is inherently randomized, and there is no guarantee that they will be found). In this case, we need to perform a primality test and optionally start again.
+- The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one-by-one.
+- Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if $n$ is large, we need to add more `x > mod ? x - mod : x` kind of statements to deal with overflows.
 
-"How big are your numbers?" determines the method to use:
+These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general the optimal approach should depend on the size of the numbers:
 
-- Less than 2^16 or so: Lookup table.
-- Less than 2^70 or so: Richard Brent's modification of Pollard's rho algorithm.
-- Less than 10^50: Lenstra elliptic curve factorization
-- Less than 10^100: Quadratic Sieve
-- More than 10^100: General Number Field Sieve
+- Smaller than $2^{16}$: use a lookup table
+- Smaller than $2^{32}$: use a list of precomputed primes with a fast divsibility check
+- Smaller than $2^{64}$ or so: use Pollard's rho algorithm with Montgomery multiplication
+- Smaller than $10^{50}$: switch to [Lenstra elliptic curve factorization](https://en.wikipedia.org/wiki/Lenstra_elliptic-curve_factorization)
+- Smaller than $10^{100}$: switch to [Quadratic Sieve](https://en.wikipedia.org/wiki/Quadratic_sieve)
+- Larger than $10^{100}$: switch to [General Number Field Sieve](https://en.wikipedia.org/wiki/General_number_field_sieve)
 
-Requiring about 100KB of memory.
+<!-- Requiring about 100KB of memory. 6542 * 8 -->
 
-6542 * 8
+If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/).

From 19143a513bdc88a564391fa4b71f5d01e3ef6a0b Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 26 May 2022 18:44:58 +0300
Subject: [PATCH 122/173] factorization improvements

---
 .../english/hpc/algorithms/factorization.md   | 21 ++++++++++---------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index fd61d441..7fc51f93 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -362,7 +362,7 @@ u64 find_factor(u64 n) {
 
 It now works at 425 factorizations per second, bottlenecked by the speed of modulo.
 
-### Optimizing Modulo
+### Optimizing the Modulo
 
 The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/): the modulo is constant, so we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap.
 
@@ -413,26 +413,27 @@ u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
 }
 ```
 
-This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than the [PARI](https://pari.math.u-bordeaux.fr/) library (invoked via [sage](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)).
+This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)'s `factor` function measures.
 
 ### Further Improvements
 
-I belive there is still a lot of potential for optimization in our implementation of the Pollard's algorithm:
+**Optimizations.** There is still a lot of potential for optimization in our implementation of the Pollard's algorithm:
 
-- There is probably be a better cycle-finding algorithm that exploits the fact that the graph is random. It is currently bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we do that, we could calculate more than one multiplication of the values we've seen to detect a loop sooner. On the other hand, there is little chance that we enter the loop in within the first few iterations, so we may just advance the iterator for some time before starting the trials with the GCD trick.
-- If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (try to prove it). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could run two or three pairs of operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel.
+- We could probably use a better cycle-finding algorithm, exploiting the fact that the graph is random. For example, there is little chance that we enter the loop in within the first few iterations (the length of the cycle and the path we walk before entering it should be equal in expectation since before we loop around, we choose the vertex of the path we've walked independently), so we may just advance the iterator for some time before starting the trials with the GCD trick.
+- Our current approach is bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we are waiting for it to complete, we could perform more than just one trial using the previous values.
+- If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (the reasoning is similar to the Birthday paradox; try to prove it yourself). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could concurrently run two or three of the same operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel.
 
-I would not be surprised to see another 3x improvement and a throughput of ~10k/sec.
+I would not be surprised to see another 3x improvement and a throughput of ~10k/sec. If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/).
 
 <!-- Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently. How to optimize for the *average* case is unclear. -->
 
-Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate which grows higher if the numbers are lower than $10^{18}$. They come from three main sources:
+**Errors.** Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate for 60-bit integers, and it grows higher if the numbers are lower. These errors come from three main sources:
 
-- Factors simply not being found (the algorithm is inherently randomized, and there is no guarantee that they will be found). In this case, we need to perform a primality test and optionally start again.
+- A cycle simply not being found (the algorithm is inherently random, and there is no guarantee that it will be found). In this case, we need to perform a primality test and optionally start again.
 - The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one-by-one.
 - Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if $n$ is large, we need to add more `x > mod ? x - mod : x` kind of statements to deal with overflows.
 
-These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general the optimal approach should depend on the size of the numbers:
+**Larger numbers.** These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general, the optimal approach should depend on the size of the numbers:
 
 - Smaller than $2^{16}$: use a lookup table
 - Smaller than $2^{32}$: use a list of precomputed primes with a fast divsibility check
@@ -443,4 +444,4 @@ These issues become less important if we exclude small numbers and numbers with
 
 <!-- Requiring about 100KB of memory. 6542 * 8 -->
 
-If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/).
+The last three approaches are very different from what we've been doing and require much more advanced number theory, and they deserve an article (or a full-length university course) of their own.

From 709340d509d45719c5f9d76432d273a1d84d44c5 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 26 May 2022 19:22:45 +0300
Subject: [PATCH 123/173] pollard edits

---
 .../english/hpc/algorithms/factorization.md   | 44 +++++++++----------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 7fc51f93..07bf7408 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -244,11 +244,11 @@ Pollard's rho is a randomized $O(\sqrt[4]{n})$ integer factorization algorithm t
 
 > One only needs to draw $d = \Theta(\sqrt{n})$ random numbers between $1$ and $n$ to get a collision with high probability.
 
-You can look up formal proof on Wikipedia, but the informal reasoning behind it is that that each of $d$ added numbers has a chance of approximately $\frac{d}{n}$ of colliding with anythin else, meaning that the expected number of collisions is $\frac{d^2}{n}$. If $d$ is asymptotically smaller than $\sqrt n$, then this ratio grows to zero as $n$ rises and to infinity otherwise.
+The reasoning behind it is that each of the $d$ added element has a $\frac{d}{n}$ chance of colliding with some other element, implying that the expected number of collisions is $\frac{d^2}{n}$. If $d$ is asymptotically smaller than $\sqrt n$, then this ratio grows to zero as $n \to \infty$, and to infinity otherwise.
 
-Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes.
+Consider some function $f(x)$ that takes a remainder $x \in [0, n)$ and maps it to some other remainder of $n$ in a way that seems random from the number theory point of view. Specifically, we will use $f(x) = x^2 + 1 \bmod n$, which is random enough for our purposes.
 
-Now, consider a graph where each number-vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. In functional graphs, the "trajectory" of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point we have to go to a vertex we have already visited).
+Now, consider a graph where each number-vertex $x$ has an edge pointing to $f(x)$. Such graphs are called *functional*. In functional graphs, the "trajectory" of any element — the path we walk if we start from that element and keep following the edges — is a path that eventually loops around (because the set of vertices is limited, and at some point, we have to go to a vertex we have already visited).
 
 ![The trajectory of an element resembles the greek letter ρ (rho), which is what the algorithm is named after](../img/rho.jpg)
 
@@ -258,11 +258,11 @@ $$
 x_0, \; f(x_0), \; f(f(x_0)), \; \ldots
 $$
 
-Now, let's make another sequence out of this one by reducing each element modulo $p$, the smallest prime divisor of $n$.
+Let's make another sequence out of this one by reducing each element modulo $p$, the smallest prime divisor of $n$.
 
-**Lemma.** The expected length of that sequence before it turns into a cycle is $O(\sqrt[4]{n})$.
+**Lemma.** The expected length of the reduced sequence before it turns into a cycle is $O(\sqrt[4]{n})$.
 
-**Proof:** Since $p$ is the smallest divisor, $p \leq \sqrt n$. Each time we follow a new edge, we essentially generate a random number between $0$ and $p$ (we treat $f$ as a "deterministically-random" function). The birthday paradox states that we only need to generate $O(\sqrt p) = O(\sqrt[4]{n})$ numers until we get a collision and thus enter a loop.
+**Proof:** Since $p$ is the smallest divisor, $p \leq \sqrt n$. Each time we follow a new edge, we essentially generate a random number between $0$ and $p$ (we treat $f$ as a "deterministically-random" function). The birthday paradox states that we only need to generate $O(\sqrt p) = O(\sqrt[4]{n})$ numbers until we get a collision and thus enter a loop.
 
 Since we don't know $p$, this mod-$p$ sequence is only imaginary, but if find a cycle in it — that is, $i$ and $j$ such that
 
@@ -307,13 +307,13 @@ u64 find_factor(u64 n) {
 }
 ```
 
-While it processes only ~25k 30-bit integers — almost 15 times slower than the fastest algorithm we have — it drammatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, factorizing around 90 of them per second.
+While it processes only ~25k 30-bit integers — which is almost 15 times slower than by checking each prime using a fast division trick — it dramatically outperforms every $\tilde{O}(\sqrt n)$ algorithm for 60-bit numbers, factorizing around 90 of them per second.
 
 ### Pollard-Brent Algorithm 
 
 Floyd's cycle-finding algorithm has a problem in that it moves iterators more than necessary: at least half of the vertices are visited one additional time by the slower iterator.
 
-One way to solve it is to memorize the values $x_i$ that the faster iterator visits and every two iterations compute the GCD using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$, but it can also be done without extra memory using a different principle: the tortoise doesn't move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare $x_i$ and $x_{2^{\lfloor \log_2 i \rfloor}}$ on each iteration:
+One way to solve it is to memorize the values $x_i$ that the faster iterator visits and, every two iterations, compute the GCD using the difference of $x_i$ and $x_{\lfloor i / 2 \rfloor}$. But it can also be done without extra memory using a different principle: the tortoise doesn't move on every iteration, but it gets reset to the value of the faster iterator when the iteration number becomes a power of two. This lets us save additional iterations while still using the same GCD trick to compare $x_i$ and $x_{2^{\lfloor \log_2 i \rfloor}}$ on each iteration:
 
 ```c++
 u64 find_factor(u64 n) {
@@ -332,11 +332,11 @@ u64 find_factor(u64 n) {
 }
 ```
 
-Note that we also set an upper limit on the number of iterations so that the algorithm finishes in reasonable time and returns `1` if $n$ turns out to be a prime.
+Note that we also set an upper limit on the number of iterations so that the algorithm finishes in a reasonable amount of time and returns `1` if $n$ turns out to be a prime.
 
-It actually does *not* improve performance and even makes the algorithm ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the asymptotic of the algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it.
+It actually does *not* improve performance and even makes the algorithm ~1.5x *slower*, which probably has something to do with the fact that $x$ is stale. It spends most of the time computing the GCD and not advancing the iterator — in fact, the time requirement of this algorithm is currently $O(\sqrt[4]{n} \log n)$ because of it.
 
-Instead of [optimizing the GCD itself](../gcd), we can optimize the number of its invocations. We can use the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$, we remove $\log n$ out of the asymptotic:
+Instead of [optimizing the GCD itself](../gcd), we will optimize the number of its invocations. We can use the fact that if one of $a$ and $b$ contains factor $p$, then $a \cdot b \bmod n$ will also contain it, so instead of computing $\gcd(a, n)$ and $\gcd(b, n)$, we can compute $\gcd(a \cdot b \bmod n, n)$. This way, we can group the calculations of GCP in groups of $M = O(\log n)$ we remove $\log n$ out of the asymptotic:
 
 ```c++
 const int M = 1024;
@@ -360,11 +360,11 @@ u64 find_factor(u64 n) {
 }
 ```
 
-It now works at 425 factorizations per second, bottlenecked by the speed of modulo.
+Now it performs 425 factorizations per second, bottlenecked by the speed of modulo.
 
 ### Optimizing the Modulo
 
-The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/): the modulo is constant, so we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap.
+The final step is to apply [Montgomery multiplication](/hpc/number-theory/montgomery/). Since the modulo is constant, we can perform all computations — advancing the iterator, multiplication, and even computing the GCD — in the Montgomery space where reduction is cheap:
 
 ```c++
 struct Montgomery {
@@ -413,7 +413,7 @@ u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
 }
 ```
 
-This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html)'s `factor` function measures.
+This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath's `factor`](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html) function measures.
 
 ### Further Improvements
 
@@ -423,24 +423,24 @@ This implementation can processes around 3k 60-bit integers per second, which is
 - Our current approach is bottlenecked by advancing the iterator (the latency of Montgomery multiplication is much higher than its reciprocal throughput), and while we are waiting for it to complete, we could perform more than just one trial using the previous values.
 - If we run $p$ independent instances of the algorithm with different seeds in parallel and stop when one of them finds the answer, it would finish $\sqrt p$ times faster (the reasoning is similar to the Birthday paradox; try to prove it yourself). We don't have to use multiple cores for that: there is a lot of untapped [instruction-level parallelism](/hpc/pipelining/), so we could concurrently run two or three of the same operations on the same thread, or use [SIMD](/hpc/simd) instructions to perform 4 or 8 multiplications in parallel.
 
-I would not be surprised to see another 3x improvement and a throughput of ~10k/sec. If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/).
+I would not be surprised to see another 3x improvement and throughput of ~10k/sec. If you [implement](https://github.com/sslotin/amh-code/tree/main/factor) some of these ideas, please [let me know](http://sereja.me/).
 
 <!-- Another observation: the length of the "tail" and the cycle is equal in expectation, since when we loop around, we choose any vertex of the path we walked independently. How to optimize for the *average* case is unclear. -->
 
 **Errors.** Another aspect that we need to handle in a practical implementation is possible errors. Our current implementation has a 0.7% error rate for 60-bit integers, and it grows higher if the numbers are lower. These errors come from three main sources:
 
 - A cycle simply not being found (the algorithm is inherently random, and there is no guarantee that it will be found). In this case, we need to perform a primality test and optionally start again.
-- The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one-by-one.
+- The `p` variable becoming zero (because both $p$ and $q$ can get into the product). It becomes increasingly more likely as we decrease size of the inputs or increase the constant `M`. In this case, we need to either restart the process or (better) roll back the last $M$ iterations and perform the trials one by one.
 - Overflows in the Montgomery multiplication. Our current implementation is pretty loose with them, and if $n$ is large, we need to add more `x > mod ? x - mod : x` kind of statements to deal with overflows.
 
 **Larger numbers.** These issues become less important if we exclude small numbers and numbers with small prime factors using the algorithms we've implemented before. In general, the optimal approach should depend on the size of the numbers:
 
-- Smaller than $2^{16}$: use a lookup table
-- Smaller than $2^{32}$: use a list of precomputed primes with a fast divsibility check
-- Smaller than $2^{64}$ or so: use Pollard's rho algorithm with Montgomery multiplication
-- Smaller than $10^{50}$: switch to [Lenstra elliptic curve factorization](https://en.wikipedia.org/wiki/Lenstra_elliptic-curve_factorization)
-- Smaller than $10^{100}$: switch to [Quadratic Sieve](https://en.wikipedia.org/wiki/Quadratic_sieve)
-- Larger than $10^{100}$: switch to [General Number Field Sieve](https://en.wikipedia.org/wiki/General_number_field_sieve)
+- Smaller than $2^{16}$: use a lookup table;
+- Smaller than $2^{32}$: use a list of precomputed primes with a fast divisibility check;
+- Smaller than $2^{64}$ or so: use Pollard's rho algorithm with Montgomery multiplication;
+- Smaller than $10^{50}$: switch to [Lenstra elliptic curve factorization](https://en.wikipedia.org/wiki/Lenstra_elliptic-curve_factorization);
+- Smaller than $10^{100}$: switch to [Quadratic Sieve](https://en.wikipedia.org/wiki/Quadratic_sieve);
+- Larger than $10^{100}$: switch to [General Number Field Sieve](https://en.wikipedia.org/wiki/General_number_field_sieve).
 
 <!-- Requiring about 100KB of memory. 6542 * 8 -->
 

From ab5ffcb7135a3848720535b47694d95acb27d504 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 26 May 2022 19:44:35 +0300
Subject: [PATCH 124/173] elaborate on benchmarking

---
 content/english/hpc/algorithms/factorization.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index 07bf7408..acfd0b0c 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -5,7 +5,7 @@ weight: 3
 
 The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs.
 
-In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and almost 4 times faster than the previous state-of-the-art.
+In this case study, we specifically consider the factorization of *word-sized* integers: those on the order of $10^9$ and $10^{18}$. Untypical for this book, in this one, you may actually learn an asymptotically better algorithm: we start with a few basic approaches and gradually build up to the $O(\sqrt[4]{n})$-time *Pollard's rho algorithm* and optimize it to the point where it can factorize 60-bit semiprimes in 0.3-0.4ms and ~3 times faster than the previous state-of-the-art.
 
 <!--
 Integer factorization is interesting because of the RSA problem.
@@ -413,7 +413,7 @@ u64 find_factor(u64 n, u64 x0 = 2, u64 a = 1) {
 }
 ```
 
-This implementation can processes around 3k 60-bit integers per second, which is ~3.8 faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath's `factor`](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html) function measures.
+This implementation can processes around 3k 60-bit integers per second, which is ~3x faster than what [PARI](https://pari.math.u-bordeaux.fr/) / [SageMath's `factor`](https://doc.sagemath.org/html/en/reference/structure/sage/structure/factorization.html) / `cat semiprimes.txt | time factor` measures.
 
 ### Further Improvements
 

From 20f8043d36a60ab730eb8705ab299a137789ac3c Mon Sep 17 00:00:00 2001
From: Mike Koltsov <6823298+ItsLastDay@users.noreply.github.com>
Date: Sat, 28 May 2022 20:58:42 +0200
Subject: [PATCH 125/173] Update factorization.md

---
 content/english/hpc/algorithms/factorization.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index acfd0b0c..eb12946c 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -1,6 +1,7 @@
 ---
 title: Integer Factorization
 weight: 3
+published: true
 ---
 
 The problem of factoring integers into primes is central to computational [number theory](/hpc/number-theory/). It has been [studied](https://www.cs.purdue.edu/homes/ssw/chapter3.pdf) since at least the 3rd century BC, and [many methods](https://en.wikipedia.org/wiki/Category:Integer_factorization_algorithms) have been developed that are efficient for different inputs.
@@ -198,7 +199,7 @@ This approach lets us process almost 20k 30-bit integers per second, but it does
 
 Note that this is actually an asymptotic optimization: there are $O(\frac{n}{\ln n})$ primes among the first $n$ numbers, so this algorithm performs $O(\frac{\sqrt n}{\ln \sqrt n})$ operations, while wheel factorization only eliminates a large but constant fraction of divisors. If we extend it to 64-bit numbers and precompute every prime under $2^{32}$ (storing which would require several hundred megabytes of memory), the relative speedup would grow by a factor of $\frac{\ln \sqrt{n^2}}{\ln \sqrt n} = 2 \cdot \frac{1/2}{1/2} \cdot \frac{\ln n}{\ln n} = 2$.
 
-All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advice and allow for some additional precomputation. In our case, it is suitable to use [the Lemire division check](/hpc/arithmetic/division/#lemire-reduction):
+All variants of trial division, including this one, are bottlenecked by the speed of integer division, which can be [optimized](/hpc/arithmetic/division/) if we know the divisors in advance and allow for some additional precomputation. In our case, it is suitable to use [the Lemire division check](/hpc/arithmetic/division/#lemire-reduction):
 
 ```c++
 // ...precomputation is the same as before,

From 7f26b7590e635cd8ab2cbcf4db35f6ffb070ca8b Mon Sep 17 00:00:00 2001
From: bkmurali <68402765+bkmurali@users.noreply.github.com>
Date: Mon, 30 May 2022 17:17:39 -0700
Subject: [PATCH 126/173] Update segment-trees.md

---
 content/english/hpc/data-structures/segment-trees.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md
index 54b32b7f..f4c6fb7f 100644
--- a/content/english/hpc/data-structures/segment-trees.md
+++ b/content/english/hpc/data-structures/segment-trees.md
@@ -1,6 +1,7 @@
 ---
 title: Segment Trees
 weight: 4
+published: true
 ---
 
 The lessons learned from [optimizing](../s-tree) [binary search](../binary-search) can be applied to a broad range of data structures.
@@ -329,7 +330,7 @@ int sum(int l, int r) {
     int s = 0;
     while (l <= r) {
         if ( l & 1) s += t[l++]; // l is a right child: add it and move to a cousin
-        if (~r & 1) s += t[r--]; // r is a light child: add it and move to a cousin
+        if (~r & 1) s += t[r--]; // r is a left child: add it and move to a cousin
         l >>= 1, r >>= 1;
     }
     return s;

From 1acb0daf3d6c18372d657b5b76c810a34f2eeb44 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Sun, 5 Jun 2022 15:26:59 +0300
Subject: [PATCH 127/173] todo notes on benchmarking

---
 content/english/hpc/complexity/levels.md         | 4 ++++
 content/english/hpc/profiling/instrumentation.md | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/content/english/hpc/complexity/levels.md b/content/english/hpc/complexity/levels.md
index 981d467c..9a792917 100644
--- a/content/english/hpc/complexity/levels.md
+++ b/content/english/hpc/complexity/levels.md
@@ -78,3 +78,7 @@ Cost to implement, bugs, maintainability. It is perfectly fine that most softwar
 What does it mean to be a better programmer? Faster programs? Faster speed of work? Fewer bugs? It is a combination of those.
 
 Implementing compiler optimizations or databases are examples of high-leverage activities because they act as a tax on everything else — which is why you see most people writing books on these particular topics rather than software optimization in general.
+
+---
+
+Factorization is kind of useless by itself, but it helps with understanding how to optimize number theoretic computations in general. Same goes for sorting and binary trees: most people hold some metainformation.
diff --git a/content/english/hpc/profiling/instrumentation.md b/content/english/hpc/profiling/instrumentation.md
index e31208c8..a622e24a 100644
--- a/content/english/hpc/profiling/instrumentation.md
+++ b/content/english/hpc/profiling/instrumentation.md
@@ -4,6 +4,8 @@ weight: 1
 published: true
 ---
 
+<!-- pv in Linux, pipes -->
+
 *Instrumentation* is an overcomplicated term that means inserting timers and other tracking code into programs. The simplest example is using the `time` utility in Unix-like systems to measure the duration of execution for the whole program.
 
 More generally, we want to know *which parts* of the program need optimization. There are tools shipped with compilers and IDEs that can time designated functions automatically, but it is more robust to do it by hand using any methods of interacting with time that the language provides:

From 1cd629fa9dde73de0d810890effbc4c7cdac4db8 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 10 Jun 2022 15:32:41 +0300
Subject: [PATCH 128/173] add anagrams problem

---
 content/russian/cs/programming/bayans.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/content/russian/cs/programming/bayans.md b/content/russian/cs/programming/bayans.md
index d35880cc..9faf6139 100644
--- a/content/russian/cs/programming/bayans.md
+++ b/content/russian/cs/programming/bayans.md
@@ -307,6 +307,10 @@ def query(y):
 
 Даны $3 \cdot 10^5$ точек на плоскости. Выберите среди них любое подмножество из 500 точек и решите для него задачу коммивояжера: найдите минимальный по длине цикл, проходящий через все эти точки.
 
+## Анаграммы
+
+Найдите в строке $s$ первую подстроку, являющуюся анаграммой (пререстановкой символов) строки $t$ за $O(n)$.
+
 <!--
 
 ## Случайная перестановка

From 6353bbc08b32f57b27b9c77ad531e5a0644a0a79 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 24 Jun 2022 23:45:40 +0300
Subject: [PATCH 129/173] change preposition

---
 content/english/hpc/data-structures/s-tree.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/s-tree.md b/content/english/hpc/data-structures/s-tree.md
index bf3b2805..105608b1 100644
--- a/content/english/hpc/data-structures/s-tree.md
+++ b/content/english/hpc/data-structures/s-tree.md
@@ -3,7 +3,7 @@ title: Static B-Trees
 weight: 2
 ---
 
-This article is a follow-up to the [previous one](../binary-search), where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching over sorted arrays, but this time we are not limited to fetching and comparing only one element at a time.
+This article is a follow-up to the [previous one](../binary-search), where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching in sorted arrays, but this time we are not limited to fetching and comparing only one element at a time.
 
 In this article, we generalize the techniques we developed for binary search to *static B-trees* and accelerate them further using [SIMD instructions](/hpc/simd). In particular, we develop two new implicit data structures:
 

From 336f8345207628e6b7858d029a49845ca53158de Mon Sep 17 00:00:00 2001
From: sharkov63 <39223464+sharkov63@users.noreply.github.com>
Date: Fri, 1 Jul 2022 19:50:01 +0300
Subject: [PATCH 130/173] Kahan Summation section: the number in the example is
 2^24, not 2^23

---
 content/english/hpc/arithmetic/errors.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/arithmetic/errors.md b/content/english/hpc/arithmetic/errors.md
index df62e91d..47d5d42d 100644
--- a/content/english/hpc/arithmetic/errors.md
+++ b/content/english/hpc/arithmetic/errors.md
@@ -1,6 +1,7 @@
 ---
 title: Rounding Errors
 weight: 2
+published: true
 ---
 
 The way rounding works in hardware floats is remarkably simple: it occurs if and only if the result of the operation is not representable exactly, and by default gets rounded to the nearest representable number (in case of a tie preferring the number that ends with a zero).
@@ -141,7 +142,7 @@ for (int i = 0; i < n; i++)
 
 Since we are performing summations and not multiplications, its relative error is no longer just bounded by $O(\epsilon \cdot n)$, but heavily depends on the input.
 
-In the most ridiculous case, if the first value is $2^{23}$ and the other values are equal to $1$, the sum is going to be $2^{23}$ regardless of $n$, which can be verified by executing the following code and observing that it simply prints $16777216 = 2^{23}$ twice:
+In the most ridiculous case, if the first value is $2^{24}$ and the other values are equal to $1$, the sum is going to be $2^{24}$ regardless of $n$, which can be verified by executing the following code and observing that it simply prints $16777216 = 2^{24}$ twice:
 
 ```cpp
 const int n = (1<<24);
@@ -154,7 +155,7 @@ for (int i = 0; i < n; i++)
 printf("%f\n", s);
 ```
 
-This happens because `float` has only 23 mantissa bits, and so $2^{23} + 1$ is the first integer number that can't be represented exactly and has to be rounded down, which happens every time we try to add $1$ to $s = 2^{23}$. The error is indeed $O(n \cdot \epsilon)$ but in terms of the absolute error, not the relative one: in the example above, it is $2$, and it would go up to infinity if the last number happened to be $-2^{23}$.
+This happens because `float` has only 23 mantissa bits, and so $2^{24} + 1$ is the first integer number that can't be represented exactly and has to be rounded down, which happens every time we try to add $1$ to $s = 2^{24}$. The error is indeed $O(n \cdot \epsilon)$ but in terms of the absolute error, not the relative one: in the example above, it is $2$, and it would go up to infinity if the last number happened to be $-2^{24}$.
 
 The obvious solution is to switch to a larger type such as `double`, but this isn't really a scalable method. An elegant solution is to store the parts that weren't added in a separate variable, which is then added to the next variable:
 

From c76cf6d11bb1b119d5047a87e04ab83895f864b1 Mon Sep 17 00:00:00 2001
From: Matt Pharr <matt@pharr.org>
Date: Fri, 1 Jul 2022 12:27:23 -0700
Subject: [PATCH 131/173] Assorted minor edits / fixups.

---
 content/english/hpc/algorithms/factorization.md      |  2 +-
 content/english/hpc/algorithms/gcd.md                |  2 +-
 content/english/hpc/algorithms/prefix.md             |  2 +-
 content/english/hpc/architecture/indirect.md         |  2 +-
 content/english/hpc/architecture/isa.md              |  2 +-
 content/english/hpc/architecture/layout.md           |  6 +++---
 content/english/hpc/architecture/loops.md            | 12 ++++++------
 content/english/hpc/arithmetic/division.md           | 12 ++++++------
 content/english/hpc/arithmetic/ieee-754.md           |  2 +-
 content/english/hpc/arithmetic/newton.md             |  2 +-
 content/english/hpc/compilation/_index.md            |  2 +-
 content/english/hpc/compilation/contracts.md         |  4 ++--
 content/english/hpc/compilation/precalc.md           |  6 +++---
 content/english/hpc/compilation/situational.md       |  4 ++--
 content/english/hpc/compilation/stages.md            | 10 +++++-----
 content/english/hpc/complexity/_index.md             |  2 +-
 content/english/hpc/cpu-cache/_index.md              |  4 ++--
 content/english/hpc/data-structures/binary-search.md |  6 +++---
 content/english/hpc/data-structures/s-tree.md        |  4 ++--
 content/english/hpc/external-memory/_index.md        |  2 +-
 content/english/hpc/external-memory/hierarchy.md     |  2 +-
 content/english/hpc/pipelining/_index.md             |  2 +-
 content/english/hpc/pipelining/branchless.md         |  6 +++---
 content/english/hpc/pipelining/tables.md             |  2 +-
 24 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/content/english/hpc/algorithms/factorization.md b/content/english/hpc/algorithms/factorization.md
index eb12946c..b900eb8c 100644
--- a/content/english/hpc/algorithms/factorization.md
+++ b/content/english/hpc/algorithms/factorization.md
@@ -43,7 +43,7 @@ vector<u64> factorize(u64 n) {
 
 After each removed factor, the problem becomes considerably smaller, so the worst-case running time of full factorization is equal to the worst-case running time of a `find_factor` call. 
 
-For many factorization algorithms, including those presented in this article, the running time scales with the smaller prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. We generate a $k$-bit semiprime as the product of two random $\lfloor k / 2 \rfloor$-bit primes.
+For many factorization algorithms, including those presented in this section, the running time scales with the smaller prime factor. Therefore, to provide worst-case input, we use *semiprimes:* products of two prime numbers $p \le q$ that are on the same order of magnitude. We generate a $k$-bit semiprime as the product of two random $\lfloor k / 2 \rfloor$-bit primes.
 
 Since some of the algorithms are inherently randomized, we also tolerate a small (<1%) percentage of false-negative errors (when `find_factor` returns `1` despite number $n$ being composite), although this rate can be reduced to almost zero without significant performance penalties.
 
diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md
index d56be8f7..7941edd0 100644
--- a/content/english/hpc/algorithms/gcd.md
+++ b/content/english/hpc/algorithms/gcd.md
@@ -14,7 +14,7 @@ $$
 \gcd(a, b) = \max_{g: \; g|a \, \land \, g | b} g
 $$
 
-You probably already know this algorithm from a CS textbook, but let me briefly remind it anyway. It is based on the following formula, assuming that $a > b$:
+You probably already know this algorithm from a CS textbook, but I will summarize it here. It is based on the following formula, assuming that $a > b$:
 
 $$
 \gcd(a, b) = \begin{cases}
diff --git a/content/english/hpc/algorithms/prefix.md b/content/english/hpc/algorithms/prefix.md
index 81d31900..43bfd560 100644
--- a/content/english/hpc/algorithms/prefix.md
+++ b/content/english/hpc/algorithms/prefix.md
@@ -146,7 +146,7 @@ Another interesting data point: if we only execute the `prefix` phase, the perfo
 
 ### Blocking
 
-So, we have a memory bandwidth problem for large arrays. We can avoid re-fetching the entire array from the RAM if we split it into blocks that fit in the cache and process them separately. All we need to pass to the next block is the sum of the previous ones, so we can design a `local_prefix` function with an interface similar to `accumulate`:
+So, we have a memory bandwidth problem for large arrays. We can avoid re-fetching the entire array from RAM if we split it into blocks that fit in the cache and process them separately. All we need to pass to the next block is the sum of the previous ones, so we can design a `local_prefix` function with an interface similar to `accumulate`:
 
 ```c++
 const int B = 4096; // <- ideally should be slightly less or equal to the L1 cache
diff --git a/content/english/hpc/architecture/indirect.md b/content/english/hpc/architecture/indirect.md
index 487b81e3..1bd96c06 100644
--- a/content/english/hpc/architecture/indirect.md
+++ b/content/english/hpc/architecture/indirect.md
@@ -102,7 +102,7 @@ There are many ways to implement this behavior, but C++ does it using a *virtual
 
 For all concrete implementations of `Animal`, compiler pads all their methods (that is, their instruction sequences) so that they have the exact same length for all classes (by inserting some [filler instructions](../layout) after `ret`) and then just writes them sequentially somewhere in the instruction memory. Then it adds a *run-time type information* field to the structure (that is, to all its instances), which is essentially just the offset in the memory region that points to the right implementation of the virtual methods of the class.
 
-During a virtual method call, that offset field is fetched from the instance of a structure, and a normal function call is made with it, using the fact that all methods and other fields of every derived class have exactly the same offsets.
+With a virtual method call, that offset field is fetched from the instance of a structure and a normal function call is made with it, using the fact that all methods and other fields of every derived class have exactly the same offsets.
 
 Of course, this adds some overhead:
 
diff --git a/content/english/hpc/architecture/isa.md b/content/english/hpc/architecture/isa.md
index 4862efb3..b902f69c 100644
--- a/content/english/hpc/architecture/isa.md
+++ b/content/english/hpc/architecture/isa.md
@@ -14,7 +14,7 @@ Abstractions help us in reducing all this complexity down to a single *interface
 
 Hardware engineers love abstractions too. An abstraction of a CPU is called an *instruction set architecture* (ISA), and it defines how a computer should work from a programmer's perspective. Similar to software interfaces, it gives computer engineers the ability to improve on existing CPU designs while also giving its users — us, programmers — the confidence that things that worked before won't break on newer chips.
 
-An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, an ISA importantly defines the counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance.
+An ISA essentially defines how the hardware should interpret the machine language. Apart from instructions and their binary encodings, an ISA also defines the counts, sizes, and purposes of registers, the memory model, and the input/output model. Similar to software interfaces, ISAs can be extended too: in fact, they are often updated, mostly in a backward-compatible way, to add new and more specialized instructions that can improve performance.
 
 ### RISC vs CISC
 
diff --git a/content/english/hpc/architecture/layout.md b/content/english/hpc/architecture/layout.md
index 37779325..df414512 100644
--- a/content/english/hpc/architecture/layout.md
+++ b/content/english/hpc/architecture/layout.md
@@ -30,7 +30,7 @@ Loop Stream Detector (LSD)
 
 ### Code Alignment
 
-Other things being equal, compilers typically prefer instructions with shorter machine code, because this way more instructions can fit in a single 32B fetch block, and also because it reduces the size of the binary. But sometimes the reverse advice applies, caused by the fact that the fetched instructions blocks have to be aligned.
+Other things being equal, compilers typically prefer instructions with shorter machine code, because this way more instructions can fit in a single 32B fetch block, and also because it reduces the size of the binary. But sometimes the reverse is prefereable, due to the fact that the fetched instructions' blocks must be aligned.
 
 Imagine that you need to execute an instruction sequence that starts on the last byte of a 32B-aligned block. You may be able to execute the first instruction without additional delay, but for the subsequent ones, you have to wait for one additional cycle to do another instruction fetch. If the code block was aligned on a 32B boundary, then up to 4 instructions could be decoded and then executed concurrently (unless they are extra long or interdependent).
 
@@ -46,7 +46,7 @@ In GCC, you can use `-falign-labels=n` flag to specify a particular alignment po
 
 The instructions are stored and fetched using largely the same [memory system](/hpc/cpu-cache) as for the data, except maybe the lower layers of cache are replaced with a separate *instruction cache* (because you wouldn't want a random data read to kick out the code that processes it).
 
-The instruction cache is crucial in situations when you either
+The instruction cache is crucial in situations when you either:
 
 - don't know what instructions you are going to execute next, and need to fetch the next block with [low latency](/hpc/cpu-cache/latency),
 - or are executing a long sequence of verbose-but-quick-to-process instructions, and need [high bandwidth](/hpc/cpu-cache/bandwidth).
@@ -153,7 +153,7 @@ length:
     ret
 ```
 
-This is a very important issue, and we will spend [much of the next chapter](/hpc/pipelining/branching) discussing it in more detail.
+Eliminating branches is an important topic, and we will spend [much of the next chapter](/hpc/pipelining/branching) discussing it in more detail.
 
 <!--
 
diff --git a/content/english/hpc/architecture/loops.md b/content/english/hpc/architecture/loops.md
index 5da022f5..ad59d890 100644
--- a/content/english/hpc/architecture/loops.md
+++ b/content/english/hpc/architecture/loops.md
@@ -29,7 +29,7 @@ Labels can be any string, but compilers don't get creative and [typically](https
 
 It is reasonable to think that these conditions are computed as `bool`-s somewhere and passed to conditional jumps as operands: after all, this is how it works in programming languages. But that is not how it is implemented in hardware. Conditional operations use a special `FLAGS` register, which first needs to be populated by executing instructions that perform some kind of check.
 
-In our example, `cmp rax, rcx` compares the iterator `rax` with the end-of-array pointer `rcx`. This updates the FLAGS register, and now it can be used by `jne loop`, which looks up a certain bit there that tells whether the two values are equal or not, and then either jumps back to the beginning or continues to the next instruction, thus breaking the loop.
+In our example, `cmp rax, rcx` compares the iterator `rax` with the end-of-array pointer `rcx`. This updates the `FLAGS` register, and now it can be used by `jne loop`, which looks up a certain bit there that tells whether the two values are equal or not, and then either jumps back to the beginning or continues to the next instruction, thus breaking the loop.
 
 ### Loop Unrolling
 
@@ -61,15 +61,15 @@ loop:
 
 Now we only need 3 loop control instructions for 4 useful ones (an improvement from $\frac{1}{4}$ to $\frac{4}{7}$ in terms of efficiency), and this can be continued to reduce the overhead almost to zero.
 
-In practice though, unrolling loops isn't always necessary for performance because modern processors don't actually execute instructions one-by-one, but maintain a [queue of pending instructions](/hpc/pipelining) so that two independent operations can be executed concurrently without waiting for each other to finish.
+In practice, unrolling loops isn't always necessary for performance because modern processors don't actually execute instructions one-by-one, but maintain a [queue of pending instructions](/hpc/pipelining) so that two independent operations can be executed concurrently without waiting for each other to finish.
 
 This is our case too: the real speedup from unrolling won't be fourfold, because the operations of incrementing the counter and checking if we are done are independent from the loop body, and can be scheduled to run concurrently with it. But may still be beneficial to [ask the compiler](/hpc/compilation/situational) to unroll it to some extent.
 
 ### An Alternative Approach
 
-You don't have to explicitly use `cmp` or a similar instruction to make a conditional jump. Many other instructions either read or modify the FLAGS register, sometimes as a by-product enabling optional exception checks.
+You don't have to explicitly use `cmp` or a similar instruction to make a conditional jump. Many other instructions either read or modify the `FLAGS` register, sometimes as a by-product enabling optional exception checks.
 
-For example, `add` always sets a bunch of flags, denoting whether the result is zero, is negative, whether an overflow or an underflow occurred, and so on. Taking advantage of this mechanism, compilers often produce loops like this:
+For example, `add` always sets a number of flags, denoting whether the result is zero, is negative, whether an overflow or an underflow occurred, and so on. Taking advantage of this mechanism, compilers often produce loops like this:
 
 ```nasm
     mov  rax, -100  ; replace 100 with the array size
@@ -79,7 +79,7 @@ loop:
     jnz  loop       ; checks if the result is zero
 ```
 
-This code is a bit harder to read for a human, but it is one instruction shorter in the repeated part, which isn't huge, but non-negligible for performance.
+This code is a bit harder to read for a human, but it is one instruction shorter in the repeated part, which may meaningfully affect performance.
 
 <!--
 
@@ -117,7 +117,7 @@ cmov
 
 Need to somehow link it to branchless programming and layout article. We now have 3 places introducing the concept.
 
-Many other operations set something in the FLAGS register. For example, add often. It is useful to, and then decrement or increment it to save on instruction. Like a while loop:
+Many other operations set something in the `FLAGS` register. For example, add often. It is useful to, and then decrement or increment it to save on instruction. Like a while loop:
 
 ```
 while (n--) {
diff --git a/content/english/hpc/arithmetic/division.md b/content/english/hpc/arithmetic/division.md
index ad1cf525..0bf44da8 100644
--- a/content/english/hpc/arithmetic/division.md
+++ b/content/english/hpc/arithmetic/division.md
@@ -118,15 +118,15 @@ This method requires some precomputation, including performing one actual divisi
 It is not very clear why such $m$ and $s$ always exist, let alone how to find them. But given a fixed $s$, intuition tells us that $m$ should be as close to $2^s/y$ as possible for $2^s$ to cancel out. So there are two natural choices: $\lfloor 2^s/y \rfloor$ and $\lceil 2^s/y \rceil$. The first one doesn't work, because if you substitute
 
 $$
-\lfloor \frac{x \cdot \lfloor 2^s/y \rfloor}{2^s} \rfloor
+\Bigl \lfloor \frac{x \cdot \lfloor 2^s/y \rfloor}{2^s} \Bigr \rfloor
 $$
 
 then for any integer $\frac{x}{y}$ where $y$ is not even, the result will be strictly less than the truth. This only leaves the other case, $m = \lceil 2^s/y \rceil$. Now, let's try to derive the lower and upper bounds for the result of the computation:
 
 $$
   \lfloor x / y \rfloor
-= \lfloor \frac{x \cdot m}{2^s} \rfloor
-= \lfloor \frac{x \cdot \lceil  2^s /y \rceil}{2^s} \rfloor
+= \Bigl \lfloor \frac{x \cdot m}{2^s} \Bigr \rfloor
+= \Bigl \lfloor \frac{x \cdot \lceil  2^s /y \rceil}{2^s} \Bigr \rfloor
 $$
 
 Let's start with the bounds for $m$:
@@ -144,7 +144,7 @@ And now for the whole expression:
 $$
 x / y - 1
 <
-\lfloor \frac{x \cdot \lceil  2^s /y \rceil}{2^s} \rfloor
+\Bigl \lfloor \frac{x \cdot \lceil  2^s /y \rceil}{2^s} \Bigr \rfloor
 <
 x / y + x / 2^s
 $$
@@ -182,8 +182,8 @@ Now, for 32-bit integers, we can set $s = 64$ and look at the computation that w
 
 $$
   \lfloor x / y \rfloor
-= \lfloor \frac{x \cdot m}{2^s} \rfloor
-= \lfloor \frac{x \cdot \lceil  2^s /y \rceil}{2^s} \rfloor
+= \Bigl \lfloor \frac{x \cdot m}{2^s} \Bigr \rfloor
+= \Bigl \lfloor \frac{x \cdot \lceil  2^s /y \rceil}{2^s} \Bigr \rfloor
 $$
 
 What we really do here is we multiply $x$ by a floating-point constant ($x \cdot m$) and then truncate the result $(\lfloor \frac{\cdot}{2^s} \rfloor)$.
diff --git a/content/english/hpc/arithmetic/ieee-754.md b/content/english/hpc/arithmetic/ieee-754.md
index 65cc5f48..7787b589 100644
--- a/content/english/hpc/arithmetic/ieee-754.md
+++ b/content/english/hpc/arithmetic/ieee-754.md
@@ -50,7 +50,7 @@ IEEE 754 and a few consequent standards define not one but *several* representat
 Their availability ranges from chip to chip:
 
 - Most CPUs support single- and double-precision — which is what `float` and `double` types refer to in C.
-- Extended formats are exclusive to x86, and are available in C as the `long double` type, which falls back to double precision on arm. The choice of 64 bits for mantissa is so that every `long long` integer can be represented exactly. There is also a 40-bit format that similarly allocates 32 mantissa bits.
+- Extended formats are exclusive to x86, and are available in C as the `long double` type, which falls back to double precision on Arm CPUs. The choice of 64 bits for mantissa is so that every `long long` integer can be represented exactly. There is also a 40-bit format that similarly allocates 32 mantissa bits.
 - Quadruple as well as the 256-bit "octuple" formats are only used for specific scientific computations and are not supported by general-purpose hardware.
 - Half-precision arithmetic only supports a small subset of operations and is generally used for applications such as machine learning, especially neural networks, because they tend to perform large amounts of calculations but don't require high levels of precision.
 - Half-precision is being gradually replaced by bfloat, which trades off 3 mantissa bits to have the same range as single-precision, enabling interoperability with it. It is mostly being adopted by specialized hardware: TPUs, FGPAs, and GPUs. The name stands for "[Brain](https://en.wikipedia.org/wiki/Google_Brain) float."
diff --git a/content/english/hpc/arithmetic/newton.md b/content/english/hpc/arithmetic/newton.md
index de42104c..510312aa 100644
--- a/content/english/hpc/arithmetic/newton.md
+++ b/content/english/hpc/arithmetic/newton.md
@@ -5,7 +5,7 @@ weight: 3
 
 Reaching the maximum possible precision is rarely required from a practical algorithm. In real-world data, modeling and measurement errors are usually several orders of magnitude larger than the errors that come from rounding floating-point numbers and such, and we are often perfectly happy with picking an approximate method that trades off precision for speed.
 
-In this section, we introduce one of the most important building blocks in such approximate, numerical algorithms: *the Newton's method*.
+In this section, we introduce one of the most important building blocks in such approximate, numerical algorithms: *Newton's method*.
 
 ## Newton's Method
 
diff --git a/content/english/hpc/compilation/_index.md b/content/english/hpc/compilation/_index.md
index 07b0e07f..e32ba624 100644
--- a/content/english/hpc/compilation/_index.md
+++ b/content/english/hpc/compilation/_index.md
@@ -6,6 +6,6 @@ weight: 4
 
 The main benefit of [learning assembly language](../architecture/assembly) is not the ability to write programs in it, but the understanding of what is happening during the execution of compiled code and its performance implications.
 
-There are rare cases where we *really* need to switch to handwritten assembly for maximal performance, but most of the time compilers are capable of producing near-optimal code all by themselves. When they do not, it is usually because the programmer knows more about the problem than what can be inferred from the source code, but failed to communicate this extra information to the compiler.
+There are rare cases where we *really* need to switch to handwritten assembly for maximal performance, but most of the time compilers are capable of producing near-optimal code all by themselves. When they do not, it is usually because the programmer knows more about the problem than what can be inferred from the source code but failed to communicate this extra information to the compiler.
 
 In this chapter, we will discuss the intricacies of getting the compiler to do exactly what we want and gathering useful information that can guide further optimizations.
diff --git a/content/english/hpc/compilation/contracts.md b/content/english/hpc/compilation/contracts.md
index e3db2955..56a50d6b 100644
--- a/content/english/hpc/compilation/contracts.md
+++ b/content/english/hpc/compilation/contracts.md
@@ -21,7 +21,7 @@ There are two major groups of actions that cause undefined behavior:
 
 Designating something as undefined instead of implementation-defined behavior also helps compilers in optimization. Consider the case of signed integer overflow. On almost all architectures, [signed integers](/hpc/arithmetic/integer) overflow the same way as unsigned ones, with `INT_MAX + 1 == INT_MIN`, and yet, this is undefined behavior according to the C++ standard. This is very much intentional: if you disallow signed integer overflow, then `(x + 1) > x` is guaranteed to be always true for `int`, but not for `unsigned int`, because `(x + 1)` may overflow. For signed types, this lets compilers optimize such checks away.
 
-As a more naturally occurring example, consider the case of a loop with an integer control variable. Modern C++ and languages like Rust are advocating for using an unsigned integer (`size_t` / `usize`), while C programmers stubbornly keep using `int`. To understand why, consider the following `for` loop:
+As a more naturally occurring example, consider the case of a loop with an integer control variable. Modern C++ and languages like Rust encourage programmers to use an unsigned integer (`size_t` / `usize`), while C programmers stubbornly keep using `int`. To understand why, consider the following `for` loop:
 
 ```cpp
 for (unsigned int i = 0; i < n; i++) {
@@ -157,7 +157,7 @@ void add(int *a, int *b, int n) {
 
 Since each iteration of this loop is independent, it can be executed in parallel and [vectorized](/hpc/simd). But is it, technically?
 
-There may be a problem if the arrays `a` and `b` intersect. Consider the case when `b == a + 1`, that is, if `b` is just a memory view of `a` starting from its second element. In this case, the next iteration depends on the previous one, and the only correct solution is to execute the loop sequentially. The compiler has to check for such possibilities, even if the programmer knows they can't happen.
+There may be a problem if the arrays `a` and `b` intersect. Consider the case when `b == a + 1`, that is, if `b` is just a memory view of `a` starting from its second element. In this case, the next iteration depends on the previous one, and the only correct solution is to execute the loop sequentially. The compiler has to check for such possibilities even if the programmer knows they can't happen.
 
 This is why we have `const` and `restrict` keywords. The first one enforces that we won't modify memory with the pointer variable, and the second is a way to tell the compiler that the memory is guaranteed to not be aliased.
 
diff --git a/content/english/hpc/compilation/precalc.md b/content/english/hpc/compilation/precalc.md
index 4a7cb7b7..7de4c8fb 100644
--- a/content/english/hpc/compilation/precalc.md
+++ b/content/english/hpc/compilation/precalc.md
@@ -5,11 +5,11 @@ weight: 8
 
 When compilers can infer that a certain variable does not depend on any user-provided data, they can compute its value during compile time and turn it into a constant by embedding it into the generated machine code.
 
-This optimization helps performance a lot, but it is not a part of the C++ standard, so compilers don't *have to* do that. When a compile-time computation is either hard to implement or time-intensive, they have a full legal right to pass on that opportunity.
+This optimization helps performance a lot, but it is not a part of the C++ standard, so compilers don't *have to* do that. When a compile-time computation is either hard to implement or time-intensive, a compiler may pass on that opportunity.
 
 ### Constant Expressions
 
-In modern C++, you can mark a function as `constexpr`, and if it is called by passing constants, its value is guaranteed to be computed during compile time:
+For a more reliable solution, in modern C++ you can mark a function as `constexpr`; if it is called by passing constants its value is guaranteed to be computed during compile time:
 
 ```c++
 constexpr int fibonacci(int n) {
@@ -23,7 +23,7 @@ static_assert(fibonacci(10) == 55);
 
 These functions have some restrictions like that they only call other `constexpr` functions and can't do memory allocation, but otherwise, they are executed "as is."
 
-Note that while they don't cost anything during the run time, they still increase compilation time, so at least remotely care about their efficiency and don't put something NP-complete in them:
+Note that while `constexpr` functions don't cost anything during run time, they still increase compilation time, so at least remotely care about their efficiency and don't put something NP-complete in them:
 
 ```c++
 constexpr int fibonacci(int n) {
diff --git a/content/english/hpc/compilation/situational.md b/content/english/hpc/compilation/situational.md
index ee758f06..bec2a255 100644
--- a/content/english/hpc/compilation/situational.md
+++ b/content/english/hpc/compilation/situational.md
@@ -63,7 +63,7 @@ This is a new feature that only appeared in C++20. Before that, there were compi
 
 ```c++
 int factorial(int n) {
-    if (likely(n > 1))
+    if (__builtin_expect(n > 1, 1))
         return n * factorial(n - 1);
     else
         return 1;
@@ -102,7 +102,7 @@ After we run the program — preferably on input that is as representative of re
 g++ -fprofile-use [other flags] source.cc -o binary
 ```
 
-It usually improves performance by 10-20% for large codebases, and for this reason it is commonly included in the build process of performance-critical projects. One more reason to invest in solid benchmarking code.
+It usually improves performance by 10-20% for large codebases, and for this reason it is commonly included in the build process of performance-critical projects. This is more reason to invest in solid benchmarking code.
 
 <!--
 
diff --git a/content/english/hpc/compilation/stages.md b/content/english/hpc/compilation/stages.md
index d321076d..95c40050 100644
--- a/content/english/hpc/compilation/stages.md
+++ b/content/english/hpc/compilation/stages.md
@@ -7,10 +7,10 @@ Before jumping straight to compiler optimizations, which is what most of this ch
 
 1. **Preprocessing** expands macros, pulls included source from header files, and strips off comments from source code: `gcc -E source.c` (outputs preprocessed source to stdout)
 2. **Compiling** parses the source, checks for syntax errors, converts it into an intermediate representation, performs optimizations, and finally translates it into assembly language: `gcc -S file.c` (emits an `.s` file)
-3. **Assembly** turns it into machine code, except that any external function calls like `printf` are substituted with placeholders: `gcc -c file.c` (emits an `.o` file, called *object file*)
+3. **Assembly** turns assembly language into machine code, except that any external function calls like `printf` are substituted with placeholders: `gcc -c file.c` (emits an `.o` file, called *object file*)
 4. **Linking** finally resolves the function calls by plugging in their actual addresses, and produces an executable binary: `gcc -o binary file.c`
 
-There are possibilities to gain something for performance in each of these stages.
+There are possibilities to improve program performance in each of these stages.
 
 ### Interprocedural Optimization
 
@@ -19,13 +19,13 @@ We have the last [stage](../stages), linking, because it is is both easier and f
 It also gives the ability to distribute code as *libraries*, which can be either *static* or *shared*:
 
 - *Static* libraries are simply collections of precompiled object files that are merged with other sources by the compiler to produce a single executable, just as it normally would.
-- *Dynamic* or *shared* libraries are precompiled executables that have additional meta-information about where their callables are, references to which are resolved during runtime. As the name suggests, this allows *sharing* the compiled binaries between multiple users.
+- *Dynamic* or *shared* libraries are precompiled executables that have additional meta-information about where their callables are, references to which are resolved during runtime. As the name suggests, this allows *sharing* the compiled binaries between multiple programs.
 
 The main advantage of using static libraries is that you can perform various *interprocedural optimizations* that require more context than just the signatures of library functions, such as [function inlining](/hpc/architecture/functions) or dead code elimination. To force the linker to look for and only accept static libraries, you can pass the `-static` option.
 
-This process is called *link-time optimization*, and it is possible because modern compilers also store some form of *intermediate representation* in object files, which allows them to perform certain lightweight optimizations on the program as a whole. This also allows using different compiled languages in the same program, which can even be optimized across language barriers if their compilers use the same intermediate representation.
+This process is called *link-time optimization (LTO)*, and it is possible because modern compilers also store some form of *intermediate representation* in object files, which allows them to perform certain lightweight optimizations on the program as a whole. This also allows using different compiled languages in the same program, which can even be optimized across language barriers if their compilers use the same intermediate representation.
 
-LTO is a relatively recent feature (it appeared in GCC only around 2014), and it is still far from perfect. In C and C++, the way to make sure no performance is lost is to create a *header-only library*. As the name suggests, they are just header files that contain full definitions of all functions, and so by simply including them, the compiler gets access to all optimizations possible. Although you do have to recompile them from scratch each time, this approach retains full control and makes sure that no performance is lost.
+LTO is a relatively recent feature (it appeared in GCC only around 2014), and it is still far from perfect. In C and C++, the way to make sure no performance is lost due to separate compilation is to create a *header-only library*. As the name suggests, they are just header files that contain full definitions of all functions, and so by simply including them, the compiler gets access to all optimizations possible. Although you do have to recompile the library code from scratch each time, this approach retains full control and makes sure that no performance is lost.
 
 ### Inspecting the Output
 
diff --git a/content/english/hpc/complexity/_index.md b/content/english/hpc/complexity/_index.md
index f38545e0..64b8a2f2 100644
--- a/content/english/hpc/complexity/_index.md
+++ b/content/english/hpc/complexity/_index.md
@@ -11,7 +11,7 @@ Complexity is an old concept. It was [systematically formulated](http://www.cs.a
 
 ### Classical Complexity Theory
 
-The "elementary operations" of a CPU are called *instructions*, and their "costs" are called *latencies*. Instructions are stored in *memory* and executed one by one by the processor, which has some internal *state* stored in a number of *registers*. One of these registers is the *instruction pointer* that indicates the address of the next instruction to read and execute. Each instruction changes the state of the processor in a certain way (including moving the instruction pointer), possibly modifies the main memory, and takes a different number of *CPU cycles* to complete before the next one can be started.
+The "elementary operations" of a CPU are called *instructions*, and their "costs" are called *latencies*. Instructions are stored in *memory* and executed one by one by the processor, which has some internal *state* stored in a number of *registers*. One of these registers is the *instruction pointer*, which indicates the address of the next instruction to read and execute. Each instruction changes the state of the processor in a certain way (including moving the instruction pointer), possibly modifies the main memory, and takes a different number of *CPU cycles* to complete before the next one can be started.
 
 To estimate the real running time of a program, you need to sum all latencies for its executed instructions and divide it by the *clock frequency*, that is, the number of cycles a particular CPU does per second. 
 
diff --git a/content/english/hpc/cpu-cache/_index.md b/content/english/hpc/cpu-cache/_index.md
index 484a39dc..ef1bbd6f 100644
--- a/content/english/hpc/cpu-cache/_index.md
+++ b/content/english/hpc/cpu-cache/_index.md
@@ -5,7 +5,7 @@ weight: 9
 
 In the [previous chapter](../external-memory), we studied computer memory from a theoretical standpoint, using the [external memory model](../external-memory/model) to estimate the performance of memory-bound algorithms.
 
-While it is more or less accurate for computations involving HDDs and network storage, where in-memory arithmetic is negligibly fast compared to the external I/O operations, it is too imprecise for lower levels in the cache hierarchy, where the costs of these operations become comparable.
+While the external memory model is more or less accurate for computations involving HDDs and network storage, where cost of arithmetic operations on in-memory values is negligible compared to external I/O operations, it is too imprecise for lower levels in the cache hierarchy, where the costs of these operations become comparable.
 
 To perform more fine-grained optimization of in-memory algorithms, we have to start taking into account the many specific details of the CPU cache system. And instead of studying loads of boring Intel documents with dry specs and theoretically achievable limits, we will estimate these parameters experimentally by running numerous small benchmark programs with access patterns that resemble the ones that often occur in practical code.
 
@@ -34,7 +34,7 @@ Although the CPU can be clocked at 4.1GHz in boost mode, we will perform most ex
 
 -->
 
-Due to difficulties in [refraining the compiler from cheating](/hpc/profiling/noise/), the code snippets in this article are slightly simplified for exposition purposes. Check the [code repository](https://github.com/sslotin/amh-code/tree/main/cpu-cache) if you want to reproduce them yourself.
+Due to difficulties in [preventing the compiler from optimizing away unused values](/hpc/profiling/noise/), the code snippets in this article are slightly simplified for exposition purposes. Check the [code repository](https://github.com/sslotin/amh-code/tree/main/cpu-cache) if you want to reproduce them yourself.
 
 ### Acknowledgements
 
diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 8a4924ea..6e73d32d 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -9,7 +9,7 @@ Instead, the most fascinating showcases of performance engineering are multifold
 
 <!-- Yet, with remarkable periodicity, these can be optimized to ridiculous levels of performance. -->
 
-In this article, we focus on one such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code.
+In this section, we focus on one such fundamental algorithm — *binary search* — and implement two of its variants that are, depending on the problem size, up to 4x faster than `std::lower_bound`, while being under just 15 lines of code.
 
 The first algorithm achieves that by removing [branches](/hpc/pipelining/branching), and the second also optimizes the memory layout to achieve better [cache system](/hpc/cpu-cache) performance. This technically disqualifies it from being a drop-in replacement for `std::lower_bound` as it needs to permute the elements of the array before it can start answering queries — but I can't recall a lot of scenarios where you obtain a sorted array but can't afford to spend linear time on preprocessing.
 
@@ -401,7 +401,7 @@ Also, note that the last few prefetch requests are actually not needed, and in f
 
 This prefetching technique allows us to read up to four elements ahead, but it doesn't really come for free — we are effectively trading off excess memory [bandwidth](/hpc/cpu-cache/bandwidth) for reduced [latency](/hpc/cpu-cache/latency). If you run more than one instance at a time on separate hardware threads or just any other memory-intensive computation in the background, it will significantly [affect](/hpc/cpu-cache/sharing) the benchmark performance.
 
-But we can do better. Instead of fetching four cache lines at a time, we could fetch four times *fewer* cache lines. And in the [next article](../s-tree), we will explore the approach.
+But we can do better. Instead of fetching four cache lines at a time, we could fetch four times *fewer* cache lines. And in the [next section](../s-tree), we will explore the approach.
 
 <!--
 
@@ -413,7 +413,7 @@ But that was a small detour. Let's get back to optimizing for *large* arrays.
 
 ### Removing the Last Branch
 
-Just the finishing touch. Did you notice the bumpiness of the Eytzinger search? This isn't random noise — let's zoom in:
+Just one finishing touch: did you notice the bumpiness of the Eytzinger search? This isn't random noise — let's zoom in:
 
 ![](../img/search-eytzinger-small.svg)
 
diff --git a/content/english/hpc/data-structures/s-tree.md b/content/english/hpc/data-structures/s-tree.md
index 105608b1..d241aed5 100644
--- a/content/english/hpc/data-structures/s-tree.md
+++ b/content/english/hpc/data-structures/s-tree.md
@@ -3,9 +3,9 @@ title: Static B-Trees
 weight: 2
 ---
 
-This article is a follow-up to the [previous one](../binary-search), where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching in sorted arrays, but this time we are not limited to fetching and comparing only one element at a time.
+This section is a follow-up to the [previous one](../binary-search), where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching in sorted arrays, but this time we are not limited to fetching and comparing only one element at a time.
 
-In this article, we generalize the techniques we developed for binary search to *static B-trees* and accelerate them further using [SIMD instructions](/hpc/simd). In particular, we develop two new implicit data structures:
+In this section, we generalize the techniques we developed for binary search to *static B-trees* and accelerate them further using [SIMD instructions](/hpc/simd). In particular, we develop two new implicit data structures:
 
 - The [first](#b-tree-layout) is based on the memory layout of a B-tree, and, depending on the array size, it is up to 8x faster than `std::lower_bound` while using the same space as the array and only requiring a permutation of its elements.
 - The [second](#b-tree-layout-1) is based on the memory layout of a B+ tree, and it is up to 15x faster than `std::lower_bound` while using just 6-7% more memory — or 6-7% **of** the memory if we can keep the original sorted array.
diff --git a/content/english/hpc/external-memory/_index.md b/content/english/hpc/external-memory/_index.md
index 2255945b..fe53c83a 100644
--- a/content/english/hpc/external-memory/_index.md
+++ b/content/english/hpc/external-memory/_index.md
@@ -19,7 +19,7 @@ When you fetch anything from memory, the request goes through an incredibly comp
 
 -->
 
-When you fetch anything from memory, there is always some non-zero latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through an incredibly complex system of address translation units and caching layers designed to both help in memory management and reduce the latency.
+When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce the latency.
 
 Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored:
 
diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md
index f0ca9c65..da1f5bb6 100644
--- a/content/english/hpc/external-memory/hierarchy.md
+++ b/content/english/hpc/external-memory/hierarchy.md
@@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data.
 
 ### Non-Volatile Memory
 
-While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to be persisted for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms.
+While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms.
 
 <!-- error correction -->
 
diff --git a/content/english/hpc/pipelining/_index.md b/content/english/hpc/pipelining/_index.md
index e18a31cc..aab72d79 100644
--- a/content/english/hpc/pipelining/_index.md
+++ b/content/english/hpc/pipelining/_index.md
@@ -5,7 +5,7 @@ weight: 3
 
 When programmers hear the word *parallelism*, they mostly think about *multi-core parallelism*, the practice of explicitly splitting a computation into semi-independent *threads* that work together to solve a common problem.
 
-This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as much computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware), and parallel algorithm design is becoming an increasingly important area, for now, we will consider the use of more than one CPU core cheating.
+This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as many computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware) and parallel algorithm design is becoming an increasingly important area, for now, we will limit ourselves to considering only a single CPU core.
 
 But there are other types of parallelism, already existing inside a CPU core, that you can use *for free*.
 
diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md
index 0f87da83..d7416f35 100644
--- a/content/english/hpc/pipelining/branchless.md
+++ b/content/english/hpc/pipelining/branchless.md
@@ -28,7 +28,7 @@ for (int i = 0; i < N; i++)
     s += (a[i] < 50) * a[i];
 ```
 
-Suddenly, the loop now takes ~7 cycles per element instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability.
+The loop now takes ~7 cycles per element instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability.
 
 But wait… shouldn't there still be a branch? How does `(a[i] < 50)` map to assembly?
 
@@ -182,7 +182,7 @@ int abs(int a) {
 
 **Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated char array (also known as "C-string") allocated somewhere on the heap and one integer containing the string size.
 
-A very common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings.
+A common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings.
 
 However, this requires a separate branch, which is costly unless most strings are empty. What we can do to get rid of it is to allocate a "zero C-string," which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction.
 
@@ -216,7 +216,7 @@ That there are no substantial reasons why compilers can't do this on their own,
 
 -->
 
-**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications, including GPU programming, because they don't have branching in the first place.
+**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications because they don't have branching in the first place.
 
 In our array sum example, if you remove the `volatile` type qualifier from the accumulator, the compiler becomes able to [vectorize](/hpc/simd/auto-vectorization) the loop:
 
diff --git a/content/english/hpc/pipelining/tables.md b/content/english/hpc/pipelining/tables.md
index 5f69c579..ad90c400 100644
--- a/content/english/hpc/pipelining/tables.md
+++ b/content/english/hpc/pipelining/tables.md
@@ -33,7 +33,7 @@ Some comments:
 - Because our minds are so used to the cost model where "more" means "worse," people mostly use *reciprocals* of throughput instead of throughput.
 - If a certain instruction is especially frequent, its execution unit could be duplicated to increase its throughput — possibly to even more than one, but not higher than the [decode width](/hpc/architecture/layout).
 - Some instructions have a latency of 0. This means that these instruction are used to control the scheduler and don't reach the execution stage. They still have non-zero reciprocal throughput because the [CPU front-end](/hpc/architecture/layout) still needs to process them.
-- Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is the [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all.
+- Most instructions are pipelined, and if they have the reciprocal throughput of $n$, this usually means that their execution unit can take another instruction after $n$ cycles (and if it is below 1, this means that there are multiple execution units, all capable of taking another instruction on the next cycle). One notable exception is [integer division](/hpc/arithmetic/division): it is either very poorly pipelined or not pipelined at all.
 - Some instructions have variable latency, depending on not only the size, but also the values of the operands. For memory operations (including fused ones like `add`), the latency is usually specified for the best case (an L1 cache hit).
 
 There are many more important little details, but this mental model will suffice for now.

From 59ca0451a59c0b3c81e1e542d2f8aff3588207c6 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 18 Jul 2022 01:17:15 +0300
Subject: [PATCH 132/173] four new theoretical problems

---
 content/russian/cs/programming/bayans.md | 37 ++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/content/russian/cs/programming/bayans.md b/content/russian/cs/programming/bayans.md
index 9faf6139..aee5deda 100644
--- a/content/russian/cs/programming/bayans.md
+++ b/content/russian/cs/programming/bayans.md
@@ -311,6 +311,43 @@ def query(y):
 
 Найдите в строке $s$ первую подстроку, являющуюся анаграммой (пререстановкой символов) строки $t$ за $O(n)$.
 
+## Функциональный граф
+
+Дан ориентированный граф из $n < 10^5$ вершин, в котором из каждой вершины ведет ровно одно ребро. Требуется ответить на $q < 10^5$ запросов «в какую вершину мы попадем, если начнем в вершине $v_i$ и сделаем $k_i < 10^{18}$ переходов» за время $O(q + n)$.
+
+## Асинхронная шляпа
+
+Серёжа и его $(n - 1)$ друзей решили поиграть в «шляпу», в которой один игрок должен за ограниченное время объяснить как можно больше слов, чтобы его партнер их отгадал.
+
+Каждый игрок должен пообщаться с любым другим по разу; обычно игра проводится так:
+
+- 1-й игрок объясняет в течение минуты слова 2-му,
+- 2-й игрок объясняет слова 3-му,
+- ...,
+- $n$-й игрок объясняет слова 1-му,
+- 1-й игрок объясняет слова 3-му,
+- 2-й игрок объясняет слова 4-му…
+
+…и так далее, пока $(n-1)$-й игрок не закончит объяснять слова $(n-2)$-ому.
+
+Если друзей собралось много, то игра может занять приличное время. Серёжу интересует, какое минимальное время она может длиться, если разрешить парам участников общаться между собой одновременно и в любом порядке.
+
+Для данного $n \le 500$, найдите минимальное количество времени $k$ и соответствующее ему расписание.
+
+## Random coffee
+
+В компании, в которой вы работаете, устроено неизвестное число людей — от одного до бесконечности с равной вероятностью. Для борьбы с одиночеством, каждый сотрудник участвует в «random coffee»: каждую неделю вы встречаетесь со случайным человеком из компании, чтобы попить кофе и обсудить что угодно.
+
+Вы участвовали в random coffee $n$ раз и пообщались с $k$ разными людьми (с некоторыми — более одного раза). Какое наиболее вероятное число человек работает в компании?
+
+## Мафия
+
+В «мафию» играют 13 человек, из которых 10 мирных и 3 мафии. Все роли розданы с помощью стандартной колоды игральных карт: заранее выбрали и перемешали 10 красных и 3 чёрные карты, кто вытянул черную — мафия. Все карты различны и известны всем. Игра начинается с дневного голосования.
+
+Как мирным гарантированно победить?
+
+<!-- Экзамены -->
+
 <!--
 
 ## Случайная перестановка

From 3a84b4c98feec9ca77c64db02e2b6504dbef1242 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 18 Jul 2022 01:22:46 +0300
Subject: [PATCH 133/173] bayans page edits

---
 content/russian/cs/programming/bayans.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/content/russian/cs/programming/bayans.md b/content/russian/cs/programming/bayans.md
index aee5deda..89fbbe7f 100644
--- a/content/russian/cs/programming/bayans.md
+++ b/content/russian/cs/programming/bayans.md
@@ -4,11 +4,12 @@ weight: 100
 authors:
 - Сергей Слотин
 created: 2017-2019
+date: 2022-07-17
 ---
 
 Везде, где не указано — время работы $O(n)$, а если есть конкретные числа, то TL 1 секунда.
 
-Задачи идут в порядке вспоминания, то есть в весьма рандомном.
+Задачи идут в порядке вспоминания/придумывания, то есть в весьма рандомном.
 
 ## Попугаи
 
@@ -121,7 +122,7 @@ int lower_bound(int x) {
 
 ## Нулевая сумма
 
-Дано  мультимножество из $n$ целых чисел. Найдите любое его подмножество, сумма чисел которого делится на $n$.
+Дано мультимножество из $n$ целых чисел. Найдите любое его непустое подмножество, сумма чисел которого делится на $n$.
 
 ## Мета-задача
 

From fd74d6427f4c775a26b854f4246c020350aa93f3 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 18 Jul 2022 18:07:28 +0300
Subject: [PATCH 134/173] two new meta-problems

---
 content/russian/cs/programming/bayans.md | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/content/russian/cs/programming/bayans.md b/content/russian/cs/programming/bayans.md
index 89fbbe7f..d7b42267 100644
--- a/content/russian/cs/programming/bayans.md
+++ b/content/russian/cs/programming/bayans.md
@@ -128,6 +128,18 @@ int lower_bound(int x) {
 
 В задаче дана произвольная строка, по которой известным только авторам способом генерируется ответ yes/no. В задаче 100 тестов. У вас есть 20 попыток. В качестве фидбэка вам доступны вердикты на каждом тесте. Вердикта всего два: OK (ответ совпал) и WA. Попытки поделить на ноль, выделить терабайт памяти и подобное тоже считаются как WA. «Решите» задачу.
 
+## Мета-задача 2
+
+Условие как в «Мета-задаче», но сообщается только число пройденных тестов.
+
+100 тестов, 70 попыток.
+
+## Мета-задача 3
+
+Условие как в «Мета-задаче», но сообщается только номер первого не пройденного теста.
+
+10 тестов, 100 попыток.
+
 ## Ниточка
 
 В плоскую доску вбили $n$ гвоздей радиуса $r$, причём так, что соответствующие точки на плоскости образуют вершины выпуклого многоугольника. На эти гвозди натянули ниточку, причём ниточка «огибает» по кругу гвозди. Найдите длину ниточки, то есть периметр этого многоугольника с учётом закругления.

From 5ea51266b9d6ce04c0ed4707feef4ffdec3954c4 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 18 Jul 2022 23:55:10 +0300
Subject: [PATCH 135/173] change eytzinger figure

---
 .../hpc/data-structures/img/eytzinger.png     | Bin 28730 -> 31441 bytes
 .../hpc/data-structures/img/eytzinger_old.png | Bin 0 -> 28730 bytes
 .../hpc/data-structures/img/src/eytzinger.svg | 454 ++++++++++++++++++
 3 files changed, 454 insertions(+)
 create mode 100644 content/english/hpc/data-structures/img/eytzinger_old.png
 create mode 100644 content/english/hpc/data-structures/img/src/eytzinger.svg

diff --git a/content/english/hpc/data-structures/img/eytzinger.png b/content/english/hpc/data-structures/img/eytzinger.png
index 97237c734cf59c1deda7cc16192b6f432c3afaf0..901efdd2f52e196b958cbd3082f4ca38fcb9810b 100644
GIT binary patch
literal 31441
zcmbrlXH-*N&<2`>CLmxyq=ST}G-=YC8UzGHN&x9fQ+k!&ReG0Rr4u>`gdRjdx*(k(
zkuDwS{hs)K-*?x#fA0@g;Bd}9GqY!&d1huO?8S3MG7@?c5C}x3tOVBtfgpMy5IFV*
zA#la{A#)1wLF}mX(gg&frosJyf5|#80vG9A74%%S94uYkUpZTV+}+*z-`d-_n7wke
z;CFDgO5c{I2Z8Q_l;MxGJu)`u+<iVFCeVBHo}>3Ze;^;tCk}_fitmLRRln|xHZp0p
z>G(C#R#Cy<)YMebTHd<Dooo684vz&rBUO+jW@l&J&p3{eA_fx^r&_I~stz6Rhdqz)
zZRIlFSsL)&cBv@l?xr=16yy{)o&w{3>CUHDsRBM|%^;S<Nw`l{G;fCx?k3_C4!wUj
zgKP2o|2<@yWJ-s727iE(7x$SG@G{5(_b`>f|DQz)LYM-8#q79XY5||oAX6e!@4zI;
zG-~qbF^dj}G~gv#+cX2oPt0R!woaunUS%fGO{;|mi^t1gcjT@Lc}y=)xOOD3659>0
z<uQG@=8&V1SfO)ciMxeFx(Mm=_~c<^2@OuC6m~AyT%ZlX)BENt0si<NodK-!V)cF!
zuUQ3%PZ&K|E+qJWYhul?{D@2Lq4prb!)TlV67u2s1dhq*Q<AqKAR0&rUaH`i^M|_+
zu%16OZ(*shLWoRimml4<Xq-)ypTV3nh^}`(k2_+?F_xuw`5@Z_%tVK;g1=@5HX|NV
zI_h?hN#V?>N(qhQ#XiC=Y@u23erY&uyKp<Qu8G}+EZ%-zJ-FNbt^414l(2Zd-K4L9
zir`_81o65Q1VfAgb61Up1S1Yt@!cF$`>Ju96CV)z<@Sm8=j(EGIciWN?1<knem1?X
zJ0w{9Fyz)(fx@AGd-#|W7DXV7M<_>P$mlF>0uBi@xl!sn)i2u4P`gzY-z?buUI~ux
z@$%K~<@;>IK%UE8W)Hg9#SZ_x!-hz~*?%e$Pd-G)B|~&Wn3OrrA;W<uKL}3TMW7<x
zW3fgsk?xe((C?tY9m2;m__2D{23`FY=)CRU97svdG1JJBC;#^0R%j^Kt{vy5C{xS!
zD<bU=l%~2rQmckSjFD&FQ-X9_G*cXp0srf{_hHUKbzzC*E~&>&_-A5A^{G8tE!`;+
zz)#szLi0TZnweZv@r-q*z~g8qw81TnhH#4zID;0M0dBXyNsWZqPfo-wkzk+7vy!1i
zH}xV*NNouKb5W|oyF~+ilLIEj$+sRrI3YY>mNx`9@g;kvw!vuedk3GRkaxPKG6e7Q
z$!vlm1q)@Nar}_;OngdL=R<n?gt1S2hY>3Eubl`5?$G1)fFAt2caR@SZ>IX1DerC(
z;gsopN6iDatPmn*BJ)5<Pj@wgnP$*Gf70uu589?S4YmSZhsj1){UP;^eeic0ZG4xR
z@h<G>xHF}^iuiS;An{U1n>Uf3_JJA}fmxPw+H&C#kU6WnArKP&`m^BdeBCT$jp{{0
zcRu@UFc5^qFxXbWN;BcqN-)ac9EP%Szs4V$-jtZTu^riY#lUjZwd~ZL!dHd!7Rn;`
zE;)z**@@(tIN4ON9H;K1jn=M^?v%l4oZ*!-qHlROzISG+*;9~)7{2)9y~mCIX|dS3
zwJ`QU@P5ZX$IRfQzP^FJE$Rz>fZmF;roZ%3!FrfJmcIm>H&JwLd5U#w|Idw2u_Vd_
zzZG-JMpu(TCJ>vv>`KTl@JC1k(HNJ!S+XfXA{Tu^{6EL|cE99A5tO96f9YxnDMZ_$
zn#qD+Y+lvL=Bz|2EB_0dg|gr%9@XlZi!yrfQ;36tTy({O3<a&)rgtq&zS@Sr*rYS%
zles%?Cl<;;dh{>M3etVA#IU2!vRQ(q<aSjs&nBcXnIV_2*Ie#XE6;?}rU`N`Pyh3&
zA>M#g@WY9!(ZxF`8NGBV!iIHi{)s$S?k-uL?!rjrxBp2nKz;9$Bd6P$TgLu<GaS|U
z&erRtfLymMPZzUx)*IDlHUDf3N3fAl^#roOGm&#n1W&<9Rm*A`DpMkR)_2UZO;AVA
z9K{6b&5))z)McMoRq5y_%lZ5nwNKblt|`;_Vb!O<oBgI~2<_Q^U^TzQg+5}7|FK#9
z*MG(@ivbyv=6u$gK6qY`?leBv!SUq5!ZYDdzjdnh@tVK=m=`xKIB(w5xH(9x^*|4Y
z(Zu=4c@Kf}R1iN^J$^+OhkAH(siXK05?#y7FN0bBeCZXIcBQMBb|36AV30!N|NLYi
z#A-G8_|>6$`Ggw!7wTKdg?&e&Z5d>J(Rl3`uOjkub^*GSKjvH;LDT)?li=Zh4(>rk
zAc~<wrw))gPHym{%gu>np_jJ|7Y_YLR=?GF9qQ&xJ@oYF-J{ns3AJt+G86;gJT-I^
zgdt#zFfe@Povy}H(es*h7c4-NtZGNWNa7k<Jx72h$I^7dP27fQ(}n*>z6DjdB%}dk
zwOGN}0=}3IzJ3=l(TF8*1G!2w615N@1>lHwVku^`bx1S(O{Aa`48{jvGBy=;l-+*X
zGN*kAM`W&d){%BJbB^6>*tWo`$^SM<GNjXTsmGZNOcnPOWd^S7%zVAiC!mjLpW5^c
zG8mLRLSVnaiS(mhqCvZpoe-9@QtPr0*r*b*e=ZylswBBBge{jRG9`2@MYgpiX20F;
z_@uffa;L?78vBKDit0k*md2C{g6jiQf}mU2KhOk)evd=DVu=mA-jfFS8K_i%wB4d8
zI#oWKil1mv-Li3;0EGk&fy3Ge<{yp)M3w~qdlzl!_gm0ZP)kJVr3C$20a6`lZN4G5
zRlM%X@^SS^?~7foZ^Bq>UdR#QHIe+>GmHpU8R`jZdTbS00>&lm!aHztE~qD23h01n
z8?UK!-506fQodEPJ}G<CfYgQ@-<NwWzy`}>dh4Veb!NC)x}J$*qnPiq8}&DVzm|D0
zlj8L+#-|BkqYCgWkRK$vBeNRw1fkf|L~Mv?-+xu9nc~U3UDiQe9b#Cni5*6_<+OL|
zf5<}^MHeAA0-z}Tzqd?-LS0ql^}5I2*7wfQnpw-k_3`{66(FI&Ymiv+dZp7^sh|L%
zcrC%nae%vMg{1e7D^f$kE7e_mPVS6+<UTTVgApre%IAn#aaW)>6Rn{J67e(%wD(Tj
zM!>wI1x+^|US3@pL>(|{-3x>wIApoAg*1>ED;M2w`<lALXn&UK;FhUq86JPsFFtVx
z9w!;H&6v{1*CtRpYI$)0VE?5PzFTpsrneoreh{XFmRcNo1}(k;Wo&e~$=Xida>f5X
zg2Kx%n5@c#VVFxP0+Z1ugjqx%+6?Y{zd$70JyD%=lF&E7Lb1;fhYi0cg*cc0F{i}P
z&=4u671<ny*a)eW6*>9^QcIEFQsm-BSkn(8V6UL<8>4vcq-HLdumOUR7?Z%%oSy-j
z$9(vlJkmtcmB><SI&~^62lfW$<tQ7X?r#13Xv)|t9HPL(z6YX%(L<=(gq?3pP^~-G
zrs>c)9h8b&XFwqHCicnEoorHufa!J7<rq9!t?_wCI2)O}$Dtcq8+#9t$pWA49&^LN
z-MS2PnhHgu8*fVOD7PDmte|*UWp@|0705XfBZ6r|x}kH^CKkg#@9YP*<DH^!AedN&
zuYf#@bI>sZ*eDH(OdQ;x_D2J2j4-0IG<4H&J=}Si19i+o%Ca~sb-(We5KXm}4VD?m
z5a{CQ{!IoCQ-O?lLRJ<{6~~PBmUqB#W6A1I=f-sZi4(Q<tDrQnKakguuap7d_O3Xv
z;1;wOY}~U2^~Mu_f{6GU)tB-O$959_=6;ZR0JaCEL9!Q~@n77sHoed7W%;`k-G#nw
zXmLXD`>BQMYur94zl6nuGYBx`U+_4=lcnoLp9yu;;bQGE;3RNQ@%k}qh`H)lJq~v>
z)jlXx9sV3-2~r1jp$X8YDixH9i0p6Uz1BB<?!$`SGsX)Nn*sLcEP(G=o`c;L4YW)}
zei$IKzmQFa6(O@h!2twlQkZjPsM7aWbX4x|!hX;uP{PdxDw4+iP-p690nOSELl@6g
zBjNcR+l4`wj=ej@P1;riDRvgXvW3s7;$%@WIz5ZXq-zG0r@_#Ka2Ql(dzBbN7c&p(
zL9Xw7CbUq^z(IaV0z99}k*Xpo5&H~c2oT{XWCPkNaDJjI65fuvS8v>1(mvrtEqop}
z@v`ZTZa)%SvGg@t+s(Q-Fx3tQ3jtSRqYow25cA{`>}C*)bWD_>TTrOTe^C`es!sS3
zx&A20a9jp?2CWjSf_P}b?McO8W<eG_dfh4cIweWtcJkh)lG}Relbw<M_-y({*J`FL
zSPQHzf{$E`+w8T~zZ=<yh|v#FQ6mhFb!G3Wm$==E*GaH<u@YEwtT6N~gS~e5w;`O5
zjD}6rq`hy>Y?r7=LR7xtD^)7)NJW=e6K@ZF9YHgr*`d=9$8sXW;&6z{P6nN3((SkQ
z7oDhSGC+_%CL3Un3}p;482lTkrG3~M`$VWv2ge1eB%y!!V3pwc-=eZ|#0M|NM*x)`
zAkPS^XsSp&c)ZMl1>K^{fk1f724e~|2&i>(IQa#7TDqR`0#EoeRUA2b7QKzuLU$qc
z=>bv4G60U3@}4U04oZ_W8BFF>S5~1he*ZI}Wqf$32%1B;2tpMAMK(!eiX%(ta3D;~
z28#+%aq2~oVw~>hAr2#eb(ML%1P<!ZGH6Du5+V}{tgAhYGXW+nlqlolIomWi1&py?
zzN4z9pz=<d^b?q?be#zB(gB@mPQXhI6;$|DJ1(6fvDPmSYnS?LBi61TbOt=eGC>=8
zmtrHQxp6OcVud|{n@jJ?VUUUAH9~?K<2wO?sP~r+u)pvxqwRvqJvcrK;ebi~-o7af
zZkXA1$1)=ZDIN+W2*L9OccmS9`2*!N5R>m25(U$7q=)JT^bZ6B%7|#beR9(=9Vpp=
zd*gG{=N+Q_E=U+ZBZWJG;EcjS!#DW<2ju|^trI}$!c<lw0;sgzY`?S8;Y~3EI`f4o
zz5CL4<bbLnT?`CFdgXHHa%yNFQe#G*F$Ac?mllAnB&-53e?XKnP8HM9;tsJ32tyy%
zJDYxGhs}a_Ak7QW`b~O>egGN9tZ=El1Pa~DtqXMy<Z`s?%Dl1psyC4skQV>rLVVeN
z;ITsfOO!_H8Lr4xt;?+|^TRjB6Drp3e`ZIAVJAiy5k-Qw@yNlPp-K&oc|wPYfWL7*
z%5|4<!}C&*OT&u_Bt$tA+5||O-ia+meqjBe{|J7Bzui@SvkbCL#H=xX2<WQ%5We(Z
z)-ueEHVYrxACjlIOZ4WW!N9#pKVFcfLc*hjuYjRAqv81yj?#`*MCa!}YM#oi_A#nz
z(vr4RAmzG+hvW_(-~bkg-5VaaQIpnWsZDVBN07ts_zY6*!+EE5D+)~bv_Rmab#j4u
zZi+i60_0*n^IGGGz0kq;J%D}-@^C`1E-_<XY~r$Klhyu8jr;z}!QfD#d*0I}bMqg(
zGE4!l=dJA6zh>mZEbmb7J$VGqE;s!1fdBp;2)~d=nq0aZ+2s`3EbPeUNS=dk512)`
zIklXLaAv{trR+^~{wL?*y=kC!{7jVRvqt_yQuyb5kpgL@`eOHi^;qb;moG)0tK<c)
zRu`5<x>;BZ7`11mDk!M(-}}KrLK0dWE?@HX98HBho7XcF?#1y92s;eVF?o#q8P!Oq
zL0!DCl&h)7#)Qajy>I{Gf!l_MUoT~rLCchhrsjettlww6IZ*J?FIMVHGpC1+60sjj
zE!+~2LnKHFv#&6q<B%M{77!B<htz-3U#|}q^Kkr3LDu}xb}f7QbFjP5Y&~F&r}dx9
zgKC6i4eG*al6Zf&TJ*aHtJM3waKF{WWNFj`q!3k5?gOklv0=t<Pv_@7(av_06|Q!A
zTZuvR(>Z++DoKvV84_2UxiEFpw67nKXNX8AF+%<hgsB6GPdHDb!2N)#C}nzBJR#;o
z>9i=d_%p<+L+3E5V+EIUcIC`#rTrDJ#iQB&&*l1dTAGN&<${6>yABBR%PO>qfzAqe
z2b^5Gjsc~Ku>!O`g*NPa1#`>xljBPlXWE}d6jCDgjLB_%+H?uYJ!-X9=O5y3tXpf$
zHI|2PDN=lwAw@C0A-aexZM*W@)nG`FhA;VGPE-D6a+Bcfw`9^pj5Fo0p7^U~%l>**
zuHVJcHqIdG__MP8uZM9~xKdoDR~xeJcJh#_A<hmr;pX}1c80<N3+Z1)G44-8$cg`m
zRiaYjw2IazZ6<E)vTj$+TJK2ShEj-=qlg+D%#9EBv>t7$2o~~(CNX)Ey}{2YLJm7D
zH#U31c1o<zd752+S7gc9xHbA4s0|ETEsd32VGcKBFmtiVb{h=$srUgbE8&1P!m)8%
z3=c<--RH(5CXon4pLXNB30~$jIhMd3YI5CiYJ`m3nc}jaVZp{Sz>J<C_0w`<aaIQy
z7F~syy)mZhH2o%20!`&v<fO0PNppU<A8k7)`B`u!Inin2%iESZTROB9x>pBQu#K~`
z{;fFD0RNkBX0+w#1>|SUEkEArXCuAis9;~QaAZL?*YYVS^k%nR%RGf1Q6zX<bG(k2
zVj_hd29rQ<plWZ<gN9y==dK-pB=*5b3JYH*%)Q$`6MIYfEV7L|d3i;M^8p3WT~b2K
z-FhZqnGR%h9KJyzmB%)t?;p6&&+$B{yH%)2!;7uK7~*$xGa``;@O%pUnDISSu;Vou
zrVQEtHF7+~90vP?{zrMRZ*mg3{wnlGMm{wmqhC6-9lJM13$93g7|}kEPP7YiY|6}u
zy#O%Bk|9*dfr8WBdaA0UA(+r}tET2FdCAwPS}T`hOe1!P__iS*Pe-e8`_T4Np+f7C
z!5y;z>a5Vo@B8|`5^rJ~Y?OF()3vZ=Z-08c+j|{ggE$NXFnSzkGkVH6-oP-rtjV49
zN5SUuKMP-ld?7YT?Br5kNl4_OPS*1!xll$FePm3-aSg`k*^N(E-Tci0xpuQvPiEjH
zJHqA*@XOCRcPZz{ch#l@Q<2lJhP%fw4@sov=Sy`*Idnl;H-?Wl6w?P(58(iN({l#w
z_7<FPf+^9${WmqM$EJA`HvQ*M7SV91l1I9nS{<NrAt5}!6di?^E;_Cv<whcv!Cf~r
zuvFRV?cHY0fL*Ct6fCT`6=%tOQD>%;<nK^v{3=ANz@Ci3**%6iYiHZ9g?rOObsV>l
zr9D2q^h8mN2PJ%N<wT;u*Zb)02qdsyf&KK?M}~xqii6LC^Y5lQ`k$+iN+p#y6>D(;
zyPnI8p_O^@i=7qc%pXH-xg1w^C(5T=U)EN2mL!21-t9y`#>4gja&S(EDVeD+jb^5c
zjF`8Gpc+M>_B$g_e7+O%lSwhxTKt|ju~+@u{DRfYwZT5{p}5th*Bxq_T2ac+o?ent
zVs9VE?xazC?|t>6)>5)02m2V>sD_BZv%#c{xg=Glx%@nuojfyqWRiGMV<+b?)4%^l
zJEGCwd}HFEe`w;*#)J_+>{#EfYzfC44On3dVHp#p8f;9XEuN~Ag!|zUr?Wu>EiLB6
z5h~9_PmR-Vgfv)p&B=JzXuW%<Dwr(QV5XRwl0AP@9CD(5H;O(X2>)r_z`#Mw+n<FK
zx^77hQ}WGDi}k0gSCv{^-3Nba2j#fXr?Ylk$tx@M_~c!P0x{)p04u3=d*U<BZ}o2N
zmX`BjOpCPF^AvRF{^~VrBP6u=??^cH=FwJzPfKp!Py6~_;;osuWlxgGoB5B-G-E>#
zbf&hx9Bs8m;_p<dMpU=QZOtaI=AuhF;O0T*IIZ08qLxo)(q8I)w`gqqbRr8K=W8qW
zf_*B{EbLR(Dyp%5vRQgL`19v9ki_<|0c)(o-6?XCH?UKpc96HcYVRYhW&0dHLGQ|Z
zWG_WL{g}Q8BbsGYmj$SPO8IOh!m@B)OsaX}fpIH?*h*HGOrzHXFM7_>7f2a5Z6_jm
zp>2c1tLrao@7?2J%k7(~5i&Ncd9Nbn9xZW#=j?pqy4<%S?{DfnKt`Y7_0hsJi`;sG
zw{SVX`|Vpp-C*Ce%8M(hi13I#C!+eEv?;Rw-dCGQ1Ydw3T3zs=B?MPQ82k>~Y_h7g
zG%lkduUMZ+-1)mt_v6cbUAI4v3mXJYHqtZInr>J<PbWmVqvdYj{x+3Y+tc05%m)Ai
zsWh!f@k`;i*sgler_7}8RvB;mLy3MB70)+UK7HDI@U`h;==sXhx3L{rKK)N}7$%jf
z_;7#`Yd?a|fYsdJV8oB=118^17xe8MBgaA@9BnKv*;t+7J{H2!9Kri=I6Z{v3}2wf
zW?`>Ix-Zwo0MS^ZlB_>pr&lq$q4ifsvAk$Aty~Z>%bcT__ta@;t$(~!%=7aW(+^U=
z^D<ugf`U`HKaJS{XA_?BF#f<E8B#haS@HBw`}Etl853>316%cj1Nq8cO%%4irvvOb
zvaVfzUp^u-66N+{{am{C4Yc9n%%*GdHtlqUk*TkUx;l~46-h}qyD@2iPU@cA=uRSs
zik`lH_(DB6qjv6{2zl|5k;*ascy7~b>bDw|_rQ9@x88t%5Zg>R_D0gdwO?xKbT7nH
z`nc`9?bof^Ui+1KmtQ5xCMk;_h><k>i5$Tawp|J1#1I<ZyK3)4x!bsD>>G?s?Zi~n
zk3+BcIXwaBSA{G)Orh0nChhmUNem~;8N}=zboV2bV?{#CwI8eb)LD`x<cU?}M^yPg
ze{MPp0tvxyh0bd>mrw(ZFV*o0NLd6N;vg=gPoXG!K6B}^>O4M>N_{R86l<bW_@BWh
z9{H?zSWpaSo5X!fEFshkDyX@r5klNpnJ}ZG1sWiQ)~BFSynAOaD_lB1X0Wu1yeRCe
z8P(*Zxm|_I2gD|k7V(#5s;NOW*-Xlx87>Yf_EY2wnrW(lUz7^t+gNKT6Fcx3JcUz(
zCCx3I^h$LCkE**{!)ow(dT|PluYfM}wQuVMU)I!m<~q!)wH2b<Y4{d0{`}-=d$*b)
zVfd?nw3>OYIzRt?uF^L=zlai6GtHkq0TSS$y+Iivf4{EqFCbmlKa~4WiA3tBu(Er6
zhyTonn<XG~kSa29L71;!QAcd(mrPR5j{Qm0if?XzO5Agy_O7e%8i=YM4sQhHjYL-L
z!Y2kIwSo}0!*;)Y{Xi{NWA4Fh<+e%thDDf^w4Vx<8YpV}8aFL7_+9$^ic@Z$!dYD#
z;VdL1ni^|^H%0ST&95u5gT>5#Am#Q_>GM?DB8e-^LKk&1%TERIR^Ri#@g#w#ev2g(
zazry#VX1%*+RmY<3krS#<kfl({dZJzrOc##eD$7ypj@`l+=<XUD^KZ!ZxTbZH8pct
zNAwSYLT(e@2YiQ?-voL}|9%f5pe`M?KpIY!XkmRLjp|4OHL)9L1~EZkG?Y?)ejD4m
z+;1}{Z|o%4%B5;L_T_D)No!iD{h6^uds$?^v&$)8+I!+3{5(7>B!6xC`e@s|>dtmc
zkc2Kr(Sjvbyl6!p9sHR!f87f6r>|JsxZlDyr&}bNAKKBsVJT=EzSgNYpl;I{FK>?!
zCbj~)1k+8GGI6KFy!vCOXZCZ}t-^JxiK}-WtW0-Byv2IDI*f&8MBe-x9`RDo@FFDN
z21rwa*&0VPJMCHv=a3dgGp+9Y8se<)3#-YzZ?B6gjPPcCjg8Yjt0snWCleT^fekfL
zUY&##9(=pqY6CzCpHlt;afryyHqvINnkWk^FJEhq@XZEtm7w3D^-8&k`pYlUc&PDG
zw%ISg<3ao!(sgg*ykMtjB0(`r)o@M3ysqzFQJsy2^CN*oa!VTOy}fX1>f{Rp3rnF~
zwRHKJip|un2<TJ*hn{jTP<<ee-^>4zaakE#50QHK1`RDY&@EF+q@h!zVwE`qZ=<56
zybiEYM(@H-MYpfZ4SSDIPZvsME07!j{;IBfYogm7eznNcdz*Izsf80<n-o1a_v*N6
zytJ5;{AE<HN^Pt)o){=<4`D@yq2&5fn$+M~x5anTJA&yjW$#uwhYi}ZhIPtcZ=W4r
z7;8R$^pcKjbo5PfYA@DuP&^+|DnA6*GTjNII-GrI^UP;07uo;Xek|oS7wmzTW8YZo
z(d(Yg&6F80-Ul&tiQ&mv)1UFSAgH$mH3>z}m&fP*bw6ZkA12E)hz0(mYqx;0m+(8g
znJ*pRI)XXAKMs84ZNK+=C_|JCYmaCZ-1GoCeenDk^w_1d^Jq6YV@eAZ<B9X3_0jp2
z_iba%ajnki9@|u)KrSTIc5D0kGl7(3DUN{jD|ykg{%0z<)MtVj5%;ISHR(wZnV)xm
ztd{EZ?nlssFg;HU_szae7kr{%&g7^zwU>g#)fZ%;-y6}5@=HxGYY)yZmZWBeb(hDo
z{U?BuDEG(BlgIEgo%7)C6lZr2ft_4B;frY}X<4p2x`P?LKyZE2Z8b(Dh*60Eoy6LQ
zuu*w<7XB6S0A6&s{K%f%<<f)u$`!4PVi#vT*%gNO(U<66c}1)gRDvJ#J+g#B+bxrV
z>y82}ZqfftnRLP9hO3=)yN6)R>N2>6ERtA)QHu=6-hx68Ky3s=c*YPy<a$Wk`diPg
zF2=u|6iuAU#sMrZs{+nfh7_>=Xf@;+i2%MHxCLNT;h}@>X2{RTPvN}y|9ci-p2tws
zZA)LmvhOzEUP9i}9UuUgHvjw_51vN9v}Pzy$ZEyX*Al>-*#)3eP}*{25z?A90NGaV
zI5NC3(JM>EwZ0ImTXR)!3f5g(L82*0*tF|-=~r)?x?^3il2{q&TV~6$?lA)783lGY
zyHcNntx)yzy#SX-X1EAo6t)Q8>C|K)*BPG*|C{<A=MK-{>_%|=PQDHr$a(R)Z@Qd1
zA{(DSv(29ATs33unli*?09pGevERj|5hg}sn){U%*`$0I*#r^CM^T6~y7yM8e+_<E
zxZOx_vJt|g<A}~zZYujC^675Y%P*m9{udf5KK><@UOC7iQjS}-gJr70f*P<<LihW-
zAk3+GQSthywciRnaFsqEC@%qyErxs&0)O9kakxPz;fC}}&whYDM}PlIU0?pqi`)Le
zOFGAo$hORY7f~fY;Y<41U-hY}XMICb-sj6p?D_MDZNmk^mb(YJ3LXN&P0ooe4Fc(M
zfSR=O@Q;v6(3WHRH;3Xs;9l71B-Lq5<&2fyb8@q@gU>+U!YQm)HBr#h6C>_jZ*n^<
z$C1miNA#N{$b)q4o9I<m_n0osIfv^n5?L=^_k_q3<?f(wY5uLeeOR}A0b}B~w|bk}
z6LsoUZ*-^38z>qM(?V}J((Qs?fHDe^ER(sJoajtQvbjf=yG{tXvr&V+tId=^thiY5
z-;0GV%tNbaIe^uxn<8<o1U0_bc2QEzwsKe*%nsuKt8I7wJ?QBT_eLXq3(K>-=68qu
zYh(9gf70~%W@is)x~%N}`n!DN6Sq0&3Bv8MX=ciMAWPsH^|w9f#hFv^Gj45M5kjX4
zP&DEqYH$jh@n_E+=Cj!J8wxe}j6?nPo0Djgd4GT4P`TjX;K*7PZM>p8Q9QLg((g_U
z<AHlI$C5JUA%$&bvjj_2;p8CME~D4iKZ4s&e#mgKb+U{DJr%DI@&kv_d>y;W2)KvK
zAI&VImXzvJ9o83xyEk|5?UD(Qx)arwi%be8nWlOv!_V-p!A*GTkYuE_X|tY!`CZ3E
zbTtW?5ZrZmWzb@0W$+I`Sa-UkdV9a;D0oR?j||HAc+7^*4hQb>nmYsSstTZ8#m9FD
z!uQkhVDxwjR6CLTJ{ZzdNvt{I(&2?2Ljnn$o#_O+cJn}Id|rW_Ok4wN{Q(S`qZd7(
zu$>;AE%J0ZGq?KLtihHBUv-x-NhIj%8sO$$2nm98SLk=WA1|$&I_?U7%#ww-{lGtu
z3)TQ?z{}G<oW{SJ;RNBdP=pVmIe26pjlWoGx13p%M1sio;N-gqgUPj3O!L%^WBS&0
z9#U(OcJY>B%``m<dJJ)RaI+f(bEe+C>-YhAWk)(c4;4pX1wLe!xVrl#t-gJypC-C1
zJ)r<wN=~$MUTq_?@c~+U_`fvAa|dVaZw!FwK{5z8J8XMS7=uccYAY*@ETA2{iYFYn
zh;Y*nGFu}#fOz2z2HIjGR`oH=Ri`Kwh_IM{b(D)CQ=nKt2Ws$YmA%H_EMDVaXdxB-
zI;S0JU%YZEftqn?4P-`%@h!Xfkl7b_u=2k>BS_JoE0BXq5O_dQS_p@*KQuzbK&4Sa
z1J5VjmE7*y;WOk|T9SGp0EfjBNE2p|UEnj8B3*FJ;N8&gKT)TS#YjI&&6|+O9;5Kh
zJmh|7Xs4|dkBGHqRw!fO?2~Gq3hwh8Div2s2o>!^Ml2-f!LF~XOC@kZCv=lA67uT?
z4^a3SS6oxsQ*UlpEeOm41A~MbI6I$y;w6(I;^?yhvd><Wx?IyARB`uD56syJXfG$c
z7(wK<<8h!LvW5gB{fJs_X7IwiLzrT4eR<Vq;S*kGv?%Hmh<p0J=wu|aQrk^h;8w$5
z_sY<gTXT*!s=kh#RYcn#W6T>>{g__hUvQ*>(%7wtlawW3H5B#~BVd&92|056JHx&r
zvgsBri8N^ji4XN$zolwl_#Lt~hVPx!cb}7toSxj>e)>u4F?r8efSt40eYRrc=1NTN
zVNExvG>{76W{85)QL(`O(qOWRs1S#TRf{FOP2tj14dSECP5X*!D1ZXxzh7$}nhXdD
zEJ1H484_?cJ=j366<yc*zT~71Q&PuLnnEf>4K;z*GkQ>7(U9^@d`h`#s-!yn3~McZ
z$V!fw9()QLWzt>AWjUIk$)|;$-V&#!y*0ehRvlU~<{yzBELFT71T=9FK%-r&DD3+m
zlnZ|I!t;!F^cA)z26;ue`N7-WUv{@K^a}-Uvcm;a#0vqN#XoEe3a=s4#r6RQ0-VJy
zc>r0*lt*$?ju(#%2frdLt|c(V>oIH1Yc@{lf&N6%TH_8lub2sA_LC)hLwuk)p2!-a
z%?I_Q76(r~L2!L!iaBKisu8#7P*=jXZydS~c)`4l&v@Oo@uETW;6?aSMr@~0VSNSI
z3o<$RVTpU)+Eiw@D$!2qIS_}E9QujAB->M=BS^0@gat(m(hvV8#1AivGL8$*&b9<(
z@9H&)j6A1@^I7B)8T4A?(wJ$aH-zgWWArrc48b}+RJn6b#zlA=JD#9@xB;~Pmvh~n
zWJwt#n(;O~N@Xn9Ay7|ZTH?(jq{&C7;rd%~wwOC%LYR;4QaLhVz{w?LuMy-X%Wq_j
ze^_aseJ_GiW4sRSgh~u(4N!*Fc%wAvYTaN$_{oN?dXw=yAcRqf_s?nKhP^p<D4@17
zpV#(8sk}F)@Q498OlGfbX_K4mWCE%QhydFCzn_{9umBSxAEM0wgUgq<ot5e+FJ(9(
ztxgOF$-rnb448~3e90Z}>{+;aL7|JAbaBWbd|BKxdZe**v&k0mV+5#u(H&%IFLeO&
z3`Mpxjm~~T-lP4RN00Ll{CHnFs+qu3$X-sph7o0Ltprbl4C5*OCdbHXAd)`-L!i4b
zK1c85zky7MI`*a5*PJj1P!A{=;id_k+rmlD35278q4S}1RRf1E(1gE(HQsgTVWO8&
zhQA{P+y<nZQ%mirz|F3!F$rR*sv%SE1mai(z9fS6&S7wjMrJ`}y!*#AM!a4&{1&BL
z1(}3?CkCdt>%!1!sJrK`|49Gz{YQvZPz>maqx*hZEO4BOO0u-iST6MY5bA4{Q8mLI
zRVDb63nsHds&svzYLQse)w_!~=f+z4<5Cy5cpxZVD3sRt2dY&3>Oj^Mm93rHH{)f4
zC&4?4$6sj^Zo!->t5|j&ksP`&H29ZAO}(*y9rFZ%nG!G^><nBDC`7BFFRibi6t=t>
znQC@ao~T3%A|+)o#i~fExR2<Y7}nlatwa3KwTraKS?duMgg=EuzE{6amFm|prL<5C
zVM~ORij5qibvVBC>=A+uDM>F*um<wsfw4U%@6T^z2;fK8m82O4rkkJ><*6{%NyK4*
z(C;bZbkGs#4zxJd+&cu}AExAw<-@SFod+`Hy7@*DM~2!`sP<%viRa6<q*BGX2X+R6
zNBZ|UJAim(S3Ow!q>O%q_C*H*qfZB<BtgT24HpbYr~7|{u-gZBW5am*WB>Ld#u8D&
zBrN1ov1h>4KH(PN%2WX>XeXV-xldZ-)M!>zzPGr5pgzT5N=n#wFv`nTy>$Is)uQrk
z#3h->O<GQv;s+*mU-RS}YK96941Ofy&<1ILpD)G{g8sM4KOC`Fm@d#E&z^s0O_fxI
zk2+OKWjZm;Zj6UFhKdkbCOoFz*q{Xb$cYUmg+<n6hnDuqev9X+eTY1R00U@34;A=!
z9deA5Fc~ZbZWwD=B9TW-VxMcSkzbg6|2-xJY}3MH_;Z3R&}aaVNB#hK<`x?23cAAE
z3~Qb}I9MBz>K+RX{Q|axL<C+Tn^>1(A4B<O1qi0d*XUZ=9VpSjKIOpKRp3{IEdk!I
z7;JkJr*)7`w6q_aTYPQ{uzGQ&2tIV9upeGU9R4rzT^V5R*oz!S0K0?^RpP=vG^GF%
zlEw&!wV5@aSMawyw4`Z1^6J4-tZIA&Gq)_%Lj@5-A(dr*7sj&Cw|JmC{`OUm77G9!
z#P5NIxWPx}8<B+3bxVEtj{x$0MD#2$2ENn;sG`i<WVN#+Pwc4n{Wp6;DTsw<HddfM
zC@~10&$Hc9ekJdqo-YL(d|`L<-isTzPmr~NhK8>X8j<{E3MLRkQVba{__^kIWzdOC
zB*5GxAr6q>0Nnr=?YiPSTdo|B_B*v7tnJ;3(?AQOw~$^2EsK>Z@Ffi(wV;rGdR;C}
zulLz*C7BigBVe+O?XZmX(sz8<Uj3YE7fh0_yJ;e}N(mpy!1wL}=anzLoN4&}>eAfN
zDSPb&tMwhH30$>tZkV@3Ot+(r+DhEr*HgaXPRBZ7c`>Bt)kw)oLb>GK(^tqlm3yH{
z9{{3RHLZvsr8=?UXkGigUK_ZK9$NPPY3uhbghS6QBXl!D%51|lK$b~reHV;pKJ{Fx
zT5&jQ$=u4}a8<X=s9&tsQt;i3$gfP1-t1b}k`Ms1WF3={e!2?nc!85niW$-*hDO<9
z@oC?}xBJheUIZ9Cyetb}{U9VakksWZ9k3XP%gl#|s4F42Poaij>GwRZ1gX<moXm)I
zm^fgq_}%vyQ&O`ktjHKRSGU{c@S5FYO6Dr9hL*CkeY;5)du(2^!%&0ky4uDZ#b5rQ
zXvI|!^N}T5X4a43zzq6=5zC2atsFgmoX)ZmLn+QIyPJvJrdGn<SAq*UmijOS@TRlu
z;oijtV|Hu}#VmxG8~S`_6*@~Xdo!P_kaNkaR~^*+%O{~o_WWkflx{27z5hZkFu`#=
zPCb;l$5i!Ru6spWSaaio0v3V|k6r2MJY36YrYHV$hy-e6W`-EKViD#BFW4VZxrKhp
zGD$dmhNLCG>Y7=`d!1@E`+=1EAIeGD_UkRNA^!#9Od*FXtbM|8H^<B55^E<zDKXqQ
z1-kpdaG{;>4fE)Q2Zgv6Lbs>xrzDKnxYAadyLwMaTBT|o@RgM?A0%?t#B1u+#d4(Z
zcO?}SrFjMz)ZTBIdfR)I3BKfxU@AkBQjRtXKD-N?jre|Cyz>5`+_w9^Sz)D@qi;4#
z@2;1{Vs%`UqBw5{5)U}grfzQQrLSL_%-?Pp%^)rQNj&Rvfpp08!U-|X(YEflZvXQk
zr?#kEd_<I@bM@Yh7@)879Ny#wGqJKT|8j{v+h~0ywM}a+&7zt$5ZZpB&<~3vXj{)h
zHnEQdsJ2?}vo%h6u+uQaIKJdKX6J;dNyn*658E71Je+(lD*^gQL5o&NHsP1If}YC5
z>EAOYlw{lb;qG60y$F>B`r)%wZPt}s!*V9%(-z59W+jdoY=^W*H(f5Wd11YTmWnW)
zrMIga_m=kqlx6H3up%&5%#4HF5WOhxrsEYh{C6ka?}}xUdpI<WDKYuK)#Dg#DRd|)
z<L+dAABdKa;IopxsG1b#;n|F>f8+Lwwe>+e#-#Dx7R<ygWF~{uYM{R6G|tdg8fS>V
zRn%omt`U_pXCt-4&u4nUKXRSSn57qQ4(o2cm)f09fmi+Cz428*&o)$v0ov%;&g4Fp
z=RPm9>4<cjkv*@PJ(?}$axMrJA(diIm;nS!_K5C5x@~+nXpA4NsX@<t*(=UmAY;J&
zO>HwtIKi2J<)Igp=)<o&PAC>d%ewc*N9`4>OA^1u&}c!~P1Bmt8)DJ4tZmj>*xV^L
z5zEpANx2{$iWSA2MO*}vw{GyKWqNY*4a=qv(HOyx{uI?-Z#C?9EcFUDI4o+oAp6qi
zeS2cl9SZl}7LA<dM!fMIlK3KVY*^uU<*7q+r<>6h856SA@eBrwUkT$~pKNqVzfU23
z>x3ot`s#p+jl~c3`n93o$!}?CYwdxpGvop4K-2%~@AAISIP>9F%l!n}N5e76oT_&n
zmiw%;8?=g^-((TUnYhefzqsKgZSQ%AxVq?O@W@j9b}V*ktbbsW%<Of~UuuIFZatqF
zrFiIrz(r6Nv(%m8{oMuuW{K(P$mT84GhhUu>QuDu0m=uy95MZ^qs@9&<9v(eO&+pu
z&Y`u7ud+>lug>)Ws!K)vz3|c8Wb@_WWCPI=sExSH$xM{Q)ep0OG^=Xx;xdfMYP;D%
zl~{*lI0oIlIJnccI3)H9ba&xA-<#QDL~cEiMeb~`XLF|9K-Il_+CzJ%tzmxUPqi6%
z*Ejp%i%1^~YHC)~+*B2Nxq}K$rI1steHvvwr%MtOVgyqaQjEC47IKrgtmEN_4~%Z}
zzL<kPo)hHBh!|0I3mzt3>*!bW#qSZIVLtFQa45#&^mJU7i!VnEc3)Bs!Vu#<@5k`q
zER7uvZQklG)iv$~eqQ=XwV!9*IBq|;AGieMu>)M!9iA}fY;}{V&n4OL>*8u=&pSSY
z_!ZAj3P79x9(H2VIHE*OX?hQEVqn}&v^iFiP;c`~I(E}){o<7I5{<W1t_QaDS%p%x
z&NLLcl<N5X>PtNDDZta(A&`;NsWc!$x&QE^*5wFFdr%@S7er5ClE|_;R8AXQVK=ke
zJ+7<EZpgJ+O(cAWZ()^y^aTRQU$1@N-X{|lwwWu6QXK!4CLXsis)@cFMD6ATpPWmq
zYWrMIJqn)sgDlmnvuA?NuL2fr+*nEoTaND1W~2Zl&1Z$#A?kXrq_%E<XNd{jaanjm
zA?Pq&6V7q$k)73c{anhkc$*V8?X>lMObT$ax{b-&&!ZeMaZf((&gR9~?=1%YRHw-8
z|CGk^D7t6m+*@*KNt<1Fq;0BVbn?9^r3pg}-}$?Vvha3l=)BdwO0-@2!^N4rn8FLb
z-lm1uve}}16ZTrx(tR(35~a^To1+CPccThCnTL&>mAl}7epN6zm@v$1vNgN@32M47
z^6IW*(_$fzluM$%tu3vCky_iDt)E*^+$dDT`y<@7qG-ApmzDH1S=5eO_S$KP@lLru
zV()bH!l!$6MW@WMZn&4V`O&80)U*fCQn|QLqnxr&>@LLolG>C%H9im-HXF^55IjzQ
zy$A;IK-CNHVEYGa6(P~#T1TE$_MiqH(cep%;~$NPQuNHP_-EC}(Kd>OvVfnL6@M~K
zRDgZ4yK_w`eP2`#!W1L?k-S{*GdIu$$~Lh7o&K`+#A(~`<1a6sjlASCY3R_LTuP~u
zD6_$eUA>&U3e7N2`=4ESO1qEez-8f-2dJKiC-n=_RE0&qJVPrsCs_O%jR%D;_#WF7
z;Paz+M`*czoK_1pT_X)=R|a1;`^i|esYi52;r%gl`S1NaA1PK-hvlq(BN2dy0>0N_
zEH?K(G?w%6w`1jfX)m`m^g~4g7uU?Fo+P({Ue>|=-Mr)<T*IlI9qCMWF84m&r;cH3
z8%_gRzb?SZ1K1-CBw{xuu=_E%&8&)BO|%<0DZrF=#tsP5GD>m>O-v!H^iz14eYLFi
zyY#8~QQPvJtJ7(#Kpm)h^(l1C19&)g*uL?@gj)JSnD!@&7vteQcuj6*UzyzTojq^a
zN+X57ILwD%hW!Di$opSOd82+gl_Fa%#&r$Pzmno#>T+~-<Kd!f^hS;bWi}Hc^VtKR
z`?bm=>{0ToJ>fw?YC3K_m0R|q@Ds}w)waID<&hZKh2V|KA}tUP9?mNMtzPb<>`BY3
z&0_D7>6p3H6;IXvezV8+{EB3Zv`YTKcDy@42YzuF2?9U@Fq`Udf8HVi&5r^E4}jBG
zpM#oLJrfi1^i`_&K(+)tsh>cfY^QF$W31C@Q-|oWSTb^Q0M1#%AOb>+O2-^LbDu2q
z&T5vXyzG=U8-bCd7Rkf$*2Dlm1v1A<f;#H&v{l%NUl-v9z!KwAvy3&23(Y$TqkXa9
ztNpEpz=OAo)fO-5=vq9cEj(b<*IU4bWLWfb$Cvi$D!ez<8Pc?s1_8;uS3jVxtLs&d
zh7G|6&x%47R}ekb?ga{Po0IuLF&7}U8@h~XL`tkOUb&4ZU|cayo3jlLUHbs6`dwbS
zUtR6>2p*>b+_`E`FDZN2?Lo_ny~fbp-C~W43vBR!-%?!orQEXej;Fe`Mjotsyn3&A
zri#hke6ze^G|#9xyfTps0c_f)-XD()q4^@K^CR4Hi{L4aujVz3roHRqu5U(Qn|jvE
z#}94#Si1dG!UWSveEb-HRI(5qY&3N3%w0|xU{3z_Q)jn}`$_Avc$%wT7EE2-u$2aQ
zQ-s<!Wy@J{9_nJZ6Wp+O8a$FGak=BanpoKeTR6Xcn|g0-sv6_CSuJ>Ivsmrsy;FtP
zE`4w-!S6f@08k4fCqX6`=P97Z_dE%v_zpCBYn1R21^$y2g2%}?DM@sMQV9fH_7<uZ
zi|a575q&H}H-*U);wdHNDrEmE)d&fJ8mh&_0{paW{qRMZfByZuo2y89ip*%*o4SoI
z&%pRCgi$AxE4&BZ7S#wq-nu93=K4r5DY;~*9Fr$Ikefc+)mpYlz18gU$oDqBORN(f
zMPy2>Bh*Wfl9fyY_|H-Q*AL4ST9!d6`Efr^<5IKQl*_Y6$rzB*&x;B*|GhRnIi^?X
zdhtWzpp7tr)6@JY#jTdTIrjm*FE^sLUt!u!+Ed8A`Ed#X*d?Ovmc>=G6k*~=EcHqM
z+-F9FI?fPbgt)AYDrwAFRdjp!8`ylEGGOGJukh}O<?6dbsrEk%KeC(Ei45WTTD#b7
z_lqXR?_h-U`tn?722a~Slo`QOu|U>El)xoyJ&+P!9eN_GOU}ZqlwBgiyo{rLdx##t
zV<6zljM|NZP+Je<{7h;;n%v>P{h|6*&udMs*!{&&@15nLI5xT;<+FF%ufLw`r89ao
z>sA^I9wvwHvlZ`&TOQ9fE*?(y&n~sO|N6cE=O^7*9#I>&$@0qIpBJa;7X!-PThv;c
zK&*<{pSeDjIid~OtRdY>7qbF#w8?N<*B1iP-CXzSLQm$9ttOt3qa7BP%@T2Dz}_{d
zRlJlEhvi{Z5@z$hozWKQEmwnL%XRN%Z;DYSe1)X5%wL}IG{-Ox0u`J17k~dVw{6wD
zGG3e(53B~TpO5D&i#u=c`95cm6e+yYboIj*V!mvCXE9dwN?OX^unXCHJQ6i7ukd-q
zIWI<nbA{~{fNBk{gY?FXa&52ckUoq)^F900?>5K6tRA$9ulU`y=;;LL^?p5<<_U(z
zL?qfhi2C5O#IWyafftO&cl??V^F?qO5DXb!*{7&gxO+t*XDh@aJ=|IOCD8rFa28{G
z4>Mr8TwLs)^mwlR4%<DLs?Jxmd61|}mHbAAGk`%c5?fF8JXFGwAto*CL^f{C3}AV=
zU7o{qqLD<y;%u>yyx5w>SAaWejKiRwo||9Zu3oj6f-fcAqYX?f;*<{6k)Itb2Vajm
zv=i^PASdRYm2j32NXSVKv(4?-i&OZ_yZb8z7&W^sZrN0Xy}d>thhpybunEg#7V)>l
zs*uZkT5Z3Gcd#0^+(RNlZ)ED%o4=WF%aF09otya*CSeZ*%%Sak*3YBq5WrW=PiN{x
zfcgATxqh`(rEw&q(EDw>S;9N+$Y`Q~Z|5M>T03}t2zmgZARZ5l{5*5)Xv$26xf03$
zY)-fN-ffs9Z(n7WO%p0mMR*slU#$CoO_?FK_PeNYWIb-V6y15A27jb6!qPV1{Dg>K
zYExiHYFA!}8{t(MrJcpH=hGUFtCHeJaR&JdjnkEmqiSzHx2KzG^`81|JHGm}$OXOC
zkhiN2oa6+K+pUv2`<);MOxiM95_#_P@#HB76WY$#Z*g+azt&(`328Ovz^w-?2Gm4D
zSxWZT6-ErIEvwALxhpEtK#DnNKhpQ_{`x_o&DsG`-i**x%Y5~J)!kBD#`W@s1i}bg
z`b)-h+puvb(;YXq7Hj=W@^k;Mw8>$&b_OG-#Po#{zbG@VVDkjfES@s%CiHZaicb1y
z!ocE9wk&YIF^t0hB93jlkq9n}7=NNYUU^j%N&ImTR}cOxzk<t6IqoCoa*$jl)>qXO
z(pX)|qWi8|D}#&qA<TajfC^ZK55X0pZ2W&Us}gyVDM$61=1!aq38@(4!K&Q(eo~14
z_mhFH2#KYN=8N3;_`|k!M}!gmJJmwJ|F_IL;q7?ljFjxqm2nzDBI0BmT1K)Z<yxBc
zxUSB4yysjK+H-SEgDbthAfJ=-(V77@pY4D3Aqs#PCRSL82h2gVug`k7GJ^l_^C_3%
z_ea+|PK&6ulA;R$GAGWEipceo_k^YcI#d5=7se#haaLGOI$He0CJvX+pUWg#M$vcG
z4Fe^h^K$$3AkYzEgYh8TOc0sOlh30{K2)uSa0$)o%1o?;u)%<Xzz%1~OgrhX$JbkP
z?jG9X6X@UbvQ(7uvG{*RX9?i9p#|Y2M?txv)YR0oh;Jl)Aq3Q<#U9nB9T0AkJG?X<
zgi{~_?OTb?#I02Eu}}Lp4=)811Ur5RC1j-bdS!=%C|lMHOG-$%Ol-Ce{qmCL&gRsE
zexr5a*=4LEe53Jzqbr(D{0H4sDb44sjr<=Q%XQPWbb=P)YQw!w0p;>CSVL$k-DuQ(
zY6yQeivylmgwi`DvD?2R1%a>7fN9qD9J85Y`U;43-V1cQ;@<(422`M5Mhh;ME9E5}
zZp?b0AG+{wOefHfmlD-w4qyy_U0r!+U7V;rMM?heSpc(UR=z=}NaZ?M1JIf@30y~4
zkJRS}hTGa_vz&c4yN(p3S>L<C0*D?0h>oUBz~_a~LwX{<J6|7D1u`(1ZCAs{7R=1-
zfDp)ZiIz$F#U%Mcj$=7zW<DcpXw#zq#C8NG`x5oyZ$x;c*#IHWU;N#)(53SRcgkI|
zs$1fMGj~?B<L(;{QPlZ+Nq>!CjtMU!*1;F(Rt59pRosqi==YE%U!%Aw;`lMZEWpxj
zW9D5Hy@VZuD7%DgR{n_x{C$~>_xFK8i>{SH-uW-R74wZh_Zqa=VcB>N;OtL{@^og*
z_{<Pynm8zR!{jrj3`a@1vA{ojHy!{gMRV%C8ln}(r=@7&g^4kJ+pxjoI=^;vj+FQs
zKE}OK7cD4N{yiJ*fVRaNKzqgI&&dTy#i`f0VA;%Hx9?GF{(k3I{3Nf3tz~WR2A^8h
z*QsA_9I$xZp?C2CY4T+90k(JZS_ptIKYP8ANpgOP%3|Rb|4+xiKk<1~w@NG64RQa`
zLS?f&yI&yO7*d%WD0*UVN7ft>66!-L4ROVL+Q=A<^!j>s@#5M2tqzhWA^N%tcglM-
z?L$B9Ke>;ndmBV;`52+A`SJ_13ZhU+O74>;BAFce5$bQ$E0S{qB~HoD469*6F)D$7
z33qF|NU&2E!1jRE5JqXHACg1Mct*T<5z4T?xJzo1aG;akS^?jv)>a5Vz`fw`O%OF<
zZoz+-LX-ZzbM4mW{@jv(mx>jBo__v>)EDV+HqjnW1hgTkE|$az5kd}ax~ISK43O)?
z<1bk(&q|PwsCbj%-bP^)W+kTZbKtpfKS}sn5b}j??x7yC1^M;gr8LvVyk}#$rMQc@
zo&4?<<ffBDb!w10xVzS0Xx9NUP`8S7AmNBQoD5?f#mLnVO>W6Qw|+ko=w`<^c%q%S
z@$9it4DfXv%g8VaWk+sa?Pc6knXfVAcMWOHUI+5aJLCil7M2SZrq+oP2!Ns&Z_}Fz
zmDTYCP@`X)4n3L(yFHm`_>y|#8GOmWY=8~@<Yc}2>E_O1H!a?m9(U<zf*FbEpZaiX
z$oNHNb@ZZn_;<<c9oPE;G<x@73e@$&h);rrbdHoi!1MFYW<5^U;4m~Z?t7y~Cqq$*
zXmvCM+A9~jQhI<>Sr*|H`h#|{Lzls|z--`y4m)#>mtn@2w<AGGYH2B-^4TdD7Wceb
z%~w}yc-VrMOG@5d7OF)bUT-t@Ous4jB~3A$=z6sA%!j;?xu4eMwqxC*xUQZJ14ulY
zZID4M3Fv4@h7{z-(z3+#EV6QmB@dMGUNzWfmH{U&g1$aBf!V6oQ=fuTZUVEQ`oBj)
zO(!uuh-g8)GNsyu9WD;e86CBB$=-+7|4(ag85ZRe_6?I_5CYPnAgDAe(k0R$pmf(F
zs7u$<Aq`7+tB7=Wr@J7jNOyNE-8{qpe&3JJr{_Kn_F!Qz_L`ZUIp_SvoIIqU2nvo$
zj{g;5(6Dvhml|F*a_a3B8QhCIbLVH6oHf<t;O!tg<(qw3-Yu&C3t16k`}=zl-L?Ka
z&A49#Ep_g=t9&t1WS4donX$4x-f?N<{h|L0uE*goiv87_YYPk@k7P-kWq9l9YN@t2
zZ<HY*o(Razb4E{$2*HKB>+p$BKj9}kj@x6=wPRyf)tg6dXP(OnE9}~>2tkJ=9&a&C
z98^+EVb@1sK|*Wp?MUN#@6HP%lZb%=J(pW;wTv)qdb?{qy*Bj?`)Kpu`0BtaFqty)
z#_s~w<+dea@Z!G73KaTz^8ohZWe3vjcKT_?-=1l+5Q;eXQdRxoO2}y|ebjJM->D7v
zRjwX?t}_Lf9#T>5C--_*<d4$aE{w<6H+XOHK#<%V42p!TFwM~zGuU$MvD>=oTNdi`
z8~fC)@AnAiTTOngF0<RzoVnbS%<x<!NKJ^IKP4?ny;76S^#t)8f^2-jm`+McvUQZG
zzo0+VA}(74)7s&yaBw(V3*&N`->z9+WmW`1mHsxUGVWpj*V=dS@}RXWtvGY#9Wk-n
zPvJA`_|2ya*vAOk{BXUR{jP%n9=i>7sULD4PS%NGT@4J=@BDd?B6mg15G?kzD2JX_
zU-QmB{@ogjA0qY#`$^m%?h}PSwyAKi$KMm;R+qAWQ~&MmYRe~Mc3X#|T7kf6TR)z|
zje<Ol3Uj%dgT6*rgv<G-GuLhsD5CpO^rJ%%SfszPMuT&X5l*zL*f6WG0<OlWoSlY9
zp&&ZVZ<Ckc(&DBQ!!zhGE4Ig7eKy7hHpeXDc1krGBc=Z2bYaJ+QF`wwN0Ei#`_n54
z+mlr1T~8au9iBx%dWyON827aTtY5vMgicUFe};6u_q;fqufu!%LnljytXdStQPOlH
z1{c4s?fcy=A$3jUqMF~H8m*fDJNO*<v4&7~X)22aT9Mvo(Xwf5{o*ZutDBySXU;>B
z++Kf~5D@R+E-J?IyuVm?C6P&i5sTm`t}J2Q1}k#LovUn^)rKs+8|03QUD4HFOSva?
z-HU@gj6Ts2fs2bH7JCtBm(LMYVpI5ZuX4c~oAi)#)@}18-)xrLalxyVk-E>ufsq}T
z|N4D87#&Wq=M@<>RIm2P$EF!?y8Yh#1)$PM5IqUBR|sasMLfQy@Pv~GS!qwZDvV0%
zF!J7)clglaRosalieXOkoOa@S%2P2In#RRwR7(=v1Ewq$R7{9nQ1JL*)CgYh43IqJ
z5jmONRWR|}^R=f4h*7YdIPtjJy6=l@^<Zf$aC~vy3XwcsY0~l8Vb&G(X{NTU0R8sj
z(t)#Ok&rlgd1Qk0@&H$Z3Yl~VbS@0-lHz^15Vn#w(_|2CakephNG~vGt)kddGjdEG
zovBZd8$$(k6GD1;x-_|!8L4D!Rl<gUS7}T6yJxt3$uV*#@aR}iT%*fW$djXp2d3wr
z;fn%cu+hR~?%k`C^pU<4y3-)mw$0s^fjQ!*t^E8$EE>y<ric(8<mFvjVW&$Mp1Hz`
zDZ+a`vs#5;N3#<r3e(c?VcQVK^0w_HRuJAnAPiHDiHY$_<LeV_|Fx9=BlWS1PvXzF
zn5U7!Q#;Ng3M^!ZTs>A4rP?$l4+nBlJ`G}LI>w}hF-l0;+NEbLA*I~TpX#as(UcO3
zq!JUcjA8T1nyCJ8?thQaHVLvK3Vr*p#Bxr>Imeauxht|W#3UAIb~kf=X`7qlw4V{{
zyZ_%k6ia<%rG6}qx$p9#9(z_!mR)xENe0^5*I1hx`$K-_Orq6aL3^oMqnbU>ZAYC(
z>*&A<%$<?&_mV->#Os(-tIi_v#+AX2)I1)w>oa+=lVx!WXqTi{&g@Er;!_|r)kA)M
zn}o1rtGh}MKLiKMUO;<GJnP}`=9E|6;tWow<F!s~Rw}}MAnsNG4OVuu-W8d&5wUm~
z9TWG@V|mm_m`s#c##INlE@te*u3umIT52LpE`JiDfdhRj3+-J{b33$c_BiM(6%}vR
ztBs1~YM~`gW@SvQr;Aioh?iAA0#i7nh@5n>ac@x4#2t<Caq25(Rv@91)Y8W}x0=Pg
z*OEn59GSFy?3S@?7%01>y}b@(i~nD3@X`xp(rW&%P64h6YO;}4R0!(vfChy=hHZ%V
zh<A2esKEPJ@aFYla+#@|EmM_N<A1kp2|uoVmG~vea6$!TwPDOvFh$7Q(ys0RH(~WC
z0p*;t0G{Jkw5&Z?A9re7DTms&UEcIjI+O<=)^BeV^tGh$!-dnZ-4-(3bLlCbUsK)0
ztFkUgDSg4kFCiy=3MZo9a;C-0)_seAGUFHvEwaMW2eXB2jv8i_PRu{MMf-7nYrku(
zDYc4*KPHx<zRGvEVf`7c(P;l(6fd>dPIy>&yltT$o1$O6Q{8opA|m*Z59#P2OrISm
zr2hUrsH*eH7%}dmGuxa2&v1Qaw*Aw>{N-}RfOf4Ti}27;o>l{&s4#pb#-ab4%$qlZ
z+pF8-l780HmYT}_kzf@{KC~YLy&ZL{f<10RqaLSIMm}e<;ym5vf7W<FnCWrFo2;XQ
zK?z+O{7F!QPun#Y#)TwZsy?5w#m*c^j<u2A2$M5^vEZBv`6(FXHtmCHzOWeCK-<@*
z*I@UQ!~KMu*wsLaaWOVlRpCrpk!WO@aREJQ<==V8)$&8W#1Nl~H<TsQ>-Va|>i#k6
zG0$o9#YeBd^1T`2;%io&S|$C=t7(hCKXV;stj>+3WoAU-i`-mBF=^Ggy6QWfOdnEm
z`+&cvmMd=MmtTI&re!~Lb>gtXI~ry5?%}`c;`%wPLOEL7_faN6z-{2rCr#F%2s%_L
zc%&X9<jj-@mDN32CslU`$<4|SP6NcEK;kJymZm~pFfqqH&DjZF*o2z<;az#?Y;UUU
zjh;0F7Z(xnx6<4fd*ZKU%%52pIuJZa<Roy@OU!WR^ug@ES6{aGhhnb0+0)PZ1>5r=
zdh9!U{W+P8jE`KK;@(b}iqIFfP&gld!JXpgtED~yOZ1^pQ~j?NqCc&a4iZCw1XhvI
z$fCjRDTv#vyv@<r^jJw}Wy1C3HaLc%-SwF^42IG60j9vB(ak8<wy(lI(WG*+qg>h{
ziGdpsfZw>uNeeAy6o`=QTQaMMZ%Bb}w5T>12V*-Kd?uudFL(QBYbtre&JL@#!TIgS
zqvGT=OW`V;$BvTnAt7nDQ+B~O+8Al@jl4$mQf%WvFAu)J#a1h<s7mB!L@(Q|mQVw1
z)5;>(E?~2pK_(3%tm?Zc)fA;7#Im%wau|1MFFkhjXHMs^8znM`+=Z|FNkbpY{Fl~a
zf%am7Vi~ou4FNw1oi`9vmeXa}B~_g+B91$Qz69czt&JzM79B+5gUf4KA!wHu`))c_
z&K!@uk2Ky*hJ}S^TdejY?b32<>82?c1(1g#b+FMWvYMOK>kC0}bJfrPj6o>%IBkD`
zzcHKHGawwv0xZyOC5FC+MeW}iy|gP|fZp$n5WuWnJ|uvX9^<6NHaHiR68x|*x9{7V
z7l`mfEw+)ir&6czS;~byQ!0e4EJg9=$En6Yp(ZJOiV<Nw$;TA#3FHRfFsJCz9uR;D
z9#xLY$<2*l?B;ZB>b>DcF3$TU^Z3oIytkS(Eh>T{^5`6)JFw5v3z62NoMr@WBJ@8+
zPiR{P(&Sr)lrK;RIgY#Zl+SR+%AvCFCWSY4Te1hY{oNyQ1fJnEg<QN`nh%m!R6*&6
zjc#%{vA_AIc5xixdu$@=ny9L<%?w&qjbWu*31+8*_c=?{Yi8zWe>ZCv6+~yNpt2jo
z$IBAZhFlfRH)3m6)mOsDgJvN+`*e{H2dd)YlH{_^k9s_czm`VDXw4Y!?y*B6bzd=Q
zThXDSL5{8|w+IHFY}aG|o*@hyZX0;szf{ekYb?FsE5l%=&63Uf4qSW$H8uSquzpX*
zT$8-v<x4E1nw!y<7qFxP(9TERNy-DgW)^O4?<TFLzDl4w?p(MV_~#N0p6+$9S@{`x
z-H67mP2H_NF6czR5Y7r?C)gY<=CCU(W3X|ta1-*AMYP({(xMnxAkm?6ODX);`j@5X
zYr~v4)p;h+^B>gr?lGKoHJbD!v_hfyPOxFt11G1?!>k$>G4)=a?<SS<-)EhggkLpr
zzG3n$LFK)?Ed5A4%cbi-y7_BS;C9X#TkJ=Id8PXnF*lodH!&_IPMtv#cH=dN`O<k8
zsT)?6koed$uJd6pQB4hrx8fgL5r9RdK)=!Vh6{pze48F0_eIg87?|_6P7dOzd0*%d
zlJpwDFhbD2EMBb7+48i}bE%63-#;ub*&dae_@)S!R84tr7BPGhW)L$B{<3I-$cs*j
zdbp1Rn^?^}xb98%?L1zn7+3)h<mO{TqHL0Tny)-C-YaMjL#**0V!7|{A4vXvH<Kai
zi$W%O9=HEsVt!xV!#=jGYkxCiZT*fQAV?I${f*<><6pWxUrjo+K@1)-x*2MKY4r;)
z$3<0UrY2f|8#6Gv??1SYb5|#(3OdSZ?KS7yy638(N-`jNYK%U9Q<dyPvU-wn)uhal
z-7XJ0tTQes$WZ6f*%^23^oX21;jI4h%NJFxg3pk)D{A=7G&Z@umbOb5meH=Nbp6#I
z*;`P7F>4T5rK;*;F(vTEE=$Kt0y98b`p($xg6Nq^PpA8fvx$s;+THv6elO<91~ig5
zl)B5a`@F$Ikh-5JTb+!UK+=Emepu++(twqBpK}Re+hRf23WUs9Or3Z&p{SMZ$*{-P
z5Awh5l@b^ajTDhp`=Xw`ovZ7P#k8djqKpiRG;(7xqc|f$K`*WiDHxo$<P;x_r)1f9
zUP6XiE?;uLav8>@u$`x}@c*&xIi8T9d1-4AU@gaz{pJO=Td5}4Hk7iJSNkP+)mxO)
zTcIUYW;^k6WQr#g$K(%=+nYU(HWq@|zI=WmavFh3;U)WfTA-Hma(_Js6RSyM_i9z#
z-<~Szg)n^q_{{o{MS7n|meSKxYiw3Q7TylI1I33xAcvyMkD7tD)(W$?zoFOQU<c<D
zwQ4jugy`+0>Z%(UdC8y%Eh$yM2~3dLOZ|y5zQWzR2qinjf7$j~y+sVH6<=JxQOhaq
zmCX^G3H8tNQ6x3di9O{S1-7&niKCn0fhud82#9+^kv`Qv>j{xX-@?y}-4mtmCs>S)
zhih3Ml*OnKAK0|W#r5g`22PjJ&P8ezd{Hh%HtydSbaibGk&@mcf5dGp&QE3r=J3R@
z7WUw7<xrBC$(y?B3sk;_sc{k6J!@v0|C+}wRSWilz3l;Uyy>V#0b2K^Xo2gD=zm9>
zihW_Y&0mhQ*{*lSKK`k-%lr0AQsbR=7f~3w4s08~Yz>Ou1ZzpH3}lGU-f2*V$~%y=
z&+QBEyqLLzp@L6?1fZ3q+$wG=0s5A3&%KML#Nz=*^Mkgt=+s0C0e(VhZYMh(LBY%8
z)lBW^$G@k8RTb)<EReDjcaufqf_<AJ=E3>+^Gw>L%D=ijKh}p7Ma9puBC$mEA@oE!
zCmUq{j#qBZ*<9U^1zOX-Qs3vQ7~<hcj|ut6O@4JPylUv(WJ5)M1EF{A;lOX04~(ib
zQk5`anB~y_Sky&J%yR$i4r_Ao<p9Kdi2q<U=*QiH(zQWv=C*&{o|}3)Z4V6qp_G3=
zIFR9u&&|$Q)n=z4sJ@^ONiV_sD%aav25bi-x(EouWrHP#gj8@RH|sWWgNt4ZAStE%
z8A}p3@8IA4?5<|fi}BD{2CVSJ8?4B;=@38YZymUrrIySc3w&xRFA0e}lW|!i7hmu&
z$rL}iSj`;zrj#l<GXv<1uI@gky=_VLQbrzfP6M}&iEtV}>i%YQd5^%hKk4yBim+^?
zdDhLHAP`d<ua{ymF7#K{Ue9C-l0p$=(Of;lK43(d%-|Gt-IPOJloi*%y{^*ap{P64
zGjx$+zfa|3Z*vZ{{I5-?ut-#g&hv~e3t;`(*<PY}H277IKQ%f2D7K2PwwJqCAmZR+
z-f6Si=;rD~8BOi9jBat-Jk=6hsb;nL;?&m1f6zT@L=h~9O?>Zq<R7;&Vjb7xq>B#D
zy`P}R>Zm6RGF<QLU2lwk3i1qA=mq<l_}E0*CQbuOYr~mQM%A(!R@PepdijvSC)koC
zS_h09SQKN4lcXJie-3IFxsjzOrCC+vk&A=*pVYo6tvpl7R7UPhM+7(0t4o^5Nf}Ny
zdNv%i73wg28dyZX<9N2>a|13DOBL0VT3Y-lUwr9hi_j6BPlsE!iLE$4khSsH!vx7F
zA@}&UcAV<QJL#T>46U(exVRSLqIQfbk(9*hl3=BJ9E|9RTuwRMCQ@cw*Y3^Tou@9v
z&3kV$DEZiPkB&nd4e7ntnKuRc`C!|661cP(&j1u_g>p4(AAtRUZ<)hUx%j;w4QJC~
zly|XaUrNv=k4odADq59Gg?67P2trH@?M>>NtZ5N(J$Jlg<Y5<J)pMF-%iYZNds^A#
zJu%eK^_NgY#klT;R~MGLLY@K%nWX<OTGSb5wQd#ZbrNq+<>!f4Ojav^#<K6DuHK%K
zpVqB9)jC*_P+`A+OJM|x?cZ&EM@;qD(a)rdl83})sSg6JJ#T+7O=!{LWmNV>c{TmH
zVF^4i_|Qo>;~fvOQ!pad&1JftK-1%@inKSq<l+EOMtGk~!Ay7EXukM8++WK-GuQto
z3)7lBI`Iv~H&q3Vh+G9gI;A5ZVf{kuOcnqLTL)FB=)t$X_T}YFq)ko5Yc72x4oS01
zl{EejsFmolu!L)QT~#?~D}+um#LG*qIjv{E+)XsPFTM1FSjTlPGgJhb#O6~IH0$zI
zWy&FPa(+t<eHNXb+Mq8L5i)YiG){m2ey1QQ_>z^83ov(_C5<N;lk;)ix;?xdL9y|%
z(88|1KOVqXtPk5R0;npTMmEAxIL*`TylV8+4sH6B5t_wYx1VZ?NWM7i47t7eR7Cb$
z#&!TvWHpr|qF;yuBKr?VJ_&n^dpN{<=6J8ex!cIQj;=>s1ijzWd%h#=V<BZ$+%vB(
zKzxSpa&(o<<9c1R$Htp_(!({y?_5X;CC4Ocpi_}1$%GZjjWE|tn-3kQNb%DmT$jnj
zGKDGTbcjPjyW#Pw8fxkU#N<3k+I7l`Qm=F&y94pl)=wk3(UT0Ado$iQJ<ISZhE^3u
zdsB@wM{7g~Kk{-(Ksjig^jQ+XBeirS7;K}3-%vi7e4Zye>@ch6#Cb`$k2(;Ll{a&K
z8R^0Y!$0JoHM@0b@v<ha8+mn0*UYdomcNs73*ijUiVHRVGv*f%d}$(mZZlmsl;M0d
z%=bu*9|wj6$K%0p0CY_jeUPtUkNbqtmC?RhyWwnsL2_&B>|mNO1YltM^RE(oqC6E{
zy6%P=^2HcZOHYlLBEu+B6|EaMN?-b>nh#_Se5je_Y;_}E&*=tW`>*teFW}91q|F9z
zMvAg7q$$S9^74v*y#y@k8hl7;p~^X|>e!KZd|sEElQ=qDbLsF)d%J;!fqRh$<@d9)
z-%#*L{Mr5<cFROJX)eX&3p`?GjKo|NPZdqAC}~hGx3||1Tkg=Q5ZJ8Wv^07DNBQQg
z>{pd>oGPo-cfK~PhtY7Ur7z)n)sN`C+i;J;N5|%L0a9<kbq;{}6ISJ%T~*x>T${LF
z_g3n<a>SWs-%Sk<pQ?ROmY!vF!ny6#c7)x{z%6e&Iw4_qSN^hA75gq1IGtxPkZI#a
zNcklGilI}tCyT4G^;gaGGihn*&7vyN@0ZW8L7qes&#UdugFLiFxEg((Zz-w$@&`UG
zt)QwJV0~~x##X!2d|@{KHycUGqbLQc!x)g&|G8^|_=M4jm*!}i$hp6<_UETk7&%#e
zUnK&6$93V}EqtRz=|GFNxfokC*JPghLsn#<y{G=Cr%C9O*<F2yKhkk<(=op)>!D}G
zk0D=5ye>iP5g}gZV$=h4D*&e<`BGnSK_O-AM#94)qnrp&K#rvOUY!$@cK&7kFkS|u
zh*U~yb%=c)R#h!0wS$M0(>RlU%mhW?xeQYBd<Pl1!^vFR%1)c#y(4`cfNUH2tf+5J
zdU?Ra7{dIC#UCnX{y*X+3Ixe!K&21#r+>J_CQ5=vio9H!?C$GX9Os{1ieAmEIXgR(
zF~;Xsm13=7Qv<+78<5(eRM2zb2%Z_;$x_!v<Z;Gk37m6xGBKsQTcQf2UGa&to{5Pw
zo}+-_TBL+F6G!VwnMnt~iBv!)WrahyLgfaiLqd)de6iC(JdOj(3eK>#uFM@9rObXc
z(&rOOlhuOPh6Q@kO9u}F=$%0Iyw<zWUeVWEf=%u`Tj!gCZ7rM@+|D1&_BQ_IiSXKe
zpH9x$p@e#&2ON$Z<i2Pr<Z-Jlld#{_mn(+TS|n9fn*g>yZ89fPaPvp(+E9})2Q|N#
zF^yqKqI6FJ#Qoair}tU*Ezx#*|B#MbSE{+N%QN`db-?`rUL@jCG$)eMjn=&|&3hy7
zCtwR4KhM&m1a0f)@~dFUqTXWQylqdfO-nog$%_L0La^<{)>OtAqj_OcTm+<Ok6BkM
zRIZ9#DP@ceXu$nz!(p6s!t{<MyC4HUGoKL=I|t=Jdy^N1>5+w{j9I=cj0;7Bbng{d
zC^Wip3X&kt$OR~Be^WYC2diTXurvRn|C1j-RIkGW_5gr>f`g;V-1z}`)Gjj*FVwFH
zy!?WNJtxqw4BLJ*$3d*Kw7@AX0)0zl{1q@WIxrYkC+e4cY68cn$?pMq$!WBdGQGe1
zp8=vJ9yU@c!aHnk{z34X5BL$(<~WRh5{k!1LVn(}4Ttt>X#OC3{l=qld#pHps91wd
zm+#Y?aDN;ihheXx`VSp7fBY=B9A<4W?+9F2d_N~nAL~oSXlgLU0dm_%8bWQX-ITF+
z53!W7goKclq|xT(qoA@(l8AA(_nHj@H_fRM7xlO0l0O^xhnPY%&<QLnDJJWZQCDAH
z#&TM1tX^ENN70Q6go620P$8@)-&i_m1o_8jj31oYz@9v*dqJ^B+I#>l9%*FB&Zkx?
zC2SQ>OTu}CF|P6H%b@Mju2Ugq=bSKo9cPd_^I4?i(<fB3=rL?F>~+S4>)vJsmgSd9
zrE#i2_<h5k|D0L_Jc+aSd!~K+U6tYIy@cDB)Pp~L0Y3ts+BC)Z_63E2r2o}R%zB=q
zFfgcvZvRKX|JzNq5tEp=k_uL^gK)${0b0=6Ip-%5`9$h?BXx9gQ5zWDuk$B;aq`ID
zk<Oy`>S-0ZRedH-`y<~ZLhs3tyrUSy-?9@-FlNFa0M>`%o-&*)p}~h^m}Yh0%6v>E
zpI1+YmbKq7C`mcj(n*yff@|p{-vNaNb9*X6^1m@d`v(J}{9#+@B|SU0l8@dLA(I>t
z25-IwR6C5^I*24rR!xniJRwN8##uP-E5e{5AsrW2#Pt{1?Q5$ReWsGS)I@}ouisb}
zf;?h!(_y95^uT;)XEL$AyW2WQ&3b5Mz{2h`>y6kVe#&s(qO-s<-$Bi6(go`AEq(P=
zx5{2P&0ep<?kSg2@|#3ph`vd-sHan|yQ*^IwedP-SUue71LXED_``(s)SjmU;=9h2
z45a9L<frHGpgKCKC>iV^jRIupO5Zdg^NYoi;txbpmtKw4c29*L+^=12Jm5ywg7@=0
z9a6EBQ(PS38Tg^B8gtv&De)UKV1d9FTYfabXSoLMaEwBRvnwkthmtCPm7T1<ZGS^~
zMKWltJ~$F`xIKj0w*9@`5b(hD)9*(SpC=Pc_@@AGzI;3mim;r`2c=C_vHeGc3oKV#
zt@eWqud>VFcAQx|U$x!N&A9uJ|9=9%C1t;zn|Tk1a1{R+E(r_$l@YJ1ARRdjoarv(
zFP(qpsFgTd0SQYkwu1;RTqt=AfGj(xH3PQ47}&qB4nc&M1+ES?9rr^(9FI17hw)n9
zl0AH56fXFEFYBJ>BSkj$rOnD$noCQiO<#RnB7+jp=rqj+z9LeRDMA%YDnY@m-~D%<
zFf;g0$jfueR(gBNgqtoWa9ntXLjqHkg~M+e_SzLfSp|~o>MqO82@0w$P2h&~;jrQF
zqr|d<xtbg`lEO@rFL}=F&g`AdKfKR!bBcfshw>38#92<&$HRbeJJt2zVRY*W9cuo|
zkjm<^?Bi_bz7$Rw7h{lhlqjVlfuz!#5>9i2_DGDNO9t}DEgI*PqM!~-<jB&yxdv=B
zhNIBFoY&bBX=2uW@f2gaRxA)P2}_B)us&mL`=DFVI|0=y-sTnugB!yhwVuM7IFS(F
zvN^_^)RD8J4Yd8m@xQQeC$7P(Q<n?p>I-o?x<aYQAU4LeFZ;_e%=+hB6+c#&Ft`u5
zDqNihpTPZue2hj48F={NEhToe&!b)iR)_`!AUt+OGcpBZFTXRkCenu9MN*#Ad5a7f
z!RZKygnO6ePHezBs%b&r8KU51kB`G}yc%su;&$Oq*R8Z=srRb^GN@2GoIbe$eghRR
ztnvWlEZ7I*jKr>9S;iJ}Ox6tT{!Be}T~|dwJ~W%7%Re}_tT8sbZgMCK?9?hmaJvAU
zrtdmxklk!?6_UWJL|W)MMyQW>NioZnm)q$;bxKmA-p<wL_Nn7eV6nC@S=;&Sk4DC=
ztJRzF2)ax3!NBEI5e+Z2eRp84?4LUM&Wxv=_|OwhPDT=T;`y@mAy?y#f2}8`X?t8j
zhq^sGjg8pN*1KU)0oHvtL5IVS%|O0wb;a^VaWZ;Te<XgORboEB7&-LhE!SE>ja3$^
zD;i8C&V!R?0p9qXr?M{>XWRvsbF=h=_vQ?|P;_QHGfL5;iU<_x8XYE<k<CPPI4p>0
zH8#=68S=M0<nHRU%hNExv+g<BwHx*>qUVm>B>a280hyb=3?<8|&(;C@oVPjZ{=a|0
z0t3_J*Si~>V|NZQ%+Fgmu|!^@SEFsOvpIVDf41s|N(H06zl5H+d{k``8or{i^W$3A
zO;1EY^YS=l8f>aX#7Qxqd&npgu8N*9J)LzgQ736EQE8a}u_WD?f|5+_AhLLUvU76^
ze-G+}SdR2Ae>~X@;A)-|_O@R>^*)J!03E=GYo4P#p=f=53YfjE1M5{7+nE=UWo*Ke
ziE&%;Kip}903GtCB=K5Bye~rXvQ;&?*{C+^F*(QguFYR|w5l<&$ydjdX4l6w6|-Kc
zAenQa#VZ-@$(i-&;PSqlcka3|MwqmB5UGiB5<>CIhV#*D>G1O7Nt<rn`Zz9d5E0m8
z{!>e6sx9Yxc3a@WTax=g55EOiM*C~d+{8~V>p9z7HClCB!*(f{T9Z#>iiKLujXCGf
zdffA^^qQ4aEOeatwqgH_^(U*-9LG#t5Z~ndA+XVX?*dKCW`DE4D}9JhL&)Xan{L5Q
z{|-oC$KnF(i<Cj_dTYAXNV=ZDbYaX{+(B4RtU^81-Q0YxjBNflsfwA|%{8Ztx|dCF
zGN}tYW21lzx1Wk2E{d}G><WG!PN(=@0#kZq|N8Q*(QPrjGicoQTuU*reC!6Zdr{z@
zyt4T|Fyd!B!T54=4nWnVv+I(UJzd4)3{jEu`Jl_y6qtTQz2&aQ`KPamzjb?fp2`e1
z=<#E?!8ef(L}Fb4Y!RNiRXp=Ne6Kyi4Gz=2Z@k#5m=d?9wGqF1MHeZ+xHkoSIjxd4
z(@215Fg-5N2Y<6~^Hsfh`j|#hcS@JlbOGcO_oLpt@&IVWT>h>P*HxtGYo|GyuVFA^
zx|EHGjte^Bm+pXzww$6=36Ww(Qp!|*Y;yoHn^*j?&o(aZl1mfQxp-YrST~(el!K4m
z6$QWLT)0x|Ll25eZ~jlJf^G4Bqq>~*{1H`?HQ3GcF+j)sPEwezqvMM*gmpNNZ5bEN
zRCCz+el#qh=g;(Vo2wfVc3%3zX^_O{Ee66SuC%n}L{-=b0d*~zb+VXF<%_5|m!b~~
zO8j;ce`K+nX`YGEh^6sK29@(Z1-ezEAe<9=hmeUAKj=p@7{%5o<EkV|PN9Qd)d3^0
zDKa5Zv4ENaI+3@<z=}A)O)|N7{{VX`up6gl`<wFEX0-gn+-^UmqzvT=>yFhD&p;H=
zEu;{S{KIVNx9xx7T=xq#wVE+LuVWqIghb!=k+qJMrMNT1#^<w%fXFcTWq)_~#U<QP
zl_c}R{R>8nscC!t_3uT2x6MbI`Ddi#ngrB)=u#6-S*MQ*&$Ka)o(i-Hu|ptxE-EJT
z;u!<o16)V<L}_<-^wUzP<rMG-b0mXQ2S#%Z(`i&hdAZ{%gt;O?%+8y7LiFaAU~Jf)
zsJ4`U<KdAv+NoK9gfP4LjaW{}^-KK2uFo!9x^L{Rt2<X#m2mE~p{^+>kqn3nk&Bg7
zZDOE$);n!TrE|BOJ*EBlMdJ<M9%ge28+Ikuc^&opHP6h9exVzhSSUVNIZybo?iZdM
zw^dtN_6^8lL_($pYwgC4+P8}oH%CO^+f`}sDLZzLOS=@Au2fiAWfMsu7dXd(m!q_J
zou}qHAI5cRwXF-sVhd8{(8c+jQ2it4z%7KK<BUyffA&`xS~m<5h^&GoDs;6=cQ3x@
zft2+!^hWfhCgvaL(FXD+Zt~l3#`p3~$f}rR-67aJ9k3%dIjVPxCoy`VbhcITIz4lA
zBDgk;`#D!6IIvC%#Ya+_nwq*?tmi0B`+xoYQ_`QAeR;P5SF}~I)u?r~)$PeWfja^Z
za_41yf<77;(i?i%$fWKcoI#%Y{mIL7L9}ghl{T~WhDDz&Cs)A~w^0P)G!1lrCF;<X
zL$&k&)NbtH@0rHR+~IlkWfU!!e@7KR;~T<=i6bok`co7)M8r<izy&9+8VYQT#0)C$
z?s*PUikM&>02j-r>H6E{?K>KBm&njDTF+!h%QPgV19)5&$hIih&f-ya3d4T_!VH*A
z_gzH(P08Ar+{(pJEVxw82Q&3Ij^B6yjg@q^T@!ql9`((tCn3$nqjoyzTID2yssm6Z
zW!DGS>%>n#4p)AfgH_q27{{05%+kcnDOQSMXB>DBfJ%A1SF?}Tl#}p^HdVySiV$@_
zLT4RqtR~_l!l^Z^-!YUv;fa;Lo(Q0X3RHpmZBm1sdC|5gV;zw7#4FsFX5uc<En80I
z`{Vxa#SHyR{~i^AZ{?CAAD`3P;nZY05DxSw5wFd=LU;5F<?=im_d7}Je!^{P+NVmM
zQu8iEhSLC^1$-q}SnA~E0XF@fkr!3&&`aA^!QQDTJ=5F)NZ=Cn2X7ChaW?dC!Vb^z
zL<1@+efvqPX5qV6ybeV8R3yi{xD~+K7f>@Rb1CWfoEA>YZ{$@sX(Em2qM^xu4SSJp
zSKN`PuJbjj#l7aGFB^K;BA~+4Y3}}@5QA?jHGf2xo>itn4JS9Qbo5=x6h0@x@A>oP
z8KP%eAwo5$V`HjKFPPnHiapWNBc~6}JBDMm^d*8w_+=iQw37)}OFuAdIWreA5%Txa
z%u)6=NuflU42E&-uYZ^RVLgeA1l7q_ppocbG0fS4_U&b@o;DW>LdV6h&^Y=yV8^@z
zD5eBD;XSt|7uulaoV4~|7Ma`5A1?rBj;cDD30oAPga(I&fA~9AW|H|RkaW9^l>t^#
z5~eas_!@hT{|SN77gHZ^?A7_rKlf!X8rTc_da@3HwQ;QM>FPfqX+dyd4qjr9Ehhs<
z#fcP$PF?y74JNls^ZhxC^ePVXa<t2>>8uFF(;j)3m}nZ(8{J$B^EdC`2hag5!%^Lg
zRKZw_;^v9v+@1s@?XF*oNVIIo!e=PLMu_w#;9|52RyUWI<Fb1^smL5NzwkFh)YW$4
z72l8S_huY=e}@FUdy`rZ+=(c3iDQmg>51@|A50Y8Z8#~r-#P=IVh)^d-s<~L1x>0j
zmiS7M3z*Qi)I86&NaKa+J>v-Us&Poh(L^Gzf$72^g0=j=xYWFK&*Ppt<DOWvU@6AF
z?{zB$*!b*yKeZCX)2v8be1*#H7iGWp2m28i3+Tp3QDIm2H#wXK`sI!+Ft?-c)=7Qn
z3NEffBb9E6yvY<k(IvqO^|N2dPFs%x&FtpHM~??^4eH<VK%&L6WH8^EaTu}dEXV}o
zEf;skp6xFD@^wr4Oi$F&FP<4tpv42cc6Nxi(7m&gmX|K=I5S5BZI6^qp{A#1W87|$
z)W3G7re2$B_fldro0HazsKOq1*jV41^!CPd8wPM-Cx#WnqwHvh`Hg-Q{o?e17`BIK
zBYL=iXEU7tmWf&JI^GT2ZLwIW{M3BVvB*#-L*TqM@hg_ue30vf-O2WGWo4@a@J!DA
z%MfJ~U9tUfPxBt+(jb$mHTE)HAF~rP5{u{3$ld4h-C3LhEvlM7GqdH)*r4=$1)cB<
z5{A*u<#Z0cT2ZXf>4T!>>~I=vULL!Mjwe@AWPHk>XESP(l3MG%Qip~NB62O+_-t0P
zOPbryV<Y`*1y)2(pQM8iXf7+$YaxF%^Lfcv3>e35Un)xDd3vW#x=<bMnK<2aA(bLB
zV!7?pP?C@hz2NoPHoh%@B?1v0Q{vZO2REgCo_6e+N3Dd@ED8{gh)Xqmcr4iigoGEo
zHu?wLQ;9J$EvJk(t_Mo3qp52GgqdJ!(zV-~t~P~dzG8_=a&I#H%fnL#-x~95_UehC
z^IZ+ncZz?e?R6XIPAo!qJ~aLmZT`j|Rzd+&lg1-6rH9sLe^UE-G21DEEPaJVzJlul
z>mheO6X*R!*`Yx|qsoT$St0*E%9{pWwqfFCDaKOvqCA*;V<8N?o2)g!putp{a<7-i
zr=)p(d$)zC2j{_8yX}caXJCZ&=pX<f{VBhA=E~7VG$jR!g-9Gx<C1d6V~Ec`b?o!-
zM=%&{R}PN1^Z?*Xvq#n{M)o%|z3)Wm^#P!o5wyMxpZ)OTW*2q3Tj>cjdF7pn;pHyE
zh^lIRUTcVnn1=F)2Ll^QiIv|EzxV6)xP&ME?Ff!e`j(K!F#S-_8)tE(=m9<+d5End
z<-inuxR-XbN6WxXvzMpwUu7tM(Z*;AqLgqPY0#iRRt)FLlAUi{jCjBlCOr~UZ>5+0
zOUn?5tsEP}i(Su@<Pc#{2e*2gCTW-*;+rHWuP&HL(VF}`gR)j&w|l^1rDit4?RwMo
z0E<`qR-!#zd%q5$FN|zC5A_y5HO?cVDM8Tp`Be1VL-K!rbAZN^hnz_StpA?vj>VgO
zMc|%EQgH01RNF@<W<4#$nX47Ka>0U;Ugcn6Ovu7D_!8@<z^VBt=k?)9r7Pl_p2B_E
zp!uzdKZOIi@9~As%w+Uk82V1?A*Nc+eEd!4pN*r<p!gQF4hY5Q9iumvmU%+W<tu>`
zNeSd@9CnC%J;_rivQo7Ue}AZoIx#0=-{(&}+U^08D8{V5HXK2(lykiOX`++?<P|Ho
z(*G?AQ;{bwhR=CkEk^2#nB9Fn2cNw`@O&I8WiYxqg(iyw;g*Kle>_~qM&yjC&i#1n
zgcp_O&Iu%L4B&bVr6PTfwdIf!o-apQND_ffy-0I>fK`IYk1ZST(svL?&xl_Wz0#e^
z&tti)a6Vcx;Oew#@1by?z!ZwcDzLYAfgYF}f(F_85)=1_`%wKto#mLnAZcT3w-Vx&
z%&uT;Jn!SbeN=*xi|9*<en^2jPZ!onOHq(`Pi*LIBd`A!pFhJbf07|_eJp|FCS#^`
zasFn;GZAdM_uopCF>r)$8oIG?dp~LgMa}ETcdrZg2ENP`cX~Jjs$;&HMeTGiClkZS
z#x#6mxbC|aS53RfS8Eh)N4G)4e#dudz)e_=H6|xK&+Ikfg~en<`$F!N+@f4bP@AW2
z55Y4<Soan?5WlTqbZ-K$!W<36+KbENTelZ5LB<o$NC+qSmzWV5=vRHN{6`$%J&03)
zgID{YcIc)G*A0Uo)5)jqb@vbxCgr;jG#!4J&FNN$)nrorQKnxSJ@k6VmnRVOXfw!D
zWiX{t#C&Ip>V<U)4NVWOiO{9GM&WpKx|6W;DW?ZB=d*~#CnNjFmK&ue?k2U3G-zbD
zwf)OaIy#^S?6Zx@04u2SK!{P4&L1rBV%SiW;A|6><36DO41U^p-K6Uox`(^5u>4a}
zHV0787<pyN*dFGG7fZV`ME%xlgCzn~0CBdkz$2MVZ*a!VIB^jHp$%_9{Bd=a7;q~{
z^<L`dFGwq#DP`A@36Rh384)+<(PZGtU>NNlO`z_z2$teEcYfNXst$9N#=>iI?@M_r
zvfrf;R@8W1VQnzCql}`o{`8lLog$Y|hKW5zt?@D(W#<9+48e&hSdcD+jpnxHbObrb
zQ;v2qA%8*E6ZC8mZHofFVXgq@H(YiMN?+e&z<f6A#zxTYVY0cMoP1ekf-QXMq+kd`
zpFR(!<*|K&-BEjzuML>C)tpGN?~;?%K3qD$7E79BTxsWb=aSor6%~kf*Q_0|B2!>K
zsbbwe`8G&j*wi_fy1Mb!n7A3gvUYF!Jo5zW1XFVA?jDAr^=!uTk;3_dbYYe;)bl)Z
zG<s^hkcyxG(}JENC*lJc8N_brXKVU9LaE+*zaU1EP{|h8IZX0m!#-~cF7x$zr^2II
zzRZ}Y*OP$+>qFnL!Rg%a;PLi>jej9^pjP`(K}osgmD?;{EJ@fNIZ;EC<JFOgu!y!Q
z0(Efm(&)M+8>&}1mse1{(D^jNmbY5+50Emvt`601sR-AW%l7JY;XJ(tJ`^aQz#{5)
z3FSMnJ?Q5dh}-xhN@*bMT=?}P>qzc5$??TZa;B-zUF`wSv_Iq$uwxjaYH0gyPM?t#
zJxIfv3M_ddM(d;`Z*mSnMOknF8m<CxjdZxUcWt$+K06k^_cJ#UQ-AlDCY3+2ezVN`
z>YSE7u;O!~?BYaKvNMX#w?vUslC(I&W_aT<X}C;&hSlW$g_;U_GliF3yDF4peO;TF
z`d^{^37?gM()7dPbNHsr(U|s#<*c%~H774A$w$>3Lea_UY)>5ThwDz4=)+K-fT$Dl
z;wlp|6X2G%4tjX}s%-*~$hPUmh$$3KD<s?c7KLYTkR_w9!-$r<J2VltFBcIBNS|_Z
zJ+=5ewJ>tW&wtvtxqoMSDcm0oTsQrvpy8BaoL8E#hK{0?mIoYdUOq!3UL|<KlFb<P
zzoJZNtM@J$;%qqz2S2%IEIp>9z2^KvqU`jT`R_@V%TY(=mZjF>T}^a*O9P{a`2A21
z#bOJvS@N*;EPwpXHS29?wz48~=(R-l2qP*!y$`5FiCHWyZ*L%7TrYyxuDbR!XMyvB
zxao1jOP*;$h9Cc<BhW$<9#lFB6aJ^LbdoaKXHH7sur0Wk;uzv?G|<W3rN!>V5^n%9
zh7d+~t4Iy@JypH0n0rOHdIs|#&MZ23$@<r`)8h{eQJePcZ6AGhZ8o)xYi>mhLR+R%
zd^rc8Z2XvW-!&k#ynV`qfx<5E4d+CW+>gF)ejp`5^gnV>?H@+?|0a|Ezr)A>-`{2H
b1~cGcMG9%1v^Ds5)EM&8%2LH|jKBOJJy%RR

literal 28730
zcmb4rWl)w~*ER}>ba#VvcXv0^-AE(dAxJj}0@B?jsdR&MNK1Ej$Gf>_p6~zXj3Zpv
z89UcnM{L8D6{Qg2@ZcaIAP{Aw#Z@36-c*6VsPEr_Pln5=p};>-&Z07E@4+9R_hw<>
zcU%_<Ef>|#pIqFHoy;LD>_6L?GdP<%nVZ`?TYh#qfovCqfFOa85f@SONI(2%p@pG|
z|8CXdO#zK8Qv?Y~7#T_6ds<qz-b@^hzRH^Z=icu)=o~ful{LM~-eMoc-=M~mWveBX
zeqrCMZa%pocO<9v{r)f5*$~_Rdur`i^<j(ZXnHp}+1qYrgzf<p9vK)ToDk!bQj^|D
zW1Nt9o^lTsgl=ZmP)w!}V;srt>h<2&>j*Sjz*RV@-4WQ}K5}QRGZXN;84{fVxHxLq
z`2Xu>j1PogjQ44Cl*6!@3|6;>lBsEF19Nl9<mKg!tgK*pc-)o-m^nJZs~n2+^PzUe
zviiRhHlw25V#r$KDNi#)ZDK;aA(p^}rS{2sgWrs##cV+-8HMu(8v2^UETwe4CN&W^
z1iwFquzTq)l4Q0A;{5wlj+rH-XMtoE0~PhKgs~4Z6l$aJ$;#^32tKphp%(0iPdM|v
z;Dyu5pFUAa#u6g8ukI)xal%A+RnWHDPb^Yz{#!vRLek)N++?2q`?ohG(Rd#Tj7i};
zR#w*4rFy<)nzU@ko$%E`u3{(r57Z(tW2BatEkP&U598zGC}JuhCAwhDQmp@sjfG~2
zQRSD_OPUtJa@aHV-!RuXk)KV!psJ~<Srf$>?=#PAg&UVzEeA??P*ruytiAoxH;NwD
zH`Ny?A;)bMYrN0;`eMA6mKKNeL+E0+)IO``QvIT}+mc$3)73XcN38Pt-HiCG@*Ldm
zM*%64gyRE~Nf!;ynJd8(BFrtV$4<+f-P4$0qUA9q#Kk+>RIx2%g_QkZ5DmoY+}XQH
z6iA&sWtk_S-VzD+uP{Y^TuzR;rH~e2g%x>@HL|xK_4ErVfuA!M3n|I{R-`C0(V`nH
z03S_YcNXc$>9!o!v2D|-wNKTdX;?2oymh{~L&N?h!P7mR*Y!vDG=gg5>ylF6+&Y4R
z{x`G5)T1hDA?J#bzeRSiGF86RW81@}7A%8q5PXu8Z5r?(7|Sg^#O#U=7MiLAX{U(t
zbj7NO**bZd@cRw7zTLl<Stfs{wFuu?(RjpZoF(0dcvYYf;0#FUg_L}o%eDqL&8MD>
z*TsK=mOFDiS(8I}7$s;|{2XLYdNQHRKzyJiPOhhX=1pB}O~S{P@vQ|x6}yIwDWrs!
z?2G1EL3>xe7JYL<ka!yz0pX84XJ^+rF3rj%RB|OG@h&_m^OWBaV4swbNtmm@k36T<
ziU`GOXqem93yk*WXR%ab9yKppYpyzC>6dQnx<SF(6491OO6x#Sg{gTR12g-1tlpBE
zEvEeP*m>VHvP(Y<A7|LRlF!w5;?(w=rCs%Q-{4mAxH+)4ziSXsd`kKKx+_a^85Luw
zWy2ZWaXmTI*=@9<63bug9X^UsicSR&Fs*Ce67`l$?Cx)BZImOZYX0{o_BlI<(}l(*
zbEwq$uwl)B=o&O}_kN7O6P1JBO9&$+rvkf4Oz(an7JF)il;q95#|Psk9<<$8n61c`
zGhMEbI%L(z+P^ROLmE+`BhMDPj#FLCSPSD?0yDRz5+^v^IL`kk@)=WJXPJ0EN=Qkp
zmZm2d6{6VCf=KN39fpS-%TwJ{Xi5fNI5q>0pm=+7OO-c1T?XVuf%fO(LAga#rw8l7
z<qj<JJaeho-D+Qzse|;Ts6Ld;=hPgLj4q78{*J=(96Rk9gK=^?EEVid49@Umk>?l~
z99%mbG_7{tg@V4?R_3%e+~9;?4^+l+K`MeIs_p7!#xz<1?J3l#&{><gq8n&!#=Wb`
znklY$oP9IU+K7A1RoOK>9F-S@pz2;T6oQ~C*4EbMfX_;BPN|>0&yD2-he+328D|*P
z(l15ovG?^rld)@XP$o(SxA54=0xU_f(b3VcnD?R1Y$O|-^exCO??sI5e;vGsb3PS3
zC1X__72Vl(3Jdocg#+Ds!*^kVgLHRyXZ%!H(d4<oi?^2+$SzN6IsRVMk8~4{TC9y5
zrX@;ivn{;$O8{hz*_sbG@6XV9+``h=cpKB|dGT(x%CP_LV(0Ak%oGCyLs(Ojn1<$a
zc6oWAVg><=&#gn#uU{fcN=lQCxRR8wQ_qo@nTZ5G@bL0lUvBl5R#1>=wEwrdzYnLW
zsaa=rvGR3zZEY=g3q+IbygYK9O7ENF@#%8QQp^xzh}Sst=Vxv0+SV2Yt*E1;<L+bu
zyq>=PM4lA#+}xa6X<4%}wYaqO`{RGjFt0E1_U@jW!x8Yh{8Cm%r>Cbko-fn;^_ErM
zw+F;lH4-K!<mKh%TrzZ2)Bwc{-tdU1)>eVdExOF@d@>(FVx=s>-d67$hm&Oie!oYY
z3f)F5%-%}<W>E==;I9lOIP?ABDL%Qy#qYr^<+7E<#i7j2&xbRf@bL2H8h8J353Q<P
zUpKkEy9;-MdiSoYuTL_udvFjoA|k?Mtdn^%43nvFnU030^Ba|txCOsbCVxOuQqpcz
zk@L=|vWCWY7kMzG9sT`MF7I%Nh>&1lU?y;lKpM%(?szw1ybq(%lHw&rjX}oDtgHs*
z<Lhe-as};M{bo0bjlO6K!dg)t9v%>Z19pe$IXSVw%E-;Nv9Rb5Nj2H02lw*EhN0l$
z(Jl-M`P`zsd-pCk!{_GsUruEu<6a8K_tezzhnthg2qz~eRxonn67zrlL`18NP2++<
z?W!U!E?&a8r;x^piCIP@;3X_2B{hlK%Sb^{pvE%v=Z`3Ol0@%tSC{A{DsJHdST6fP
znpFnjZj+grnU>?(0hUxp`}+}}Mt;cf=JrrfQbs&Wf?FW7#+a_SSUEYryNHLBu!H$d
zO-xMuumJ1r<psuBx{myKA`b@S<qw1(_gj7LRHkKmr>4+lWo2{Qv%nN1ARv4oj5q!R
zrk5Oja&l4@vn(nqN+nOL&i<dyB&zvxBsP<{{|4xon5t^+bb*GQot=!784fCVQ%Pb{
zQt)e+-3r6Q!*>_z%PcLd?d(Q)iNjD)*XC<2a=!+LA`^Fwk1M-q%v9)dvD+=EO|qnV
zMPV{Y(r#{Vudj87<_>AGu&_8dIiVfV9~~ZY<MsxY5CsT-DN@RcK%xW#1?B8~p_1p%
zZappidT$W8mkbUbUQWOP9}n+-9<mV;k3G&sz7T<eg2LnRwx&1*8VV{K%xUgW6PQ6|
z6_o_=Es|+(WC<p?q#%UX>;96ayrM!v>e$wTY7snZ?cjip7svDZ5clxtNCJjDCMM?4
zC}<6X&LB$5r)l`o*y4J3f?P{WD-o|(R9#(tk_DIL&H?wfyHi$84I@20T^xpvkk`d1
zIUp4e6jEzYsJ@)7p*cA@yQ~S3k?_r(hOy@6=AA=Ba+fdhad9L+B>VpUO>*rpE(Ie|
zxR5F2Z(xl9&%jWjqF+{DR|h5}Ud77FDtAH%Y)dMy%eC&*Uj+pP`Q)rBG2n(t6H`;U
zXp@}sa=radqibb4RsF%i!2-z_P=)=sj9zpL3kwrs2BdMpqGk2PQQYlTlpsM#q}8zH
zCY97PX!q0K_dn@&FeH9{`N8#>aW5<?2!xuj{QP{-Loz<T=4qtvr6s*d)Ko88R@R@>
zyTl4V*l9Qt)qQa+Q)!|xtGwvR$@8YEu|6-i1VlzgS{8K_jvM8pqLR|n%e%x~t#(4<
z;NXP8Y3rx1htLqy8C(s74#ErR#dzMGQ}DVU)89`veqPB`Fc=s}NB$P+$Nm1|>MFRT
zq-2+5xVKk=k`mjzGz7DXw>&!=0>qnw!swH^n!IA~pI^U1rg3JLIN$mRK0ZB3yxcrL
zKU*#~RP4UA{l5@U7@~DnDi#qLc?KqJ0@YGk8O7GlPC_bMQBhG@O)b$yKBPolLPEmh
zayK3<78scB(NRU0-W=5;Ma#MBoP1MO^`3}igpG?kQ1PSazq`D-2?e9SD>q*thg}Tz
z6fb4BbW%$TKM0nQNati^WUITo)c4EW+}za@LS_(j2CRW&I~(U)Lk{g50aCF~sOY!>
zU4cFjUm~o@wHoZ;8{a&!cus!6!NZf~X6POWxxMuu;CFeiS+yeGbK$VLv-K7tF;RWI
zw4Q?3)kU~nu$9H*)W|Jd|4;Qg#THnuHGbXi4(ER1M_(Zrk9T777HW*MHwZ&gV7`a;
z*=+OvI=#!BF9}W7_dp54Q1z!P+PQ@VSzf_z>4NNR68aY_Od>&__qB{qIi|)iglu%|
zlarz8EpKJvF5;hg(MoEJrTx5+d1%5@*lgZ?r#Bd8Y?Xu6Xvt4$Zx;e}b!f=%zG^}i
z^dFKHw0cGP%*3df$*4Z|*xRF-!C4xWMtPoW!|fovoDS_9)|k%P&qbvK%m{>J2dMNW
z&rEIPqJV27R5u*<Z$E+on-<k6<S+M-DPgl1HQJBT-eq^B(Q!RC+`778om_3Tpt!`R
z>3Hhnm}!nExq*X&8!vhhZW+PO9l9yYDb3N};>`86?>ZUhJzS}UH}rq1$5CNT>WxUo
z;Pt+yFC0ma_8YnpnJT`|(}`1lU!^~kLpt1flr}r>qb=WhmE&MOSk`I2jdyB&%D>S(
z8{bs_3Ge!3KDeswKB7BOUL6d#`G2#!1P$rOtDM($^#(S$snMMD_hO3i%_t$k2Q!@c
zi>Ihm50q=T$_vh!Vh7*TW7iM%&%Dq|k3RF&R%l(IEvWCIcE@YbfAAuFQD6C(Uay;Y
zL$rg9)P1Mg|C)wJ7tC?AUo)l;hZnu1?QbXv^<&9}_sy<oCy@Wndd8AnE0Q5NvEndz
zT=y}4$|)Cg{T2QQyZu9!`XuIiZYdtGf=)=PvwyLtbVxi4+s-3Ik<1VegC4|q!wIz+
z7M$b#rIq5MIV~tWX?_}CXw*5TY3*%!BI21kNSQ{lXh=#ZW!JelNJzRd`uO-*V<i}y
z4RnVh$0un*n1)Vj5_rHN8ZV1tZ)E)Fe2&a;^t3E32xf`-q;+f^P%tWqv(R!WLKfTq
zwP4Z1cImu`n=60@lFmSdjGi7LeDJf;QjRXgY~0uIkEZNfkSJ(SnDiqXXGY5p_=j_?
zjUh6)l5U4fwaDA8Z8RKwf$7$!E*xqS<PsD(WdkC|@aR;hYV6F@;&0+7!f*L*5s7xV
zgdb)`Sd^uRh?$lr+hEqw(9Rz3oX<WZsKSAlcJ=qej4beI+<$WVN{MHwnGkLP4W;HZ
zldwIbcsLwcqS@lxGO^TGGHuh<Lcs5l6h3}Kv6@_CfbL3U5Y*l~6(#!NQ%iqqHAWB?
zU5C6bF7e90^})fM)Fko;0~xTbRk~)!h!vt2HSCQu$p{d?bOe!2;A)!hCrdi3<`=X`
zXlQ7-nzBX>4-5~Bd+Fd!BkE{){wvRS_ej5)fPW;*$3(jo?J5#j`q5wuX(;r9XKr2!
z!Z{5pCVFJyNEBUt`6o8l;`~n}23qwutAm3gWxrTnL-m#q0=bPrs~7aqc3-+nbZ8)1
zUD9os{NT$WTB}#>_ppfYiaT#+b+FVDUtze<*N7tXRT_?fZ?Z%_lVw`Jo++xvA$ogz
z|LP<go5`5Bt0v|Al_wlwojfJkxS)#MoO72Y(>hB)14R}}qS_^NTxdZ&+~l_lQd<(_
zt#{{^Brsp7V3I=MEc2Dmdc{O!SkjuqO0a2fGD=fgO5fdEW39^ec$@jJ-E<|OU1uaf
zaQ3w)R$H2fXogoASstlvvpPj{jOTfx?>1US1Qht?MH2tsBRO&<a1+N)oWD?}Mp@NY
zApjK$xBMvc<X}$^lvQ|@h^TQr6X7`?r)axI#CzZ8cU6CO|HY?XqG4@+;*kEM1_Ndz
zxAq%(;OG*qJFc-Jp4+m*!!9Pz+<Zfq(JwVyxIdz#e@<AlQg6{fx>r_J-T1)PEnQov
zkR~c81(Pbnlw+(Fy`GyLMP&DbgpU(<3~tDP{@}{lh3H7*DO<k2dMHFCS1y@R8o?JT
zCp(v_L68zg!js73oSzycCqFQt4NcT%yg)7)Z_5lvd$L|m^8?g_-<dl<5wg>&$g`Zc
z2#*AJ<5Z~QsP~6Ibe&rIm=#yk6J^nihk`#uB&bnHST0#!(_ki|5vaV96RYd$f*d2U
zJf3=L(c~_&tBS%ivAi5?-wtqSgr)^^5^!MrIb>B55Bu$m=ln@=hn2Hix_+LmO=>zr
zm6{`{ED)_cx{hT)8WYjuPc_$UKx7;1bD0dB<!m2{F9)Wv$s~a)K@`icRza^_c42!R
zEHe0KoBgC0{geiF$%B~+o6mVT5Vl4JB~JU(G>^B7?-2BS>!!mX49TddZj>ydW8;Lw
z)>#Ed=A;_qJsv&>y@PQ|{BCR}rfZ{Hw<`o&nBf&^u<_>hmQFHAW$d%>t-wj%ivL1;
za+rf%Z{u9GHH~Bp0ZrzqeCk|kG6pV>(}y~I8nVOfGq_|-xM1q_<Y$^P2h44=&eu4;
zTob*~gFyb|b}|>zA3Br6ywe_f0sAHH`Qh%e>|*UE(BHr9(nxmqDsSAT^w0KX0t!e?
zVd?1vibQ;YaTnh4SVJRPR-yqA$nI4=O*N)-PmF~IPL7V9jd|^z-2y`DXPHj(lO|ak
z^S$j3=X0n7j%O(Bs^*7PjON3`!w#FLB286k1{-fFG2T1)KfD+f&p_Zm-(~=ON$$;A
z`2snDKiF;a5X-@6WK}b^)@IKe^XwZY;6^5Ln3<gT+i44~EsYr#$rL-sHI`J}K!}O~
zciCGbx4;cuuifcC7$ZH%{4);#AV?VG-m<!zUDgdv_L<2lCw!1cPdDc%2?TslzQQ15
ziOClJ1iR9Nl#~?b>c}01eKPPU40oKRw$vdbw}qhPlsj~s4(6=A7yhiBLT-+}3gLnY
zl+7zIw~z;1_L9w{-;C#-#S17)<in9927(2Y2ju!oHl&R0?DXNSQqNyYiPtQ>leJiF
zzUW5Ht|C>7B#>>lfb~NX%d)r}F6T>ZdL5T^G;D8kF-ec$wQ#%NQ_cNOlI2ukbp8fg
zlo}i|a=%jhBb@t<F0<<)B17agXq>1igYUo~U3L=*@}6V7bR?pbN{7Jv^I8`FV}juf
zJ{UwiAIEFi!tXWf6y!mor1t-Q2QL!2pN;opkBNyXt7*N@Oeb);LReiYVAD&_I33|y
z0RUO^KP=qh`+#LJn%)VyB6WIoE%Im44CHhP3W~b=1_^8ns>`E)@V*b%=7end2oLOL
zL10M$oNK()h4L9~o!MiY1PU73cxZd%OB{0Mmsk)^YXG^Ro?TQFnknG@_7Uw=`b&jw
z0|_%T%Gaf(ID!{EhG?A*P^Dus()0+T;wtOuMg41b&r>hcI6Xh_0?5_@Bvt^SjgFUo
zg+)gP1LzIjcXrgcLdMIhGZ~L7xesbg<KYyR-T7K7fDus8(ZBLIZ+Fbh{MU_<G_JM1
zJy<q2w!`^a7ywKXVupPHxS{FWxo}+VcmvSJ>i*xdCYL=lMuV2y`4JNUA?S7L-UGA*
zDqaQ<KB8p4QP#*%;cml&?gWRUQ@utk%h^go6BCnahc!`ZI=W<?QGlT@Gn#$wScgYO
z98Xt8PVdYSn8v1Q!M=%8rtSeHh_RJbH~`e^J>lq0jvMcIU3Qg^y`u3rx_+7up%4-x
ziiwHMEiRhgU+ziE%a<2yB9OPDp`o2V-K=zXcD|96lne_G@0=`<n<)O6=QLZ}*w_y`
z=lFQ*NFd}#@csLDZns06pssy|_d^arI3=VQaC?_~Q)X6HG619%0Q@*x+<JNGa(~uO
zETgLWZF@9B4#0{vYMSnW7_QID<LNxk(%|#&jQX;BdwU$#V^dS#e!J|6+AY-OfU*&E
zVw0Li6|kW);Q*{A)Yo$WHevE;v607)kAfl?i(X3@4-c=-a)L}&PHv;lHzZyXAdNaO
zK$GL+*}?GGD%RGtMMure&8mPZ80_l{VbuTKW3?#X3-(|2!EBXkC=y{YSX(Yv`|34q
znkp)2BR@DI**Q3@=6{+e#>bP++?e9X3=9lpgN%?+SV%dNDUg=%?OX4?pJt^VsjjZ>
z&&I|gd@lR2tJahku=7ZXh=?{fH^o80;PU)<r_*|gf{iT&@COA01H=7RvLQ8qU^B~U
z#yu2#d})CClF-r70pOWEA~v?F{ss#!*yVVM#~Q2x9zMP)uq<tkS`X+nTCC?m<S5o{
zv^TT2SEN!*CkYTHsiCi_sbTkf^q8EO$Z<WG`Dkxn+4GZG-gRqxdvc-PMn+dRjZ!{E
zn8~m$Q41?TShzoyn9AANd4IW;Pr&C^*+hGv7d&}zqd&F;&^u-(CZYh-QwnA=?MaiN
zqM`y^PCYp_l?RB8k0K%g`Q$9}!RHqjR)A2+FDsMFlZq`I7?75cLR-7IbJ-mywVwY;
zadotq4B#6DVacd*uiWrkNXWsFk!YPdOOc1`qr8L!G#M#cRZB~1L}J189~?H4d~S!?
zKwXfLGD#W3z{H%mK3W{?==cH@0}*3m<AfNsLO>cVG`ln5;^9qZi+rV~rk0+J-$&4B
z>Dd$`WoBm12CG5U*f=MjTwTY=hy;;<TbUA$QPm$%CF6ioE0KuA0(1nWYH)USbzFnp
zA|-7T;M5XgK3r&>t!tT?eTXI!&~brf6cRF=j4wR`duF=X{WB{EM?w3`vu-J)E9(_Z
z)>sj!o4<vJ!(Q!8g+4w$ejqGa5Nw;~`813JN~;8y52-+(kP>w98=sk}bj~0q;`b0{
zWMnimHWr~$$}~eZXRQ|{K^X`8c%Z8*@VD!MMiH+NSSHd6ROs5h5t@j-{r!=@Gpcmz
zOBLPu+MRchT=%EL!4yhiGwQ{;5ERCsfrat>==t;KPmsB(&QDL{Tt3tcj*b>)`9G(=
zuGfV+t3<#*>`AkJ#jLWHRa7hlEY0i?jGqv{mYs!#)NG~xH}D{&UtU+V^`BSTmIY4J
zGBX*ys=Y?jd2Fd0WsQ2vB2-jWtvUmt^Gix(BC#0OeFku~ozB)JtU=&;ZO{d2Maq5e
zzkeA7goJ|&3%?DJ>o1L^=@b+cbebGU@&8m1lzXV7@v+{s<|s=SDQ4sXX6PGWNjyD0
zWq8rZ#g13nl|W#ikdu?+^}T0r&bqI<_c1R5$*@#dLj&vS@nS5$pg>~9|HZcgiHBSc
zPyvKIPC@bt3MGIqy9AgYtdsndsw$?d!})I@sqgMi<e?E-N&W!iH&tgfwcjgn1;ny4
zgkQgY)qNVlo1B>`1cUx_SvW~eMMV?@Br0-pay(`u2m#-FwaIuK8Z;CX;irciDpONa
zUWZi(?~v*QgEk)-bV_*=K!@oxyV+&zkLc;!vi?y5{G-uSp#n1phlGa*cS~#QYsjs$
zomZKhFbfO}d`qd68C6_NjX=Qtn&taW*Sb*wqqt`o1r;0;Vlr24jKk|f{SFZ?2y7bp
zAMCII^CrYlqU$u+hKq$FbuKRA16qlhjjeloB<-WJGRdALT6e^3J6K@Vz7K9Gsi~b~
zW616&%e`kC{l@n8jC+=>CC1$$DLl@UAS<k{uLlbH-WyF9e+&x?i*sQw!~*)!YTD;E
zk(D;ziB@lSaVaU&z9`)Ph>*O&4<Hbvb348T@#+*1G^we$ZijOLV6(|cmAPKEc_bw!
z8i5;`xw!f>_}m?y?p@Obys_*S8<Je=ctvDnU@x{u#B6Qp4-XH=0sj|8EJOsNQ9?{#
z*hlbbUo?IwAh^1hmx(}9yIl(<&d$&OuxIHiy4Yan2qK(byC324@v-IKQq1T31E$kE
zAMjp|s|F8PsJ8@yJ~&udSWOOVZ$~o)BIz_ML1xvMoY?<>iHS+C{Syj25{#1=SYgbr
z`)WP?{c^l`-8oH7-14cc&>6h0fxmw7<mTnoc$}I1N9yQgnAzAwmT6Y452vzmyB`zK
z)6<{ctoVzOph)xnxRtpt)U|u?Xv+YUWL#roV?Oy`xzyCuI+I?w$?56*R5lA0z%+vJ
z#}Sv9D9-k2nB;ZE|8k-*|MW?gSEcpxbWQ9vrfO?xN&B=3JC<?v9&Ds4^0^<&0~+_6
z%ZHFK5Q(Hf2BQQw{t)sPEXdE7=0#J52K%tU^I|(5Bo+qkpF!X%sy%50)PJqs$>|vx
z6s)S=c5A4pGWUl{1g~is=nDL>?}hdB_<ej@KE`FmihO-%d^lILJ6A)F$7N5ymxA!+
z^whz_!((-|Q!FgZ<IQZQfU%L$m+OXx>c&PXknF~@MIdGR%Wqx_#LGQZC@3frYHl1E
znZeT*ckBvXU6aZD@ZG;<*v*>s6wB@Y$o~HR1bl7?npOH=OCxC0`P@U|Kf-g*Ep|^#
zpoWBmB*bWwyOzFAnqHd^uI|fHNcAbR*8D<uT{OofVzjcFGK42IbW4n1K+35FkXiyz
z$wN$+k%c-y=`uNSkO7#;*V$flI}{cXA*HwD?I@4tz@w$jar{;|;F%SijJg&E2QMk9
zLtChr5va{((bdr*{CeZ&rrF_AJ^aJ<6SM-Mev0mVSH7z*yM|3f`XN0AHYVBj9UDI~
z(wEXwIKW;-gB_FiF;6lf#^59<D9Gp_%YXgx4xYKTIw&;>F=|9dtYb&)qc$!mCs7_A
zak~N`jsDKSUF<x4Ar%V|R#d$HEi|3FjtPX3nvK4kV(~Ci)1r=(&@qz#c>xN>gsBsq
z>oQZHk+{YP`;)$fviQs~+b%S{DJ?CnPicB~a(C~~AN*##Jc{&HO6iM&)Q(eftUqov
z9Tn2X^T^Qc-A)#>`zXPwM~jZEu0Ja3^HKw1ndVCb5oCzTX$mZ}t(C?xZ|8Gclh-;*
zZueKm(~dWw&?@>qIrA9=SFLIXR7Qi}L7jP2Mjf*&xg{YQsY4fXvZt%pu6lwE#IlH6
zj)w!NpzfSDS^rUn-g__*PoBc%FW5Jt9oU)Ir)l+XDg9>d;z7v+brbnUw{gx{-$Yy-
zT=8Z!Lns!y-CpuRQNOg#Pi2YDR_hwF?gwYzmR$0jxgpD+!NJZ|Qn9!*LFC98TM4$j
zQh|Kp+g(!5D!ni9hW^nEX<Rj55m_RlqQu*upN!|1QK$SLp`-E2)UBTAsE{`XVi0eg
z4>Wvk{<)jT4hp0)8NeVD_58%7JL?aa!X(1{zB3c7%H7~he(b$$3N4Qi_I|1@3+bi9
z=&@l3M-5-S>hyiEsO^g@jmmniWB<PlOlB4!Jc?8;w;CH+hiebbsy!sA7}z6e$P-5@
zUc1fJdSlZWnV{g}h93*)o&8Iv*yTD*vu;FxVZ{?t#0_4Vt?UefMQ?(qmSoj$dBrfW
z-zZI=I%_TMZB;g|mzXm9&h*nz?&+cS_xAxwD5!NerPSoqO=xqI`Nav5X$%du;pI2R
zuakr|#|KU+SlrdDc016RoI;y09=h4TO|6eDdh}_W@4A=XttK04{N*PhPl%5XaGT-0
zwZAsG#O3|jcSO+=(VHM+xUstuYQFANHH=j?JS<;VZyhRe(9qcs=uN3!<2(XK4F~VA
ze(N~8P=50E?c4FH7l^}Ot8WnP2o)DF-b1AbxiNoI%!mTP%W0Al(src=@&SpUZ`(AP
zl!<9fCT*{TCPlCX+rD{46hsAXt8Z^4qlpkjx+PO;E;UCN5Y8|9wlObIh_FHcvEnpS
z$vt3ucJFtOkiUgxdL6-$-k_@$7rWQ(y*B>bK3;2~q_|jFwJ<+tG~hn9w$E8EdjK0w
zqB&Yn7^KMpnnF!RZXm3y+g!+U&^kS>R@Y$rm7$-pySjj;$Zn<8?Tu#aQFraCk>X-S
z6~J6yX{HKI$0U%cX=ya8;uB0-yi_4FWALJqo*y#pUhHr$rxR<uXBA^6BSH_aVta%v
z^g$-*noGSHGE`O$iu<DV=tKs^h9ubilg{x}eoIVilvef+Ee?k(Bd6_BkJa}5A0U%K
z2lpm1>Geph4{_Z-9Mh$H#m9CzVR6`6g^UV37@xHHrkhoqTp!Lyj<(+Wsx-a7OPxPG
zjp<WzUtW8{Ms^LBj*;m&UaUI1dD1FxzlS5DDd?X&*yAMdybA^>5^aZa8Xuq2zN^^E
zSq__Vot#Gyv;Q-H|1mbj*71%og~vW&czvTP%EO?v%8FuEEIO4U2_2mr!2}Qax88%*
z9I@T$R?T1fB5%6V#y#|gJ_lBLryAR9GVE>_ZuA^ML1BIO|GCHOK>e6?y8%2pOYf>e
zpO@MQ?enPAthJ2|I2o71`UWv1F83GEOc#oCbB*<2VpGEaoDid?zbX5pCQO22G`8G|
z!~bbgstPT^uyEBt1(tQvIbKKpc9M%qhy}G7682v+zNwEpx_mv#dWaNRCRU=sg59)-
z?Y{<v^2LwDJ$-%aAO!8QIPCf}!NH@Tq=#Tb_ey)H1_i=Z7Hv@#kWp(A0m{sv^71bt
zB7+_G@22(f95EB=F~U@O1AzWSXWPF1F_hMdvh+ijKg<p0<SnhXHUC1BF+I^l0yC!a
z=fw;kWCGPjgVWQ~Rc6;GG4ZAyN9=#J_vH}pFKQjvCEub&MX}7vK#tJzn1)y#G+2%+
z!!jAP6dNzx9|0X+Sq(UQB+_CHwmLh+k?l3`n61?MDgz_dUg**JO%>{_Z^wA|890n7
zh662ZWlAKPZ-}G3iQGx^+-dXsK+!W^BK4Www)l1>(&<n2=c?Gu!cyg`T@4B$Y$B$u
zHIoGtbN$4V1$PwtC8r(QpIQTEXRq?3j;=cMeRDdQ-l40}=>!&S8J)<tezC8zOQD=L
zrjYIZu{KzBnUN6@fjU!#mbsN5p99qbh1)%#*k`)u8vZEL`|pi`bhbM9jq2~{AjGQ<
zF41f~T&$cBxl;m_Q7DU|Ku1#r<;x+eCu;zDS8O9%UxNn<K<o5I!3%Go%r9tZI@^(=
zbxnzEFIX}zaQXQ$8&8a41t?`{aG6xZ{aQ~-Pe<_dytBU2p&O<bm&~Kw&DK5p)M#xQ
z9A{yBz4GHdB4TiAfqL)|Z0)4u17=P|Dy<6n%4(1%!Dx^f>971Q^*0a<3=3qN^R;+|
z6)U2mnVn73+L~MMo#5j8lBG<yVq?~}@PzeKL)QbxggP|r%IC@#I4%#hspWMwA6xIy
zj5T=q;nQ-Hm;I$q4S7*#CMO^RP66`U@5A6rr85A<ySiT>0R#zsPxsDUb@i;+fo#-S
zY3>5fou?-z(?B8(7gr-1GC4Xc`U<(?`+SN|WFHk#NB+K#7h$S7)?qcFV`<F%SxS38
zXQ58c^NUB>PVMRh6BAw;ifB{rNXEmGWDN2)Yb8`o!!8JbxqTS(=Q{)Soi!iY5JN4}
zg*sI+^UONi>EaqS|0=UekHMiKG3xStaU0PO!4#zOMbK>=c3LITSa8s+JX!_p@yW@d
z*<CRhL3KC1DCdde69&YsBOI%f%kQ0eIo_P;;~n>EuGdnU<Z(?aU$tpMD2aeLi{f+6
z0w3P-&~@dB1~+e^F81Q1p?;Jt?(xD49k0M{%t$3{PfG0R^H!J|1P{8bSf#OV!J;{8
zCy<!52I9XhZEVN|7%JfWwM5-~&_>NmO?_tjlq)uGw)OXQcXu5wgzbC_^76i3CPpyv
zoq9kZ^HYaiyYd&;D}O~j5%_ry4e_U4CN=Z>TXg+@csmtimoBcTn&D&ns{H2A#euQu
zD*qOt2IwfMOMLvo$;laleXFdcb){fVRF*q>mVTbo!NNP-SmF5q9q*#}vT8IFvG<OW
z_jW~d#Q6S1Q-K79giU}@M`!D5_d7k9Pit%I@u^uUA6xb)U3Gpqr{_g3ZnOy$F)vHu
zGB_C2k2#&Ya}zo5WKxj`c}Z9waf_GO$cbFiXhq+=-NbxzdsBnS!$Y*`$zM#ZzSK-f
z$q%(cZ_uHAd_MpvDt9i;R{ZAlpKv<t?+nbaMIe&=k=URlq(|bvIu1=@7lpk?H6ELz
z#^>0an|Sllm~?mUZ~ZylV2RP3rrYhU)zD)Y4*e-*6oFktWkg3mWNA;YWM{TETL}@L
zn%5ZY(8NRyc~{^gonGaxKHRn7fP^y}W@7%%s)Q)kV^@O1+>U<mfh0eCkZFI6UQb11
zU0vO{S}V=EN$wx*2pA4iL@f?_ahG#{WR30i5sAPSTAhQW7be=(3sh7*1aEKeQKwlc
zBqSt{GkTP_&&l?zXtGdHXlQ~D6)m*s@=0;u<AU@3t9Z}ZmN$a4VSnQ!3v&7q62c-X
zAX-|gjXxyPXi`Z@i3cT4pes{D8NNS-#uSJVrlv;?BoPn)68a9#sUSkcV;sFi$Rl}h
z{px6U!a>OY=aPAlCoZ=WAMBd;ucl+vXrW5+4W+8a_rrD&_>q)bm1zlyUk0jCE)Qm7
zhvu#(Qzr5pFMRGc#Q&678p(4pU<9@Hl~g0}hfDVvK6^4B%^~RHCHZIEtYzR=|Fnh?
z5D;KAo#nf~zXxnGLyLg@cP{}U2FS+D4;|#%??o2)(y#sDvl@w42-k9g72pv*j&z|`
zHJ-+JMCQd_j5o-!&Cw)3AKGW~#eD9MDhaMAS-ihG2*n-?j7vzrohb8ZiIM4)$YK{i
zhjl~^Z$e3rNnzrz(7R-WoXz~OzI(y{KIrcBT&AbD`exwK^5e|~RsM5r!e9+zuqdg#
zyh6?GO-6NH4V@`Xpz`flL4?1(+&Kx;SeaV=v&}};xTb2?_K0m*@ef&P=`Z!sh%w7&
zp%K>5YzN;l$OS7<PDqjM6fk#bHGIB0J<pNedd|Nv16NlJ+N|NuXh!#stF-F9f!-H`
z4}o#ALa;p>XCycCQsm+ql7h=vgCe@BxIYl(nBuaMBWqfD$@u9p0M&>-3+e=bq@}Eh
zh*hx|v0EuI`~=^^DILQ<@aLce`_OXuBof4ih9ExkJisBtb!=k_t(wb6skv9i53(@<
zOVeYW6NTGjLwVE?0Yt~cPFr{j0mkOmcy>bYfvd|0dqdu@(D1Z4T=69sfa*oS=R|nm
z(lFUPc!xOAC5))?t~7~OuaPMuAiD`dECgYVgoX_;#2DC{>!V4|FRcR)H{O>MV>`j{
z5S!y_d8j%EWj-2B5JoE@hNpiu$+gzj2fBjcSN+Z~4En_*3?43>6JAu-?kz;9ITAzv
zND0ZlU@`j1z`9jv$7E?%Chm7Z4_KbP{Q|h{&f;H!F(>nXw9_bhd*`YTVIJG;8Gv79
zRJ-}=-wcB;%BnB2Qi}kO4cnCh+VjFCT@10npZRIRs|-N2fS)6L@%=;=7R^Q%d>>aC
z1AT~%0RNv}ue}0jLX)E(E7gwk{D~OY&NIg~|8$L(a#Q8yrD<w=;USO+`NO<5$JYN0
zc5~R5<u~D!?I;8!74QIfsdM2F753u?o+5!z%$3lnh?Y?=p&jix;}yd*m(>RiM#DGw
z`1nX@ti-@`VRAQVuElEgVNzAqddWa5TZoe}D-^vyR>pOVL;LcTKA&7dXdlh`BlD)&
zX&vR*P$>2*43a~f;-lGbkG+F~?t^DvJ?<8Ks%!hQ>w)$8)V*8JQSsf<vbD`{bb=x?
zh>qS~=px0;9ws6AKJqcXzxF%oYEW3Q!~v;O?fosDLJlc$cSfCocPWCda99dB0kaeF
zm{rQ}496C4w?YTGo%b<*yB+;87Tx|jRir2?r+`_uEidl&NwR(c1umGo=0-{H)*G3~
z!&Ai2Fl!@mowVKhZ>f3$(noZFYi{?;s;WQtXEnK9L($vuWrzR%R50uB@POy{gX7KD
zhnsG|Qvkb(b8Oonf~u;j%h4qC5ca6ZWu$|at3&yvxEGjHZf6JVhKgs@gPG?LKtzV~
z*gehE+hDV++eG!&7K&11bank{m`t{)Wn-(}Xe?;*edZp`;)B7)#%3w1Xb05j>jFc$
z<0r8B&J<V~KpG`KLf7i|IAr<A>dt!3y1ag0Z^O*Wie{=~KWO&3n~9X{@s}OG-_sSB
zme#yc2iJztzfRQXX~7e$7fzZtvHT~i3nksk>Ka{bt=_ALhx7r{SE>dEMBg7zq+a<A
zeXlF2m9;!0dvjH<2$)A_aU5)P=rMe3$Ad>1m%}+!Z?_Xsz=(cGY<UikjqSggvLGQL
zfq0UPl8k-RJ>FIK*CKE6m3E+g#9bQ83cC5Hi^JzeKl5_Nr>0KF7D3bvO-xMeBjo$;
z96oZD)B0`t@0&n-w(ha-L%vzj7H|BT9Ji@+yRwY8{$Qr?S;S1ckKd|g&=?S7U}3FK
zm#A_(oT7k*pOaI7=&E1`Y(zOVKky8C#YJn3#hRL2x(~J4=0D~@0e>cv_svl^0DY*K
zn6P=@3*)|Neo-d*lAAzgLQbBYR~8un31c#PsLjk(^@vK*T>+>_01eT(?3aYCW#Qx*
zbn=R;-=D*sqnzG_d1JymJHz{2j7ZC;@v4`GVwN5NUi%9npF6O-!Bae<;iUZ_-|^*C
zQ^O}^6F|g0nH+CxB`lQB3<6Y($j6$}ZzwXMyqYb7lo)Vp$Hy^_8+{_Gs^~zl;r2Yw
zm(ReP#Rf9kT&)GN-}9XXAnJ@(I|AmK9La$n?Ipme@Y>wM!W6L4O`aE&=jZ3uURT!g
zS%UhX+yh;;_4EiNqw#{v%jxrriZ%e7u{%{1tA^a9tNpVbdMhk4vK!DNHTM5_gMx!a
zG&Ql|vFIV%hb_8sWX^!Z2Z(jYhif|^8@{rxz*L*$Lca~~cCs8T6-z|Q&5Z*bY<=LO
zGeGRg^qUBF?p(0=Kl|(+9FViKW7y1;cY>$Cy8ZSnm$G?j1qCw!i7V6Wc7)cdp{6Dd
z34@dmh5$8dEZK;6+Z1qs<bX>)KRZKh{SFAkPoF;3nGa#XQkn|ZTTM}djfY7=MHOjq
zaeYk*uKVr((gyX7g&3IB*qt_|ot&NX04qCj0F)9EbaZr(n&OOj?WaLPH|}o)s?V68
zCg2@??=N$}0LKfH#0i6UA`=UKAS5IN!*+k@es*^Ds_g(iGEtayTXnJVGnK096w~qT
zD<99x%e#4W#1LRM+z<8kO*A3jcaU(Uf`fyzp8STsGh`8d8OQ~bxZLI|2RPX}zo&cF
z<iW8qDxlr|;P+&O4ervB@Ao+bx=DUso)}=0%^5HD|I&ez7rE`v_hne$ShD_;Z}9P_
zfZjHw2Q)yw*DL+}{OVJp?oyPg^8rJzGL|V&lAWDByyf@u<Zk`%HwU0dr%ib6|CX#L
z{rDjZc;|u@zlVx0b3j{{*vwY4gAVIR_xbDtob>uQ;F~UWm`_Q;v0iTBAr$m!!b)Jt
z;`KNs0SQ6H*48$7Xt~MptBZ>ZH3>;J_>LGIF8Eb;1O9;=V6TQtegWFt0C4pJqpUS4
z;myAMnXHl$m5q%JAbQOUW`PTn6bLUr_}nXTK$ie_Po}T0{~N5yg4yt}uxw!7od7Ij
zFX+;AjcI?ecI_t&TwLuvS4P7&`QN{PV-OI~_n8eBeHzPD0M^ySZ{Jk+Ts6wIL^wG)
zfeW0m&usl8RfHxmEzi%-Ps9p6Yi+sK{`?6_+yE>Foev@4R#7oA86~AyM!hEGE!Xbu
zZW$1|0Y6Un`ki8ypaO6$Uq0Sl&{lZ^p=fe;wy3(AndJjmWJaBD_W?r;wv(pkKMJ|;
z%7F0!0<s*!V2{*xq^6r2r|aQdJP=dCp&Br6uX-quHCT~AKfRtGd2n%X#sSbT`zHy^
z<UppK=#9kYkq1K$azEI8Qb4u%YrnBM_|447NEpb51YlubZNv&CWMmKmhFqgL`jz4a
zB6_R;OJ=v(I>l5k9Tin5Ah@jnzrtaE0BDKnN_{?^dTUIW!4xGB6h78|8UduXDTh5B
zGjkl6HQ+$DApm`G1v>5I-@m}^n2p!yx;cRQI^7adQqFN;+?9l$#Ajw^()F&>fF|O4
zyp#$y505mZAFFN)*a@%d>Fbn_dLb0%=f@QkP*mtQfB$x!;t!~dgq$2wLVl09T25ji
zKl#^9aeO?gorW+0>;TeIQlXjro?Y5$KtL)2(=k4j%*12!jr1N^0*Zk;n)u_#9~p1a
zoVY^yR8i2IqQ^%cz`{`ho&y>hdK%DUKe!#qcJ%+Pw0MebZ*SK*Zb-gzUBCjS(AU>j
z=YC=^x_EtdM*6z_-A{;i>}I9FPI_I}z)YXs2;#c^tGL?R+v~XO<_k8V*U#$eY`~#7
zEK~}UWP^pLda~R)bY#th2mmW47M2Q-ZuS;Gw_`4f^#KF7nYFbn5MI#b9<C%rMc<~T
zq>KZ&-Fr=BU|^76QK5iL#E-J*Jq=TT<_<JCp~30tL^O1C@cmUMknje<DU5`Kgr3PZ
z@Gch9KINC^rxIY6r9bI}PNg*ftpW<GB)L=!#>1&1IEY|oYb)=4eTaJSFs%;u&8uRa
z#%YK4;O*(*5!cYrpf!rK|8;@?>%VH_udhEx0!{EEh&42Vf(GR$-8&b1WmRptonK*+
z@eE{t49K6+($M4prE3DvM<u|>tg)<0aAUpy75N5fwnA6a=7c83o8v0r!!co{c5TQM
z_*uekg^!+*u@KPr<6)@e)^=ICn5-XLfS%}kF~S8}t8q|NL+mw3DBxA6v7)H~Jizhb
zw1K#nozuKrrQmzuoCmu*{Bq%OTn@-s?)A^hKX*^sU%pFMjR3X!RX}`|jrZTNCEtTM
z95>kJ`{3soH@I-R-W!Q0l%+;&W@Z*wQNh4q*rsQKE6AU9c{tA&ZXO3soSEG0vX{`q
zO0WMraiQ_EVcyQSeMlK0CaQ|BsdBDorH3Tn>Pgwz6ECl>$~aiXedCgnB(ATo>wuOe
zBb75%r1Y(=&97cEq!>BhDn9{O95lSm!tPg>emSeKsL1&F3100^h?RndEv&2m6``~d
zbM4zKLk3cg67Wc0b|VW(^F~Pj_ZjfN4gjfLF`b)qkDAPRd${C0#pkLsl!Z)8RP+O3
zaM&~Ma0{@VsIaB{n#QWyM_|>)7mNz=yVzEck&1^SmfXo=F%8Mj^V#mU+y#<cLJS!u
zP~8fgT>OCi3q;&jHNs-qg=l;(X-&=KOhI4TJ?erM5CeE!fszM$YksCvZS;nRn_K#*
zAn4Dqm^!iaMnRC$(ZA+oa8Th-Mrf(vEi=;s7~2noTakiGWDc&bN#PTOJ$qpBzHW|K
zVxjLYiA?tnk;0SUdHQFsMicXO%449`1M|Z}Bpbvm%%m6kja<X$6$6Vx!KxTw$Ob#9
zV0uiN0WqWY!w4{()YaD)_(|>~gI{@4eyL$HX&(q4?Cr_C27!$Y@;&PM(&FN<cSr=%
z!13_?zU!(J2($-1(>v-wBmG}Cmys$-j3thYp9wYX-2!JHUYmDwvXbnQC>wuiw5_7r
z(nMtTvqAcSMsPO&sk8UIT(?niyGXDq0@NA_!NJf^py>%QD!?s7!45zr9;a=}@KdaM
za`|#Z+$MOy$CjBYLnRPASozF`uMTDl&bMi2&5Q?9BFc5@WedTR$^o<fY8<IBXw{kg
zSX#gt8u+d5%NK~$W2o=x>GBo8TyWJnOkvQ{0M36<y*p?D#vu^SWfU4UDuIB92lm1=
z@{Bev2yV&qC1Z5t0s<M}EQM8X1SV)0IVcIHs|<x|kh&PGh>6;NIc-Jc=S@qCii%bs
ze`5R(KNOn-3HbElq8kK_8W7Z8?N1;e$0a9M9Z{!G{QVm&Ows{Lr<}sV5MbgeE-4WK
z!G3b0G<t03wXy<^{}eV0r1g!B8s{BFP;Pzg-;m)os0YI8D<7)W>cs&LwVXabd0Sdr
z=K}FFAx6Uv9N7Y9{x4Zs#K4OP3les@R?S;LIlX@OS~fN`^Z~r$2y6`OR+CVm7vqzY
zfxxttzoD+{<;4pKKnJj*qwu+qz%heft`#3hFa<!2<^-m!l(aO{wXR@LPeMYx%1Qu3
z+nx-PlcDJzypWMCwfW*VyPtqSg`AyR8o_B-PJ)33^@x^VSg`(gNUN^4p|1xP{q&!^
zX-;V=?Xece_5GF2;nFYoWJdiiP=!rYW*~NKfQO=MYp1^V#X@j(bwwcJ#{pYWvP`4m
z1K}@G&|Gfd4M!s24g<m`4qn5T@NirwC+BRiB2#v$%|3m~2CYsJeDVfo%LaR5dRA75
zL7~_K*z1m+i;2+%tk({YkN2c;$#|Uc6M)(ioB(47nhqS0qGhBi$3X|GKd%_Rz6FrT
zM)QRXmse?|<m80Fckmt_F#y00;12r=FkEM6r^sYH-3JUfN05cV++u^|q^vnBOy>*_
ztUEdJ*5oZdxL<>UJix&(<3EXmI2_g(;0p@EX3>uyKZ0~lRjKzY@KwvIZIGBM+y~$T
z=JnLoTR<AUQWGHQ2gS$7=aXwvgTrW2F$BG9?JttMwHB6ECUAs=`s~V<5T{>u#<~HA
z@`@CJ0gn)rfos><noC7)z!^aS@XF5yE2TwuqU_>in#5GGtfvnJn-h0($D5p%I98UH
zllNEn`PV8UIy!hDfQPVi)LYJ^2X7wTL`~L0XW6Otd6Nntd5FMcF_427;<1_)Pa6;c
zD{Gz2tUWN3<QVwgPYtqZ(g@x?VGEVb&7N4EJezK=j)zA@>&jkH#Deos;EWihsVVxs
z>xiV+HUDwsVq$`U_%Tr%`*pL-bX^JWNLIVBbru&mYg`2CB2_RBuiv;l9FxnAP(QD`
z1qZbk8qV5&AkX{0dAO7csy+(BcKKJgnd!#&*H>L13rW^gt@t{ot$$m2ZB2=0)-%ZB
zVgP`Ace!K*2goLY{12*@WGqGl#aKuK<v#B<i^arD29nS17VI{oqQKUWJCWBfedMbR
zY+(nppGgoA6`b(dqecy#it-a%nEW*2Yd%{o#ym}rE_|n_*Ky`OikXYh{P+=dB%LSn
z2M1UF)J~<pWA~qXScLGLN=5;&J5qsFk)kr}uM+^Vz&2L&>*9OjU4ZZ2&QY_d&Tk(M
zE1QGyvyDkMcH{{mE<I2nfV>v7c|fdDFt)#ssAN4qnb`JVPXNqovR<mP+KtEVpcXGY
z;zRa1Tg@%cO3}2s{5!xhU7GZ3TA^nB<V5Myr;6Q!&us<Zo22E!OWQYgi#1t*g5A4W
zDJZ*nwaED1{NrY2VaY4wI@a8K8JiV~0uD&&R<En?zmW|4W3zzAhft?sUDNBKvU+kc
zF)=~P*-uN}%j;f+3*#T~i@f$q@Aog2@!C0BUyY(M{i07Jh*mS@^XW@m*m7t0J3J&@
zufuk7`19NUyZ{DNpVX}&g#>Nx?2af|*=<QVx*rGxyrM2I5eNQtH|i{cWqH{_6{b_i
zTu8{5zY#Fup9yI#A?NoPzHEM0R<98L*?BMR<4Sy_99IdA5eNnafEQQUjl8w?*9eVG
zrvPm(@ie;3g4uB0J;~cCZB@u&S_f9JfbG$w7)R<CT!Z0YAm(ZoeC>?&2amKmd(luS
z)7Umi({CyTCLb9|Nrt_E&kprMbneK8lT%#TfW4OUat;flrlbr(K-3-o;hNU&ajlh_
zH4B@{K5j?->lb3F<C~4gOl(x+-Erf%Z=b{3kutvo28u>6@Hu%10rxbfpI@wzkx|FT
zKcxU?@^24kM9OLM8%-?yl(g&ZZ|%!|BX9~TOJiU!A)qM#9TkPZ&Fj|XzENay+|tWo
z9T|4oeOM9}8yOthpMmFp%mKVRXiM!c5IamOm5!mo51B+MoOilQEuN6zG~Oz;@n3Kf
z(5T}}!0o{*tjoc2t)$&6ThJ3H2rRiU&$k=yF=!fuSz}UTnXN`kKQSBZS2Q;sNd4`#
zV>hIbJH}4pwfNxF%wC=$Z;u)|fPjzApfi2Atlks3gTtB6<z~AQ1Dq<(iPig{@Lk>9
zjOu-PfL)2s(#vZAMP#8o0U`=9`2P`#uXjd0&p?F&6GJQ%ZqwMpy%<pO@}vp%?sk2q
zA}zg6Q(Z<z#&TsD$%1JqD7u1>%C4_ZJkRiC6nL)UwWMMIO(3SPihS@q*CpdE9u(gV
zlWn7E@pw5UtGH19I+8}`i6tJp=uwnFg`eCmh_pC+_>qM|K=2`+$QWRY>YAF~^CTN)
zvpLQ5y3~xMV8&5SZl7Heuj_JcmhwvVGQWTma{qKe_vl*uJGtaxF*gl7Q1Fe)c&h_1
zRq$v=bH~4KY|6q<>2<hh1YuR7H&PA`aXm<@z^w`M5Tjpx>rV&%TZ6G3QzJuckR!1`
zj4s!}PsgePhP)h8DJexE?$s>BqmBI_fOw70H)X&fC~ZiX-<^f6<li3$j*vW8mI<4D
z)-Zr$iN~&y6`<EIz)IUa6uL+VA|={y_nNPv{e)V&y3RB9=AbgqRY+@?YdbKC9#a(L
z@Lab{!NE7_4!Ht`16+7)Mp2)2{72U>i-bRSTKQ7AENC-@kgKkiTY;5+_g^zI(d|1i
zC8exdH+XQKtisv5-0+1j1c~sEc5~7em_$^=*xp2|zl)^FTQ~LXB>a~&l?JU{R?SH@
z{x5GpLPI`w_vP_Tr=jG9dwOsICy}3-IXNrO??HD#@5JhB-hHpo8d3j7_C|?l<ALVY
z4qmF!{0Row&rjoW-g{vU-4!sCiS8)Q%+)+8HU0kWwBRPO(Ae00o;8!JdEy6<ob$t{
zpiseztGtz?MfcUr9Y_LRr;v+=sY#WIiAoNGwy5CWzj=3#k}1f^AqxvXd7$CIGT0rl
zCZhx4zsUj(4UZav=T{gxqWtmah?wXhp*&onYr4z5Bb}qXm3=%8Yj)cOK9jz3&-6y$
z#iw_EhP7w`0ZKfdoB08ryft2a>mdZU_Q)eGn1#IPM%m<dL<_hPZo6iF(Hs=;I&Ozs
z_CsEWm*U>fWt!Q=s{){I*4T38p)&wd1E5<&s{FNrrBu2C^=_g1*QTe)niMT#QKM;Z
zL~ageP=R473h;|gsJ~Pw6@DOy<!Pj}pY}T?(@D$qRi?2tWyHl2Kx4k1<H^muXRgd%
zCR}QVL2k9*Snq`=C}hlvjYS4_WvG{qFsF7iinI6>585}M1@Dp925n59%dc^Y_1qrj
zwY975ECR#Au*d)Y-D38<(`jtz*%rF#{N+5$`hB(gX$jIc40C<T-z(DxT|Uz?cH)Qm
zi$I|gA<P33r&gUMsYNQ?Prs*kP@zGlv&-G)Og&>=ZDO5aZo5H3wn-$RXt#zpy|E`z
z4!eA&I7!asJ5S^Aa#*817dva|7ED8+YBRQ)f7fYT1y0X@d+{Ts?;P3{Pxnm!a>*HU
z+iifFB~jJGDh}nX>kMdRC;+)kmVdk;;Id)}cp-HJ$xI0<eoe1q*YkbE!*;%^b~l5!
z^ucT5{=e$ZGa9a_5BEwEM2%iDL=U1fIuR{F1|fRyz4sa<(TNtKi{1&Mjy6U&dheq}
zXSC?|<X!K*_tRbX+g<B^v-UW1_St8jwa<Q@-}4`ehC|6-BNLss>`5|i>z1k0E6)S!
z6tg$&=c~z`5)4fq?%TWDIO+7EzGT3v)Cp2)|4{YN4n2*H+ij%2sBPJWYB4dPClI~O
zfnZHUwbY=`G`7bonYKGdtTWv|XNBk!FVvheA9_XykJ|H4D)UpCsY<{7#1#4Z6Vo@<
zh2*+<&+VH~Q}j2w@4P5DR>Y-F!>TN|hnn0h5lQ@REk`i23;Wql^~dt|psZ#VJd6I}
zv5bX?%F0U9#bSY2P#HjA7AaV$J3Kv|(lUw{v$b!C9n{r=DTt{|<jkn6I?3$>M~fXc
z$BjrCO*VvFpvkFeH@mYntjrC6C@@_;t?zZA1EQTpRmg0TR^i6H)D+z4rClMXPEff~
zJ8j-rv2Ig~+1O9`qVLVpQY=9-n1uG^M4ZQl=ThbQPyopjm!qWM4R<$|NvRi~e$-Z)
z02>ze@6C3AxD+~NyVvD{9{EEwk48i!v@r;|dirqq{uC}GnotsK&>r`nUQvjQ+@k@q
zPWOSACscY_H!w-gBT!lCM0g>`pr7)Y^yrwPrlyBO@OqC*w>h8shYvq>f3%ogp*Dxo
z0uH{!4o-J{Jz}4mnulDCfptYWeoNS(#mK`PycS_U<)T0z?h$}pZwC8{zsUD+L7+m$
zegU!^Ga)I0fSre!xTpkzbC6+o<<2qdF};P=TUMcz&Zw8%<&uzdx$7mD8R$RYH5o?Q
zY`(UgGM^m~(3n#qr!@ViQ{21lfpQ2o*XiyKyj_oNL{q~6px##>Je2{k-aZ=}u{+>Q
zC2OWj<w!H%IxI(tr1(s+G(hWICab0Z;{5=eEN^YCJ%jV!(o`X4bvwe)yQw)b4K!jd
z^`+Tz;&EOq;uuh**?e9CTLjID#ibgKk+$X!%1_rKFn^@b^xmquyrKk+gq*#2b5NbO
zpE0&SzuDS6P^-Xd(n%`$jg_T`KWLIFb8B0c*mjb~!QZ^W*TURVy*WtAvQ`Hg;uYL)
z{<FNiJbjb;OQ||%ops$G(Z7Z%MBG0G9rmks+sfxe5Dr?I7i6Mzg%HZ2+ck7qeaHH)
zECH+je$X%%>#l~jcHd}{355Yx5wi)9)uAJUJGT!~r6l*P3L9oT??LkdN%Ph$en$qw
zM<I7|hCqHjV|A5iU~mvr(8COPDDnKeUCqC$2r24xC8*MAdO5dm;)E+BP8+*-%&jgf
zt5k=5$VkZ(vf>LgJu?$BGJx^`WI%|6AlWSz*c9Ekqwp+kk#WhU6Eww>UtI0J(p@Zm
zPsxgZ4=CG4k+wqM^cW~Jv#0F)r@G3zw@|?xZ5!~suYeL{q|Y@Cf_8QFIC{Ub(*1r>
zOycV|0!=58kNt|wX+CO8PO?COhXfw$>Iq9WU@MA8n=t$Sb<9^ZC#||dmZy+gVT}*3
zp0c1HSlTtX%3GSpr)+mO|7tKe2iRJEw&hwZxt)lqi!7jv@`ac3tXh0wU!Hj)Sjb4i
zAa+z@zGyOHE~5y0EnjFB;B_V>___$Ztr;(MW7ft3*HS^RP_^*0=vOFQ-!`cd-A8W>
z95~n5<?%hfLW&q)3a0}4)1`MnozUd;G%>kjJ0!CIZGSMytJ}7$&f1A4$?moqX>V_)
z`g#v2r)3-uC0=7=<Lvo8-T_Nw!@))x^RPyJsMc#DA!V8##u7<)uXKG`b5aiv#25}@
zaw^qD-S2`C-nYvclMkU*uFixthKIi`+WDxbXQ#lF_4QjWX6o?)ttOE0-iQ|FlXy=`
zmYJhG_HAq8?pj_}mg!2Yp+wzUthlU#nPI$!eg)W_TGs>V8r=+6k&_bjdRf^sX;!%6
zd4z7;yvygzHMZ7_wF#T;=d^3Bo*qEr3s5rrs9spXN9Vh-=<K(4s2MrmC}`;8#Ub9A
z%K!O-2|qUc%{FCe82K}2N?7u#YT1~m!?2bb=?w?O#$svu>2JYZN>Y05T4x$U`%(p`
zUX@HA89x<wRCM@i?2jL>{5(Zzf+Ry|0?z&0TE7L+9y#)!4aJNjsL$UJ?(M$%Difx?
zMhxRzyLdP(($>;#w>E6_W5Q(^DqpC}ZitG;i$=Z4J7;!yed!j2kGOmD+G)2`$hux?
ztPM&0sP}?)D-qM7G3?4@LA~Raknr14tHUv5CmkZ1@B!(AoW`;DaeNx(7|EQ!^>Tz;
z^6iOS6gsqm6}TG7wzfa#!?qS)I%n6GMMKT2Lw|;Iq>PPBD}-tb@x^YWc$uOc#B1u4
zMn3c$tIEziG3NOb5$TMG`B%MxGDhAWtS4jz7LaA>aduohB7X+wcC@xS65F)rc3P{U
zeBcW7V?mD2%xKlSdMU&mjLdfzrJrvgs61`Xv?wUge2X$8Q0M1zEB4uttXws5@X>ys
z5pRuUzlptzOe$y`YQ{&IQACn4R@PKfas%CvG_KdRadPUNsEw-DCObJjEx2%Zx{<=K
zv3ZvfWz{CBSZ%y!Vz}_Eb32hcL-4n7O{M?Mc#$3Oe7hVj?Y&igzjZnM{?)8%@9%lp
z@t=YA?$=w(=R2kNLE|6P*_PL5b3%bI)Eb~>TT^TsOfqqD#vqZ)e>M~yU3t*8d1iTi
z+1Zv4>M)@|bfFs#SJRz89kn)ijRw2zG~El7|F|!ki6)<4aNcqMDA{Mzf0JZzSnWax
zT~2Mbq3ZMG<m58hudw{W!e3raE`$mNA<e&B?|ml2{H*XgXwM@r_bt1zx2HZHEhOJ1
z9j*)lX(wm5JsWN9qqj;*d9uuzx5Qojs1WkJ&4hN5T7xfPVR=)wC8eUdZyc$An%&~e
zywq>S*VAuJuQRM;o669FK)`Y>pq6-TjZ_uO?`A^2VhLGU5FZ^jmc)|1&ykHxsh#G(
zO*~E+nalU$tB68=E<H9EQL7p%f8|MW#z3sZJ26^F|0?HZZ2z#!zi2lpGt;8sj_oV;
z@)e@5Q7vA|F=?SQ7{%8GgW;&|%$FQu1D+TY)8|mE{TQxVS8FevxC%Rq>qYm&@aiM*
z=bsGQ(;tD5PYW(+K}&ey9<eidy2;(iT(+l4;}99!{Rh{IktH!A6l(y85u0ZZ%#-ld
z{(AYLq_VuiGEI<H_ObpfD>>O#V&cRp=4h!%72rsnp3i@*YiL+VHmqTrZJfmQf=6yf
z{SYye=T@3^6PJ+a?;qYQo8oABTj2b+WvNxwexsHBAjZErFE~{VNBQJ%-p66SF$;)|
zbag#%`0UDCYc*^RM^%n=Xxqi2JWclEL`~M61DRJMj0;&jL6s)8bAM4$1w&~C_XnU9
zMj@dTAhzSYuWcd3#<>yT_sOAcY;kf42Nk7V<Il^VVI719L>SGlejOq>bVNLlUsiFl
zA(zzA#XX$32ni22J~^>!&`P|4ugL(kL$R8EODd4BF{;#`zsf3B<>A1Hs*jq3f^5EX
z>uqS<M1_aH05<0UP4AA<3Z7#j0)L4j<kGeyJRKUaEgf-U)SIon!~P=evY6~@BBpn~
z=?W+S2a`8~R>K+nsKGO`6yPOpz$@ugPA2S{mqz<{89B3L!U>2#fF?$MIslqxGf{Zn
zXkThbDCl?bd`q7{W#RA-ve4a(#E2rp^k!LcS_2oU{)CT}En`pN5%H{?U0F$qrM)A2
za<W2ZbILFOE7yC}z@Goz@a7R+ouKAv(7l1Yjw<$#`;fE=@;3=2-Y;2;7w(d-gp6Rr
zZdKdMem9=#ZYn(K!`<E8zifxei4zmVRLrMNce*6XC$8d=RdDC#_@9=9TSLm5xlAg}
z)Du<ru8*DC&!38U`X7Bm8O5?@{WY@h`a)rNdulPEZA)-m>lH^VY$xty={GOo%bz}+
zUw~&0D7$!|`G%^eXP*IAlh<QcAo5ZRx$E1sic2F{FE1V%fc<A}%56mIlJv0h7~o#c
zn#v{Ug}g}PB<?x3^aGaWSjjKQ=(ZO^r(SVQo8d>Set%}7JP>W!eTAMQVUyEZp0rn)
zH3S%N&E8hKr&U4F!2xc~TIUmr?QE*jQXNCh!RySaBz8&V{s_xAD74B5iM`Y$aGT+e
zk#s@BjUGQ!2Ycxw8pH?-A=BMYJq}A{Dp*zEd7b)^{p8*`-gC;{*wz3k`L_A7Fg=Lo
z@!xT3eTF-n8J%MB<-T_h*7ujqvOPBs#WO{8?IX?g96C_nk8y*ArKj@%Msd0F@M7go
z)ql(c8jEconUkjxBOeM`ILF(hh+9-q&EXEZ+uO4xo8htt(?mqPyWQb-;tS~6A7rij
z?Y1h*A<xZC93pX2pGPux1-o1f*d7LwI(;@YgWKL~Umj$n*JvfL>~$zg>?bqrk<J|6
zH&;Buv!D@2U%x1L*G9zL?O{W^$)Uc7<LibLHwqc2WzCLH=)9S%eCfr4b~wkb@CvLw
zf}o&N!QV~`ee?Aa?KnAAfTEj-lUR6qoTpNpb_9%SsyKbw3`O3>7r?eY4okW3?%`}f
zh*8Qtdr9GmXI-C$`-Ji$ziEA%roSJtdbK^(guY}rKR-_?V9i7K;729WWK~u0;SqT3
z#1ema?!I!J+X3ip;rRZ*CmtsTaW26j52GNs+sJ(a3dQ%4ySF~f=oXnGu($AcLGMBI
zrW+x-xQGa5$Bbztrba4dR2_!+0#&dqZvQ-MeYkJY_n7)9!IiR`E!%X8XgNMUTj#EV
z7U*WJlp76BJlyawGs7WK0{ea6IrSP3F2rWr@>0=~xhx!j-9T4reCx$LhrZX6$I`=V
z3uP7(Dykpi_Zx0QP@iDitF|-bY0|-po1A0&lfUb958Yp5(sl0PzEh}I+#D8&AsnDW
zBLm=;*83k6{QL>H8T!+e53%$S?@Bk=KCgbYE8=zQ5y%R-p!$f~XLH-qA3m;A4m@AP
zE!}J~Ke1}Pnn_`M^~zbdT3!J_5IQGRqz0#tqBLl}fx!Yip6=99l16<%3Pz*9mX#9k
z$yY*^&2ypq{-ly!&UC4(ZkYD4ga6rB81x#t8^*cpe~Ld{r8f-Bz$(SSf`S|G{WZU<
zV;Es&_&P?afDLiOqCdY>%c<qZ!xwehuf)Aq4gekLB2t)3n8f36KP%m7Sy7O5;4<VR
zO1}vyb$gR|>2tXh<dHoqEq*#QQAeY9b>Nhx$=p6aufFJU^ZxcEE~u3u7WV}`eQ=7Q
zZ-kg5G#L1(bmMAaMLY(+h)`FNC{YZAu0b!CHxuKQH>1duKK64<IkApN+z4m;qfW?N
z^wESCoiPD757~L+*lT;x`Ssp{hdO34V1{!jTj=ONedBu=Q=mij$#`9|D>m^<Uj2~(
zmm9%ooNg;-e2NcSfym8l^%~7R3We)`CcPqB1NDn)I+?e*J#Y##8_$2YOIn=*+~G33
z@Ys!MpP6xBQttT!xA}5Wkg!6&>1U?+Yk>8j3%Y&QfN1${Am(+(R<$`u{%<fk0RRVx
zaH^JkmQW~~r~&f2zXMu=x3m39Rl+pvY)hLwFFq~?-Z+RKcigdm|FtdtcNO!MHc$56
zmh#p{f12So<L6bO_P~UL1@OOvi@2pdnE3b-JoehQeiYGKz)@`hJ(k;qD{gzOc2^sT
zas7R&+o3hkO<=$tn`~yaY<bX?EItNyI<FcxsZF2AiUg>jOe>pr*$sp4JRDm+sc$aM
z4}9wgR~}fa?l<MYj*HG~n?n+E!r{Pr7}4F$<T5dgD=TLKwHCl`d6*Jf=#JusvIB3|
z@C4=u7ma6r(ZiM`gpz-y7$gI~{oUQGkQgr+d(mC!bSmrW8mV2qs%ZjFuc{N>)@kvq
z0I~=qBlp*j)n=@B;vv1I=SW1$Aqgf{PXrQ)<fs_%0dTTzptJ$J@g%W33##C&uXPmg
zw+R~;m!hf_wVSNDiD4LcH3AeZ()T|;LDKGNXndgWiA{eABviS9V=L>}w*hY}E<6VG
zGSKr96X_k3s~|cBPKEQgUuXV1I{+I&$4fJM&G{m(PAS4Sa$((D6;~3KWgqQ+hc`Ap
zI#~^UV6+D>Bx`Pq0};<>ot^YZ!XW?mlR^u=`Rc$hG`bx6$ET(;0V@9F_AlWf9~^pF
zAesMnlbeph#AFRwVwFX4L#W|VqW?Zqt@Zrk#?*Tko7V-&1Z?Zq`jF>oKw*lA;$mx;
zrE@oUsbG3bMcp}#S9tpN5=$41u^Z4Bs@zY+3Q9{AhtdS~n3(a&s@m=$4VitNk-8^~
zr$I#VYGcUz8#*Fi=)Gc@GY&ZYeJg69j6qmfcJZNu(Ns%XCq*>V@mHjZlpa@)N~0nk
zD2)Ng`B&9F&HQCI!xf()uA!VNtmA(@=jzhjKVM)dcB%xwBNS%{36<#vP>qYkUA&Rp
znLRi#gIF2ze!`ssZxV#H)fNgMy>@uW<e6pih=_jEi+z!w&CS<}Wut7k6j@Y{mc)<1
zHG%5{Vj0tx|E^6=L!%C`c+Zo$0*ip9MjMC}OXnBHkOl>9W2&iJ?SCi=*ktWi9}(y7
z>gs|6fDwJPd>;Fzs)TVKKT!VR>Skrd5-@9-@Tir2<-x#Q>>XU7DT8;7<&o`k4VihF
zNCNTaHQ=c>0aDNqs6u5C;1wO8^kyGlEI3V!&jTV47vn@NHUWWj-WXuyjb#1Ac#87M
z9qRz7i6M#0Mc~E15^4c}Dy>&rdiL{m5p~aY@~zDItw)LUmlmyssM$D{6vhty2L_dX
zh<S0vNnVGmx9`u>02$D~y;AqwN6<^!PWoH`4dEAeU;;3XmtdNysgFSC@?He70xt5_
zzs>;xI0%02Asczr7eFXSZKXQ_^FUhOl3<5wPF7AXZ)%5!@@$`z>t{7j-1UUVV$j2#
z_}S<K5!vP}uFH)gV9_~2&%J;(NB+aO>Nx+ecZ!fs*aS(`+|U{fYtA55;UsGb$Vq=u
zpbOZDbYgQ?0GLbwZ9zY3JtfGYXXH-@WCS3aE|1;0GvPscfGEl_rjT4-7Qa1)Eqcsk
zYT3uF(0(QY@P$~wD~|$+d49`^mjede#RaFav3Y!YUeRX9(34`zB_GKHV3(`FHPMS)
z@hQ0Z2CHYxh(yc(BMCM(H`R1?<<rN))5|yjcEsGuDn2vwU=r4Bt7d2j0Wg-wkG}{A
z2%xP3KA*h2sE&>f0RG(OgIuhDRsbmI1tcyFO-)n2^pmsW7reaj0MpCI&mS}%Q^x;_
z9QPlSw7spauOG5(+%X0J0;HmU{o(^mG8QoSLXo1j7f@%TfjWNxAW%|K0XMBjZtp4S
z>azbkPeH-mzw-c-LseDvwV|OT7Qyq5+2z=lbQ*x{WMXAa-`}?n3CVDyE7rxPpKTA4
z0uTgJ^25TbeQ2_@v@~GK|ATfuB#4gXBl3U$zG%Gz0o77!;UOtStZZyD{R0CGjEpdP
zh2V_cRwm*ISq3%C-k_A%4DoN7h?C<HO8@@2BiQ-RXX8iT{{L&BsOS3!^8ri9KV1R1
z<8QD5-SIhBS#EAH0FH+ZqFdYBGx6|@uDK3b@*f-=099c{^fk@Q7}3$utp%uy`doW%
zYUk|#ock8_0Vqe;%nU^)WZPK>;T!Y@J1s3OgF&fkW_@ifI6XZbW+Ae7C!(&Yr$_Ti
z{aP2`tSg(g(rEAG89x&Rkm&<FE<kyh*d*SheQcOKSOAJga=gm}ahL7i#C+~1U@(83
zA+srvS&o#|d8WRtf21mcw&I{02De`;{h=kMKYcO0LPSg}{Xwx*W)mQVP<LBS8K`?t
zhX?Wibw;7#%p+$VyP5JXKzatqIN<+(6-1@FoqGTod{GTVENDf|z@v-p$mzT^+1tm5
z*Jr>zZ1CE2^8>IwQ}XuCm)YE3&{Ni#3oL4F=Wf@I1<JP8T3!y7oPD)WqEOS&=mzMu
z{rebpL|IPGCn}-kuPg7XE(OaROAJv0HtQ_${6{9h_pMr)HEnqvJ)@SI%ylhQ!kzm&
zU{*d`Gm`6y7Wfc#gN=>1mD2uD3$x&CWKU!@y5{&{#v-f%F}t(bdaq~|(C-p9fbP8q
zH!^c{T&<Pk0C91MNkmv>`*pH`KEC~g3#_^XbQTZi=P5cZ*SOP8^J~B$k@Q+RFJxXG
zcZ9|X>uW0K)2$rMD*-ixWHw*jjcnhZ5RRxfZhw>6MAMhBBL<Xn<ebsN^4c$!*uhm8
z;WAK#19rI5C&3NO{`fPo{L=(r?!Rl75&b_k+)KF{Qs3USQX$J=Cq=RMQO__o6;nEM
zXRV7J3iA=yMwLaf<Lzn(-&|F&sVRG*jL@3Y7q2F1VrjSPk9JFr+(z#=PY9wmf(4{@
z2MJyJMlWy0TLa$4;V=yjcVNNa*BK_fk$L@=rxxom(NTqrjZ@LtEZcOZb?+=||I;E4
zm0vI0gHbFTqE6iMg>w8}#X^z<rvo9=Us=FM4nO6$VM!UP#QHbi^Gx{`swb_`&<KnF
z{YJpA%hw>9!E$qb(1`4a60Vbty|GRpiNm3;Zu+oyw}RV(vy(6R5T~`><mp|T?r&;L
zA6HxKz?1+Li~zXHeO*m_^%#tjeBKSpg&0xQ>MhMw&>17{lyZt-?vsYenHdNkl}mJA
zuRBocxAswrd}b|@B`)&C)tap}!BQcktnu*PP2)*o7E;Jt9`oYZ`|h}z40}r9g-e|&
zb-A1g4>2agH{?~TUVl+W?f!Qh0=MZ)P3>C3IShEdwqS<IJ{Mt{i(JL|zAp|5P~-|i
zG}rCh0&Vstern87PnDJLbwq^>#;=5zAsL6Qb)7D)?lMGWD;yhT<Q<IBEBF3PDuNE1
zbHq_*pPRjTS%$uN(pR$~6M9C~-@rnswjAM0Gf~ER+gYmXWZmaDTuINxJ2e;IM&p~A
zqR{k*6tp$%_B^PP3Uwv9>b_;tjJ%48fHF^&&lv{8&0b!zG!{!GYYruCX#HxPPSi0o
zZ4xa$YLJ+vrn$Nfe;iaBkFw-s6!W99lu&9D!7)p0WYrBD*Bz1iu2ntC@xzZ;|6o#G
z)xagP3rsP;t5@uS^XD3xj&Cn)+oO)jrTd<4rQi*`?J1ag(MDvhgLmdwa$#vxj_R-f
zfy<(TMU822EbNL|ys0uZ1rx)Ybw=GHok*bGx}RDiHU2OMn+U_7A@oE|J=(Mo@kbs4
z5wHP00V}Sv-A|DAPrepZumxWP=yl*0JwYvVj(xW2RIP~<OWE(dX|vx7JJb9w0nSOt
zoYkGJzGEs1c$Gxw0Ap;bMHZW}PB&9a&RH?c<`?A{_U1#>MgBALUBs|=wE`&3g*mK3
zmW%9ZUZF`6(0AO{I#KKJnDe~5&sl}e9@MS|4ib8!JmN2&bMa%I5KgK#A2B_yKs(47
z#SQ6^<rkhLF>t`0Hhf1ekfA{-!uA~}n7Sifnx5J=<$3OzHtz4wXdYthCMF}7VcOQ%
zYc)SiWXsGsn>$`_JqjuLnGDZt!;xOkRlm(S***#9M|{`=b<C!+FE6dUHS?#uz;70m
zWtJuS^5(vT9bE5~thuK8jxkFH6Q|PpV<$m96CRB+2DnW`Q<1J{!N81<M9lV4B8kK(
zM#<cX^0HcQr6mb?(#x<yQ>ACChqu>YDw><v?7t=xkt+lp-Yvvo_X8}MpemTmH-(x*
zkbgA<L1yRt1F+=z=Wej>)n~hJcc9S7ep(vUeu%rIawZSukl_ARb3!3vyx5$TD>r_T
zA1B(6#wTX*z3_7<R>;b!twX&reuQaYBr8|OkKAZHdlMy}D9q;+fBZ4}bAHh2KFSff
z%!b*NiVh=U?IJWuJPt7IvMmFt-z4Ovj=N0m>`LdyJ;{U|l#>35%UE3@a~z_gD@|Mf
zMfe3(3MQ2zWI(8#)yY0-G?Q2wc!ot8S>G0Wz>wN64U)gwjdH=u>Qt6kphurQd&2IK
zyO_4-)l~EGNUgkXS086Oc-UUv$#c~I!}bVtLQI&e!?iRaSyABh=OlD*Y~Fg+y5$Ll
zLzrj5DGQh=R!gvqeJ>JaWXTH2_MuWFJXO;KOSRj)3QZ_u^K#klPQ$qWxEWsA6*EK$
zx*?#hao%!8aFCa;AtA+K$tW3zT~Ql5F~N$;=)Mch8MI{2-(*52R6O54Z~3+@PAKLf
zLMQWy_%J~JYUQTLB*RYr$%0S_HAbt!s1Bz<`lH6;rj;?o?*L6c*8u0D!JRo3YPfm+
zMh6agpwY<c+>SmSGQ6gQOva9c&cFIckzI5hHb~8dS6q}ef-0Aw4)=fCK?Aln(8c2k
z-8sKH9CbyWL$t_{e)f*>7QuvUz6VZE<gK-g5<lI86}KHVSqK)~_23(te9vQdN|Um%
zAoCaW=lIXBab}I)y1y91_po<Rb$i56%vP4i#}_#_w%hi)V5;MO0;9rE#W3`1wXv7|
z6zf$l1=bZZDRi6ZO{ev<KXnvoMVdu&VIl9PD%7O!a&;m2*k-AIy+wC;1<4+I9JABK
zW~!f!hr1lat4nC1yCLpO1>L9J6(qB$MN;x5Ql+=UYFS49-%PQ7^TTpw%AF(DzB`|{
zysA)nG{e%QmfW7Q;vfaj+I<2(5bsD4#Gv3-sEjd0zmMAQ-r?sqa2DWRCRVzeKNu=;
zx7gj#$f8Y0(9CBpHdU(CSD?p~aj#<xlRke$s0I2{XN_I~32UIy<^<1YPRi`GhNeEV
z-xxHNu;qHAO>L{=yO_RE^n^Gcl{9Y2qsA>DOAw1}UURe#&Y+~weQ=9@f0Ne4wHTaj
zK5FgaHD;{N<(?2p^oc<5h6Fy_-E2$PvuSH>*ASx9`Q#-z$5*hkix+(1#E5A}OxMZs
zJ&or@km+0U1#eS3w(NrFwOnj6G5-oyGk(O9cok4V>BowYP>rz+Xu+1`U6gBw_w4^P
zgtk(x-&bJOI`zrj9H46^Jzq_Rp%2j>jG&diVCg`|sCw}r=ZQ4m-W)h9>Cexcg)N4%
z*r#eYRH7wrYu{%O_gCY`Y?Rd%l2g&Vfh1u8X%ntn8RUyBkiDs1;N!iF>Wm7PQ<E(s
zFU5-)XIXfB$@uipxWU1Oayavx!5v(6crK`!;08^XJLqee?d-x<PVwkzm%;J*CvYAa
zu2fHVsxr9TvGQ+AX+hQM>~f;NUq@ddCM(zN1meTJSBls~6=?U{Seuy(vye#Yr=L&o
zV)01PU(2{hORc4e4iWtD!C^lgxEf&nP6%3~99i3f4cSl)Wq8V4-eoPda))^Hyq}^R
z$fmqofMo2o*3ou9X&zNI>-InV!K3c~lgutDS(D7_%~V+2aSIGY+xXY5xBIRB0O85I
z`eBil6Rs@%WTeXJnr-CNq9Sj7we6k_2LF2BxyKM!7PupnCo^Z*e7*H%)OysWa~R%f
zndf3+Tuq~x9HILdS%>zt-j|<?KDfIKOfpWCB9Rni>E&%MI!u#I+so%m*{<v=ZG98G
z{^8MpT<#r9Eq+-#9h*t)%4YrVG@EH}BC?#eS8!FB*a{Vum)XYE+t&(tl#r9@Eng(-
zhAXs`{z@zLfltiEaZqFarGyO&3pku^B^lQSQ~V`I<=Q7qP^5ejoXmO5MwK-8&FJT?
zp>o;01OG=k9#+b>OfF9(Slwy6ANwXgf7Umd<ji-C>St1-*%Q)qeIXp~ogR<*ahMcT
zYH`)jdsX>P;;8Bz&sZZGm(>n0V`O@<tJ4?%L~S=({>yu5X=7;qGO*0xFb#|jWqqv;
zs;gK30H9iP?NKs4DaoJVEJTriF2X}h)p!`Ibsy_ea8?F5(qq)Ve_=lmhY9|uCVac(
zX!sI*U##cHn0;vd!cqf6y>)6oiS)mo%2WRT<*AfrCp3J!JV5__CGVDaT~r}fa6XA+
zcw3i50#iV3^9B~5;#>OJzh)MD_S90&xG7()#ECWXedpKle&qoXPvnM$49x=g&|)qI
zP~?4uG-+*n#rRV53;6<1GXe1cWxe!{=8r;jJ2x-&3(*h}F3zMZ>y7FXFCE{%SMg-K
zGPXHSGFd*&$p|D(fooc>GM;ed7k2Y7&XO7Qp%D|mo_(c}Jzd*U7=c=bjD7iGus~3|
z|M~Gk1Y_$jn$adc9`H9_nUHrM-N*}Fn7xKx$2t@v?~Wd?GuBN$7ly|lThl7<nk9nk
zf7O=%NPuysl6C$DO=&;5<_`=f+}0TV^4ZO*FrM%?2xd`CIG-5v97HrIadJVcDk-C9
zfT<U^+h3+x%~qt-s$|mLVca9HPnvGyupWkFVYj6zQGK06YZX5i7tfZPe>RRz`Ct2l
z5nMjtVz`pSL+^t&(_KpSD2JI#5aS*ix}f-`?~GzH<2Ml>L#U7|hT3IIpDr=dz{hTl
z0IYCLQdz6&(;A<6k9h^F=PiboT9t@&rGK&Hv&1*94!s1$oO_fndW2;9684PH4p*&<
zYeX#9_dc1dA;G<gj5|H(v_)}Z*h#ZF>A+HxsD`nf@Jr_-Y9gv|AvFO`5(nVU$f`kJ
z)-P^%;jc03S6_*__;!5dnLN3S+rz=TU|v>mMi~p@f~Z7tQ(sDGsx%d5;cw&8S6_6h
z|K2;d7y5J6_VfC-7Pno&!QyCHW`E(*%({(;y<Ed~u`f*^W9S_QJwmqIMo?En3h6;!
z(I9nNd78WCP9rFfO_SJNf6sN+MCWqJAT||MKKb@g^5!k3HTUr|QRMiOlSO03-n8;Y
z1y&xI*35bVM%B01g;-d8oNlXER@CoV5q<E)v$ezgXNYqUZOa8c(~1yZ!U7{hqr-=J
zk^aoVL&oNgmxkUi!!cQPd3}B2iS{F^LPh;%!XMq=zb3^h#S=g({z(!*`l@kz_CKZV
zGs+HU?I^Rt{Jxws<-_BP#3uJH;rREW)eF(55*5$NH*pbnMtl;at^-jg+j=5*M(TJ2
zl|d4pS28IInsVq|zh>&kx8O_X*0lZT@!ft~J3DC<C%!Gtb>qe5X}#;*Lz&q`P{sJN
zJLSd4(@Ena&4iW|R+;NxL#1Q-8>HMuEM7v%e*P6tiVQFhkkY0ZVQ%J4KHn&h9_ct9
zelDqXoZ&n~Z|fFeF#FF#H)bnn-7=^^CN`!|)aRk|+@AH$SV;Gu;*x;-f0tfHvBxe`
z8-Y0Pzxt%J)h@P4B}%)BUO7`cZ-=t}IO|}YHasIGQ*xf{I3&3YE^M~7B{($es?#Az
z#yo~izt<kJ>`}?wan@=19i(Ke$|LiYz^gdp;x`zO6Q<dYs*b}B>L|~Kny#;b!{dk3
z+kkj-dd4CzLCX2mZnAPFm;@*zC;x0~`>vxzoi!76-=@<}?&T%KqMU(r*4b+_eD5ny
zZesFmiS-Z@aAJyu{YTl3j`r%y#&+D5(+~f8{`(qI3W~(EG}}uj4HrYS6~MjuC)1cu
zF#|FTHvRFL#^qlND*+*)^`oOBXAwk_MU6Jk59Pmqh3^qyt2Qa9VO(~$FI{f?bAWI9
z{++z&Pr=OI$;E|yk=yNX2co<TC}oI&@jPg)qSLqo2E)|Snrhedz(M?yayB%eim`H3
zca}HWTVIC()jEC8Xr?k!00H1F6XMOG&pSX1%U_|-{7=g@X#D?4!vmo@mr2!0?RE_C
Q<9}%H<RG$@Z;V6!2jD)|YybcN

diff --git a/content/english/hpc/data-structures/img/eytzinger_old.png b/content/english/hpc/data-structures/img/eytzinger_old.png
new file mode 100644
index 0000000000000000000000000000000000000000..97237c734cf59c1deda7cc16192b6f432c3afaf0
GIT binary patch
literal 28730
zcmb4rWl)w~*ER}>ba#VvcXv0^-AE(dAxJj}0@B?jsdR&MNK1Ej$Gf>_p6~zXj3Zpv
z89UcnM{L8D6{Qg2@ZcaIAP{Aw#Z@36-c*6VsPEr_Pln5=p};>-&Z07E@4+9R_hw<>
zcU%_<Ef>|#pIqFHoy;LD>_6L?GdP<%nVZ`?TYh#qfovCqfFOa85f@SONI(2%p@pG|
z|8CXdO#zK8Qv?Y~7#T_6ds<qz-b@^hzRH^Z=icu)=o~ful{LM~-eMoc-=M~mWveBX
zeqrCMZa%pocO<9v{r)f5*$~_Rdur`i^<j(ZXnHp}+1qYrgzf<p9vK)ToDk!bQj^|D
zW1Nt9o^lTsgl=ZmP)w!}V;srt>h<2&>j*Sjz*RV@-4WQ}K5}QRGZXN;84{fVxHxLq
z`2Xu>j1PogjQ44Cl*6!@3|6;>lBsEF19Nl9<mKg!tgK*pc-)o-m^nJZs~n2+^PzUe
zviiRhHlw25V#r$KDNi#)ZDK;aA(p^}rS{2sgWrs##cV+-8HMu(8v2^UETwe4CN&W^
z1iwFquzTq)l4Q0A;{5wlj+rH-XMtoE0~PhKgs~4Z6l$aJ$;#^32tKphp%(0iPdM|v
z;Dyu5pFUAa#u6g8ukI)xal%A+RnWHDPb^Yz{#!vRLek)N++?2q`?ohG(Rd#Tj7i};
zR#w*4rFy<)nzU@ko$%E`u3{(r57Z(tW2BatEkP&U598zGC}JuhCAwhDQmp@sjfG~2
zQRSD_OPUtJa@aHV-!RuXk)KV!psJ~<Srf$>?=#PAg&UVzEeA??P*ruytiAoxH;NwD
zH`Ny?A;)bMYrN0;`eMA6mKKNeL+E0+)IO``QvIT}+mc$3)73XcN38Pt-HiCG@*Ldm
zM*%64gyRE~Nf!;ynJd8(BFrtV$4<+f-P4$0qUA9q#Kk+>RIx2%g_QkZ5DmoY+}XQH
z6iA&sWtk_S-VzD+uP{Y^TuzR;rH~e2g%x>@HL|xK_4ErVfuA!M3n|I{R-`C0(V`nH
z03S_YcNXc$>9!o!v2D|-wNKTdX;?2oymh{~L&N?h!P7mR*Y!vDG=gg5>ylF6+&Y4R
z{x`G5)T1hDA?J#bzeRSiGF86RW81@}7A%8q5PXu8Z5r?(7|Sg^#O#U=7MiLAX{U(t
zbj7NO**bZd@cRw7zTLl<Stfs{wFuu?(RjpZoF(0dcvYYf;0#FUg_L}o%eDqL&8MD>
z*TsK=mOFDiS(8I}7$s;|{2XLYdNQHRKzyJiPOhhX=1pB}O~S{P@vQ|x6}yIwDWrs!
z?2G1EL3>xe7JYL<ka!yz0pX84XJ^+rF3rj%RB|OG@h&_m^OWBaV4swbNtmm@k36T<
ziU`GOXqem93yk*WXR%ab9yKppYpyzC>6dQnx<SF(6491OO6x#Sg{gTR12g-1tlpBE
zEvEeP*m>VHvP(Y<A7|LRlF!w5;?(w=rCs%Q-{4mAxH+)4ziSXsd`kKKx+_a^85Luw
zWy2ZWaXmTI*=@9<63bug9X^UsicSR&Fs*Ce67`l$?Cx)BZImOZYX0{o_BlI<(}l(*
zbEwq$uwl)B=o&O}_kN7O6P1JBO9&$+rvkf4Oz(an7JF)il;q95#|Psk9<<$8n61c`
zGhMEbI%L(z+P^ROLmE+`BhMDPj#FLCSPSD?0yDRz5+^v^IL`kk@)=WJXPJ0EN=Qkp
zmZm2d6{6VCf=KN39fpS-%TwJ{Xi5fNI5q>0pm=+7OO-c1T?XVuf%fO(LAga#rw8l7
z<qj<JJaeho-D+Qzse|;Ts6Ld;=hPgLj4q78{*J=(96Rk9gK=^?EEVid49@Umk>?l~
z99%mbG_7{tg@V4?R_3%e+~9;?4^+l+K`MeIs_p7!#xz<1?J3l#&{><gq8n&!#=Wb`
znklY$oP9IU+K7A1RoOK>9F-S@pz2;T6oQ~C*4EbMfX_;BPN|>0&yD2-he+328D|*P
z(l15ovG?^rld)@XP$o(SxA54=0xU_f(b3VcnD?R1Y$O|-^exCO??sI5e;vGsb3PS3
zC1X__72Vl(3Jdocg#+Ds!*^kVgLHRyXZ%!H(d4<oi?^2+$SzN6IsRVMk8~4{TC9y5
zrX@;ivn{;$O8{hz*_sbG@6XV9+``h=cpKB|dGT(x%CP_LV(0Ak%oGCyLs(Ojn1<$a
zc6oWAVg><=&#gn#uU{fcN=lQCxRR8wQ_qo@nTZ5G@bL0lUvBl5R#1>=wEwrdzYnLW
zsaa=rvGR3zZEY=g3q+IbygYK9O7ENF@#%8QQp^xzh}Sst=Vxv0+SV2Yt*E1;<L+bu
zyq>=PM4lA#+}xa6X<4%}wYaqO`{RGjFt0E1_U@jW!x8Yh{8Cm%r>Cbko-fn;^_ErM
zw+F;lH4-K!<mKh%TrzZ2)Bwc{-tdU1)>eVdExOF@d@>(FVx=s>-d67$hm&Oie!oYY
z3f)F5%-%}<W>E==;I9lOIP?ABDL%Qy#qYr^<+7E<#i7j2&xbRf@bL2H8h8J353Q<P
zUpKkEy9;-MdiSoYuTL_udvFjoA|k?Mtdn^%43nvFnU030^Ba|txCOsbCVxOuQqpcz
zk@L=|vWCWY7kMzG9sT`MF7I%Nh>&1lU?y;lKpM%(?szw1ybq(%lHw&rjX}oDtgHs*
z<Lhe-as};M{bo0bjlO6K!dg)t9v%>Z19pe$IXSVw%E-;Nv9Rb5Nj2H02lw*EhN0l$
z(Jl-M`P`zsd-pCk!{_GsUruEu<6a8K_tezzhnthg2qz~eRxonn67zrlL`18NP2++<
z?W!U!E?&a8r;x^piCIP@;3X_2B{hlK%Sb^{pvE%v=Z`3Ol0@%tSC{A{DsJHdST6fP
znpFnjZj+grnU>?(0hUxp`}+}}Mt;cf=JrrfQbs&Wf?FW7#+a_SSUEYryNHLBu!H$d
zO-xMuumJ1r<psuBx{myKA`b@S<qw1(_gj7LRHkKmr>4+lWo2{Qv%nN1ARv4oj5q!R
zrk5Oja&l4@vn(nqN+nOL&i<dyB&zvxBsP<{{|4xon5t^+bb*GQot=!784fCVQ%Pb{
zQt)e+-3r6Q!*>_z%PcLd?d(Q)iNjD)*XC<2a=!+LA`^Fwk1M-q%v9)dvD+=EO|qnV
zMPV{Y(r#{Vudj87<_>AGu&_8dIiVfV9~~ZY<MsxY5CsT-DN@RcK%xW#1?B8~p_1p%
zZappidT$W8mkbUbUQWOP9}n+-9<mV;k3G&sz7T<eg2LnRwx&1*8VV{K%xUgW6PQ6|
z6_o_=Es|+(WC<p?q#%UX>;96ayrM!v>e$wTY7snZ?cjip7svDZ5clxtNCJjDCMM?4
zC}<6X&LB$5r)l`o*y4J3f?P{WD-o|(R9#(tk_DIL&H?wfyHi$84I@20T^xpvkk`d1
zIUp4e6jEzYsJ@)7p*cA@yQ~S3k?_r(hOy@6=AA=Ba+fdhad9L+B>VpUO>*rpE(Ie|
zxR5F2Z(xl9&%jWjqF+{DR|h5}Ud77FDtAH%Y)dMy%eC&*Uj+pP`Q)rBG2n(t6H`;U
zXp@}sa=radqibb4RsF%i!2-z_P=)=sj9zpL3kwrs2BdMpqGk2PQQYlTlpsM#q}8zH
zCY97PX!q0K_dn@&FeH9{`N8#>aW5<?2!xuj{QP{-Loz<T=4qtvr6s*d)Ko88R@R@>
zyTl4V*l9Qt)qQa+Q)!|xtGwvR$@8YEu|6-i1VlzgS{8K_jvM8pqLR|n%e%x~t#(4<
z;NXP8Y3rx1htLqy8C(s74#ErR#dzMGQ}DVU)89`veqPB`Fc=s}NB$P+$Nm1|>MFRT
zq-2+5xVKk=k`mjzGz7DXw>&!=0>qnw!swH^n!IA~pI^U1rg3JLIN$mRK0ZB3yxcrL
zKU*#~RP4UA{l5@U7@~DnDi#qLc?KqJ0@YGk8O7GlPC_bMQBhG@O)b$yKBPolLPEmh
zayK3<78scB(NRU0-W=5;Ma#MBoP1MO^`3}igpG?kQ1PSazq`D-2?e9SD>q*thg}Tz
z6fb4BbW%$TKM0nQNati^WUITo)c4EW+}za@LS_(j2CRW&I~(U)Lk{g50aCF~sOY!>
zU4cFjUm~o@wHoZ;8{a&!cus!6!NZf~X6POWxxMuu;CFeiS+yeGbK$VLv-K7tF;RWI
zw4Q?3)kU~nu$9H*)W|Jd|4;Qg#THnuHGbXi4(ER1M_(Zrk9T777HW*MHwZ&gV7`a;
z*=+OvI=#!BF9}W7_dp54Q1z!P+PQ@VSzf_z>4NNR68aY_Od>&__qB{qIi|)iglu%|
zlarz8EpKJvF5;hg(MoEJrTx5+d1%5@*lgZ?r#Bd8Y?Xu6Xvt4$Zx;e}b!f=%zG^}i
z^dFKHw0cGP%*3df$*4Z|*xRF-!C4xWMtPoW!|fovoDS_9)|k%P&qbvK%m{>J2dMNW
z&rEIPqJV27R5u*<Z$E+on-<k6<S+M-DPgl1HQJBT-eq^B(Q!RC+`778om_3Tpt!`R
z>3Hhnm}!nExq*X&8!vhhZW+PO9l9yYDb3N};>`86?>ZUhJzS}UH}rq1$5CNT>WxUo
z;Pt+yFC0ma_8YnpnJT`|(}`1lU!^~kLpt1flr}r>qb=WhmE&MOSk`I2jdyB&%D>S(
z8{bs_3Ge!3KDeswKB7BOUL6d#`G2#!1P$rOtDM($^#(S$snMMD_hO3i%_t$k2Q!@c
zi>Ihm50q=T$_vh!Vh7*TW7iM%&%Dq|k3RF&R%l(IEvWCIcE@YbfAAuFQD6C(Uay;Y
zL$rg9)P1Mg|C)wJ7tC?AUo)l;hZnu1?QbXv^<&9}_sy<oCy@Wndd8AnE0Q5NvEndz
zT=y}4$|)Cg{T2QQyZu9!`XuIiZYdtGf=)=PvwyLtbVxi4+s-3Ik<1VegC4|q!wIz+
z7M$b#rIq5MIV~tWX?_}CXw*5TY3*%!BI21kNSQ{lXh=#ZW!JelNJzRd`uO-*V<i}y
z4RnVh$0un*n1)Vj5_rHN8ZV1tZ)E)Fe2&a;^t3E32xf`-q;+f^P%tWqv(R!WLKfTq
zwP4Z1cImu`n=60@lFmSdjGi7LeDJf;QjRXgY~0uIkEZNfkSJ(SnDiqXXGY5p_=j_?
zjUh6)l5U4fwaDA8Z8RKwf$7$!E*xqS<PsD(WdkC|@aR;hYV6F@;&0+7!f*L*5s7xV
zgdb)`Sd^uRh?$lr+hEqw(9Rz3oX<WZsKSAlcJ=qej4beI+<$WVN{MHwnGkLP4W;HZ
zldwIbcsLwcqS@lxGO^TGGHuh<Lcs5l6h3}Kv6@_CfbL3U5Y*l~6(#!NQ%iqqHAWB?
zU5C6bF7e90^})fM)Fko;0~xTbRk~)!h!vt2HSCQu$p{d?bOe!2;A)!hCrdi3<`=X`
zXlQ7-nzBX>4-5~Bd+Fd!BkE{){wvRS_ej5)fPW;*$3(jo?J5#j`q5wuX(;r9XKr2!
z!Z{5pCVFJyNEBUt`6o8l;`~n}23qwutAm3gWxrTnL-m#q0=bPrs~7aqc3-+nbZ8)1
zUD9os{NT$WTB}#>_ppfYiaT#+b+FVDUtze<*N7tXRT_?fZ?Z%_lVw`Jo++xvA$ogz
z|LP<go5`5Bt0v|Al_wlwojfJkxS)#MoO72Y(>hB)14R}}qS_^NTxdZ&+~l_lQd<(_
zt#{{^Brsp7V3I=MEc2Dmdc{O!SkjuqO0a2fGD=fgO5fdEW39^ec$@jJ-E<|OU1uaf
zaQ3w)R$H2fXogoASstlvvpPj{jOTfx?>1US1Qht?MH2tsBRO&<a1+N)oWD?}Mp@NY
zApjK$xBMvc<X}$^lvQ|@h^TQr6X7`?r)axI#CzZ8cU6CO|HY?XqG4@+;*kEM1_Ndz
zxAq%(;OG*qJFc-Jp4+m*!!9Pz+<Zfq(JwVyxIdz#e@<AlQg6{fx>r_J-T1)PEnQov
zkR~c81(Pbnlw+(Fy`GyLMP&DbgpU(<3~tDP{@}{lh3H7*DO<k2dMHFCS1y@R8o?JT
zCp(v_L68zg!js73oSzycCqFQt4NcT%yg)7)Z_5lvd$L|m^8?g_-<dl<5wg>&$g`Zc
z2#*AJ<5Z~QsP~6Ibe&rIm=#yk6J^nihk`#uB&bnHST0#!(_ki|5vaV96RYd$f*d2U
zJf3=L(c~_&tBS%ivAi5?-wtqSgr)^^5^!MrIb>B55Bu$m=ln@=hn2Hix_+LmO=>zr
zm6{`{ED)_cx{hT)8WYjuPc_$UKx7;1bD0dB<!m2{F9)Wv$s~a)K@`icRza^_c42!R
zEHe0KoBgC0{geiF$%B~+o6mVT5Vl4JB~JU(G>^B7?-2BS>!!mX49TddZj>ydW8;Lw
z)>#Ed=A;_qJsv&>y@PQ|{BCR}rfZ{Hw<`o&nBf&^u<_>hmQFHAW$d%>t-wj%ivL1;
za+rf%Z{u9GHH~Bp0ZrzqeCk|kG6pV>(}y~I8nVOfGq_|-xM1q_<Y$^P2h44=&eu4;
zTob*~gFyb|b}|>zA3Br6ywe_f0sAHH`Qh%e>|*UE(BHr9(nxmqDsSAT^w0KX0t!e?
zVd?1vibQ;YaTnh4SVJRPR-yqA$nI4=O*N)-PmF~IPL7V9jd|^z-2y`DXPHj(lO|ak
z^S$j3=X0n7j%O(Bs^*7PjON3`!w#FLB286k1{-fFG2T1)KfD+f&p_Zm-(~=ON$$;A
z`2snDKiF;a5X-@6WK}b^)@IKe^XwZY;6^5Ln3<gT+i44~EsYr#$rL-sHI`J}K!}O~
zciCGbx4;cuuifcC7$ZH%{4);#AV?VG-m<!zUDgdv_L<2lCw!1cPdDc%2?TslzQQ15
ziOClJ1iR9Nl#~?b>c}01eKPPU40oKRw$vdbw}qhPlsj~s4(6=A7yhiBLT-+}3gLnY
zl+7zIw~z;1_L9w{-;C#-#S17)<in9927(2Y2ju!oHl&R0?DXNSQqNyYiPtQ>leJiF
zzUW5Ht|C>7B#>>lfb~NX%d)r}F6T>ZdL5T^G;D8kF-ec$wQ#%NQ_cNOlI2ukbp8fg
zlo}i|a=%jhBb@t<F0<<)B17agXq>1igYUo~U3L=*@}6V7bR?pbN{7Jv^I8`FV}juf
zJ{UwiAIEFi!tXWf6y!mor1t-Q2QL!2pN;opkBNyXt7*N@Oeb);LReiYVAD&_I33|y
z0RUO^KP=qh`+#LJn%)VyB6WIoE%Im44CHhP3W~b=1_^8ns>`E)@V*b%=7end2oLOL
zL10M$oNK()h4L9~o!MiY1PU73cxZd%OB{0Mmsk)^YXG^Ro?TQFnknG@_7Uw=`b&jw
z0|_%T%Gaf(ID!{EhG?A*P^Dus()0+T;wtOuMg41b&r>hcI6Xh_0?5_@Bvt^SjgFUo
zg+)gP1LzIjcXrgcLdMIhGZ~L7xesbg<KYyR-T7K7fDus8(ZBLIZ+Fbh{MU_<G_JM1
zJy<q2w!`^a7ywKXVupPHxS{FWxo}+VcmvSJ>i*xdCYL=lMuV2y`4JNUA?S7L-UGA*
zDqaQ<KB8p4QP#*%;cml&?gWRUQ@utk%h^go6BCnahc!`ZI=W<?QGlT@Gn#$wScgYO
z98Xt8PVdYSn8v1Q!M=%8rtSeHh_RJbH~`e^J>lq0jvMcIU3Qg^y`u3rx_+7up%4-x
ziiwHMEiRhgU+ziE%a<2yB9OPDp`o2V-K=zXcD|96lne_G@0=`<n<)O6=QLZ}*w_y`
z=lFQ*NFd}#@csLDZns06pssy|_d^arI3=VQaC?_~Q)X6HG619%0Q@*x+<JNGa(~uO
zETgLWZF@9B4#0{vYMSnW7_QID<LNxk(%|#&jQX;BdwU$#V^dS#e!J|6+AY-OfU*&E
zVw0Li6|kW);Q*{A)Yo$WHevE;v607)kAfl?i(X3@4-c=-a)L}&PHv;lHzZyXAdNaO
zK$GL+*}?GGD%RGtMMure&8mPZ80_l{VbuTKW3?#X3-(|2!EBXkC=y{YSX(Yv`|34q
znkp)2BR@DI**Q3@=6{+e#>bP++?e9X3=9lpgN%?+SV%dNDUg=%?OX4?pJt^VsjjZ>
z&&I|gd@lR2tJahku=7ZXh=?{fH^o80;PU)<r_*|gf{iT&@COA01H=7RvLQ8qU^B~U
z#yu2#d})CClF-r70pOWEA~v?F{ss#!*yVVM#~Q2x9zMP)uq<tkS`X+nTCC?m<S5o{
zv^TT2SEN!*CkYTHsiCi_sbTkf^q8EO$Z<WG`Dkxn+4GZG-gRqxdvc-PMn+dRjZ!{E
zn8~m$Q41?TShzoyn9AANd4IW;Pr&C^*+hGv7d&}zqd&F;&^u-(CZYh-QwnA=?MaiN
zqM`y^PCYp_l?RB8k0K%g`Q$9}!RHqjR)A2+FDsMFlZq`I7?75cLR-7IbJ-mywVwY;
zadotq4B#6DVacd*uiWrkNXWsFk!YPdOOc1`qr8L!G#M#cRZB~1L}J189~?H4d~S!?
zKwXfLGD#W3z{H%mK3W{?==cH@0}*3m<AfNsLO>cVG`ln5;^9qZi+rV~rk0+J-$&4B
z>Dd$`WoBm12CG5U*f=MjTwTY=hy;;<TbUA$QPm$%CF6ioE0KuA0(1nWYH)USbzFnp
zA|-7T;M5XgK3r&>t!tT?eTXI!&~brf6cRF=j4wR`duF=X{WB{EM?w3`vu-J)E9(_Z
z)>sj!o4<vJ!(Q!8g+4w$ejqGa5Nw;~`813JN~;8y52-+(kP>w98=sk}bj~0q;`b0{
zWMnimHWr~$$}~eZXRQ|{K^X`8c%Z8*@VD!MMiH+NSSHd6ROs5h5t@j-{r!=@Gpcmz
zOBLPu+MRchT=%EL!4yhiGwQ{;5ERCsfrat>==t;KPmsB(&QDL{Tt3tcj*b>)`9G(=
zuGfV+t3<#*>`AkJ#jLWHRa7hlEY0i?jGqv{mYs!#)NG~xH}D{&UtU+V^`BSTmIY4J
zGBX*ys=Y?jd2Fd0WsQ2vB2-jWtvUmt^Gix(BC#0OeFku~ozB)JtU=&;ZO{d2Maq5e
zzkeA7goJ|&3%?DJ>o1L^=@b+cbebGU@&8m1lzXV7@v+{s<|s=SDQ4sXX6PGWNjyD0
zWq8rZ#g13nl|W#ikdu?+^}T0r&bqI<_c1R5$*@#dLj&vS@nS5$pg>~9|HZcgiHBSc
zPyvKIPC@bt3MGIqy9AgYtdsndsw$?d!})I@sqgMi<e?E-N&W!iH&tgfwcjgn1;ny4
zgkQgY)qNVlo1B>`1cUx_SvW~eMMV?@Br0-pay(`u2m#-FwaIuK8Z;CX;irciDpONa
zUWZi(?~v*QgEk)-bV_*=K!@oxyV+&zkLc;!vi?y5{G-uSp#n1phlGa*cS~#QYsjs$
zomZKhFbfO}d`qd68C6_NjX=Qtn&taW*Sb*wqqt`o1r;0;Vlr24jKk|f{SFZ?2y7bp
zAMCII^CrYlqU$u+hKq$FbuKRA16qlhjjeloB<-WJGRdALT6e^3J6K@Vz7K9Gsi~b~
zW616&%e`kC{l@n8jC+=>CC1$$DLl@UAS<k{uLlbH-WyF9e+&x?i*sQw!~*)!YTD;E
zk(D;ziB@lSaVaU&z9`)Ph>*O&4<Hbvb348T@#+*1G^we$ZijOLV6(|cmAPKEc_bw!
z8i5;`xw!f>_}m?y?p@Obys_*S8<Je=ctvDnU@x{u#B6Qp4-XH=0sj|8EJOsNQ9?{#
z*hlbbUo?IwAh^1hmx(}9yIl(<&d$&OuxIHiy4Yan2qK(byC324@v-IKQq1T31E$kE
zAMjp|s|F8PsJ8@yJ~&udSWOOVZ$~o)BIz_ML1xvMoY?<>iHS+C{Syj25{#1=SYgbr
z`)WP?{c^l`-8oH7-14cc&>6h0fxmw7<mTnoc$}I1N9yQgnAzAwmT6Y452vzmyB`zK
z)6<{ctoVzOph)xnxRtpt)U|u?Xv+YUWL#roV?Oy`xzyCuI+I?w$?56*R5lA0z%+vJ
z#}Sv9D9-k2nB;ZE|8k-*|MW?gSEcpxbWQ9vrfO?xN&B=3JC<?v9&Ds4^0^<&0~+_6
z%ZHFK5Q(Hf2BQQw{t)sPEXdE7=0#J52K%tU^I|(5Bo+qkpF!X%sy%50)PJqs$>|vx
z6s)S=c5A4pGWUl{1g~is=nDL>?}hdB_<ej@KE`FmihO-%d^lILJ6A)F$7N5ymxA!+
z^whz_!((-|Q!FgZ<IQZQfU%L$m+OXx>c&PXknF~@MIdGR%Wqx_#LGQZC@3frYHl1E
znZeT*ckBvXU6aZD@ZG;<*v*>s6wB@Y$o~HR1bl7?npOH=OCxC0`P@U|Kf-g*Ep|^#
zpoWBmB*bWwyOzFAnqHd^uI|fHNcAbR*8D<uT{OofVzjcFGK42IbW4n1K+35FkXiyz
z$wN$+k%c-y=`uNSkO7#;*V$flI}{cXA*HwD?I@4tz@w$jar{;|;F%SijJg&E2QMk9
zLtChr5va{((bdr*{CeZ&rrF_AJ^aJ<6SM-Mev0mVSH7z*yM|3f`XN0AHYVBj9UDI~
z(wEXwIKW;-gB_FiF;6lf#^59<D9Gp_%YXgx4xYKTIw&;>F=|9dtYb&)qc$!mCs7_A
zak~N`jsDKSUF<x4Ar%V|R#d$HEi|3FjtPX3nvK4kV(~Ci)1r=(&@qz#c>xN>gsBsq
z>oQZHk+{YP`;)$fviQs~+b%S{DJ?CnPicB~a(C~~AN*##Jc{&HO6iM&)Q(eftUqov
z9Tn2X^T^Qc-A)#>`zXPwM~jZEu0Ja3^HKw1ndVCb5oCzTX$mZ}t(C?xZ|8Gclh-;*
zZueKm(~dWw&?@>qIrA9=SFLIXR7Qi}L7jP2Mjf*&xg{YQsY4fXvZt%pu6lwE#IlH6
zj)w!NpzfSDS^rUn-g__*PoBc%FW5Jt9oU)Ir)l+XDg9>d;z7v+brbnUw{gx{-$Yy-
zT=8Z!Lns!y-CpuRQNOg#Pi2YDR_hwF?gwYzmR$0jxgpD+!NJZ|Qn9!*LFC98TM4$j
zQh|Kp+g(!5D!ni9hW^nEX<Rj55m_RlqQu*upN!|1QK$SLp`-E2)UBTAsE{`XVi0eg
z4>Wvk{<)jT4hp0)8NeVD_58%7JL?aa!X(1{zB3c7%H7~he(b$$3N4Qi_I|1@3+bi9
z=&@l3M-5-S>hyiEsO^g@jmmniWB<PlOlB4!Jc?8;w;CH+hiebbsy!sA7}z6e$P-5@
zUc1fJdSlZWnV{g}h93*)o&8Iv*yTD*vu;FxVZ{?t#0_4Vt?UefMQ?(qmSoj$dBrfW
z-zZI=I%_TMZB;g|mzXm9&h*nz?&+cS_xAxwD5!NerPSoqO=xqI`Nav5X$%du;pI2R
zuakr|#|KU+SlrdDc016RoI;y09=h4TO|6eDdh}_W@4A=XttK04{N*PhPl%5XaGT-0
zwZAsG#O3|jcSO+=(VHM+xUstuYQFANHH=j?JS<;VZyhRe(9qcs=uN3!<2(XK4F~VA
ze(N~8P=50E?c4FH7l^}Ot8WnP2o)DF-b1AbxiNoI%!mTP%W0Al(src=@&SpUZ`(AP
zl!<9fCT*{TCPlCX+rD{46hsAXt8Z^4qlpkjx+PO;E;UCN5Y8|9wlObIh_FHcvEnpS
z$vt3ucJFtOkiUgxdL6-$-k_@$7rWQ(y*B>bK3;2~q_|jFwJ<+tG~hn9w$E8EdjK0w
zqB&Yn7^KMpnnF!RZXm3y+g!+U&^kS>R@Y$rm7$-pySjj;$Zn<8?Tu#aQFraCk>X-S
z6~J6yX{HKI$0U%cX=ya8;uB0-yi_4FWALJqo*y#pUhHr$rxR<uXBA^6BSH_aVta%v
z^g$-*noGSHGE`O$iu<DV=tKs^h9ubilg{x}eoIVilvef+Ee?k(Bd6_BkJa}5A0U%K
z2lpm1>Geph4{_Z-9Mh$H#m9CzVR6`6g^UV37@xHHrkhoqTp!Lyj<(+Wsx-a7OPxPG
zjp<WzUtW8{Ms^LBj*;m&UaUI1dD1FxzlS5DDd?X&*yAMdybA^>5^aZa8Xuq2zN^^E
zSq__Vot#Gyv;Q-H|1mbj*71%og~vW&czvTP%EO?v%8FuEEIO4U2_2mr!2}Qax88%*
z9I@T$R?T1fB5%6V#y#|gJ_lBLryAR9GVE>_ZuA^ML1BIO|GCHOK>e6?y8%2pOYf>e
zpO@MQ?enPAthJ2|I2o71`UWv1F83GEOc#oCbB*<2VpGEaoDid?zbX5pCQO22G`8G|
z!~bbgstPT^uyEBt1(tQvIbKKpc9M%qhy}G7682v+zNwEpx_mv#dWaNRCRU=sg59)-
z?Y{<v^2LwDJ$-%aAO!8QIPCf}!NH@Tq=#Tb_ey)H1_i=Z7Hv@#kWp(A0m{sv^71bt
zB7+_G@22(f95EB=F~U@O1AzWSXWPF1F_hMdvh+ijKg<p0<SnhXHUC1BF+I^l0yC!a
z=fw;kWCGPjgVWQ~Rc6;GG4ZAyN9=#J_vH}pFKQjvCEub&MX}7vK#tJzn1)y#G+2%+
z!!jAP6dNzx9|0X+Sq(UQB+_CHwmLh+k?l3`n61?MDgz_dUg**JO%>{_Z^wA|890n7
zh662ZWlAKPZ-}G3iQGx^+-dXsK+!W^BK4Www)l1>(&<n2=c?Gu!cyg`T@4B$Y$B$u
zHIoGtbN$4V1$PwtC8r(QpIQTEXRq?3j;=cMeRDdQ-l40}=>!&S8J)<tezC8zOQD=L
zrjYIZu{KzBnUN6@fjU!#mbsN5p99qbh1)%#*k`)u8vZEL`|pi`bhbM9jq2~{AjGQ<
zF41f~T&$cBxl;m_Q7DU|Ku1#r<;x+eCu;zDS8O9%UxNn<K<o5I!3%Go%r9tZI@^(=
zbxnzEFIX}zaQXQ$8&8a41t?`{aG6xZ{aQ~-Pe<_dytBU2p&O<bm&~Kw&DK5p)M#xQ
z9A{yBz4GHdB4TiAfqL)|Z0)4u17=P|Dy<6n%4(1%!Dx^f>971Q^*0a<3=3qN^R;+|
z6)U2mnVn73+L~MMo#5j8lBG<yVq?~}@PzeKL)QbxggP|r%IC@#I4%#hspWMwA6xIy
zj5T=q;nQ-Hm;I$q4S7*#CMO^RP66`U@5A6rr85A<ySiT>0R#zsPxsDUb@i;+fo#-S
zY3>5fou?-z(?B8(7gr-1GC4Xc`U<(?`+SN|WFHk#NB+K#7h$S7)?qcFV`<F%SxS38
zXQ58c^NUB>PVMRh6BAw;ifB{rNXEmGWDN2)Yb8`o!!8JbxqTS(=Q{)Soi!iY5JN4}
zg*sI+^UONi>EaqS|0=UekHMiKG3xStaU0PO!4#zOMbK>=c3LITSa8s+JX!_p@yW@d
z*<CRhL3KC1DCdde69&YsBOI%f%kQ0eIo_P;;~n>EuGdnU<Z(?aU$tpMD2aeLi{f+6
z0w3P-&~@dB1~+e^F81Q1p?;Jt?(xD49k0M{%t$3{PfG0R^H!J|1P{8bSf#OV!J;{8
zCy<!52I9XhZEVN|7%JfWwM5-~&_>NmO?_tjlq)uGw)OXQcXu5wgzbC_^76i3CPpyv
zoq9kZ^HYaiyYd&;D}O~j5%_ry4e_U4CN=Z>TXg+@csmtimoBcTn&D&ns{H2A#euQu
zD*qOt2IwfMOMLvo$;laleXFdcb){fVRF*q>mVTbo!NNP-SmF5q9q*#}vT8IFvG<OW
z_jW~d#Q6S1Q-K79giU}@M`!D5_d7k9Pit%I@u^uUA6xb)U3Gpqr{_g3ZnOy$F)vHu
zGB_C2k2#&Ya}zo5WKxj`c}Z9waf_GO$cbFiXhq+=-NbxzdsBnS!$Y*`$zM#ZzSK-f
z$q%(cZ_uHAd_MpvDt9i;R{ZAlpKv<t?+nbaMIe&=k=URlq(|bvIu1=@7lpk?H6ELz
z#^>0an|Sllm~?mUZ~ZylV2RP3rrYhU)zD)Y4*e-*6oFktWkg3mWNA;YWM{TETL}@L
zn%5ZY(8NRyc~{^gonGaxKHRn7fP^y}W@7%%s)Q)kV^@O1+>U<mfh0eCkZFI6UQb11
zU0vO{S}V=EN$wx*2pA4iL@f?_ahG#{WR30i5sAPSTAhQW7be=(3sh7*1aEKeQKwlc
zBqSt{GkTP_&&l?zXtGdHXlQ~D6)m*s@=0;u<AU@3t9Z}ZmN$a4VSnQ!3v&7q62c-X
zAX-|gjXxyPXi`Z@i3cT4pes{D8NNS-#uSJVrlv;?BoPn)68a9#sUSkcV;sFi$Rl}h
z{px6U!a>OY=aPAlCoZ=WAMBd;ucl+vXrW5+4W+8a_rrD&_>q)bm1zlyUk0jCE)Qm7
zhvu#(Qzr5pFMRGc#Q&678p(4pU<9@Hl~g0}hfDVvK6^4B%^~RHCHZIEtYzR=|Fnh?
z5D;KAo#nf~zXxnGLyLg@cP{}U2FS+D4;|#%??o2)(y#sDvl@w42-k9g72pv*j&z|`
zHJ-+JMCQd_j5o-!&Cw)3AKGW~#eD9MDhaMAS-ihG2*n-?j7vzrohb8ZiIM4)$YK{i
zhjl~^Z$e3rNnzrz(7R-WoXz~OzI(y{KIrcBT&AbD`exwK^5e|~RsM5r!e9+zuqdg#
zyh6?GO-6NH4V@`Xpz`flL4?1(+&Kx;SeaV=v&}};xTb2?_K0m*@ef&P=`Z!sh%w7&
zp%K>5YzN;l$OS7<PDqjM6fk#bHGIB0J<pNedd|Nv16NlJ+N|NuXh!#stF-F9f!-H`
z4}o#ALa;p>XCycCQsm+ql7h=vgCe@BxIYl(nBuaMBWqfD$@u9p0M&>-3+e=bq@}Eh
zh*hx|v0EuI`~=^^DILQ<@aLce`_OXuBof4ih9ExkJisBtb!=k_t(wb6skv9i53(@<
zOVeYW6NTGjLwVE?0Yt~cPFr{j0mkOmcy>bYfvd|0dqdu@(D1Z4T=69sfa*oS=R|nm
z(lFUPc!xOAC5))?t~7~OuaPMuAiD`dECgYVgoX_;#2DC{>!V4|FRcR)H{O>MV>`j{
z5S!y_d8j%EWj-2B5JoE@hNpiu$+gzj2fBjcSN+Z~4En_*3?43>6JAu-?kz;9ITAzv
zND0ZlU@`j1z`9jv$7E?%Chm7Z4_KbP{Q|h{&f;H!F(>nXw9_bhd*`YTVIJG;8Gv79
zRJ-}=-wcB;%BnB2Qi}kO4cnCh+VjFCT@10npZRIRs|-N2fS)6L@%=;=7R^Q%d>>aC
z1AT~%0RNv}ue}0jLX)E(E7gwk{D~OY&NIg~|8$L(a#Q8yrD<w=;USO+`NO<5$JYN0
zc5~R5<u~D!?I;8!74QIfsdM2F753u?o+5!z%$3lnh?Y?=p&jix;}yd*m(>RiM#DGw
z`1nX@ti-@`VRAQVuElEgVNzAqddWa5TZoe}D-^vyR>pOVL;LcTKA&7dXdlh`BlD)&
zX&vR*P$>2*43a~f;-lGbkG+F~?t^DvJ?<8Ks%!hQ>w)$8)V*8JQSsf<vbD`{bb=x?
zh>qS~=px0;9ws6AKJqcXzxF%oYEW3Q!~v;O?fosDLJlc$cSfCocPWCda99dB0kaeF
zm{rQ}496C4w?YTGo%b<*yB+;87Tx|jRir2?r+`_uEidl&NwR(c1umGo=0-{H)*G3~
z!&Ai2Fl!@mowVKhZ>f3$(noZFYi{?;s;WQtXEnK9L($vuWrzR%R50uB@POy{gX7KD
zhnsG|Qvkb(b8Oonf~u;j%h4qC5ca6ZWu$|at3&yvxEGjHZf6JVhKgs@gPG?LKtzV~
z*gehE+hDV++eG!&7K&11bank{m`t{)Wn-(}Xe?;*edZp`;)B7)#%3w1Xb05j>jFc$
z<0r8B&J<V~KpG`KLf7i|IAr<A>dt!3y1ag0Z^O*Wie{=~KWO&3n~9X{@s}OG-_sSB
zme#yc2iJztzfRQXX~7e$7fzZtvHT~i3nksk>Ka{bt=_ALhx7r{SE>dEMBg7zq+a<A
zeXlF2m9;!0dvjH<2$)A_aU5)P=rMe3$Ad>1m%}+!Z?_Xsz=(cGY<UikjqSggvLGQL
zfq0UPl8k-RJ>FIK*CKE6m3E+g#9bQ83cC5Hi^JzeKl5_Nr>0KF7D3bvO-xMeBjo$;
z96oZD)B0`t@0&n-w(ha-L%vzj7H|BT9Ji@+yRwY8{$Qr?S;S1ckKd|g&=?S7U}3FK
zm#A_(oT7k*pOaI7=&E1`Y(zOVKky8C#YJn3#hRL2x(~J4=0D~@0e>cv_svl^0DY*K
zn6P=@3*)|Neo-d*lAAzgLQbBYR~8un31c#PsLjk(^@vK*T>+>_01eT(?3aYCW#Qx*
zbn=R;-=D*sqnzG_d1JymJHz{2j7ZC;@v4`GVwN5NUi%9npF6O-!Bae<;iUZ_-|^*C
zQ^O}^6F|g0nH+CxB`lQB3<6Y($j6$}ZzwXMyqYb7lo)Vp$Hy^_8+{_Gs^~zl;r2Yw
zm(ReP#Rf9kT&)GN-}9XXAnJ@(I|AmK9La$n?Ipme@Y>wM!W6L4O`aE&=jZ3uURT!g
zS%UhX+yh;;_4EiNqw#{v%jxrriZ%e7u{%{1tA^a9tNpVbdMhk4vK!DNHTM5_gMx!a
zG&Ql|vFIV%hb_8sWX^!Z2Z(jYhif|^8@{rxz*L*$Lca~~cCs8T6-z|Q&5Z*bY<=LO
zGeGRg^qUBF?p(0=Kl|(+9FViKW7y1;cY>$Cy8ZSnm$G?j1qCw!i7V6Wc7)cdp{6Dd
z34@dmh5$8dEZK;6+Z1qs<bX>)KRZKh{SFAkPoF;3nGa#XQkn|ZTTM}djfY7=MHOjq
zaeYk*uKVr((gyX7g&3IB*qt_|ot&NX04qCj0F)9EbaZr(n&OOj?WaLPH|}o)s?V68
zCg2@??=N$}0LKfH#0i6UA`=UKAS5IN!*+k@es*^Ds_g(iGEtayTXnJVGnK096w~qT
zD<99x%e#4W#1LRM+z<8kO*A3jcaU(Uf`fyzp8STsGh`8d8OQ~bxZLI|2RPX}zo&cF
z<iW8qDxlr|;P+&O4ervB@Ao+bx=DUso)}=0%^5HD|I&ez7rE`v_hne$ShD_;Z}9P_
zfZjHw2Q)yw*DL+}{OVJp?oyPg^8rJzGL|V&lAWDByyf@u<Zk`%HwU0dr%ib6|CX#L
z{rDjZc;|u@zlVx0b3j{{*vwY4gAVIR_xbDtob>uQ;F~UWm`_Q;v0iTBAr$m!!b)Jt
z;`KNs0SQ6H*48$7Xt~MptBZ>ZH3>;J_>LGIF8Eb;1O9;=V6TQtegWFt0C4pJqpUS4
z;myAMnXHl$m5q%JAbQOUW`PTn6bLUr_}nXTK$ie_Po}T0{~N5yg4yt}uxw!7od7Ij
zFX+;AjcI?ecI_t&TwLuvS4P7&`QN{PV-OI~_n8eBeHzPD0M^ySZ{Jk+Ts6wIL^wG)
zfeW0m&usl8RfHxmEzi%-Ps9p6Yi+sK{`?6_+yE>Foev@4R#7oA86~AyM!hEGE!Xbu
zZW$1|0Y6Un`ki8ypaO6$Uq0Sl&{lZ^p=fe;wy3(AndJjmWJaBD_W?r;wv(pkKMJ|;
z%7F0!0<s*!V2{*xq^6r2r|aQdJP=dCp&Br6uX-quHCT~AKfRtGd2n%X#sSbT`zHy^
z<UppK=#9kYkq1K$azEI8Qb4u%YrnBM_|447NEpb51YlubZNv&CWMmKmhFqgL`jz4a
zB6_R;OJ=v(I>l5k9Tin5Ah@jnzrtaE0BDKnN_{?^dTUIW!4xGB6h78|8UduXDTh5B
zGjkl6HQ+$DApm`G1v>5I-@m}^n2p!yx;cRQI^7adQqFN;+?9l$#Ajw^()F&>fF|O4
zyp#$y505mZAFFN)*a@%d>Fbn_dLb0%=f@QkP*mtQfB$x!;t!~dgq$2wLVl09T25ji
zKl#^9aeO?gorW+0>;TeIQlXjro?Y5$KtL)2(=k4j%*12!jr1N^0*Zk;n)u_#9~p1a
zoVY^yR8i2IqQ^%cz`{`ho&y>hdK%DUKe!#qcJ%+Pw0MebZ*SK*Zb-gzUBCjS(AU>j
z=YC=^x_EtdM*6z_-A{;i>}I9FPI_I}z)YXs2;#c^tGL?R+v~XO<_k8V*U#$eY`~#7
zEK~}UWP^pLda~R)bY#th2mmW47M2Q-ZuS;Gw_`4f^#KF7nYFbn5MI#b9<C%rMc<~T
zq>KZ&-Fr=BU|^76QK5iL#E-J*Jq=TT<_<JCp~30tL^O1C@cmUMknje<DU5`Kgr3PZ
z@Gch9KINC^rxIY6r9bI}PNg*ftpW<GB)L=!#>1&1IEY|oYb)=4eTaJSFs%;u&8uRa
z#%YK4;O*(*5!cYrpf!rK|8;@?>%VH_udhEx0!{EEh&42Vf(GR$-8&b1WmRptonK*+
z@eE{t49K6+($M4prE3DvM<u|>tg)<0aAUpy75N5fwnA6a=7c83o8v0r!!co{c5TQM
z_*uekg^!+*u@KPr<6)@e)^=ICn5-XLfS%}kF~S8}t8q|NL+mw3DBxA6v7)H~Jizhb
zw1K#nozuKrrQmzuoCmu*{Bq%OTn@-s?)A^hKX*^sU%pFMjR3X!RX}`|jrZTNCEtTM
z95>kJ`{3soH@I-R-W!Q0l%+;&W@Z*wQNh4q*rsQKE6AU9c{tA&ZXO3soSEG0vX{`q
zO0WMraiQ_EVcyQSeMlK0CaQ|BsdBDorH3Tn>Pgwz6ECl>$~aiXedCgnB(ATo>wuOe
zBb75%r1Y(=&97cEq!>BhDn9{O95lSm!tPg>emSeKsL1&F3100^h?RndEv&2m6``~d
zbM4zKLk3cg67Wc0b|VW(^F~Pj_ZjfN4gjfLF`b)qkDAPRd${C0#pkLsl!Z)8RP+O3
zaM&~Ma0{@VsIaB{n#QWyM_|>)7mNz=yVzEck&1^SmfXo=F%8Mj^V#mU+y#<cLJS!u
zP~8fgT>OCi3q;&jHNs-qg=l;(X-&=KOhI4TJ?erM5CeE!fszM$YksCvZS;nRn_K#*
zAn4Dqm^!iaMnRC$(ZA+oa8Th-Mrf(vEi=;s7~2noTakiGWDc&bN#PTOJ$qpBzHW|K
zVxjLYiA?tnk;0SUdHQFsMicXO%449`1M|Z}Bpbvm%%m6kja<X$6$6Vx!KxTw$Ob#9
zV0uiN0WqWY!w4{()YaD)_(|>~gI{@4eyL$HX&(q4?Cr_C27!$Y@;&PM(&FN<cSr=%
z!13_?zU!(J2($-1(>v-wBmG}Cmys$-j3thYp9wYX-2!JHUYmDwvXbnQC>wuiw5_7r
z(nMtTvqAcSMsPO&sk8UIT(?niyGXDq0@NA_!NJf^py>%QD!?s7!45zr9;a=}@KdaM
za`|#Z+$MOy$CjBYLnRPASozF`uMTDl&bMi2&5Q?9BFc5@WedTR$^o<fY8<IBXw{kg
zSX#gt8u+d5%NK~$W2o=x>GBo8TyWJnOkvQ{0M36<y*p?D#vu^SWfU4UDuIB92lm1=
z@{Bev2yV&qC1Z5t0s<M}EQM8X1SV)0IVcIHs|<x|kh&PGh>6;NIc-Jc=S@qCii%bs
ze`5R(KNOn-3HbElq8kK_8W7Z8?N1;e$0a9M9Z{!G{QVm&Ows{Lr<}sV5MbgeE-4WK
z!G3b0G<t03wXy<^{}eV0r1g!B8s{BFP;Pzg-;m)os0YI8D<7)W>cs&LwVXabd0Sdr
z=K}FFAx6Uv9N7Y9{x4Zs#K4OP3les@R?S;LIlX@OS~fN`^Z~r$2y6`OR+CVm7vqzY
zfxxttzoD+{<;4pKKnJj*qwu+qz%heft`#3hFa<!2<^-m!l(aO{wXR@LPeMYx%1Qu3
z+nx-PlcDJzypWMCwfW*VyPtqSg`AyR8o_B-PJ)33^@x^VSg`(gNUN^4p|1xP{q&!^
zX-;V=?Xece_5GF2;nFYoWJdiiP=!rYW*~NKfQO=MYp1^V#X@j(bwwcJ#{pYWvP`4m
z1K}@G&|Gfd4M!s24g<m`4qn5T@NirwC+BRiB2#v$%|3m~2CYsJeDVfo%LaR5dRA75
zL7~_K*z1m+i;2+%tk({YkN2c;$#|Uc6M)(ioB(47nhqS0qGhBi$3X|GKd%_Rz6FrT
zM)QRXmse?|<m80Fckmt_F#y00;12r=FkEM6r^sYH-3JUfN05cV++u^|q^vnBOy>*_
ztUEdJ*5oZdxL<>UJix&(<3EXmI2_g(;0p@EX3>uyKZ0~lRjKzY@KwvIZIGBM+y~$T
z=JnLoTR<AUQWGHQ2gS$7=aXwvgTrW2F$BG9?JttMwHB6ECUAs=`s~V<5T{>u#<~HA
z@`@CJ0gn)rfos><noC7)z!^aS@XF5yE2TwuqU_>in#5GGtfvnJn-h0($D5p%I98UH
zllNEn`PV8UIy!hDfQPVi)LYJ^2X7wTL`~L0XW6Otd6Nntd5FMcF_427;<1_)Pa6;c
zD{Gz2tUWN3<QVwgPYtqZ(g@x?VGEVb&7N4EJezK=j)zA@>&jkH#Deos;EWihsVVxs
z>xiV+HUDwsVq$`U_%Tr%`*pL-bX^JWNLIVBbru&mYg`2CB2_RBuiv;l9FxnAP(QD`
z1qZbk8qV5&AkX{0dAO7csy+(BcKKJgnd!#&*H>L13rW^gt@t{ot$$m2ZB2=0)-%ZB
zVgP`Ace!K*2goLY{12*@WGqGl#aKuK<v#B<i^arD29nS17VI{oqQKUWJCWBfedMbR
zY+(nppGgoA6`b(dqecy#it-a%nEW*2Yd%{o#ym}rE_|n_*Ky`OikXYh{P+=dB%LSn
z2M1UF)J~<pWA~qXScLGLN=5;&J5qsFk)kr}uM+^Vz&2L&>*9OjU4ZZ2&QY_d&Tk(M
zE1QGyvyDkMcH{{mE<I2nfV>v7c|fdDFt)#ssAN4qnb`JVPXNqovR<mP+KtEVpcXGY
z;zRa1Tg@%cO3}2s{5!xhU7GZ3TA^nB<V5Myr;6Q!&us<Zo22E!OWQYgi#1t*g5A4W
zDJZ*nwaED1{NrY2VaY4wI@a8K8JiV~0uD&&R<En?zmW|4W3zzAhft?sUDNBKvU+kc
zF)=~P*-uN}%j;f+3*#T~i@f$q@Aog2@!C0BUyY(M{i07Jh*mS@^XW@m*m7t0J3J&@
zufuk7`19NUyZ{DNpVX}&g#>Nx?2af|*=<QVx*rGxyrM2I5eNQtH|i{cWqH{_6{b_i
zTu8{5zY#Fup9yI#A?NoPzHEM0R<98L*?BMR<4Sy_99IdA5eNnafEQQUjl8w?*9eVG
zrvPm(@ie;3g4uB0J;~cCZB@u&S_f9JfbG$w7)R<CT!Z0YAm(ZoeC>?&2amKmd(luS
z)7Umi({CyTCLb9|Nrt_E&kprMbneK8lT%#TfW4OUat;flrlbr(K-3-o;hNU&ajlh_
zH4B@{K5j?->lb3F<C~4gOl(x+-Erf%Z=b{3kutvo28u>6@Hu%10rxbfpI@wzkx|FT
zKcxU?@^24kM9OLM8%-?yl(g&ZZ|%!|BX9~TOJiU!A)qM#9TkPZ&Fj|XzENay+|tWo
z9T|4oeOM9}8yOthpMmFp%mKVRXiM!c5IamOm5!mo51B+MoOilQEuN6zG~Oz;@n3Kf
z(5T}}!0o{*tjoc2t)$&6ThJ3H2rRiU&$k=yF=!fuSz}UTnXN`kKQSBZS2Q;sNd4`#
zV>hIbJH}4pwfNxF%wC=$Z;u)|fPjzApfi2Atlks3gTtB6<z~AQ1Dq<(iPig{@Lk>9
zjOu-PfL)2s(#vZAMP#8o0U`=9`2P`#uXjd0&p?F&6GJQ%ZqwMpy%<pO@}vp%?sk2q
zA}zg6Q(Z<z#&TsD$%1JqD7u1>%C4_ZJkRiC6nL)UwWMMIO(3SPihS@q*CpdE9u(gV
zlWn7E@pw5UtGH19I+8}`i6tJp=uwnFg`eCmh_pC+_>qM|K=2`+$QWRY>YAF~^CTN)
zvpLQ5y3~xMV8&5SZl7Heuj_JcmhwvVGQWTma{qKe_vl*uJGtaxF*gl7Q1Fe)c&h_1
zRq$v=bH~4KY|6q<>2<hh1YuR7H&PA`aXm<@z^w`M5Tjpx>rV&%TZ6G3QzJuckR!1`
zj4s!}PsgePhP)h8DJexE?$s>BqmBI_fOw70H)X&fC~ZiX-<^f6<li3$j*vW8mI<4D
z)-Zr$iN~&y6`<EIz)IUa6uL+VA|={y_nNPv{e)V&y3RB9=AbgqRY+@?YdbKC9#a(L
z@Lab{!NE7_4!Ht`16+7)Mp2)2{72U>i-bRSTKQ7AENC-@kgKkiTY;5+_g^zI(d|1i
zC8exdH+XQKtisv5-0+1j1c~sEc5~7em_$^=*xp2|zl)^FTQ~LXB>a~&l?JU{R?SH@
z{x5GpLPI`w_vP_Tr=jG9dwOsICy}3-IXNrO??HD#@5JhB-hHpo8d3j7_C|?l<ALVY
z4qmF!{0Row&rjoW-g{vU-4!sCiS8)Q%+)+8HU0kWwBRPO(Ae00o;8!JdEy6<ob$t{
zpiseztGtz?MfcUr9Y_LRr;v+=sY#WIiAoNGwy5CWzj=3#k}1f^AqxvXd7$CIGT0rl
zCZhx4zsUj(4UZav=T{gxqWtmah?wXhp*&onYr4z5Bb}qXm3=%8Yj)cOK9jz3&-6y$
z#iw_EhP7w`0ZKfdoB08ryft2a>mdZU_Q)eGn1#IPM%m<dL<_hPZo6iF(Hs=;I&Ozs
z_CsEWm*U>fWt!Q=s{){I*4T38p)&wd1E5<&s{FNrrBu2C^=_g1*QTe)niMT#QKM;Z
zL~ageP=R473h;|gsJ~Pw6@DOy<!Pj}pY}T?(@D$qRi?2tWyHl2Kx4k1<H^muXRgd%
zCR}QVL2k9*Snq`=C}hlvjYS4_WvG{qFsF7iinI6>585}M1@Dp925n59%dc^Y_1qrj
zwY975ECR#Au*d)Y-D38<(`jtz*%rF#{N+5$`hB(gX$jIc40C<T-z(DxT|Uz?cH)Qm
zi$I|gA<P33r&gUMsYNQ?Prs*kP@zGlv&-G)Og&>=ZDO5aZo5H3wn-$RXt#zpy|E`z
z4!eA&I7!asJ5S^Aa#*817dva|7ED8+YBRQ)f7fYT1y0X@d+{Ts?;P3{Pxnm!a>*HU
z+iifFB~jJGDh}nX>kMdRC;+)kmVdk;;Id)}cp-HJ$xI0<eoe1q*YkbE!*;%^b~l5!
z^ucT5{=e$ZGa9a_5BEwEM2%iDL=U1fIuR{F1|fRyz4sa<(TNtKi{1&Mjy6U&dheq}
zXSC?|<X!K*_tRbX+g<B^v-UW1_St8jwa<Q@-}4`ehC|6-BNLss>`5|i>z1k0E6)S!
z6tg$&=c~z`5)4fq?%TWDIO+7EzGT3v)Cp2)|4{YN4n2*H+ij%2sBPJWYB4dPClI~O
zfnZHUwbY=`G`7bonYKGdtTWv|XNBk!FVvheA9_XykJ|H4D)UpCsY<{7#1#4Z6Vo@<
zh2*+<&+VH~Q}j2w@4P5DR>Y-F!>TN|hnn0h5lQ@REk`i23;Wql^~dt|psZ#VJd6I}
zv5bX?%F0U9#bSY2P#HjA7AaV$J3Kv|(lUw{v$b!C9n{r=DTt{|<jkn6I?3$>M~fXc
z$BjrCO*VvFpvkFeH@mYntjrC6C@@_;t?zZA1EQTpRmg0TR^i6H)D+z4rClMXPEff~
zJ8j-rv2Ig~+1O9`qVLVpQY=9-n1uG^M4ZQl=ThbQPyopjm!qWM4R<$|NvRi~e$-Z)
z02>ze@6C3AxD+~NyVvD{9{EEwk48i!v@r;|dirqq{uC}GnotsK&>r`nUQvjQ+@k@q
zPWOSACscY_H!w-gBT!lCM0g>`pr7)Y^yrwPrlyBO@OqC*w>h8shYvq>f3%ogp*Dxo
z0uH{!4o-J{Jz}4mnulDCfptYWeoNS(#mK`PycS_U<)T0z?h$}pZwC8{zsUD+L7+m$
zegU!^Ga)I0fSre!xTpkzbC6+o<<2qdF};P=TUMcz&Zw8%<&uzdx$7mD8R$RYH5o?Q
zY`(UgGM^m~(3n#qr!@ViQ{21lfpQ2o*XiyKyj_oNL{q~6px##>Je2{k-aZ=}u{+>Q
zC2OWj<w!H%IxI(tr1(s+G(hWICab0Z;{5=eEN^YCJ%jV!(o`X4bvwe)yQw)b4K!jd
z^`+Tz;&EOq;uuh**?e9CTLjID#ibgKk+$X!%1_rKFn^@b^xmquyrKk+gq*#2b5NbO
zpE0&SzuDS6P^-Xd(n%`$jg_T`KWLIFb8B0c*mjb~!QZ^W*TURVy*WtAvQ`Hg;uYL)
z{<FNiJbjb;OQ||%ops$G(Z7Z%MBG0G9rmks+sfxe5Dr?I7i6Mzg%HZ2+ck7qeaHH)
zECH+je$X%%>#l~jcHd}{355Yx5wi)9)uAJUJGT!~r6l*P3L9oT??LkdN%Ph$en$qw
zM<I7|hCqHjV|A5iU~mvr(8COPDDnKeUCqC$2r24xC8*MAdO5dm;)E+BP8+*-%&jgf
zt5k=5$VkZ(vf>LgJu?$BGJx^`WI%|6AlWSz*c9Ekqwp+kk#WhU6Eww>UtI0J(p@Zm
zPsxgZ4=CG4k+wqM^cW~Jv#0F)r@G3zw@|?xZ5!~suYeL{q|Y@Cf_8QFIC{Ub(*1r>
zOycV|0!=58kNt|wX+CO8PO?COhXfw$>Iq9WU@MA8n=t$Sb<9^ZC#||dmZy+gVT}*3
zp0c1HSlTtX%3GSpr)+mO|7tKe2iRJEw&hwZxt)lqi!7jv@`ac3tXh0wU!Hj)Sjb4i
zAa+z@zGyOHE~5y0EnjFB;B_V>___$Ztr;(MW7ft3*HS^RP_^*0=vOFQ-!`cd-A8W>
z95~n5<?%hfLW&q)3a0}4)1`MnozUd;G%>kjJ0!CIZGSMytJ}7$&f1A4$?moqX>V_)
z`g#v2r)3-uC0=7=<Lvo8-T_Nw!@))x^RPyJsMc#DA!V8##u7<)uXKG`b5aiv#25}@
zaw^qD-S2`C-nYvclMkU*uFixthKIi`+WDxbXQ#lF_4QjWX6o?)ttOE0-iQ|FlXy=`
zmYJhG_HAq8?pj_}mg!2Yp+wzUthlU#nPI$!eg)W_TGs>V8r=+6k&_bjdRf^sX;!%6
zd4z7;yvygzHMZ7_wF#T;=d^3Bo*qEr3s5rrs9spXN9Vh-=<K(4s2MrmC}`;8#Ub9A
z%K!O-2|qUc%{FCe82K}2N?7u#YT1~m!?2bb=?w?O#$svu>2JYZN>Y05T4x$U`%(p`
zUX@HA89x<wRCM@i?2jL>{5(Zzf+Ry|0?z&0TE7L+9y#)!4aJNjsL$UJ?(M$%Difx?
zMhxRzyLdP(($>;#w>E6_W5Q(^DqpC}ZitG;i$=Z4J7;!yed!j2kGOmD+G)2`$hux?
ztPM&0sP}?)D-qM7G3?4@LA~Raknr14tHUv5CmkZ1@B!(AoW`;DaeNx(7|EQ!^>Tz;
z^6iOS6gsqm6}TG7wzfa#!?qS)I%n6GMMKT2Lw|;Iq>PPBD}-tb@x^YWc$uOc#B1u4
zMn3c$tIEziG3NOb5$TMG`B%MxGDhAWtS4jz7LaA>aduohB7X+wcC@xS65F)rc3P{U
zeBcW7V?mD2%xKlSdMU&mjLdfzrJrvgs61`Xv?wUge2X$8Q0M1zEB4uttXws5@X>ys
z5pRuUzlptzOe$y`YQ{&IQACn4R@PKfas%CvG_KdRadPUNsEw-DCObJjEx2%Zx{<=K
zv3ZvfWz{CBSZ%y!Vz}_Eb32hcL-4n7O{M?Mc#$3Oe7hVj?Y&igzjZnM{?)8%@9%lp
z@t=YA?$=w(=R2kNLE|6P*_PL5b3%bI)Eb~>TT^TsOfqqD#vqZ)e>M~yU3t*8d1iTi
z+1Zv4>M)@|bfFs#SJRz89kn)ijRw2zG~El7|F|!ki6)<4aNcqMDA{Mzf0JZzSnWax
zT~2Mbq3ZMG<m58hudw{W!e3raE`$mNA<e&B?|ml2{H*XgXwM@r_bt1zx2HZHEhOJ1
z9j*)lX(wm5JsWN9qqj;*d9uuzx5Qojs1WkJ&4hN5T7xfPVR=)wC8eUdZyc$An%&~e
zywq>S*VAuJuQRM;o669FK)`Y>pq6-TjZ_uO?`A^2VhLGU5FZ^jmc)|1&ykHxsh#G(
zO*~E+nalU$tB68=E<H9EQL7p%f8|MW#z3sZJ26^F|0?HZZ2z#!zi2lpGt;8sj_oV;
z@)e@5Q7vA|F=?SQ7{%8GgW;&|%$FQu1D+TY)8|mE{TQxVS8FevxC%Rq>qYm&@aiM*
z=bsGQ(;tD5PYW(+K}&ey9<eidy2;(iT(+l4;}99!{Rh{IktH!A6l(y85u0ZZ%#-ld
z{(AYLq_VuiGEI<H_ObpfD>>O#V&cRp=4h!%72rsnp3i@*YiL+VHmqTrZJfmQf=6yf
z{SYye=T@3^6PJ+a?;qYQo8oABTj2b+WvNxwexsHBAjZErFE~{VNBQJ%-p66SF$;)|
zbag#%`0UDCYc*^RM^%n=Xxqi2JWclEL`~M61DRJMj0;&jL6s)8bAM4$1w&~C_XnU9
zMj@dTAhzSYuWcd3#<>yT_sOAcY;kf42Nk7V<Il^VVI719L>SGlejOq>bVNLlUsiFl
zA(zzA#XX$32ni22J~^>!&`P|4ugL(kL$R8EODd4BF{;#`zsf3B<>A1Hs*jq3f^5EX
z>uqS<M1_aH05<0UP4AA<3Z7#j0)L4j<kGeyJRKUaEgf-U)SIon!~P=evY6~@BBpn~
z=?W+S2a`8~R>K+nsKGO`6yPOpz$@ugPA2S{mqz<{89B3L!U>2#fF?$MIslqxGf{Zn
zXkThbDCl?bd`q7{W#RA-ve4a(#E2rp^k!LcS_2oU{)CT}En`pN5%H{?U0F$qrM)A2
za<W2ZbILFOE7yC}z@Goz@a7R+ouKAv(7l1Yjw<$#`;fE=@;3=2-Y;2;7w(d-gp6Rr
zZdKdMem9=#ZYn(K!`<E8zifxei4zmVRLrMNce*6XC$8d=RdDC#_@9=9TSLm5xlAg}
z)Du<ru8*DC&!38U`X7Bm8O5?@{WY@h`a)rNdulPEZA)-m>lH^VY$xty={GOo%bz}+
zUw~&0D7$!|`G%^eXP*IAlh<QcAo5ZRx$E1sic2F{FE1V%fc<A}%56mIlJv0h7~o#c
zn#v{Ug}g}PB<?x3^aGaWSjjKQ=(ZO^r(SVQo8d>Set%}7JP>W!eTAMQVUyEZp0rn)
zH3S%N&E8hKr&U4F!2xc~TIUmr?QE*jQXNCh!RySaBz8&V{s_xAD74B5iM`Y$aGT+e
zk#s@BjUGQ!2Ycxw8pH?-A=BMYJq}A{Dp*zEd7b)^{p8*`-gC;{*wz3k`L_A7Fg=Lo
z@!xT3eTF-n8J%MB<-T_h*7ujqvOPBs#WO{8?IX?g96C_nk8y*ArKj@%Msd0F@M7go
z)ql(c8jEconUkjxBOeM`ILF(hh+9-q&EXEZ+uO4xo8htt(?mqPyWQb-;tS~6A7rij
z?Y1h*A<xZC93pX2pGPux1-o1f*d7LwI(;@YgWKL~Umj$n*JvfL>~$zg>?bqrk<J|6
zH&;Buv!D@2U%x1L*G9zL?O{W^$)Uc7<LibLHwqc2WzCLH=)9S%eCfr4b~wkb@CvLw
zf}o&N!QV~`ee?Aa?KnAAfTEj-lUR6qoTpNpb_9%SsyKbw3`O3>7r?eY4okW3?%`}f
zh*8Qtdr9GmXI-C$`-Ji$ziEA%roSJtdbK^(guY}rKR-_?V9i7K;729WWK~u0;SqT3
z#1ema?!I!J+X3ip;rRZ*CmtsTaW26j52GNs+sJ(a3dQ%4ySF~f=oXnGu($AcLGMBI
zrW+x-xQGa5$Bbztrba4dR2_!+0#&dqZvQ-MeYkJY_n7)9!IiR`E!%X8XgNMUTj#EV
z7U*WJlp76BJlyawGs7WK0{ea6IrSP3F2rWr@>0=~xhx!j-9T4reCx$LhrZX6$I`=V
z3uP7(Dykpi_Zx0QP@iDitF|-bY0|-po1A0&lfUb958Yp5(sl0PzEh}I+#D8&AsnDW
zBLm=;*83k6{QL>H8T!+e53%$S?@Bk=KCgbYE8=zQ5y%R-p!$f~XLH-qA3m;A4m@AP
zE!}J~Ke1}Pnn_`M^~zbdT3!J_5IQGRqz0#tqBLl}fx!Yip6=99l16<%3Pz*9mX#9k
z$yY*^&2ypq{-ly!&UC4(ZkYD4ga6rB81x#t8^*cpe~Ld{r8f-Bz$(SSf`S|G{WZU<
zV;Es&_&P?afDLiOqCdY>%c<qZ!xwehuf)Aq4gekLB2t)3n8f36KP%m7Sy7O5;4<VR
zO1}vyb$gR|>2tXh<dHoqEq*#QQAeY9b>Nhx$=p6aufFJU^ZxcEE~u3u7WV}`eQ=7Q
zZ-kg5G#L1(bmMAaMLY(+h)`FNC{YZAu0b!CHxuKQH>1duKK64<IkApN+z4m;qfW?N
z^wESCoiPD757~L+*lT;x`Ssp{hdO34V1{!jTj=ONedBu=Q=mij$#`9|D>m^<Uj2~(
zmm9%ooNg;-e2NcSfym8l^%~7R3We)`CcPqB1NDn)I+?e*J#Y##8_$2YOIn=*+~G33
z@Ys!MpP6xBQttT!xA}5Wkg!6&>1U?+Yk>8j3%Y&QfN1${Am(+(R<$`u{%<fk0RRVx
zaH^JkmQW~~r~&f2zXMu=x3m39Rl+pvY)hLwFFq~?-Z+RKcigdm|FtdtcNO!MHc$56
zmh#p{f12So<L6bO_P~UL1@OOvi@2pdnE3b-JoehQeiYGKz)@`hJ(k;qD{gzOc2^sT
zas7R&+o3hkO<=$tn`~yaY<bX?EItNyI<FcxsZF2AiUg>jOe>pr*$sp4JRDm+sc$aM
z4}9wgR~}fa?l<MYj*HG~n?n+E!r{Pr7}4F$<T5dgD=TLKwHCl`d6*Jf=#JusvIB3|
z@C4=u7ma6r(ZiM`gpz-y7$gI~{oUQGkQgr+d(mC!bSmrW8mV2qs%ZjFuc{N>)@kvq
z0I~=qBlp*j)n=@B;vv1I=SW1$Aqgf{PXrQ)<fs_%0dTTzptJ$J@g%W33##C&uXPmg
zw+R~;m!hf_wVSNDiD4LcH3AeZ()T|;LDKGNXndgWiA{eABviS9V=L>}w*hY}E<6VG
zGSKr96X_k3s~|cBPKEQgUuXV1I{+I&$4fJM&G{m(PAS4Sa$((D6;~3KWgqQ+hc`Ap
zI#~^UV6+D>Bx`Pq0};<>ot^YZ!XW?mlR^u=`Rc$hG`bx6$ET(;0V@9F_AlWf9~^pF
zAesMnlbeph#AFRwVwFX4L#W|VqW?Zqt@Zrk#?*Tko7V-&1Z?Zq`jF>oKw*lA;$mx;
zrE@oUsbG3bMcp}#S9tpN5=$41u^Z4Bs@zY+3Q9{AhtdS~n3(a&s@m=$4VitNk-8^~
zr$I#VYGcUz8#*Fi=)Gc@GY&ZYeJg69j6qmfcJZNu(Ns%XCq*>V@mHjZlpa@)N~0nk
zD2)Ng`B&9F&HQCI!xf()uA!VNtmA(@=jzhjKVM)dcB%xwBNS%{36<#vP>qYkUA&Rp
znLRi#gIF2ze!`ssZxV#H)fNgMy>@uW<e6pih=_jEi+z!w&CS<}Wut7k6j@Y{mc)<1
zHG%5{Vj0tx|E^6=L!%C`c+Zo$0*ip9MjMC}OXnBHkOl>9W2&iJ?SCi=*ktWi9}(y7
z>gs|6fDwJPd>;Fzs)TVKKT!VR>Skrd5-@9-@Tir2<-x#Q>>XU7DT8;7<&o`k4VihF
zNCNTaHQ=c>0aDNqs6u5C;1wO8^kyGlEI3V!&jTV47vn@NHUWWj-WXuyjb#1Ac#87M
z9qRz7i6M#0Mc~E15^4c}Dy>&rdiL{m5p~aY@~zDItw)LUmlmyssM$D{6vhty2L_dX
zh<S0vNnVGmx9`u>02$D~y;AqwN6<^!PWoH`4dEAeU;;3XmtdNysgFSC@?He70xt5_
zzs>;xI0%02Asczr7eFXSZKXQ_^FUhOl3<5wPF7AXZ)%5!@@$`z>t{7j-1UUVV$j2#
z_}S<K5!vP}uFH)gV9_~2&%J;(NB+aO>Nx+ecZ!fs*aS(`+|U{fYtA55;UsGb$Vq=u
zpbOZDbYgQ?0GLbwZ9zY3JtfGYXXH-@WCS3aE|1;0GvPscfGEl_rjT4-7Qa1)Eqcsk
zYT3uF(0(QY@P$~wD~|$+d49`^mjede#RaFav3Y!YUeRX9(34`zB_GKHV3(`FHPMS)
z@hQ0Z2CHYxh(yc(BMCM(H`R1?<<rN))5|yjcEsGuDn2vwU=r4Bt7d2j0Wg-wkG}{A
z2%xP3KA*h2sE&>f0RG(OgIuhDRsbmI1tcyFO-)n2^pmsW7reaj0MpCI&mS}%Q^x;_
z9QPlSw7spauOG5(+%X0J0;HmU{o(^mG8QoSLXo1j7f@%TfjWNxAW%|K0XMBjZtp4S
z>azbkPeH-mzw-c-LseDvwV|OT7Qyq5+2z=lbQ*x{WMXAa-`}?n3CVDyE7rxPpKTA4
z0uTgJ^25TbeQ2_@v@~GK|ATfuB#4gXBl3U$zG%Gz0o77!;UOtStZZyD{R0CGjEpdP
zh2V_cRwm*ISq3%C-k_A%4DoN7h?C<HO8@@2BiQ-RXX8iT{{L&BsOS3!^8ri9KV1R1
z<8QD5-SIhBS#EAH0FH+ZqFdYBGx6|@uDK3b@*f-=099c{^fk@Q7}3$utp%uy`doW%
zYUk|#ock8_0Vqe;%nU^)WZPK>;T!Y@J1s3OgF&fkW_@ifI6XZbW+Ae7C!(&Yr$_Ti
z{aP2`tSg(g(rEAG89x&Rkm&<FE<kyh*d*SheQcOKSOAJga=gm}ahL7i#C+~1U@(83
zA+srvS&o#|d8WRtf21mcw&I{02De`;{h=kMKYcO0LPSg}{Xwx*W)mQVP<LBS8K`?t
zhX?Wibw;7#%p+$VyP5JXKzatqIN<+(6-1@FoqGTod{GTVENDf|z@v-p$mzT^+1tm5
z*Jr>zZ1CE2^8>IwQ}XuCm)YE3&{Ni#3oL4F=Wf@I1<JP8T3!y7oPD)WqEOS&=mzMu
z{rebpL|IPGCn}-kuPg7XE(OaROAJv0HtQ_${6{9h_pMr)HEnqvJ)@SI%ylhQ!kzm&
zU{*d`Gm`6y7Wfc#gN=>1mD2uD3$x&CWKU!@y5{&{#v-f%F}t(bdaq~|(C-p9fbP8q
zH!^c{T&<Pk0C91MNkmv>`*pH`KEC~g3#_^XbQTZi=P5cZ*SOP8^J~B$k@Q+RFJxXG
zcZ9|X>uW0K)2$rMD*-ixWHw*jjcnhZ5RRxfZhw>6MAMhBBL<Xn<ebsN^4c$!*uhm8
z;WAK#19rI5C&3NO{`fPo{L=(r?!Rl75&b_k+)KF{Qs3USQX$J=Cq=RMQO__o6;nEM
zXRV7J3iA=yMwLaf<Lzn(-&|F&sVRG*jL@3Y7q2F1VrjSPk9JFr+(z#=PY9wmf(4{@
z2MJyJMlWy0TLa$4;V=yjcVNNa*BK_fk$L@=rxxom(NTqrjZ@LtEZcOZb?+=||I;E4
zm0vI0gHbFTqE6iMg>w8}#X^z<rvo9=Us=FM4nO6$VM!UP#QHbi^Gx{`swb_`&<KnF
z{YJpA%hw>9!E$qb(1`4a60Vbty|GRpiNm3;Zu+oyw}RV(vy(6R5T~`><mp|T?r&;L
zA6HxKz?1+Li~zXHeO*m_^%#tjeBKSpg&0xQ>MhMw&>17{lyZt-?vsYenHdNkl}mJA
zuRBocxAswrd}b|@B`)&C)tap}!BQcktnu*PP2)*o7E;Jt9`oYZ`|h}z40}r9g-e|&
zb-A1g4>2agH{?~TUVl+W?f!Qh0=MZ)P3>C3IShEdwqS<IJ{Mt{i(JL|zAp|5P~-|i
zG}rCh0&Vstern87PnDJLbwq^>#;=5zAsL6Qb)7D)?lMGWD;yhT<Q<IBEBF3PDuNE1
zbHq_*pPRjTS%$uN(pR$~6M9C~-@rnswjAM0Gf~ER+gYmXWZmaDTuINxJ2e;IM&p~A
zqR{k*6tp$%_B^PP3Uwv9>b_;tjJ%48fHF^&&lv{8&0b!zG!{!GYYruCX#HxPPSi0o
zZ4xa$YLJ+vrn$Nfe;iaBkFw-s6!W99lu&9D!7)p0WYrBD*Bz1iu2ntC@xzZ;|6o#G
z)xagP3rsP;t5@uS^XD3xj&Cn)+oO)jrTd<4rQi*`?J1ag(MDvhgLmdwa$#vxj_R-f
zfy<(TMU822EbNL|ys0uZ1rx)Ybw=GHok*bGx}RDiHU2OMn+U_7A@oE|J=(Mo@kbs4
z5wHP00V}Sv-A|DAPrepZumxWP=yl*0JwYvVj(xW2RIP~<OWE(dX|vx7JJb9w0nSOt
zoYkGJzGEs1c$Gxw0Ap;bMHZW}PB&9a&RH?c<`?A{_U1#>MgBALUBs|=wE`&3g*mK3
zmW%9ZUZF`6(0AO{I#KKJnDe~5&sl}e9@MS|4ib8!JmN2&bMa%I5KgK#A2B_yKs(47
z#SQ6^<rkhLF>t`0Hhf1ekfA{-!uA~}n7Sifnx5J=<$3OzHtz4wXdYthCMF}7VcOQ%
zYc)SiWXsGsn>$`_JqjuLnGDZt!;xOkRlm(S***#9M|{`=b<C!+FE6dUHS?#uz;70m
zWtJuS^5(vT9bE5~thuK8jxkFH6Q|PpV<$m96CRB+2DnW`Q<1J{!N81<M9lV4B8kK(
zM#<cX^0HcQr6mb?(#x<yQ>ACChqu>YDw><v?7t=xkt+lp-Yvvo_X8}MpemTmH-(x*
zkbgA<L1yRt1F+=z=Wej>)n~hJcc9S7ep(vUeu%rIawZSukl_ARb3!3vyx5$TD>r_T
zA1B(6#wTX*z3_7<R>;b!twX&reuQaYBr8|OkKAZHdlMy}D9q;+fBZ4}bAHh2KFSff
z%!b*NiVh=U?IJWuJPt7IvMmFt-z4Ovj=N0m>`LdyJ;{U|l#>35%UE3@a~z_gD@|Mf
zMfe3(3MQ2zWI(8#)yY0-G?Q2wc!ot8S>G0Wz>wN64U)gwjdH=u>Qt6kphurQd&2IK
zyO_4-)l~EGNUgkXS086Oc-UUv$#c~I!}bVtLQI&e!?iRaSyABh=OlD*Y~Fg+y5$Ll
zLzrj5DGQh=R!gvqeJ>JaWXTH2_MuWFJXO;KOSRj)3QZ_u^K#klPQ$qWxEWsA6*EK$
zx*?#hao%!8aFCa;AtA+K$tW3zT~Ql5F~N$;=)Mch8MI{2-(*52R6O54Z~3+@PAKLf
zLMQWy_%J~JYUQTLB*RYr$%0S_HAbt!s1Bz<`lH6;rj;?o?*L6c*8u0D!JRo3YPfm+
zMh6agpwY<c+>SmSGQ6gQOva9c&cFIckzI5hHb~8dS6q}ef-0Aw4)=fCK?Aln(8c2k
z-8sKH9CbyWL$t_{e)f*>7QuvUz6VZE<gK-g5<lI86}KHVSqK)~_23(te9vQdN|Um%
zAoCaW=lIXBab}I)y1y91_po<Rb$i56%vP4i#}_#_w%hi)V5;MO0;9rE#W3`1wXv7|
z6zf$l1=bZZDRi6ZO{ev<KXnvoMVdu&VIl9PD%7O!a&;m2*k-AIy+wC;1<4+I9JABK
zW~!f!hr1lat4nC1yCLpO1>L9J6(qB$MN;x5Ql+=UYFS49-%PQ7^TTpw%AF(DzB`|{
zysA)nG{e%QmfW7Q;vfaj+I<2(5bsD4#Gv3-sEjd0zmMAQ-r?sqa2DWRCRVzeKNu=;
zx7gj#$f8Y0(9CBpHdU(CSD?p~aj#<xlRke$s0I2{XN_I~32UIy<^<1YPRi`GhNeEV
z-xxHNu;qHAO>L{=yO_RE^n^Gcl{9Y2qsA>DOAw1}UURe#&Y+~weQ=9@f0Ne4wHTaj
zK5FgaHD;{N<(?2p^oc<5h6Fy_-E2$PvuSH>*ASx9`Q#-z$5*hkix+(1#E5A}OxMZs
zJ&or@km+0U1#eS3w(NrFwOnj6G5-oyGk(O9cok4V>BowYP>rz+Xu+1`U6gBw_w4^P
zgtk(x-&bJOI`zrj9H46^Jzq_Rp%2j>jG&diVCg`|sCw}r=ZQ4m-W)h9>Cexcg)N4%
z*r#eYRH7wrYu{%O_gCY`Y?Rd%l2g&Vfh1u8X%ntn8RUyBkiDs1;N!iF>Wm7PQ<E(s
zFU5-)XIXfB$@uipxWU1Oayavx!5v(6crK`!;08^XJLqee?d-x<PVwkzm%;J*CvYAa
zu2fHVsxr9TvGQ+AX+hQM>~f;NUq@ddCM(zN1meTJSBls~6=?U{Seuy(vye#Yr=L&o
zV)01PU(2{hORc4e4iWtD!C^lgxEf&nP6%3~99i3f4cSl)Wq8V4-eoPda))^Hyq}^R
z$fmqofMo2o*3ou9X&zNI>-InV!K3c~lgutDS(D7_%~V+2aSIGY+xXY5xBIRB0O85I
z`eBil6Rs@%WTeXJnr-CNq9Sj7we6k_2LF2BxyKM!7PupnCo^Z*e7*H%)OysWa~R%f
zndf3+Tuq~x9HILdS%>zt-j|<?KDfIKOfpWCB9Rni>E&%MI!u#I+so%m*{<v=ZG98G
z{^8MpT<#r9Eq+-#9h*t)%4YrVG@EH}BC?#eS8!FB*a{Vum)XYE+t&(tl#r9@Eng(-
zhAXs`{z@zLfltiEaZqFarGyO&3pku^B^lQSQ~V`I<=Q7qP^5ejoXmO5MwK-8&FJT?
zp>o;01OG=k9#+b>OfF9(Slwy6ANwXgf7Umd<ji-C>St1-*%Q)qeIXp~ogR<*ahMcT
zYH`)jdsX>P;;8Bz&sZZGm(>n0V`O@<tJ4?%L~S=({>yu5X=7;qGO*0xFb#|jWqqv;
zs;gK30H9iP?NKs4DaoJVEJTriF2X}h)p!`Ibsy_ea8?F5(qq)Ve_=lmhY9|uCVac(
zX!sI*U##cHn0;vd!cqf6y>)6oiS)mo%2WRT<*AfrCp3J!JV5__CGVDaT~r}fa6XA+
zcw3i50#iV3^9B~5;#>OJzh)MD_S90&xG7()#ECWXedpKle&qoXPvnM$49x=g&|)qI
zP~?4uG-+*n#rRV53;6<1GXe1cWxe!{=8r;jJ2x-&3(*h}F3zMZ>y7FXFCE{%SMg-K
zGPXHSGFd*&$p|D(fooc>GM;ed7k2Y7&XO7Qp%D|mo_(c}Jzd*U7=c=bjD7iGus~3|
z|M~Gk1Y_$jn$adc9`H9_nUHrM-N*}Fn7xKx$2t@v?~Wd?GuBN$7ly|lThl7<nk9nk
zf7O=%NPuysl6C$DO=&;5<_`=f+}0TV^4ZO*FrM%?2xd`CIG-5v97HrIadJVcDk-C9
zfT<U^+h3+x%~qt-s$|mLVca9HPnvGyupWkFVYj6zQGK06YZX5i7tfZPe>RRz`Ct2l
z5nMjtVz`pSL+^t&(_KpSD2JI#5aS*ix}f-`?~GzH<2Ml>L#U7|hT3IIpDr=dz{hTl
z0IYCLQdz6&(;A<6k9h^F=PiboT9t@&rGK&Hv&1*94!s1$oO_fndW2;9684PH4p*&<
zYeX#9_dc1dA;G<gj5|H(v_)}Z*h#ZF>A+HxsD`nf@Jr_-Y9gv|AvFO`5(nVU$f`kJ
z)-P^%;jc03S6_*__;!5dnLN3S+rz=TU|v>mMi~p@f~Z7tQ(sDGsx%d5;cw&8S6_6h
z|K2;d7y5J6_VfC-7Pno&!QyCHW`E(*%({(;y<Ed~u`f*^W9S_QJwmqIMo?En3h6;!
z(I9nNd78WCP9rFfO_SJNf6sN+MCWqJAT||MKKb@g^5!k3HTUr|QRMiOlSO03-n8;Y
z1y&xI*35bVM%B01g;-d8oNlXER@CoV5q<E)v$ezgXNYqUZOa8c(~1yZ!U7{hqr-=J
zk^aoVL&oNgmxkUi!!cQPd3}B2iS{F^LPh;%!XMq=zb3^h#S=g({z(!*`l@kz_CKZV
zGs+HU?I^Rt{Jxws<-_BP#3uJH;rREW)eF(55*5$NH*pbnMtl;at^-jg+j=5*M(TJ2
zl|d4pS28IInsVq|zh>&kx8O_X*0lZT@!ft~J3DC<C%!Gtb>qe5X}#;*Lz&q`P{sJN
zJLSd4(@Ena&4iW|R+;NxL#1Q-8>HMuEM7v%e*P6tiVQFhkkY0ZVQ%J4KHn&h9_ct9
zelDqXoZ&n~Z|fFeF#FF#H)bnn-7=^^CN`!|)aRk|+@AH$SV;Gu;*x;-f0tfHvBxe`
z8-Y0Pzxt%J)h@P4B}%)BUO7`cZ-=t}IO|}YHasIGQ*xf{I3&3YE^M~7B{($es?#Az
z#yo~izt<kJ>`}?wan@=19i(Ke$|LiYz^gdp;x`zO6Q<dYs*b}B>L|~Kny#;b!{dk3
z+kkj-dd4CzLCX2mZnAPFm;@*zC;x0~`>vxzoi!76-=@<}?&T%KqMU(r*4b+_eD5ny
zZesFmiS-Z@aAJyu{YTl3j`r%y#&+D5(+~f8{`(qI3W~(EG}}uj4HrYS6~MjuC)1cu
zF#|FTHvRFL#^qlND*+*)^`oOBXAwk_MU6Jk59Pmqh3^qyt2Qa9VO(~$FI{f?bAWI9
z{++z&Pr=OI$;E|yk=yNX2co<TC}oI&@jPg)qSLqo2E)|Snrhedz(M?yayB%eim`H3
zca}HWTVIC()jEC8Xr?k!00H1F6XMOG&pSX1%U_|-{7=g@X#D?4!vmo@mr2!0?RE_C
Q<9}%H<RG$@Z;V6!2jD)|YybcN

literal 0
HcmV?d00001

diff --git a/content/english/hpc/data-structures/img/src/eytzinger.svg b/content/english/hpc/data-structures/img/src/eytzinger.svg
new file mode 100644
index 00000000..da565f0d
--- /dev/null
+++ b/content/english/hpc/data-structures/img/src/eytzinger.svg
@@ -0,0 +1,454 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+
+<svg
+   width="420"
+   height="280"
+   viewBox="0 0 111.12511 74.083332"
+   version="1.1"
+   id="svg5"
+   inkscape:version="1.2.1 (9c6d41e410, 2022-07-14)"
+   sodipodi:docname="eytzinger.svg"
+   inkscape:export-filename="../eytzinger2.png"
+   inkscape:export-xdpi="137.14285"
+   inkscape:export-ydpi="137.14285"
+   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+   xmlns="http://www.w3.org/2000/svg"
+   xmlns:svg="http://www.w3.org/2000/svg">
+  <sodipodi:namedview
+     id="namedview7"
+     pagecolor="#ffffff"
+     bordercolor="#000000"
+     borderopacity="0.25"
+     inkscape:showpageshadow="2"
+     inkscape:pageopacity="0.0"
+     inkscape:pagecheckerboard="0"
+     inkscape:deskcolor="#d1d1d1"
+     inkscape:document-units="mm"
+     showgrid="true"
+     inkscape:zoom="4"
+     inkscape:cx="241.875"
+     inkscape:cy="237.375"
+     inkscape:window-width="2560"
+     inkscape:window-height="1011"
+     inkscape:window-x="0"
+     inkscape:window-y="32"
+     inkscape:window-maximized="1"
+     inkscape:current-layer="layer1">
+    <inkscape:grid
+       type="xygrid"
+       id="grid23683"
+       spacingx="0.1322917"
+       spacingy="0.1322917"
+       originx="0"
+       originy="0" />
+  </sodipodi:namedview>
+  <defs
+     id="defs2">
+    <rect
+       x="-90"
+       y="-80"
+       width="690"
+       height="530"
+       id="rect27770" />
+  </defs>
+  <g
+     inkscape:label="Layer 1"
+     inkscape:groupmode="layer"
+     id="layer1">
+    <circle
+       style="fill:#ff0000;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221"
+       cx="69.291191"
+       cy="7.9003606"
+       r="4.5881357" />
+    <circle
+       style="fill:#ff4646;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4"
+       cx="41.714005"
+       cy="23.151033"
+       r="4.5881357" />
+    <circle
+       style="fill:#ff4646;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4"
+       cx="87.836967"
+       cy="23.180923"
+       r="4.5881357" />
+    <circle
+       style="fill:#ffdcdc;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4-8"
+       cx="13.676345"
+       cy="51.400574"
+       r="4.5881357" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="15.16085"
+       y="53.399693"
+       id="text2027"><tspan
+         sodipodi:role="line"
+         id="tspan2025"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="15.16085"
+         y="53.399693">0</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="89.296989"
+       y="25.13541"
+       id="text2027-3-9-7"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2-1"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="89.296989"
+         y="25.13541">7</tspan></text>
+    <circle
+       style="fill:#ffdcdc;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4-8-7"
+       cx="32.24707"
+       cy="51.400574"
+       r="4.5881357" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="33.896336"
+       y="53.318047"
+       id="text2027-3-9"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="33.896336"
+         y="53.318047">2</tspan></text>
+    <circle
+       style="fill:#ffdcdc;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4-8-7-2"
+       cx="50.70089"
+       cy="51.425514"
+       r="4.5881357" />
+    <circle
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4-0"
+       cx="22.96171"
+       cy="38.635292"
+       r="4.5881357" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="24.498793"
+       y="40.485737"
+       id="text2027-3"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="24.498793"
+         y="40.485737">1</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="42.994846"
+       y="25.13541"
+       id="text2027-3-9-6"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2-8"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="42.994846"
+         y="25.13541">3</tspan></text>
+    <circle
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4-0-9"
+       cx="60.003384"
+       cy="38.635292"
+       r="4.5881357" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="52.321949"
+       y="53.507244"
+       id="text2027-3-9-6-2"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2-8-6"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="52.321949"
+         y="53.507244">4</tspan></text>
+    <circle
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4-0-9-6"
+       cx="78.524223"
+       cy="38.635292"
+       r="4.5881357" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="79.889954"
+       y="40.358799"
+       id="text2027-3-9-7-2"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2-1-7"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="79.889954"
+         y="40.358799">8</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="61.449898"
+       y="40.618031"
+       id="text2027-3-9-6-2-4"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2-8-6-9"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="61.449898"
+         y="40.618031">5</tspan></text>
+    <circle
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="path221-4-4-0-9-6-5"
+       cx="97.045059"
+       cy="38.635292"
+       r="4.5881357" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="98.401169"
+       y="40.353977"
+       id="text2027-3-9-7-2-2"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2-1-7-6"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="98.401169"
+         y="40.353977">9</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="70.737701"
+       y="9.9268551"
+       id="text2027-3-9-6-2-4-0"><tspan
+         sodipodi:role="line"
+         id="tspan2025-1-2-8-6-9-4"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="70.737701"
+         y="9.9268551">6</tspan></text>
+    <rect
+       style="opacity:1;fill:#ff0000;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616"
+       width="9.2742929"
+       height="9.2742929"
+       x="9.0694494"
+       y="60.684288" />
+    <rect
+       style="fill:#ff4646;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-6"
+       width="9.2742929"
+       height="9.2742929"
+       x="18.343744"
+       y="60.684288" />
+    <rect
+       style="fill:#ff4646;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-3"
+       width="9.2742929"
+       height="9.2742929"
+       x="27.618038"
+       y="60.684288" />
+    <rect
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-6-8"
+       width="9.2742929"
+       height="9.2742929"
+       x="36.89233"
+       y="60.684288" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="43.01965"
+       y="67.340981"
+       id="text2027-1-7"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-3"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="43.01965"
+         y="67.340981">1</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="24.175987"
+       y="67.309631"
+       id="text2027-1-7-1"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-3-2"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="24.175987"
+         y="67.309631">3</tspan></text>
+    <rect
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-9"
+       width="9.2742929"
+       height="9.2742929"
+       x="46.166618"
+       y="60.684288" />
+    <rect
+       style="fill:#ffdcdc;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-4"
+       width="9.2742929"
+       height="9.2742929"
+       x="83.263794"
+       y="60.684288" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="89.305939"
+       y="67.155983"
+       id="text2027-1-9"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-4"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="89.305939"
+         y="67.155983">2</tspan></text>
+    <rect
+       style="fill:#ffdcdc;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-6-6"
+       width="9.2742929"
+       height="9.2742929"
+       x="92.538086"
+       y="60.684288" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="98.788048"
+       y="67.177864"
+       id="text2027-1-3"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-9"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="98.788048"
+         y="67.177864">4</tspan></text>
+    <rect
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-6-0"
+       width="9.2742929"
+       height="9.2742929"
+       x="55.440918"
+       y="60.684288" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="61.383411"
+       y="67.336502"
+       id="text2027-1-4"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-47"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="61.383411"
+         y="67.336502">8</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="51.990688"
+       y="67.204216"
+       id="text2027-1-7-8"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-3-8"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="51.990688"
+         y="67.204216">5</tspan></text>
+    <rect
+       style="fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-3-5"
+       width="9.2742929"
+       height="9.2742929"
+       x="64.715202"
+       y="60.684288" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="70.595444"
+       y="67.341507"
+       id="text2027-1-7-3"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-3-1"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="70.595444"
+         y="67.341507">9</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="15.05302"
+       y="67.42263"
+       id="text2027-1-9-0"
+       inkscape:transform-center-x="0.081921321"
+       inkscape:transform-center-y="-0.075522315"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-4-9"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="15.05302"
+         y="67.42263">6</tspan></text>
+    <rect
+       style="fill:#ffdcdc;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-opacity:1;stroke-dasharray:none"
+       id="rect19616-6-8-6"
+       width="9.2742929"
+       height="9.2742929"
+       x="73.989502"
+       y="60.684288" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="79.983849"
+       y="67.158051"
+       id="text2027-1"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="79.983849"
+         y="67.158051">0</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:0.265643;stroke-dasharray:none;stroke-opacity:1"
+       x="33.420914"
+       y="67.478096"
+       id="text2027-1-7-1-3"><tspan
+         sodipodi:role="line"
+         id="tspan2025-0-3-2-8"
+         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;text-anchor:end;opacity:1;fill:#000000;fill-opacity:1;stroke-width:0.265643;stroke-dasharray:none"
+         x="33.420914"
+         y="67.478096">7</tspan></text>
+    <text
+       xml:space="preserve"
+       transform="scale(0.26458333)"
+       id="text27768"
+       style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:18.6667px;font-family:'Droid Sans';-inkscape-font-specification:'Droid Sans';text-align:end;white-space:pre;shape-inside:url(#rect27770);display:inline;opacity:1;fill:#ffaaaa;fill-opacity:1;stroke:#000000;stroke-width:1.00401;stroke-dasharray:none;stroke-opacity:1" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="m 20.24065,42.333349 -3.968755,5.291675"
+       id="path43298" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="m 57.282363,42.333349 -3.968754,5.291675"
+       id="path43378" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="m 25.664616,42.333349 3.968755,5.291675"
+       id="path43380" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="M 37.921684,25.72447 25.797251,35.056853"
+       id="path43382" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="m 45.55494,25.723093 11.594882,9.333899"
+       id="path43384" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="m 85.328232,26.987498 -4.497921,7.672924"
+       id="path43388" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="m 90.355323,26.987498 4.36563,7.672924"
+       id="path43390" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="m 72.760509,10.980185 11.377097,9.525012"
+       id="path43392" />
+    <path
+       style="opacity:1;fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:0.3000003;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
+       d="M 65.182847,9.9895775 45.507975,20.505879"
+       id="path43394" />
+  </g>
+</svg>

From f3fb1ae8eceaaf73d231763b3bcf0fb3f4b964eb Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 19 Jul 2022 01:10:19 +0300
Subject: [PATCH 136/173] typos

---
 content/english/hpc/algorithms/gcd.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/algorithms/gcd.md b/content/english/hpc/algorithms/gcd.md
index 7941edd0..6a4f8ca7 100644
--- a/content/english/hpc/algorithms/gcd.md
+++ b/content/english/hpc/algorithms/gcd.md
@@ -252,9 +252,9 @@ int gcd(int a, int b) {
 }
 ```
 
-It runs in 91ns — which is good enough to leave it there.
+It runs in 91ns, which is good enough to leave it there.
 
-If somebody wants to try to shove off a few more nanoseconds by re-writing assembly by hand or trying a lookup table to save a few last iterations, please [let me know](http://sereja.me/).
+If somebody wants to try to shave off a few more nanoseconds by rewriting the assembly by hand or trying a lookup table to save a few last iterations, please [let me know](http://sereja.me/).
 
 ### Acknowledgements
 

From 9d626692f78d3e173644d1bbbf8dbbca7d9c2d79 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 19 Jul 2022 01:28:13 +0300
Subject: [PATCH 137/173] improve wording

---
 content/english/hpc/algorithms/matmul.md   | 2 +-
 content/english/hpc/cpu-cache/alignment.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 02c68f36..5f2847d2 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -438,7 +438,7 @@ There is also an approach that performs asymptotically fewer arithmetic operatio
 
 FMA also supports 64-bit floating-point numbers, but it does not support integers: you need to perform addition and multiplication separately, which results in decreased performance. If you can guarantee that all intermediate results can be represented exactly as 32- or 64-bit floating-point numbers (which is [often the case](/hpc/arithmetic/errors/)), it may be faster to just convert them to and from floats.
 
-You can also apply the same trick to other similar computations. One example is the "min-plus matrix multiplication," which is defined as:
+This approach can be also applied to some similar-looking computations. One example is the "min-plus matrix multiplication" defined as:
 
 $$
 (A \circ B)_{ij} = \min_{1 \le k \le n} (A_{ik} + B_{kj})
diff --git a/content/english/hpc/cpu-cache/alignment.md b/content/english/hpc/cpu-cache/alignment.md
index 59579467..e9c5f4d3 100644
--- a/content/english/hpc/cpu-cache/alignment.md
+++ b/content/english/hpc/cpu-cache/alignment.md
@@ -185,4 +185,4 @@ int load(int *p) {
 }
 ```
 
-Compilers usually don't do that because this is not technically always legal: that 4th byte may be on a memory page that you don't own, so the operating system won't let you load it even if you are going to discard it right away.
+Compilers usually don't do that because it's technically not legal: that 4th byte may be on a memory page that you don't own, so the operating system won't let you load it even if you are going to discard it right away.

From 05f05c5b4eb587ff533769f3fca83486b0307890 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 19 Jul 2022 03:27:19 +0300
Subject: [PATCH 138/173] elaborating on eytzinger layout

---
 .../hpc/data-structures/binary-search.md      | 33 ++++++++++---------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 6e73d32d..d2f237cb 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -248,7 +248,7 @@ Apart from being compact, it has some nice properties, like that all even-number
 
 Here is how this layout looks when applied to binary search:
 
-![](../img/eytzinger.png)
+![Note that the tree is slightly imbalanced (because of the last layer is continuous)](../img/eytzinger.png)
 
 When searching in this layout, we just need to start from the first element of the array, and then on each iteration jump to either $2 k$ or $(2k + 1)$, depending on how the comparison went:
 
@@ -278,15 +278,15 @@ void eytzinger(int k = 1) {
 }
 ```
 
-This function takes the current node number `k`, recursively writes out all elements to the left of the middle of the search interval, writes out the current element we'd compare against, and then recursively writes out all the elements on the right. It seems a bit complicated, but to convince ourselves that it works, we only need three observations:
+This function takes the current node number `k`, recursively writes out all elements to the left of the middle of the search interval, writes out the current element we'd compare against, and then recursively writes out all the elements on the right. It seems a bit complicated, but to convince yourself that it works, you only need three observations:
 
 - It writes exactly `n` elements as we enter the body of `if` for each `k` from `1` to `n` just once.
 - It writes out sequential elements from the original array as it increments the `i` pointer each time.
-- By the time we write the element at node `k`, we have already written all the elements to its left (exactly `i`).
+- By the time we write the element at node `k`, we will have already written all the elements to its left (exactly `i`).
 
-Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time.
+Despite being recursive, it is actually quite fast as all the memory reads are sequential, and the memory writes are only in $O(\log n)$ different memory blocks at a time. Maintaining the permutation is both logically and computationally harder to maintain though: adding an element to a sorted array only requires shifting a suffix of its elements one position to the right, while Eytzinger array practically needs to be rebuilt from scratch.
 
-Note that this traversal and the resulting permutation are not exactly equivalent to the "tree" of vanilla binary search: for example, the left child subtree may be larger than the right child subtree — and even more than just by one node — but it doesn't matter since both approaches result in the same logarithmic tree depth.
+Note that this traversal and the resulting permutation are not exactly equivalent to the "tree" of vanilla binary search: for example, the left child subtree may be larger than the right child subtree — up to twice as large — but it doesn't matter much since both approaches result in the same $\lceil \log_2 n \rceil$ tree depth.
 
 Also note that the Eytzinger array is one-indexed — this will be important for performance later. You can put in the zeroth element the value that you want to be returned in the case when the lower bound doesn't exist (similar to `a.end()` for `std::lower_bound`).
 
@@ -300,22 +300,25 @@ while (k <= n)
     k = 2 * k + (t[k] < x);
 ```
 
-The only problem arises when we need to restore the index of the resulting element, as $k$ may end up not pointing to a leaf node. Here is an example of how that can happen:
+The only problem arises when we need to restore the index of the resulting element, as $k$ does not directly point to it. Consider this example (its corresponding tree is listed above):
 
 ```center
-    array:  1 2 3 4 5 6 7 8                     
-eytzinger:  5 3 7 2 4 6 8 1                     
-1st range:  ---------------  k := 1             
-2nd range:  -------          k := 2*k      (=2) 
-3rd range:      ---          k := 2*k + 1  (=5) 
-4th range:        -          k := 2*k      (=10)
+    array:  0 1 2 3 4 5 6 7 8 9                           
+eytzinger:  6 3 7 1 5 8 9 0 2 4                           
+1st range:  -------------------  k := 1                    
+2nd range:  -------------        k := 2*k     = 2   (6 ≥ 3)
+3rd range:  -------              k := 2*k     = 4   (3 ≥ 3)
+4th range:      ---              k := 2*k + 1 = 9   (1 < 3)
+5th range:        -              k := 2*k + 1 = 19  (2 < 3)
 ```
 
-Here we query the array of $[1, …, 8]$ for the lower bound of $x=4$. We compare it against $5$, $3$, and $4$, go left-right-left, and end up with $k = 10$, which isn't even a valid array index.
+<!-- do we need the last comparison? -->
 
-The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we start comparing $x$ against elements to the left, and all these comparisons evaluate true (that is, leading to the right). Therefore, to restore the answer, we just need to "cancel" some number of right turns.
+Here we query the array of $[0, …, 9]$ for the lower bound of $x=3$. We compare it against $6$, $3$, $1$, and $2$, go left-left-right-right, and end up with $k = 19$, which isn't even a valid array index.
 
-This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing 1s in the binary representation and right-shift $k$ by exactly that number of bits. To do this, we can invert the number (`~k`) and call the "find first set" instruction:
+The trick is to notice that, unless the answer is the last element of the array, we compare $x$ against it at some point, and after we've learned that it is not less than $x$, we go left exactly once and then keep going right until we reach a leaf (because we will only be comparing $x$ against lesser elements). Therefore, to restore the answer, we just need to "cancel" some number of right turns and then one more.
+
+This can be done in an elegant way by observing that the right turns are recorded in the binary representation of $k$ as 1-bits, and so we just need to find the number of trailing 1s in the binary representation and right-shift $k$ by exactly that number of bits plus one. To do this, we can invert the number (`~k`) and call the "find first set" instruction:
 
 ```c++
 int lower_bound(int x) {

From c98fcddab8225ab707a71820da7ff45e744da04d Mon Sep 17 00:00:00 2001
From: song-jx <79297685+song-jx@users.noreply.github.com>
Date: Tue, 19 Jul 2022 22:05:52 +0800
Subject: [PATCH 139/173] Fixed a problem that could cause out of bounds.

Calling add(32, 0) when N = 33 will be out of bounds.
---
 content/english/hpc/data-structures/segment-trees.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md
index f4c6fb7f..e98c16cb 100644
--- a/content/english/hpc/data-structures/segment-trees.md
+++ b/content/english/hpc/data-structures/segment-trees.md
@@ -594,7 +594,7 @@ constexpr int offset(int h) {
     int s = 0, n = N;
     while (h--) {
         s += (n + B - 1) / B * B;
-        n /= B;
+        n = (n + B - 1) / B;
     }
     return s;
 }

From dad89c8d3155433d45875a057039bcc944ca98f8 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 19 Jul 2022 18:23:56 +0300
Subject: [PATCH 140/173] new branchless binary search

---
 .../hpc/data-structures/binary-search.md      | 61 ++++++++++++++++---
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index d2f237cb..f2e61ffb 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -3,6 +3,8 @@ title: Binary Search
 weight: 1
 ---
 
+<!-- mention interpolation search and radix trees? -->
+
 While improving the speed of user-facing applications is the end goal of performance engineering, people don't really get excited over 5-10% improvements in some databases. Yes, this is what software engineers are paid for, but these types of optimizations tend to be too intricate and system-specific to be readily generalized to other software.
 
 Instead, the most fascinating showcases of performance engineering are multifold optimizations of textbook algorithms: the kinds that everybody knows and deemed so simple that it would never even occur to try to optimize them in the first place. These optimizations are simple and instructive and can very much be adopted elsewhere. And they are surprisingly not as rare as you'd think.
@@ -71,7 +73,7 @@ int lower_bound(int x) {
 
 Find the middle element of the search range, compare it to `x`, shrink the range in half. Beautiful in its simplicity.
 
-A similar approach is employed by `std::lower_bound`, except that it needs to be more generic to support containers with non-random-access iterators and thus uses the first element and the size of the search interval instead of the two of its ends. Implementations from both [Clang](https://github.com/llvm-mirror/libcxx/blob/78d6a7767ed57b50122a161b91f59f19c9bd0d19/include/algorithm#L4169) and [GCC](https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L1023) use this metaprogramming monstrosity:
+A similar approach is employed by `std::lower_bound`, except that it needs to be more generic to support containers with non-random-access iterators and thus uses the first element and the size of the search interval instead of the two of its ends. To this end, implementations from both [Clang](https://github.com/llvm-mirror/libcxx/blob/78d6a7767ed57b50122a161b91f59f19c9bd0d19/include/algorithm#L4169) and [GCC](https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L1023) use this metaprogramming monstrosity:
 
 ```c++
 template <class _Compare, class _ForwardIterator, class _Tp>
@@ -131,23 +133,60 @@ Now, let's try to get rid of these obstacles one by one.
 
 ## Removing Branches
 
-We can replace branching with [predication](/hpc/pipelining/branchless). To do this, we need to adopt the STL approach and rewrite the loop using the first element and the size of the search interval — instead of its first and last element. This way we only need to update the first element of the search interval with a `cmov` instruction and halve its size on each iteration:
+We can replace branching with [predication](/hpc/pipelining/branchless). To make the task easier, we can adopt the STL approach and rewrite the loop using the first element and the size of the search interval (instead of its first and last element):
 
 ```c++
 int lower_bound(int x) {
     int *base = t, len = n;
     while (len > 1) {
         int half = len / 2;
-        base = (base[half] < x ? &base[half] : base);
+        if (base[half - 1] < x) {
+            base += half;
+            len = len - half;
+        } else {
+            len = half;
+        }
+    }
+    return *base;
+}
+```
+
+Note that, on each iteration, `len` is essentially just halved and then either floored or ceiled, depending on how the comparison went. This conditional update seems unnecessary; to avoid it, we can simply say that it's always ceiled:
+
+```c++
+int lower_bound(int x) {
+    int *base = t, len = n;
+    while (len > 1) {
+        int half = len / 2;
+        if (base[half - 1] < x)
+            base += half;
+        len -= half; // = ceil(len / 2)
+    }
+    return *base;
+}
+```
+
+This way, we only need to update the first element of the search interval with a [conditional move](/hpc/pipelining/branchless/) and halve its size on each iteration:
+
+```c++
+int lower_bound(int x) {
+    int *base = t, len = n;
+    while (len > 1) {
+        int half = len / 2;
+        base += (base[half - 1] < x) * half; // will be replaced with a "cmov"
         len -= half;
     }
-    return *(base + (*base < x));
+    return *base;
 }
 ```
 
-Note that this loop is not always equivalent to the standard binary search — it always rounds *up* the size of the search interval, so it accesses slightly different elements and may perform one comparison more than what is needed. We do this to make the number of iterations constant and remove the need for branching completely, although it does require an awkward `(*base < x)` check at the end.
+<!-- pre-compute base pointer for next iteration? -->
 
-As typical for predication, this trick is very fragile to compiler optimizations. It doesn't make a difference on Clang — for some reason, it replaces the ternary operator with a branch anyway — but it works fine on GCC (9.3), yielding a 2.5-3x improvement on small arrays:
+Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely.
+
+As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the funciton is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays:
+
+<!-- todo: update numbers -->
 
 ![](../img/search-branchless.svg)
 
@@ -162,15 +201,17 @@ int lower_bound(int x) {
     int *base = t, len = n;
     while (len > 1) {
         int half = len / 2;
-        __builtin_prefetch(&base[(len - half) / 2]);
-        __builtin_prefetch(&base[half + (len - half) / 2]);
-        base = (base[half] < x ? &base[half] : base);
         len -= half;
+        __builtin_prefetch(&base[len / 2 - 1]);
+        __builtin_prefetch(&base[half + len / 2 - 1]);
+        base += (base[half - 1] < x) * half;
     }
-    return *(base + (*base < x));
+    return *base;
 }
 ```
 
+<!-- todo: rerun this too -->
+
 With prefetching, the performance on large arrays becomes roughly the same:
 
 ![](../img/search-branchless-prefetch.svg)

From 3b2037f968fec31bf7f0ffef74a80df432338a51 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 05:31:43 +0300
Subject: [PATCH 141/173] simplify code

---
 content/english/hpc/data-structures/segment-trees.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md
index e98c16cb..90435a38 100644
--- a/content/english/hpc/data-structures/segment-trees.md
+++ b/content/english/hpc/data-structures/segment-trees.md
@@ -593,8 +593,8 @@ constexpr int height(int n) {
 constexpr int offset(int h) {
     int s = 0, n = N;
     while (h--) {
-        s += (n + B - 1) / B * B;
         n = (n + B - 1) / B;
+        s += n * B;
     }
     return s;
 }

From b8e8ede0ad7a040478df5d985c5bdf417758385b Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 05:37:35 +0300
Subject: [PATCH 142/173] change wording

---
 content/english/hpc/data-structures/segment-trees.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/content/english/hpc/data-structures/segment-trees.md b/content/english/hpc/data-structures/segment-trees.md
index 90435a38..9ad14608 100644
--- a/content/english/hpc/data-structures/segment-trees.md
+++ b/content/english/hpc/data-structures/segment-trees.md
@@ -603,14 +603,14 @@ constexpr int H = height(N);
 alignas(64) int t[offset(H)]; // an array for storing nodes
 ```
 
-This way we effectively reduce the height of the tree by approximately $\frac{\log_B n}{\log_2 n} = \log_2 B$ times ($\sim4$ times if $B = 16$), but it becomes non-trivial to implement in-node operations efficiently. For our problem, we have two main options:
+This way, we effectively reduce the height of the tree by approximately $\frac{\log_B n}{\log_2 n} = \log_2 B$ times ($\sim4$ times if $B = 16$), but it becomes non-trivial to implement in-node operations efficiently. For our problem, we have two main options:
 
 1. We could store $B$ *sums* in each node (for each of its $B$ children).
 2. We could store $B$ *prefix sums* in each node (the $i$-th being the sum of the first $(i + 1)$ children).
 
 If we go with the first option, the `add` query would be largely the same as in the bottom-up segment tree, but the `sum` query would need to add up to $B$ scalars in each node it visits. And if we go with the second option, the `sum` query would be trivial, but the `add` query would need to add `x` to some suffix on each node it visits.
 
-In either case, one operation will perform $O(\log_B n)$ operations, touching just one scalar in each node, while the other will perform $O(B \cdot \log_B n)$ operations, touching up to $B$ scalars in each node. However, it is 21st century, and we can use [SIMD](/hpc/simd) to accelerate the slower operation. Since there are no fast [horizontal reductions](/hpc/simd/reduction) in SIMD instruction sets, but it is easy to add a vector to a vector, we will choose the second approach and store prefix sums in each node.
+In either case, one operation would perform $O(\log_B n)$ operations, touching just one scalar in each node, while the other would perform $O(B \cdot \log_B n)$ operations, touching up to $B$ scalars in each node. We can, however, use [SIMD](/hpc/simd) to accelerate the slower operation, and since there are no fast [horizontal reductions](/hpc/simd/reduction) in SIMD instruction sets, but it is easy to add a vector to a vector, we will choose the second approach and store prefix sums in each node.
 
 This makes the `sum` query extremely fast and easy to implement:
 
@@ -623,7 +623,7 @@ int sum(int k) {
 }
 ```
 
-The `add` query is more complicated and slower. We need to add a number to only a suffix of a node, and we can do this by [masking out](/hpc/simd/masking) the positions that need not be modified.
+The `add` query is more complicated and slower. We need to add a number only to a suffix of a node, and we can do this by [masking out](/hpc/simd/masking) the positions that should not be modified.
 
 We can pre-calculate a $B \times B$ array corresponding to $B$ such masks that tell, for each of $B$ positions within a node, whether a certain prefix sum value needs to be updated or not:
 

From 6a06a065e37eb052a4774a2415cfe9fd356acfce Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 05:48:13 +0300
Subject: [PATCH 143/173] adjust header padding

---
 themes/algorithmica/assets/style.sass | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass
index eb5e2410..b91f9a5f 100644
--- a/themes/algorithmica/assets/style.sass
+++ b/themes/algorithmica/assets/style.sass
@@ -187,10 +187,10 @@ menu
   display: flex
   font-family: $font-headings
   
-  height: 30px
+  height: 26px
   background-color: $background
   justify-content: space-between
-  padding: 12px
+  padding: 14px
   margin: 0
   text-align: center
 

From 1c8c455097f458dcb86f67e40c77fd1ca15830c6 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 05:57:43 +0300
Subject: [PATCH 144/173] change simd titles

---
 content/english/hpc/_index.md         | 4 ++--
 content/english/hpc/simd/moving.md    | 2 +-
 content/english/hpc/simd/reduction.md | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index 942c9f6a..7a0068ff 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -163,8 +163,8 @@ Planned table of contents:
  9.11. AoS and SoA
 10. SIMD Parallelism
  10.1. Intrinsics and Vector Types
- 10.2. Loading and Writing Data
- 10.3. Sums and Other Reductions
+ 10.2. Moving Data
+ 10.3. Reductions
  10.4. Masking and Blending
  10.5. In-Register Shuffles
  10.6. Auto-Vectorization
diff --git a/content/english/hpc/simd/moving.md b/content/english/hpc/simd/moving.md
index 948c31c4..72cbbd33 100644
--- a/content/english/hpc/simd/moving.md
+++ b/content/english/hpc/simd/moving.md
@@ -1,5 +1,5 @@
 ---
-title: Loading and Writing Data
+title: Moving Data
 aliases: [/hpc/simd/vectorization]
 weight: 2
 ---
diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md
index c67c1942..5a0ace1e 100644
--- a/content/english/hpc/simd/reduction.md
+++ b/content/english/hpc/simd/reduction.md
@@ -1,5 +1,5 @@
 ---
-title: Sums and Other Reductions
+title: Reductions
 weight: 3
 ---
 

From af2c2b90dedcd2ab701cff977383529244219dbc Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 06:11:31 +0300
Subject: [PATCH 145/173] improving search

---
 themes/algorithmica/assets/style.sass          | 3 +++
 themes/algorithmica/layouts/partials/head.html | 1 +
 2 files changed, 4 insertions(+)

diff --git a/themes/algorithmica/assets/style.sass b/themes/algorithmica/assets/style.sass
index b91f9a5f..00a420cf 100644
--- a/themes/algorithmica/assets/style.sass
+++ b/themes/algorithmica/assets/style.sass
@@ -236,6 +236,9 @@ menu
     background: $code-background
     border: $code-border
 
+    &:focus
+      outline: 1px solid $dimmed
+
   #search-count
     margin-top: 8px
     color: $dimmed
diff --git a/themes/algorithmica/layouts/partials/head.html b/themes/algorithmica/layouts/partials/head.html
index 2f4c3c46..c5013dba 100644
--- a/themes/algorithmica/layouts/partials/head.html
+++ b/themes/algorithmica/layouts/partials/head.html
@@ -45,6 +45,7 @@
       if (window.getComputedStyle(searchDiv).display == 'none') {
         searchDiv.style.display = 'block'
         window.scrollTo({ top: 0 });
+        document.getElementById('search-bar').focus()
       } else {
         searchDiv.style.display = 'none'  
       }

From 72e00452f0bbd5d30d436ff207fc3f91ee3c678d Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 07:47:49 +0300
Subject: [PATCH 146/173] spmd

---
 content/english/hpc/_index.md                 |  2 +-
 content/english/hpc/simd/_index.md            |  2 +-
 .../english/hpc/simd/auto-vectorization.md    | 26 ++++++++++++++-----
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index 7a0068ff..8d73bcb0 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -167,7 +167,7 @@ Planned table of contents:
  10.3. Reductions
  10.4. Masking and Blending
  10.5. In-Register Shuffles
- 10.6. Auto-Vectorization
+ 10.6. Auto-Vectorization and SPMD
 11. Algorithm Case Studies
  11.1. Binary GCD
 (11.2. Prime Number Sieves)
diff --git a/content/english/hpc/simd/_index.md b/content/english/hpc/simd/_index.md
index 5e05da8e..50f6e3ed 100644
--- a/content/english/hpc/simd/_index.md
+++ b/content/english/hpc/simd/_index.md
@@ -43,6 +43,6 @@ In particular, AVX2 has instructions for working with 256-bit registers, while b
 
 ![](img/intel-extensions.webp)
 
-Compilers often do a good job rewriting simple loops with SIMD instructions, like in the case above. This optimization is called [auto-vectorization](auto-vectorization), and it is the preferred way to use SIMD.
+Compilers often do a good job rewriting simple loops with SIMD instructions, like in the case above. This optimization is called [auto-vectorization](auto-vectorization), and it is the most popular way of using SIMD.
 
 The problem is that it only works with certain types of loops, and even then it often yields suboptimal results. To understand its limitations, we need to get our hands dirty and explore this technology on a lower level, which is what we are going to do in this chapter.
diff --git a/content/english/hpc/simd/auto-vectorization.md b/content/english/hpc/simd/auto-vectorization.md
index 5fc568c3..b7b8a45f 100644
--- a/content/english/hpc/simd/auto-vectorization.md
+++ b/content/english/hpc/simd/auto-vectorization.md
@@ -1,15 +1,17 @@
 ---
-title: Auto-Vectorization
+title: Auto-Vectorization and SPMD
 weight: 10
 ---
 
-SIMD-parallelism is most often used for *embarrassingly parallel* computations: the kinds where all you do is apply some elementwise function to all elements of an array and write it back somewhere else. In this setting, you don't even need to know how SIMD works: the compiler is perfectly capable of optimizing such loops by itself — you just need to be aware that such optimization exists and that it usually yields a 5-10x speedup.
+SIMD parallelism is most often used for *embarrassingly parallel* computations: the kinds where all you do is apply some elementwise function to all elements of an array and write it back somewhere else. In this setting, you don't even need to know how SIMD works: the compiler is perfectly capable of optimizing such loops by itself — you just need to be aware that such optimization exists and that it usually yields a 5-10x speedup.
 
-Doing nothing and relying on auto-vectorization is actually the preferred way of using SIMD. Whenever you can, you should always stick with the scalar code for its simplicity and maintainability. But often even the loops that seem straightforward to vectorize are not optimized because of some technical nuances. [As in many other cases](/hpc/compilation/contracts), the compiler may need some additional input from the programmer as he may know a bit more about the problem than what can be inferred from static analysis.
+Doing nothing and relying on auto-vectorization is actually the most popular way of using SIMD. In fact, in many cases, it even advised to stick with the plain scalar code for its simplicity and maintainability.
+
+But often even the loops that seem straightforward to vectorize are not optimized because of some technical nuances. [As in many other cases](/hpc/compilation/contracts), the compiler may need some additional input from the programmer as he may know a bit more about the problem than what can be inferred from static analysis.
 
 ### Potential Problems
 
-Consider the "a + b" example:
+Consider the "a + b" example we [started with](../intrinsics/#simd-intrinsics):
 
 ```c++
 void sum(int *a, int *b, int *c, int n) {
@@ -47,8 +49,18 @@ for (int i = 0; i < n; i++)
 
 To help the compiler eliminate this corner case, we can use the `alignas` specifier on static arrays and the `std::assume_aligned` function to mark pointers aligned.
 
-**Checking if vectorization happened.**  In either case, it is useful to check if the compiler vectorized the loop the way you intended. You can either [compiling it to assembly](/hpc/compilation/stages) and look for blocks for instructions that start with a "v" or add the `-fopt-info-vec-optimized` compiler flag so that the compiler indicates where auto-vectorization is happening and what SIMD width is being used. If you swap `optimized` for `missed` or `all`, you may also get some reasoning behind why it is not happening in other places.
+**Checking if vectorization happened.** In either case, it is useful to check if the compiler vectorized the loop the way you intended. You can either [compiling it to assembly](/hpc/compilation/stages) and look for blocks for instructions that start with a "v" or add the `-fopt-info-vec-optimized` compiler flag so that the compiler indicates where auto-vectorization is happening and what SIMD width is being used. If you swap `optimized` for `missed` or `all`, you may also get some reasoning behind why it is not happening in other places.
 
----
+There are [many other ways](https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf) of telling the compiler exactly what we mean, but in especially complex cases — e.g., when there are a lot of branches or function calls inside the loop — it is easier to go one level of abstraction down and vectorize manually.
+
+### SPMD
+
+There is a neat compromise between auto-vectorization and the manual use of SIMD intrinsics: "single program, multiple data" (SPMD). This is a model of computation in which the programmer writes what appears to be a regular serial program, but that is actually executed in parallel on the hardware.  
+
+The programming experience is largely the same, and there is still the fundamental limitation in that the computation must be data-parallel, but SPMD ensures that the vectorization will happen regardless of the compiler and the target CPU architecture. It also allows for the computation to be automatically parallelized across multiple cores and, in some cases, even offloaded to other types of parallel hardware.
+
+There is support for SPMD is some modern languages ([Julia](https://docs.julialang.org/en/v1/base/base/#Base.SimdLoop.@simd)), multiprocessing APIs ([OpenMP](https://www.openmp.org/spec-html/5.0/openmpsu42.html)), and specialized compilers (Intel [ISPC](https://ispc.github.io/)), but it has seen the most success in the context of GPU programming where both problems and hardware are massively parallel.
+
+We will cover this model of computation in much more depth in Part 2
 
-There are [many other ways](https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf) of telling the compiler what we meant exactly, but in especially complex cases — when inside the loop there are a lot of branches or some functions are called — it is easier to go down to the intrinsics level and write it yourself.
+<!-- This approach is especially popular with [game developers](https://twitter.com/pbrubaker/status/1537041398037303296) because they need to support many platforms and have reliable performance, and also because it resembles the way graphics programming is done. -->

From af8d237fcc253fd5a1d32d281e26fa94f2cae948 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 08:39:36 +0300
Subject: [PATCH 147/173] update index

---
 content/english/hpc/_index.md | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index 8d73bcb0..a1ff7f42 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -39,11 +39,11 @@ After that, I will mostly be fixing errors and only doing some minor edits refle
 
 **Pre-ordering / financially supporting the book.** Due to my unfortunate citizenship and place of birth, you can't — that is, until I find a way that at the same time complies with international sanctions, doesn't sponsor [the war](https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine), and won't put me in prison for tax evasion.
 
-So, don't bother. If you want to support this book, just share the articles you like on link aggregators and social media and help fix typos — that would be enough.
+So, don't bother. If you want to support this book, just share it and help fix typos — that would be enough.
 
 **Translations.** The website has a separate functionality for creating and managing translations — and I've already been contacted by some nice people willing to translate the book into Italian and Chinese (and I will personally translate at least some of it into my native Russian).
 
-However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when a centralized translation process starts.
+However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when centralized translation starts.
 
 **"Translating" the Russian version.** The articles hosted at [ru.algorithmica.org/cs/](https://ru.algorithmica.org/cs/) are not about advanced performance engineering but mostly about classical computer science algorithms — without discussing how to speed them up beyond asymptotic complexity. Most of the information there is not unique and already exists in English on some other places on the internet: for example, the similar-spirited [cp-algorithms.com](https://cp-algorithms.com/).
 
@@ -51,7 +51,7 @@ However, as the book is still evolving, it is probably not the best idea to star
 
 There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, but [one of them](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming) is 50 years old, and [the other](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) is 30 years old, and [computers have changed a lot](/hpc/complexity/hardware) since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs.
 
-And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat the hardware like something from the 1990s.
+And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as "[Performance Engineering of Software Systems](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2018/)" from MIT, "[Programming Parallel Computers](https://ppc.cs.aalto.fi/)" from Aalto University, and some non-academic ones like Denis Bakhvalov's "[Performance Ninja](https://github.com/dendibakh/perf-ninja)" — most computer science graduates still treat modern hardware like something from the 1990s.
 
 What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools for *teaching* performance engineering — and I'm looking forward to collaborating with other people who want to make it a reality as well.
 
@@ -76,7 +76,7 @@ Competitive programming is, in my opinion, misguided. They are doing useless thi
 
 The first part covers the basics of computer architecture and optimization of single-threaded algorithms.
 
-It walks through the main CPU optimization topics such as caching, SIMD and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure.
+It walks through the main CPU optimization topics such as caching, SIMD, and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure.
 
 Planned table of contents:
 
@@ -94,7 +94,7 @@ Planned table of contents:
  1.4. Functions and Recursion
  1.5. Indirect Branching
  1.6. Machine Code Layout
- 1.7. Interrupts and System Calls
+ 1.7. System Calls
  1.8. Virtualization
 3. Instruction-Level Parallelism
  3.1. Pipeline Hazards
@@ -215,7 +215,7 @@ Among the cool things that we will speed up:
 - optimal Karatsuba Algorithm
 - optimal FFT
 
-This work is largely based on blog posts, research papers, conference talks and other work authored by a lot of people:
+This work is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people:
 
 - [Agner Fog](https://agner.org/optimize/)
 - [Daniel Lemire](https://lemire.me/en/#publications)
@@ -248,29 +248,33 @@ This work is largely based on blog posts, research papers, conference talks and
 - [Creel](https://www.youtube.com/c/WhatsACreel)
 
 Volume: 450-600 pages  
-Release date: Q2 2022
+Release date: Q3 2022
 
 ### Part II: Parallel Algorithms
 
-Concurrency, models of parallelism, green threads and concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking and graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication and sorting.
+Concurrency, models of parallelism, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting.
 
 Volume: 150-200 pages  
-Release date: 2023?
+Release date: 2023-2024?
 
 ### Part III: Distributed Computing
 
-Communication-constrained algorithms, message passing, actor model, partitioning, MapReduce, consistency and reliability at scale, storage, compression, scheduling and cloud computing, distributed deep learning.
+(I might need some help from here on.)
+
+Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, consistency, reliability, scheduling, cloud computing.
 
 Release date: ??? (more likely to be completed than not)
 
 ### Part IV: Compilers and Domain-Specific Architectures
 
-LLVM IR, compiler optimizations, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++ and oneAPI, XLA,  Verilog, FPGAs, ASICs, TPUs and other AI accelerators.
+(TODO: come up with a better title — one that emphasizes that this part is mainly about the software-hardware boundary and not PL/IC design.)
+
+LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators.
 
 Release date: ??? (less likely to be completed than not)
 
 ### Disclaimer: Technology Choices
 
-The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles we aim to convey are not specific to them.
+The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them.
 
 To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed.

From 20b8479c5ac2ed627cd86baa12e4e6656074c8ae Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 22:52:44 +0300
Subject: [PATCH 148/173] update hpc index

---
 content/english/hpc/_index.md | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index a1ff7f42..8c5e9ef2 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -239,6 +239,10 @@ This work is largely based on blog posts, research papers, conference talks, and
 - [Geoff Langdale](https://branchfree.org/)
 - [Matt Kulukundis](https://twitter.com/JuvHarlequinKFM)
 - [Georg Sauthoff](https://gms.tf/)
+- [Danila Kutenin](https://danlark.org/author/kutdanila/)
+- [Ivica Bogosavljević](https://johnysswlab.com/author/ibogi/)
+- [Matt Pharr](https://pharr.org/matt/)
+- [Jan Wassenberg](https://research.google/people/JanWassenberg/)
 - [Marshall Lochbaum](https://mlochbaum.github.io/publications.html)
 - [Pavel Zemtsov](https://pzemtsov.github.io/)
 - [Nayuki](https://www.nayuki.io/category/programming)
@@ -252,22 +256,22 @@ Release date: Q3 2022
 
 ### Part II: Parallel Algorithms
 
-Concurrency, models of parallelism, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting.
+Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting.
 
 Volume: 150-200 pages  
 Release date: 2023-2024?
 
 ### Part III: Distributed Computing
 
-(I might need some help from here on.)
+<!-- (I might need some help from here on.) -->
 
-Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, consistency, reliability, scheduling, cloud computing.
+Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing.
 
 Release date: ??? (more likely to be completed than not)
 
-### Part IV: Compilers and Domain-Specific Architectures
+### Part IV: Software & Hardware
 
-(TODO: come up with a better title — one that emphasizes that this part is mainly about the software-hardware boundary and not PL/IC design.)
+<!-- (TODO: come up with a better title — one that emphasizes that this part is mainly about the software-hardware boundary and not PL/IC design.) -->
 
 LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators.
 
@@ -277,4 +281,4 @@ Release date: ??? (less likely to be completed than not)
 
 The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them.
 
-To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed.
+To clear my conscience, I'm not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust / [Carbon?](https://github.com/carbon-language/carbon-lang), LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed.

From 6b522385797429bf1a1b5c0295f33ac73350e1a1 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 23:55:19 +0300
Subject: [PATCH 149/173] edit number theory intro

---
 content/english/hpc/number-theory/_index.md | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md
index 6812e14c..d66f85fd 100644
--- a/content/english/hpc/number-theory/_index.md
+++ b/content/english/hpc/number-theory/_index.md
@@ -3,17 +3,15 @@ title: Number Theory
 weight: 7
 ---
 
-*Disclaimer: this chapter is a very early draft that is probably not worth reading yet.*
-
 In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._H._Hardy) published a famous essay titled "[A Mathematician's Apology](https://en.wikipedia.org/wiki/A_Mathematician%27s_Apology)" discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications.
 
-I personally don't agree — and I wrote this book partially to show that there are way too few people working on practical algorithm design instead of theoretical computer science — but I understand where Hardy is coming from. Being 62 years old, he witnessed the devastation caused by the First and the ongoing Second World War that was greatly amplified by the weaponization of science.
+Similar to mathematics, the various fields of computer science also form a spectrum, with mathematical logic and computability theory on one end and web programming and application development on the other. I assume that you, the reader, is more on the applied side: this book was written to show that there are way too few people working on practical algorithm design instead of theoretical computer science — and since you got to Chapter 7, you probably also believe in that statement.
 
-As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing:
+But, regardless of the personal views on the matter, one can see where Hardy is coming from. Being 62 years old at the moment of writing, he witnessed the devastation caused by the First and the ongoing Second World War — which was greatly amplified by the weaponization of science. As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing:
 
 > No one has yet discovered any warlike purpose to be served by the theory of numbers or relativity, and it seems unlikely that anyone will do so for many years.
 
-Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory.
+Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory — the computational aspect of which is the topic of this chapter.
 
 <!--
 

From da0ea49775ac5bd8c4fa0347abb7589bd60ccbcc Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 20 Jul 2022 23:57:27 +0300
Subject: [PATCH 150/173] change wording

---
 content/english/hpc/number-theory/_index.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/number-theory/_index.md b/content/english/hpc/number-theory/_index.md
index d66f85fd..e91fa1fb 100644
--- a/content/english/hpc/number-theory/_index.md
+++ b/content/english/hpc/number-theory/_index.md
@@ -7,11 +7,11 @@ In 1940, a British mathematician [G. H. Hardy](https://en.wikipedia.org/wiki/G._
 
 Similar to mathematics, the various fields of computer science also form a spectrum, with mathematical logic and computability theory on one end and web programming and application development on the other. I assume that you, the reader, is more on the applied side: this book was written to show that there are way too few people working on practical algorithm design instead of theoretical computer science — and since you got to Chapter 7, you probably also believe in that statement.
 
-But, regardless of the personal views on the matter, one can see where Hardy is coming from. Being 62 years old at the moment of writing, he witnessed the devastation caused by the First and the ongoing Second World War — which was greatly amplified by the weaponization of science. As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing:
+But, regardless of the personal views on the matter, one can see where Hardy is coming from. Being 62 years old at the date of writing, he witnessed the devastation caused by the First and the ongoing Second World War — which was greatly amplified by the weaponization of science. As a number theorist, Hardy finds calm working in a "useless" field and not having to face any moral dilemmas, writing:
 
 > No one has yet discovered any warlike purpose to be served by the theory of numbers or relativity, and it seems unlikely that anyone will do so for many years.
 
-Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory — the computational aspect of which is the topic of this chapter.
+Ironically, this statement was proved very wrong just 5 years later with the development of the atomic bomb, which would not have been possible without the [understanding](https://en.wikipedia.org/wiki/Einstein%E2%80%93Szil%C3%A1rd_letter) of relativity, and the inception of computer-era cryptography, which extensively builds on number theory — the computational aspect of which is the main topic of this chapter.
 
 <!--
 

From 4a0fdc358ade810cd7e462df3b6fef88f270d5e3 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Jul 2022 00:43:23 +0300
Subject: [PATCH 151/173] number theory chapter edits

---
 .../english/hpc/number-theory/euclid-extended.md |  8 ++++----
 .../english/hpc/number-theory/exponentiation.md  | 16 +++++++++-------
 content/english/hpc/number-theory/modular.md     |  8 ++++----
 content/english/hpc/number-theory/montgomery.md  | 10 +++++-----
 4 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/content/english/hpc/number-theory/euclid-extended.md b/content/english/hpc/number-theory/euclid-extended.md
index 54fa0b1e..a37c1b29 100644
--- a/content/english/hpc/number-theory/euclid-extended.md
+++ b/content/english/hpc/number-theory/euclid-extended.md
@@ -9,9 +9,9 @@ $$
 a^{\phi(m)} \equiv 1 \pmod m
 $$
 
-where $\phi(m)$ is [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) defined as the number of positive integers $x < m$ that are coprime with $m$. In particular case when $m$ is a prime, then all the $m - 1$ residues are coprime and $\phi(m) = m - 1$, yielding the Fermat's theorem.
+where $\phi(m)$ is [Euler's totient function](https://en.wikipedia.org/wiki/Euler%27s_totient_function) defined as the number of positive integers $x < m$ that are coprime with $m$. In the special case when $m$ is a prime, then all the $m - 1$ residues are coprime and $\phi(m) = m - 1$, yielding the Fermat's theorem.
 
-This lets us calculate the inverse of $a$ as $a^{\phi(m) - 1}$ if we know $\phi(m)$, but calculating it is, in turn, not so fast: you usually need to obtain the factorization of $m$. There is a more general method that works by modifying the [the Euclidean algorthm](/hpc/algorithms/gcd/).
+This lets us calculate the inverse of $a$ as $a^{\phi(m) - 1}$ if we know $\phi(m)$, but in turn, calculating it is not so fast: you usually need to obtain the [factorization](/hpc/algorithms/factorization/) of $m$ to do it. There is a more general method that works by modifying the [the Euclidean algorthm](/hpc/algorithms/gcd/).
 
 ### Algorithm
 
@@ -95,6 +95,6 @@ int inverse(int a) {
 }
 ```
 
-Note that, unlike binary exponentiation, the running time depends on the value of $a$. For example, for this particular value of $m$ ($10^9 + 7$), the worst input happens to be 564400443, on which the algorithm performs 37 iterations and taking 250ns.
+Note that, unlike binary exponentiation, the running time depends on the value of $a$. For example, for this particular value of $m$ ($10^9 + 7$), the worst input happens to be 564400443, for which the algorithm performs 37 iterations and takes 250ns.
 
-**Exercise**. Try to adapt the same technique for binary GCD (it won't give performance speedup though unless you are better than me at optimization).
+**Exercise**. Try to adapt the same technique for the [binary GCD](/hpc/algorithms/gcd/#binary-gcd) (it won't give performance speedup though unless you are better than me at optimization).
diff --git a/content/english/hpc/number-theory/exponentiation.md b/content/english/hpc/number-theory/exponentiation.md
index b5964463..8806257d 100644
--- a/content/english/hpc/number-theory/exponentiation.md
+++ b/content/english/hpc/number-theory/exponentiation.md
@@ -3,7 +3,7 @@ title: Binary Exponentiation
 weight: 2
 ---
 
-In modular arithmetic and computational algebra in general, you often need to raise a number to the $n$-th power — to do [modular division](../modular/#modular-division), perform [primality tests](../modular/#fermats-theorem), or compute some combinatorial values — ­and you usually want to spend fewer than $\Theta(n)$ operations calculating it.
+In modular arithmetic (and computational algebra in general), you often need to raise a number to the $n$-th power — to do [modular division](../modular/#modular-division), perform [primality tests](../modular/#fermats-theorem), or compute some combinatorial values — ­and you usually want to spend fewer than $\Theta(n)$ operations calculating it.
 
 *Binary exponentiation*, also known as *exponentiation by squaring*, is a method that allows for computation of the $n$-th power using $O(\log n)$ multiplications, relying on the following observation:
 
@@ -54,9 +54,11 @@ u64 inverse(u64 a) {
 }
 ```
 
-We use $m = 10^9+7$, which is a modulo value is commonly used in *competitive programming* to calculate checksums in combinatorial problems because it is prime (allowing inverse via binary exponentiation), sufficiently large while not overflowing `int` in addition or `long long` in multiplication, and is easy to type as `1e9 + 7`. Since it is a compile-time constant, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once and use them for fast reduction).
+We use $m = 10^9+7$, which is a modulo value commonly used in competitive programming to calculate checksums in combinatorial problems — because it is prime (allowing inverse via binary exponentiation), sufficiently large, not overflowing `int` in addition, not overflowing `long long` in multiplication, and easy to type as `1e9 + 7`.
 
-The execution path and hence the running time depends on the value of $n$. For this particular $n$, the baseline implementation takes around 330ns per call. As recursion introduces some [overhead](/hpc/architecture/functions/), it makes sense to unroll it into an iterative one.
+Since we use it as compile-time constant in the code, the compiler can optimize the modulo by [replacing it with multiplication](/hpc/arithmetic/division/) (even if it is not a compile-time constant, it is still cheaper to compute the magic constants by hand once and use them for fast reduction).
+
+The execution path — and consequently the running time — depends on the value of $n$. For this particular $n$, the baseline implementation takes around 330ns per call. As recursion introduces some [overhead](/hpc/architecture/functions/), it makes sense to unroll the implementation into an iterative procedure.
 
 ### Iterative Implementation
 
@@ -66,7 +68,7 @@ $$
 a^{42} = a^{32+8+2} = a^{32} \cdot a^8 \cdot a^2 
 $$
 
-To calculate this product, we can iterate over the bits of $n$ maintaining two variables: the value of $a^{2^k}$ and the current product after considering $k$ lowest bits. On each step, we multiply the current product by $a^{2^k}$ if the $k$-th bit of $n$ is set, and, in either case, square $a^k$ to get $a^{2^k \cdot 2} = a^{2^{k+1}}$ that will be used on the next iteration.
+To calculate this product, we can iterate over the bits of $n$ maintaining two variables: the value of $a^{2^k}$ and the current product after considering $k$ lowest bits of $n$. On each step, we multiply the current product by $a^{2^k}$ if the $k$-th bit of $n$ is set, and, in either case, square $a^k$ to get $a^{2^k \cdot 2} = a^{2^{k+1}}$ that will be used on the next iteration.
 
 ```c++
 u64 binpow(u64 a, u64 n) {
@@ -85,7 +87,7 @@ u64 binpow(u64 a, u64 n) {
 
 The iterative implementation takes about 180ns per call. The heavy calculations are the same; the improvement mainly comes from the reduced dependency chain: `a = a * a % M` needs to finish before the loop can proceed, and it can now execute concurrently with `r = res * a % M`.
 
-The performance also benefits from $n$ being a constant, [making all branches predictable](/hpc/pipelining/branching/) and letting the scheduler know what needs to be executed in advance. The compiler, however, does not take advantage of it and does not unroll the `while(n) n >>= 1` loop. We can rewrite it as a `for` loop that takes constant 30 iterations:
+The performance also benefits from $n$ being a constant, [making all branches predictable](/hpc/pipelining/branching/) and letting the scheduler know what needs to be executed in advance. The compiler, however, does not take advantage of it and does not unroll the `while(n) n >>= 1` loop. We can rewrite it as a `for` loop that performs constant 30 iterations:
 
 ```c++
 u64 inverse(u64 a) {
@@ -102,6 +104,6 @@ u64 inverse(u64 a) {
 }
 ```
 
-This forces the compiler to generate only the instructions we need, shoving off another 10ns and making the total running time ~170ns.
+This forces the compiler to generate only the instructions we need, shaving off another 10ns and making the total running time ~170ns.
 
-Note that the performance depends not only on the binary length of $n$, but also on the number of binary 1s. If $n$ is $2^{30}$, it takes around 20ns less not having to perform these off-path multiplications.
+Note that the performance depends not only on the binary length of $n$, but also on the number of binary 1s. If $n$ is $2^{30}$, it takes around 20ns less as we don't have to to perform any off-path multiplications.
diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md
index 2f90bd95..47310780 100644
--- a/content/english/hpc/number-theory/modular.md
+++ b/content/english/hpc/number-theory/modular.md
@@ -3,10 +3,10 @@ title: Modular Arithmetic
 weight: 1
 ---
 
-TODO: use it in binary exponentiation.
-
 <!--
 
+TODO: use it in binary exponentiation.
+
 In this section, we are going to discuss some preliminaries before discussing more advanced topics.
 
 we use the 1st of January, 1970 as the start of the "Unix era," and all time computations are usually done relative to that timestamp.
@@ -19,9 +19,9 @@ And the beautiful thing about it is that remainders are small and cyclic. Think
 
 Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the "Unix era" — and use these timestamps in all computations that have to do with time.
 
-We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 0 AD.
+We humans also keep track of time relative to some point in the past, which usually has a political or religious significance. For example, at the moment of writing, approximately 63882260594 seconds have passed since 1 AD — [6th century Eastern Roman monks' best estimate](https://en.wikipedia.org/wiki/Anno_Domini) of the day Jesus Christ was born.
 
-But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now and it's time to go to dinner, or that it's Thursday, and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainder* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
+But unlike computers, we do not always need *all* that information. Depending on the task at hand, the relevant part may be that it's 2 pm right now, and it's time to go to dinner; or that it's Thursday, and so Subway's sub of the day is an Italian BMT. Instead of the whole timestamp, we use its *remainder* containing just the information we need: it is much easier to deal with 1- or 2-digit numbers than 11-digit ones.
 
 **Problem.** Today is Thursday. What day of the week will be exactly in a year?
 
diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
index 038391dd..669e39ba 100644
--- a/content/english/hpc/number-theory/montgomery.md
+++ b/content/english/hpc/number-theory/montgomery.md
@@ -3,9 +3,9 @@ title: Montgomery Multiplication
 weight: 4
 ---
 
-Unsurprisingly, large fractions of computations in [modular arithmetic](../modular) are often spent on calculating the modulo operation, which is as slow as general integer division and typically taking 15-20 cycles, depending on the operand size.
+Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular) is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size.
 
-The best way to deal this nuisance is to avoid modulo operation altogether, delaying or replacing it with [predication](/hpc/pipelining/branchless), which can be done when calculating sums, for example:
+The best way to deal this nuisance is to avoid modulo operation altogether, delaying or replacing it with [predication](/hpc/pipelining/branchless), which can be done, for example, when calculating modular sums:
 
 ```cpp
 const int M = 1e9 + 7;
@@ -44,7 +44,7 @@ But there is another technique designed specifically for modular arithmetic, cal
 
 Montgomery multiplication works by first transforming the multipliers into *Montgomery space*, where modular multiplication can be performed cheaply, and then transforming them back when their actual values are needed. Unlike general integer division methods, Montgomery multiplication is not efficient for performing just one modular reduction and only becomes worthwhile when there is a chain of modular operations.
 
-The space is defined by the modulo $n$ and a positive integer $r \ge n$ coprime to $n$. The algorithm involves division and modulo by $r$, so in practice, $r$ is chosen to be $2^m$ with $m$ being equal 32 or 64, so that these operations can be done with a right-shift and a bitwise AND respectively.
+The space is defined by the modulo $n$ and a positive integer $r \ge n$ coprime to $n$. The algorithm involves modulo and division by $r$, so in practice, $r$ is chosen to be $2^{32}$ or $2^{64}$, so that these operations can be done with a right-shift and a bitwise AND respectively.
 
 <!-- Therefore $n$ needs to be an odd number so that every power of $2$ will be coprime to $n$. And if it is not, we can make it odd (?). -->
 
@@ -54,7 +54,7 @@ $$
 \bar{x} = x \cdot r \bmod n
 $$
 
-Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we don't use this method for general modular multiplication and only long sequences of operations where transforming numbers to and from the Montgomery space is worth it.
+Computing this transformation involves a multiplication and a modulo — an expensive operation that we wanted to optimize away in the first place — which is why we only use this method when the overhead of transforming numbers to and from the Montgomery space is worth it and not for general modular multiplication.
 
 <!-- Note that the transformation is actually such a multiplication that we want to optimize, so it is still an expensive operation. However, we will only need to transform a number into the space once, perform as many operations as we want efficiently in that space and at the end transform the final result back, which should be profitable if we are doing lots of operations modulo $n$. -->
 
@@ -287,6 +287,6 @@ int inverse(int _a) {
 }
 ```
 
-While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case in modular arithmetic is for `inverse` to be used as a subprocedure in a bigger computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
+While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
 
 **Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).

From a05f571a1762a1f6f8d8b6b329cdbde03b2f56a6 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 21 Jul 2022 02:49:51 +0300
Subject: [PATCH 152/173] move acknowledgements section

---
 content/english/hpc/_index.md | 58 +++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 27 deletions(-)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index 8c5e9ef2..ed71792a 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -215,7 +215,35 @@ Among the cool things that we will speed up:
 - optimal Karatsuba Algorithm
 - optimal FFT
 
-This work is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people:
+Volume: 450-600 pages  
+Release date: Q3 2022
+
+### Part II: Parallel Algorithms
+
+Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting.
+
+Volume: 150-200 pages  
+Release date: 2023-2024?
+
+### Part III: Distributed Computing
+
+<!-- (I might need some help from here on.) -->
+
+Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing.
+
+Release date: ??? (more likely to be completed than not)
+
+### Part IV: Software & Hardware
+
+<!-- (TODO: come up with a better title — one that emphasizes that this part is mainly about the software-hardware boundary and not PL/IC design.) -->
+
+LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators.
+
+Release date: ??? (less likely to be completed than not)
+
+### Acknowledgements
+
+The book is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people:
 
 - [Agner Fog](https://agner.org/optimize/)
 - [Daniel Lemire](https://lemire.me/en/#publications)
@@ -245,38 +273,14 @@ This work is largely based on blog posts, research papers, conference talks, and
 - [Jan Wassenberg](https://research.google/people/JanWassenberg/)
 - [Marshall Lochbaum](https://mlochbaum.github.io/publications.html)
 - [Pavel Zemtsov](https://pzemtsov.github.io/)
+- [Gustavo Duarte](https://manybutfinite.com/)
+- [Nyaan](https://nyaannyaan.github.io/library/)
 - [Nayuki](https://www.nayuki.io/category/programming)
 - [InstLatX64](https://twitter.com/InstLatX64)
 - [ridiculous_fish](https://ridiculousfish.com/blog/)
 - [Z boson](https://stackoverflow.com/users/2542702/z-boson)
 - [Creel](https://www.youtube.com/c/WhatsACreel)
 
-Volume: 450-600 pages  
-Release date: Q3 2022
-
-### Part II: Parallel Algorithms
-
-Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting.
-
-Volume: 150-200 pages  
-Release date: 2023-2024?
-
-### Part III: Distributed Computing
-
-<!-- (I might need some help from here on.) -->
-
-Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing.
-
-Release date: ??? (more likely to be completed than not)
-
-### Part IV: Software & Hardware
-
-<!-- (TODO: come up with a better title — one that emphasizes that this part is mainly about the software-hardware boundary and not PL/IC design.) -->
-
-LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators.
-
-Release date: ??? (less likely to be completed than not)
-
 ### Disclaimer: Technology Choices
 
 The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them.

From 19bb6305fb564080bc8f0e8995bfeb51038116bd Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Fri, 22 Jul 2022 01:49:24 +0300
Subject: [PATCH 153/173] links to floyd-warshall

---
 content/english/hpc/algorithms/matmul.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/algorithms/matmul.md b/content/english/hpc/algorithms/matmul.md
index 5f2847d2..cf976045 100644
--- a/content/english/hpc/algorithms/matmul.md
+++ b/content/english/hpc/algorithms/matmul.md
@@ -474,9 +474,9 @@ for (int k = 0; k < n; k++)
             d[i][j] = min(d[i][j], d[i][k] + d[k][j]);
 ```
 
-Interestingly, vectorizing the distance product and executing it $O(\log n)$ times in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot.
+Interestingly, similarly vectorizing the distance product and executing it $O(\log n)$ times ([or possibly fewer](https://arxiv.org/pdf/1904.01210.pdf)) in $O(n^3 \log n)$ total operations is faster than naively executing the Floyd-Warshall algorithm in $O(n^3)$ operations, although not by a lot.
 
-As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because now there is a logical dependency between the iterations, and you need to perform updates in a particular order, but it is still possible to design a similar kernel and a block iteration order that achieves a 30-50x total speedup.
+As an exercise, try to speed up this "for-for-for" computation. It is harder to do than in the matrix multiplication case because now there is a logical dependency between the iterations, and you need to perform updates in a particular order, but it is still possible to design [a similar kernel and a block iteration order](https://github.com/sslotin/amh-code/blob/main/floyd/blocked.cc) that achieves a 30-50x total speedup.
 
 ## Acknowledgements
 

From fd9bdbea9477ed7e4e0c749f2967bf5997bb73a8 Mon Sep 17 00:00:00 2001
From: Rinat Valiullov <9755333+RinatValiullov@users.noreply.github.com>
Date: Tue, 26 Jul 2022 01:54:42 +0500
Subject: [PATCH 154/173] fix typo (duplicate text)

---
 content/russian/cs/sorting/bubble.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/russian/cs/sorting/bubble.md b/content/russian/cs/sorting/bubble.md
index 2d9af9b5..38fa5c8a 100644
--- a/content/russian/cs/sorting/bubble.md
+++ b/content/russian/cs/sorting/bubble.md
@@ -1,9 +1,10 @@
 ---
 title: Сортировка пузырьком
 weight: 1
+published: true
 ---
 
-Наш первый подход будет заключаться в следующем: обозначим за $n$ длину массива и $n$ раз пройдёмся раз пройдемся по нему слева направо, меняя два соседних элемента, если первый больше второго.
+Наш первый подход будет заключаться в следующем: обозначим за $n$ длину массива и $n$ раз пройдёмся по нему слева направо, меняя два соседних элемента, если первый больше второго.
 
 Каждую итерацию максимальный элемент «всплывает» как пузырек к концу массива — отсюда и название.
 

From 326755608c2464b2fddf960cf972b03d2f8a684f Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 28 Jul 2022 09:05:21 +0300
Subject: [PATCH 155/173] underline eytzinger search example

---
 .../english/hpc/data-structures/binary-search.md   | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index f2e61ffb..7401712e 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -343,7 +343,7 @@ while (k <= n)
 
 The only problem arises when we need to restore the index of the resulting element, as $k$ does not directly point to it. Consider this example (its corresponding tree is listed above):
 
-```center
+<!--
     array:  0 1 2 3 4 5 6 7 8 9                           
 eytzinger:  6 3 7 1 5 8 9 0 2 4                           
 1st range:  -------------------  k := 1                    
@@ -351,7 +351,17 @@ eytzinger:  6 3 7 1 5 8 9 0 2 4
 3rd range:  -------              k := 2*k     = 4   (3 ≥ 3)
 4th range:      ---              k := 2*k + 1 = 9   (1 < 3)
 5th range:        -              k := 2*k + 1 = 19  (2 < 3)
-```
+-->
+
+<pre class='center-pre'>
+    array:  0 1 2 3 4 5 6 7 8 9                           
+eytzinger:  <u>6</u> <u>3</u> 7 <u>1</u> 5 8 9 0 <u>2</u> 4                           
+1st range:  -------------------  k := 1                    
+2nd range:  -------------        k := 2*k     = 2   (6 ≥ 3)
+3rd range:  -------              k := 2*k     = 4   (3 ≥ 3)
+4th range:      ---              k := 2*k + 1 = 9   (1 < 3)
+5th range:        -              k := 2*k + 1 = 19  (2 < 3)
+</pre>
 
 <!-- do we need the last comparison? -->
 

From da216d6c81f59d334f3c9c26cf2ce768871314bb Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 28 Jul 2022 09:17:31 +0300
Subject: [PATCH 156/173] fix example

---
 .../english/hpc/data-structures/binary-search.md   | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 7401712e..85f9ef52 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -354,13 +354,13 @@ eytzinger:  6 3 7 1 5 8 9 0 2 4
 -->
 
 <pre class='center-pre'>
-    array:  0 1 2 3 4 5 6 7 8 9                           
-eytzinger:  <u>6</u> <u>3</u> 7 <u>1</u> 5 8 9 0 <u>2</u> 4                           
-1st range:  -------------------  k := 1                    
-2nd range:  -------------        k := 2*k     = 2   (6 ≥ 3)
-3rd range:  -------              k := 2*k     = 4   (3 ≥ 3)
-4th range:      ---              k := 2*k + 1 = 9   (1 < 3)
-5th range:        -              k := 2*k + 1 = 19  (2 < 3)
+    array:  0 1 2 3 4 5 6 7 8 9                            
+eytzinger:  <u>6</u> <u>3</u> 7 <u>1</u> 5 8 9 0 <u>2</u> 4                            
+1st range:  ------------?------  k := 2*k     = 2   (6 ≥ 3)
+2nd range:  ------?------        k := 2*k     = 4   (3 ≥ 3)
+3rd range:  --?----              k := 2*k + 1 = 9   (1 < 3)
+4th range:      ?--              k := 2*k + 1 = 19  (2 < 3)
+5th range:        !                                        
 </pre>
 
 <!-- do we need the last comparison? -->

From 0d811cc49a1784a813f071a2aeb5755e5dfd958a Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Thu, 28 Jul 2022 13:50:38 +0300
Subject: [PATCH 157/173] add s-tree rank example

---
 content/english/hpc/data-structures/s-tree.md | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/content/english/hpc/data-structures/s-tree.md b/content/english/hpc/data-structures/s-tree.md
index d241aed5..875f72ec 100644
--- a/content/english/hpc/data-structures/s-tree.md
+++ b/content/english/hpc/data-structures/s-tree.md
@@ -102,7 +102,19 @@ int i = __builtin_ffs(mask) - 1;
 // now i is the number of the correct child node
 ```
 
-Unfortunately, the compilers are not smart enough yet to auto-vectorize this code, so we need to manually vectorize it with intrinsics:
+Unfortunately, the compilers are not smart enough to [auto-vectorize](/hpc/simd/auto-vectorization/) this code yet, so we have to optimize it manually. In AVX2, we can load 8 elements, compare them against the search key, producing a [vector mask](/hpc/simd/masking/), and then extract the scalar mask from it with `movemask`. Here is a minimized illustrated example of what we want to do:
+
+```center
+       y = 4        17       65       103     
+       x = 42       42       42       42      
+   y ≥ x = 00000000 00000000 11111111 11111111
+           ├┬┬┬─────┴────────┴────────┘       
+movemask = 0011                               
+           ┌─┘                                
+     ffs = 3                                  
+```
+
+Since we are limited to processing 8 elements at a time (half our block / cache line size), we have to split the elements into two groups and then combine the two 8-bit masks. To do this, it will be slightly easier to swap the condition for `x > y` and compute the inverted mask instead:
 
 ```c++
 typedef __m256i reg;
@@ -114,7 +126,7 @@ int cmp(reg x_vec, int* y_ptr) {
 }
 ```
 
-This function works for 8-element vectors, which is half our block / cache line size. To process the entire block, we need to call it twice and then combine the masks:
+Now, to process the entire block, we need to call it twice and combine the masks:
 
 ```c++
 int mask = ~(
@@ -123,7 +135,7 @@ int mask = ~(
 );
 ```
 
-Now, to descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier:
+To descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier:
 
 ```c++
 int i = __builtin_ffs(mask) - 1;

From f01a7d3df6e6a885fb5b63376df5a0399981bbe2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Anti=20R=C3=A4is?= <antirais@gmail.com>
Date: Fri, 29 Jul 2022 18:23:51 +0300
Subject: [PATCH 158/173] Improve wording.

---
 content/english/hpc/profiling/noise.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/profiling/noise.md b/content/english/hpc/profiling/noise.md
index 74ff0272..243f3600 100644
--- a/content/english/hpc/profiling/noise.md
+++ b/content/english/hpc/profiling/noise.md
@@ -1,6 +1,7 @@
 ---
 title: Getting Accurate Results
 weight: 10
+published: true
 ---
 
 It is not an uncommon for there to be two library algorithm implementations, each maintaining its own benchmarking code, and each claiming to be faster than the other. This confuses everyone involved, especially the users, who have to somehow choose between the two.
@@ -111,7 +112,7 @@ for (int i = 0; i < N; i++)
     checksum ^= lower_bound(q[i]);
 ```
 
-It is also sometimes convenient to combine the warm-up run with answer validation, it if is more complicated than just computing some sort of checksum.
+It is also sometimes convenient to combine the warm-up run with answer validation, if it is more complicated than just computing some sort of checksum.
 
 **Over-optimization.** Sometimes the benchmark is outright erroneous because the compiler just optimized the benchmarked code away. To prevent the compiler from cutting corners, you need to add checksums and either print them somewhere or add the `volatile` qualifier, which also prevents any sort of interleaving of loop iterations.
 

From 20d53920f54959981cbab0c17b877a4025763cf4 Mon Sep 17 00:00:00 2001
From: Pasha <mail@pechhenka.ru>
Date: Sun, 31 Jul 2022 16:33:16 +0300
Subject: [PATCH 159/173] =?UTF-8?q?=D0=BD=D0=B5=D1=81=D0=BE=D0=B3=D0=BB?=
 =?UTF-8?q?=D0=B0=D1=81=D0=BE=D0=B2=D0=B0=D0=BD=D0=BE=D1=81=D1=82=D1=8C=20?=
 =?UTF-8?q?=D1=81=D0=BB=D0=BE=D0=B2=20=D0=B2=20=D0=BA=D0=BE=D0=BD=D1=86?=
 =?UTF-8?q?=D0=B5=20=D0=B0=D0=B1=D0=B7=D0=B0=D1=86=D0=B0?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 content/russian/cs/decomposition/scanline.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/content/russian/cs/decomposition/scanline.md b/content/russian/cs/decomposition/scanline.md
index 4c9bcdf0..3bc99afd 100644
--- a/content/russian/cs/decomposition/scanline.md
+++ b/content/russian/cs/decomposition/scanline.md
@@ -1,11 +1,12 @@
 ---
 title: Сканирующая прямая
 authors:
-- Сергей Слотин
+  - Сергей Слотин
 prerequisites:
-- /cs/range-queries
-- /cs/segment-tree
+  - /cs/range-queries
+  - /cs/segment-tree
 weight: 1
+published: true
 ---
 
 Метод сканирующей прямой (англ. *scanline*) заключается в сортировке точек на координатной прямой либо каких-то абстрактных «событий» по какому-то признаку и последующему проходу по ним.
@@ -22,7 +23,7 @@ weight: 1
 
 Это решение можно улучшить. Отсортируем интересные точки по возрастанию координаты и пройдем по ним слева направо, поддерживая количество отрезков `cnt`, которые покрывают данную точку. Если в данной точке начинается отрезок, то надо увеличить `cnt` на единицу, а если заканчивается, то уменьшить. После этого пробуем обновить ответ на задачу текущим значением `cnt`. 
 
-Как такое писать: нужно представить интересные точки в виде структур с полями «координата» и «тип» (начало / конец) и отсортировать со своим компаратором. Удобно начало отрезка обозначать +1, а конец -1, чтобы просто прибавлять к `cnt` это значение и на разбирать случае.
+Как такое писать: нужно представить интересные точки в виде структур с полями «координата» и «тип» (начало / конец) и отсортировать со своим компаратором. Удобно начало отрезка обозначать +1, а конец -1, чтобы просто прибавлять к `cnt` это значение и не разбивать на случаи.
 
 Единственный нюанс — если координаты двух точек совпали, чтобы получить правильный ответ, сначала надо рассмотреть все начала отрезков, а только потом концы (чтобы при обновлении ответа в этой координате учлись и правые, и левые граничные отрезки).
 

From 6661563a59217abbe6f69c38d27a6af2cd69aeb4 Mon Sep 17 00:00:00 2001
From: Iago-lito <iago-lito@users.noreply.github.com>
Date: Fri, 5 Aug 2022 16:39:01 +0200
Subject: [PATCH 160/173] Update integer.md

---
 content/english/hpc/arithmetic/integer.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/arithmetic/integer.md b/content/english/hpc/arithmetic/integer.md
index 47f5bd32..686db686 100644
--- a/content/english/hpc/arithmetic/integer.md
+++ b/content/english/hpc/arithmetic/integer.md
@@ -93,7 +93,7 @@ This seems like an important architecture aspect, but in most cases, it doesn't
 - Little-endian has the advantage that you can cast a value to a smaller type (e.g., `long long` to `int`) by just loading fewer bytes, which in most cases means doing nothing — thanks to *register aliasing*, `eax` refers to the first 4 bytes of `rax`, so conversion is essentially free. It is also easier to read values in a variety of type sizes — while on big-endian architectures, loading an `int` from a `long long` array would require shifting the pointer by 2 bytes.
 - Big-endian has the advantage that higher bytes are loaded first, which in theory can make highest-to-lowest routines such as comparisons and printing faster. You can also perform certain checks such as finding out whether a number is negative by only loading its first byte.
 
-Big-endian is also more "natural" — this is how we write binary numbers on paper — but the advantage of having faster type conversions outweigh it. For this reason, little-endian is used by default on most hardware, although some CPUs are "bi-endian" and can be configured to switch modes on demand.
+Big-endian is also more "natural" — this is how we write binary numbers on paper — but the advantage of having faster type conversions outweights it. For this reason, little-endian is used by default on most hardware, although some CPUs are "bi-endian" and can be configured to switch modes on demand.
 
 ### 128-bit Integers
 

From 387715b6c648a722b2fce506aedf2b79a38d18aa Mon Sep 17 00:00:00 2001
From: psn2706 <69345823+psn2706@users.noreply.github.com>
Date: Wed, 10 Aug 2022 23:19:44 +0300
Subject: [PATCH 161/173] Correction of typos

---
 content/russian/cs/persistent/persistent-array.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/content/russian/cs/persistent/persistent-array.md b/content/russian/cs/persistent/persistent-array.md
index e476c355..018c287a 100644
--- a/content/russian/cs/persistent/persistent-array.md
+++ b/content/russian/cs/persistent/persistent-array.md
@@ -2,8 +2,9 @@
 title: Структуры с откатами
 weight: 1
 authors:
-- Сергей Слотин
-date: 2021-09-12
+  - Сергей Слотин
+date: {}
+published: true
 ---
 
 Состояние любой структуры как-то лежит в памяти: в каких-то массивах, или в более общем случае, по каким-то определенным адресам в памяти. Для простоты, пусть у нас есть некоторый массив $a$ размера $n$, и нам нужно обрабатывать запросы присвоения и чтения, а также иногда откатывать изменения обратно.
@@ -20,7 +21,7 @@ int a[N];
 stack< pair<int, int> > s;
 
 void change(int k, int x) {
-    l.push({k, a[k]});
+    s.push({k, a[k]});
     a[k] = x;
 }
 
@@ -84,7 +85,7 @@ void rollback() {
 
 ```cpp
 int t = 0;
-vector<int> versions[N];
+vector< pair<int, int> > versions[N];
 
 void change(int k, int x) {
     versions[k].push_back({t++, x});

From 155891c5ed8502decd64047a21736c6285d8edcd Mon Sep 17 00:00:00 2001
From: zh Wang <rekind133@outlook.com>
Date: Fri, 12 Aug 2022 05:36:02 +0800
Subject: [PATCH 162/173] Fix typo

---
 content/english/hpc/profiling/noise.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/english/hpc/profiling/noise.md b/content/english/hpc/profiling/noise.md
index 243f3600..b1b186ae 100644
--- a/content/english/hpc/profiling/noise.md
+++ b/content/english/hpc/profiling/noise.md
@@ -128,10 +128,10 @@ https://github.com/sosy-lab/benchexec
 
 The issues we've described produce *bias* in measurements: they consistently give advantage to one algorithm over the other. There are other types of possible problems with benchmarking that result in either unpredictable skews or just completely random noise, thus increasing *variance*.
 
-These type of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling:
+These types of issues are caused by side effects and some sort of external noise, mostly due to noisy neighbors and CPU frequency scaling:
 
 - If you benchmark a compute-bound algorithm, measure its performance in cycles using `perf stat`: this way it will be independent of clock frequency, fluctuations of which is usually the main source of noise.
-- Otherwise, set core frequency to the what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e.g., `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it.
+- Otherwise, set core frequency to what you expect it to be and make sure nothing interferes with it. On Linux you can do it with `cpupower` (e.g., `sudo cpupower frequency-set -g powersave` to put it to minimum or `sudo cpupower frequency-set -g ondemand` to enable turbo boost). I use a [convenient GNOME shell extension](https://extensions.gnome.org/extension/1082/cpufreq/) that has a separate button to do it.
 - If applicable, turn hyper-threading off and attach jobs to specific cores. Make sure no other jobs are running on the system, turn off networking and try not to fiddle with the mouse.
 
 You can't remove noises and biases completely. Even a program's name can affect its speed: the executable's name ends up in an environment variable, environment variables end up on the call stack, and so the length of the name affects stack alignment, which can result in data accesses slowing down due to crossing cache line or memory page boundaries.

From adcdf626b1408ad0635ace91dc2a1facdad127d1 Mon Sep 17 00:00:00 2001
From: ar1emicus <87391584+ar1emicus@users.noreply.github.com>
Date: Tue, 16 Aug 2022 03:52:35 +0500
Subject: [PATCH 163/173] Update sqrt-structures.md

---
 content/russian/cs/range-queries/sqrt-structures.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/content/russian/cs/range-queries/sqrt-structures.md b/content/russian/cs/range-queries/sqrt-structures.md
index bac0da16..8d2cfd6f 100644
--- a/content/russian/cs/range-queries/sqrt-structures.md
+++ b/content/russian/cs/range-queries/sqrt-structures.md
@@ -1,10 +1,11 @@
 ---
 title: Корневые структуры
 authors:
-- Сергей Слотин
-- Иван Сафонов
+  - Сергей Слотин
+  - Иван Сафонов
 weight: 6
-date: 2021-09-13
+date: {}
+published: true
 ---
 
 Корневые оптимизации можно использовать много для чего, в частности в контексте структур данных.
@@ -68,6 +69,7 @@ void upd(int l, int r, int x) {
             l += c;
         }
         else {
+          	b[l / c] += x;
             a[l] += x;
             l++;
         }
@@ -111,8 +113,8 @@ vector< vector<int> > blocks;
 // возвращает индекс блока и индекс элемента внутри блока
 pair<int, int> find_block(int pos) {
     int idx = 0;
-    while (blocks[idx].size() >= pos)
-        pos -= blocks[idx--].size();
+    while (blocks[idx].size() <= pos)
+        pos -= blocks[idx++].size();
     return {idx, pos};
 }
 ```

From b80dafe5a8efe0389b1395919dc1770df7408d9f Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Tue, 16 Aug 2022 07:39:08 +0300
Subject: [PATCH 164/173] code style

---
 content/russian/cs/range-queries/sqrt-structures.md | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/content/russian/cs/range-queries/sqrt-structures.md b/content/russian/cs/range-queries/sqrt-structures.md
index 8d2cfd6f..25fe3b5e 100644
--- a/content/russian/cs/range-queries/sqrt-structures.md
+++ b/content/russian/cs/range-queries/sqrt-structures.md
@@ -4,8 +4,7 @@ authors:
   - Сергей Слотин
   - Иван Сафонов
 weight: 6
-date: {}
-published: true
+date: 2022-08-16
 ---
 
 Корневые оптимизации можно использовать много для чего, в частности в контексте структур данных.
@@ -24,16 +23,15 @@ published: true
 ```c++
 // c это и количество блоков, и также их размер; оно должно быть чуть больше корня
 const int maxn = 1e5, c = 330;
-int a[maxn], b[c];
-int add[c];
+int a[maxn], b[c], add[c];
 
 for (int i = 0; i < n; i++)
     b[i / c] += a[i];
 ```
 
-Заведем также массив `add` размера $\sqrt n$, который будем использовать для отложенной операции прибавления на блоке. Будем считать, что реальное значение $i$-го элемента равно `a[i] + add[i / c]`.
+Заведем также массив `add` размера $\sqrt n$, который будем использовать для отложенной операции прибавления на блоке: будем считать, что реальное значение $i$-го элемента равно `a[i] + add[i / c]`.
 
-Теперь мы можем отвечать на запросы первого типа за $O(\sqrt n)$ на запрос:
+Теперь мы можем отвечать на запросы первого типа за $O(\sqrt n)$ операций на запрос:
 
 1. Для всех блоков, лежащих целиком внутри запроса, просто возьмём уже посчитанные суммы и сложим.
 2. Для блоков, пересекающихся с запросом только частично (их максимум два — правый и левый), проитерируемся по нужным элементам и поштучно прибавим к ответу.
@@ -69,7 +67,7 @@ void upd(int l, int r, int x) {
             l += c;
         }
         else {
-          	b[l / c] += x;
+            b[l / c] += x;
             a[l] += x;
             l++;
         }

From a9e98c13f2373883145b951af55cc881671e1804 Mon Sep 17 00:00:00 2001
From: Vladislav Shirshakov <loykopp@gmail.com>
Date: Tue, 16 Aug 2022 18:59:47 +0500
Subject: [PATCH 165/173] =?UTF-8?q?=D0=94=D0=BB=D1=8F=20=D0=BF=D0=B5=D1=80?=
 =?UTF-8?q?=D0=B5=D0=BC=D0=B5=D0=BD=D0=BD=D0=BE=D0=B9=20=D0=BD=D0=B5=20?=
 =?UTF-8?q?=D1=83=D0=BA=D0=B0=D0=B7=D0=B0=D0=BD=20=D1=82=D0=B8=D0=BF?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 content/russian/cs/sorting/selection.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/russian/cs/sorting/selection.md b/content/russian/cs/sorting/selection.md
index b47f2320..30854b5f 100644
--- a/content/russian/cs/sorting/selection.md
+++ b/content/russian/cs/sorting/selection.md
@@ -1,6 +1,7 @@
 ---
 title: Сортировка выбором
 weight: 2
+published: true
 ---
 
 Похожим методом является **сортировка выбором** (минимума или максимума).
@@ -10,7 +11,7 @@ weight: 2
 ```cpp
 void selection_sort(int *a, int n) {
     for (int k = 0; k < n - 1; k++)
-        for (j = k + 1; j < n; j++)
+        for (int j = k + 1; j < n; j++)
             if (a[k] > a[j])
                 swap(a[j], a[k]);
 }

From 7fd943e685a0d3ab4c9073cd704bdb25f2455606 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Wed, 17 Aug 2022 09:40:56 +0300
Subject: [PATCH 166/173] improve wording in branchless programming section

---
 content/english/hpc/pipelining/branchless.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/content/english/hpc/pipelining/branchless.md b/content/english/hpc/pipelining/branchless.md
index d7416f35..31bd5a39 100644
--- a/content/english/hpc/pipelining/branchless.md
+++ b/content/english/hpc/pipelining/branchless.md
@@ -91,7 +91,7 @@ $$
 
 This way you can eliminate branching, but this comes at the cost of evaluating *both* branches and the `cmov` itself. Because evaluating the ">=" branch costs nothing, the performance is exactly equal to [the "always yes" case](../branching/#branch-prediction) in the branchy version.
 
-### When It Is Beneficial
+### When Predication Is Beneficial
 
 Using predication eliminates [a control hazard](../hazards) but introduces a data hazard. There is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved and not flush the entire pipeline in case of a mispredict.
 
@@ -180,11 +180,11 @@ int abs(int a) {
 
 ### Larger Examples
 
-**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated char array (also known as "C-string") allocated somewhere on the heap and one integer containing the string size.
+**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated `char` array (also known as a "C-string") allocated somewhere on the heap and one integer containing the string size.
 
-A common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings.
+A common value for a string is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic approach is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings.
 
-However, this requires a separate branch, which is costly unless most strings are empty. What we can do to get rid of it is to allocate a "zero C-string," which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction.
+However, this requires a separate branch, which is costly (unless the majority of strings are either empty or non-empty). To remove the check and thus also the branch, we can allocate a "zero C-string," which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction.
 
 **Binary search.** The standard binary search [can be implemented](/hpc/data-structures/binary-search) without branches, and on small arrays (that fit into cache) it works ~4x faster than the branchy `std::lower_bound`:
 
@@ -193,10 +193,10 @@ int lower_bound(int x) {
     int *base = t, len = n;
     while (len > 1) {
         int half = len / 2;
-        base = (base[half] < x ? &base[half] : base);
+        base += (base[half - 1] < x) * half; // will be replaced with a "cmov"
         len -= half;
     }
-    return *(base + (*base < x));
+    return *base;
 }
 ```
 
@@ -218,7 +218,7 @@ That there are no substantial reasons why compilers can't do this on their own,
 
 **Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications because they don't have branching in the first place.
 
-In our array sum example, if you remove the `volatile` type qualifier from the accumulator, the compiler becomes able to [vectorize](/hpc/simd/auto-vectorization) the loop:
+In our array sum example, removing the `volatile` type qualifier from the accumulator allows the compiler to [vectorize](/hpc/simd/auto-vectorization) the loop:
 
 ```c++
 /* volatile */ int s = 0;
@@ -230,7 +230,7 @@ for (int i = 0; i < N; i++)
 
 It now works in ~0.3 per element, which is mainly [bottlenecked by the memory](/hpc/cpu-cache/bandwidth).
 
-The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/shuffling).
+The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific small deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/shuffling).
 
 <!--
 

From a0707a409d0e7165acbe13d7906be6135dace695 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 29 Aug 2022 01:36:40 +0300
Subject: [PATCH 167/173] reorganize simd reduction

---
 content/english/hpc/simd/reduction.md | 64 +++++++++++++++------------
 1 file changed, 35 insertions(+), 29 deletions(-)

diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md
index 5a0ace1e..b74ef3b8 100644
--- a/content/english/hpc/simd/reduction.md
+++ b/content/english/hpc/simd/reduction.md
@@ -48,56 +48,62 @@ int sum_simd(v8si *a, int n) {
 
 You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.
 
-### Horizontal Summation
-
-The last part, where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum, is called "horizontal summation."
-
-Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
-
-![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png)
-
-Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
-
-```c++
-int hsum(__m256i x) {
-    __m128i l = _mm256_extracti128_si256(x, 0);
-    __m128i h = _mm256_extracti128_si256(x, 1);
-    l = _mm_add_epi32(l, h);
-    l = _mm_hadd_epi32(l, l);
-    return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1);
-}
-```
-
-There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
-
-There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.
-
 ### Instruction-Level Parallelism
 
-Our implementation matches what the compiler produces automatically, but it is actually [suboptimal](/hpc/pipelining/throughput): when we use just one accumulator, we have to wait one cycle between the loop iterations for vector addition to complete, while its throughput is 2 on this microarchitecture.
+Our implementation matches what the compiler produces automatically, but it is actually suboptimal: when we use just one accumulator, [we have to wait](/hpc/pipelining/throughput) one cycle between the loop iterations for a vector addition to complete, while the [throughput](/hpc/pipelining/tables/) of corresponding instruction is 2 on this microarchitecture.
 
 If we again divide the array in $B \geq 2$ parts and use a *separate* accumulator for each, we can saturate the throughput of vector addition and increase the performance twofold:
 
 ```c++
-const int B = 2;
+const int B = 2; // how many vector accumulators to use
 
 int sum_simd(v8si *a, int n) {
     v8si b[B] = {0};
 
-    for (int i = 0; i < n / 8; i += B)
+    for (int i = 0; i + (B - 1) < n / 8; i += B)
         for (int j = 0; j < B; j++)
             b[j] += a[i + j];
-    
+
+    // sum all vector accumulators into one
     for (int i = 1; i < B; i++)
         b[0] += b[i];
     
     int s = 0;
 
+    // sum 8 scalar accumulators into one
     for (int i = 0; i < 8; i++)
         s += b[0][i];
 
+     // add the remainder of a
+    for (int i = n / (8 * B) * (8 * B); i < n; i++)
+        s += a[i];
+
     return s;
 }
 ```
 
-If you have more than 2 relevant execution ports, you can increase `B` accordingly. But the n-fold performance increase will only apply to arrays that fit L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
+If you have more than 2 relevant execution ports, you can increase the `B` constant accordingly, but the $n$-fold performance increase will only apply to arrays that fit into L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
+
+### Horizontal Summation
+
+The part where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum is called "horizontal summation."
+
+Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
+
+![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png)
+
+Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
+
+```c++
+int hsum(__m256i x) {
+    __m128i l = _mm256_extracti128_si256(x, 0);
+    __m128i h = _mm256_extracti128_si256(x, 1);
+    l = _mm_add_epi32(l, h);
+    l = _mm_hadd_epi32(l, l);
+    return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1);
+}
+```
+
+There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
+
+There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.

From a7ade20d0e10dc55a036f904abbfa9f0db776212 Mon Sep 17 00:00:00 2001
From: Sergey Slotin <me@sereja.me>
Date: Mon, 29 Aug 2022 04:25:54 +0300
Subject: [PATCH 168/173] credit Konstantin

---
 content/english/hpc/_index.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/content/english/hpc/_index.md b/content/english/hpc/_index.md
index ed71792a..9b6aa606 100644
--- a/content/english/hpc/_index.md
+++ b/content/english/hpc/_index.md
@@ -276,6 +276,7 @@ The book is largely based on blog posts, research papers, conference talks, and
 - [Gustavo Duarte](https://manybutfinite.com/)
 - [Nyaan](https://nyaannyaan.github.io/library/)
 - [Nayuki](https://www.nayuki.io/category/programming)
+- [Konstantin](http://const.me/)
 - [InstLatX64](https://twitter.com/InstLatX64)
 - [ridiculous_fish](https://ridiculousfish.com/blog/)
 - [Z boson](https://stackoverflow.com/users/2542702/z-boson)

From e0fd9f8a61bce117fa24690dbd5822b8366bc975 Mon Sep 17 00:00:00 2001
From: zh Wang <rekind133@outlook.com>
Date: Mon, 29 Aug 2022 16:58:42 +0800
Subject: [PATCH 169/173] Fix typo

Signed-off-by: zh Wang <rekind133@outlook.com>
---
 content/english/hpc/simd/reduction.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md
index b74ef3b8..89678103 100644
--- a/content/english/hpc/simd/reduction.md
+++ b/content/english/hpc/simd/reduction.md
@@ -46,7 +46,7 @@ int sum_simd(v8si *a, int n) {
 }
 ```
 
-You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.
+You can use this approach for other reductions, such as for finding the minimum or the xor-sum of an array.
 
 ### Instruction-Level Parallelism
 

From b1935821725b6589561ad359e4d65d48d471c6a5 Mon Sep 17 00:00:00 2001
From: zh Wang <rekind133@outlook.com>
Date: Mon, 29 Aug 2022 20:02:40 +0800
Subject: [PATCH 170/173] Fix typo

Signed-off-by: zh Wang <rekind133@outlook.com>
---
 content/english/hpc/compilation/flags.md         | 2 +-
 content/english/hpc/compilation/situational.md   | 2 +-
 content/english/hpc/external-memory/_index.md    | 4 ++--
 content/english/hpc/external-memory/hierarchy.md | 2 +-
 content/english/hpc/external-memory/model.md     | 2 +-
 content/english/hpc/number-theory/modular.md     | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/content/english/hpc/compilation/flags.md b/content/english/hpc/compilation/flags.md
index ceae9e87..74383237 100644
--- a/content/english/hpc/compilation/flags.md
+++ b/content/english/hpc/compilation/flags.md
@@ -12,7 +12,7 @@ There are 4 *and a half* main levels of optimization for speed in GCC:
 
 - `-O0` is the default one that does no optimizations (although, in a sense, it does optimize: for compilation time).
 - `-O1` (also aliased as `-O`) does a few "low-hanging fruit" optimizations, almost not affecting the compilation time.
-- `-O2` enables all optimizations that are known to have little to no negative side effects and take reasonable time to complete (this is what most projects use for production builds).
+- `-O2` enables all optimizations that are known to have little to no negative side effects and take a reasonable time to complete (this is what most projects use for production builds).
 - `-O3` does very aggressive optimization, enabling almost all *correct* optimizations implemented in GCC.
 - `-Ofast` does everything in `-O3`, plus a few more optimizations flags that may break strict standard compliance, but not in a way that would be critical for most applications (e.g., floating-point operations may be rearranged so that the result is off by a few bits in the mantissa).
 
diff --git a/content/english/hpc/compilation/situational.md b/content/english/hpc/compilation/situational.md
index bec2a255..41620c70 100644
--- a/content/english/hpc/compilation/situational.md
+++ b/content/english/hpc/compilation/situational.md
@@ -96,7 +96,7 @@ The whole process is automated by modern compilers. For example, the `-fprofile-
 g++ -fprofile-generate [other flags] source.cc -o binary
 ```
 
-After we run the program — preferably on input that is as representative of real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag:
+After we run the program — preferably on input that is as representative of the real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag:
 
 ```
 g++ -fprofile-use [other flags] source.cc -o binary
diff --git a/content/english/hpc/external-memory/_index.md b/content/english/hpc/external-memory/_index.md
index fe53c83a..0af587b3 100644
--- a/content/english/hpc/external-memory/_index.md
+++ b/content/english/hpc/external-memory/_index.md
@@ -19,7 +19,7 @@ When you fetch anything from memory, the request goes through an incredibly comp
 
 -->
 
-When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce the latency.
+When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce latency.
 
 Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored:
 
@@ -27,7 +27,7 @@ Therefore, the only correct answer to this question is "it depends" — primaril
 - If it was accessed recently, it is probably *cached* and will take less than that to fetch, depending on how long ago it was accessed — it could be ~50 cycles for the slowest layer of cache and around 4-5 cycles for the fastest.
 - But it could also be stored on some type of *external memory* such as a hard drive, and in this case, it will take around 5ms, or roughly $10^7$ cycles (!) to access it.
 
-Such high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind.
+Such a high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind.
 
 ![](img/memory-vs-compute.png)
 
diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md
index da1f5bb6..26dfc144 100644
--- a/content/english/hpc/external-memory/hierarchy.md
+++ b/content/english/hpc/external-memory/hierarchy.md
@@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data.
 
 ### Non-Volatile Memory
 
-While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms.
+While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them to collide with silicon atoms.
 
 <!-- error correction -->
 
diff --git a/content/english/hpc/external-memory/model.md b/content/english/hpc/external-memory/model.md
index 35cba4ea..9ab86eba 100644
--- a/content/english/hpc/external-memory/model.md
+++ b/content/english/hpc/external-memory/model.md
@@ -18,7 +18,7 @@ Similar in spirit, in the *external memory model*, we simply ignore every operat
 
 In this model, we measure the performance of an algorithm in terms of its high-level *I/O operations*, or *IOPS* — that is, the total number of blocks read or written to external memory during execution.
 
-We will mostly focus on the case where the internal memory is RAM and external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes.
+We will mostly focus on the case where the internal memory is RAM and the external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes.
 
 ### Array Scan
 
diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md
index 47310780..3d05e2f9 100644
--- a/content/english/hpc/number-theory/modular.md
+++ b/content/english/hpc/number-theory/modular.md
@@ -100,7 +100,7 @@ $$
 $$
 \begin{aligned}
 a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p &
-\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)}
+\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by definition)}
 \\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)}
 \\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)}
 \\\ &= a

From 88ed77156863353ad37e486195ce0ba3ef682afb Mon Sep 17 00:00:00 2001
From: trasua <andrew@trasua.dev>
Date: Mon, 5 Sep 2022 15:59:32 +0700
Subject: [PATCH 171/173] fix typo

---
 content/english/hpc/data-structures/binary-search.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md
index 85f9ef52..6426ddde 100644
--- a/content/english/hpc/data-structures/binary-search.md
+++ b/content/english/hpc/data-structures/binary-search.md
@@ -1,6 +1,7 @@
 ---
 title: Binary Search
 weight: 1
+published: true
 ---
 
 <!-- mention interpolation search and radix trees? -->
@@ -184,7 +185,7 @@ int lower_bound(int x) {
 
 Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely.
 
-As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the funciton is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays:
+As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the function is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays:
 
 <!-- todo: update numbers -->
 

From 4e00ee7cc5769cd650be6d215e0efacf2f14de51 Mon Sep 17 00:00:00 2001
From: novikov-vladimir <99834014+novikov-vladimir@users.noreply.github.com>
Date: Fri, 11 Nov 2022 19:57:50 +0300
Subject: [PATCH 172/173] =?UTF-8?q?=D0=90=D0=B2=D1=82=D0=BE=D0=BC=D0=B0?=
 =?UTF-8?q?=D1=82=D0=BD=D1=8B=D0=B9=20=D0=BF=D0=B5=D1=80=D0=B5=D1=85=D0=BE?=
 =?UTF-8?q?=D0=B4=20=D0=B4=D0=BE=D0=BB=D0=B6=D0=B5=D0=BD=20=D0=B2=D0=B5?=
 =?UTF-8?q?=D1=81=D1=82=D0=B8=20=D0=B2=20=D0=B2=D0=B5=D1=80=D1=88=D0=B8?=
 =?UTF-8?q?=D0=BD=D1=83,=20=D1=81=D0=BE=D0=BE=D1=82=D0=B2=D0=B5=D1=82?=
 =?UTF-8?q?=D1=81=D1=82=D0=B2=D1=83=D1=8E=D1=89=D1=83=D1=8E=20=D0=BC=D0=B0?=
 =?UTF-8?q?=D0=BA=D1=81=D0=B8=D0=BC=D0=B0=D0=BB=D1=8C=D0=BD=D0=BE=D0=BC?=
 =?UTF-8?q?=D1=83=20=D0=BF=D1=80=D0=B8=D0=BD=D0=B8=D0=BC=D0=B0=D0=B5=D0=BC?=
 =?UTF-8?q?=D0=BE=D0=BC=D1=83=20=D0=B1=D0=BE=D1=80=D0=BE=D0=BC=20=D1=81?=
 =?UTF-8?q?=D1=83=D1=84=D1=84=D0=B8=D0=BA=D1=81=D1=83=20(=D0=BD=D0=B5=20?=
 =?UTF-8?q?=D0=BC=D0=B8=D0=BD=D0=B8=D0=BC=D0=B0=D0=BB=D1=8C=D0=BD=D0=BE?=
 =?UTF-8?q?=D0=BC=D1=83).?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 content/russian/cs/string-structures/aho-corasick.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/content/russian/cs/string-structures/aho-corasick.md b/content/russian/cs/string-structures/aho-corasick.md
index 369f5171..2ca1da65 100644
--- a/content/russian/cs/string-structures/aho-corasick.md
+++ b/content/russian/cs/string-structures/aho-corasick.md
@@ -1,10 +1,11 @@
 ---
 title: Алгоритм Ахо-Корасик
 authors:
-- Сергей Слотин
+  - Сергей Слотин
 weight: 2
 prerequisites:
-- trie
+  - trie
+published: true
 ---
 
 Представим, что мы работаем журналистами в некотором авторитарном государстве, контролирующем СМИ, и в котором время от времени издаются законы, запрещающие упоминать определенные политические события или использовать определенные слова. Как эффективно реализовать подобную цензуру программно?
@@ -36,7 +37,7 @@ prerequisites:
 
 **Определение.** *Суффиксная ссылка* $l(v)$ ведёт в вершину $u \neq v$, которая соответствует наидлиннейшему принимаемому бором суффиксу $v$.
 
-**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую минимальному принимаемому бором суффиксу строки $v + c$.
+**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую максимальному принимаемому бором суффиксу строки $v + c$.
 
 **Наблюдение.** Если переход и так существует в боре (будем называть такой переход *прямым*), то автоматный переход будет вести туда же.
 

From 0fa54119101693a9670972a3c27657d2ee1c59d1 Mon Sep 17 00:00:00 2001
From: DavideGianessi <118054693+DavideGianessi@users.noreply.github.com>
Date: Sat, 12 Nov 2022 12:49:11 +0100
Subject: [PATCH 173/173] typo

---
 content/english/hpc/number-theory/montgomery.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
index 669e39ba..0eeef0b0 100644
--- a/content/english/hpc/number-theory/montgomery.md
+++ b/content/english/hpc/number-theory/montgomery.md
@@ -1,6 +1,7 @@
 ---
 title: Montgomery Multiplication
 weight: 4
+published: true
 ---
 
 Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular) is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size.
@@ -287,6 +288,6 @@ int inverse(int _a) {
 }
 ```
 
-While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
+While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158ns we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
 
 **Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).