[stdlib] Greatly optimize unicode casing #5395

martinvuyk · Oct 5, 2025

Greatly optimize unicode casing by adding fast paths for ASCII and Latin-1 encodings.

Benchmark results

CPU: Intel® Core™ i7-7700HQ

bench	old value (ms)	new value (ms)	markdown perc	magnitude
bench_string_lower[10]	72.42	15.50	0.79	4.67
bench_string_upper[10]	121.40	22.11	0.82	5.49
bench_string_lower[30]	74.95	15.39	0.79	4.87
bench_string_upper[30]	125.36	21.35	0.83	5.87
bench_string_lower[50]	77.73	14.76	0.81	5.27
bench_string_upper[50]	132.22	20.05	0.85	6.59
bench_string_lower[100]	86.46	14.48	0.83	5.97
bench_string_upper[100]	119.76	19.96	0.83	6.00
bench_string_lower[1000]	69.18	14.03	0.80	4.93
bench_string_upper[1000]	125.47	19.48	0.84	6.44
bench_string_lower[10000]	67.78	13.90	0.79	4.88
bench_string_upper[10000]	125.44	19.44	0.85	6.45
bench_string_lower[100000]	67.40	14.68	0.78	4.59
bench_string_upper[100000]	125.52	20.99	0.83	5.98
bench_string_lower[1000000]	67.39	15.11	0.78	4.46
bench_string_upper[1000000]	125.58	20.58	0.84	6.10

mojo/stdlib/stdlib/collections/string/_unicode.mojo

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>

barcharcraz · Oct 17, 2025

This is obviously an improvement, but I think we should optimize the whole algorithm as well. _unicode_lookups appears to be a very simple translation of the UCD case folding data, and we could probably get better performance (and better code size) by preprocessing that data to something more compact.

A starting point would be figuring out the offsets for each character, then merging ranges with the same offsets and storing just the range start and length. Obviously this is more "work" to look up but it's a lot less thrashing. If you start the search by looking at the first range then you end up doing basically the same thing as this optimization except you might not get the bitwise version (which probably won't matter, arithmetic has the same TP as bitwise ops and there aren't any dependency chains on the output).

You'd probably want to first check at the beginning of your map (in case of ascii) then start searching from wherever the last processed character was.

martinvuyk requested a review from a team as a code owner October 5, 2025 20:40

martinvuyk force-pushed the fix-unicode-lowercasing branch 3 times, most recently from c86ae88 to 46a2469 Compare October 5, 2025 20:47

NathanSWard reviewed Oct 6, 2025

View reviewed changes

mojo/stdlib/stdlib/collections/string/_unicode.mojo Show resolved Hide resolved

Greatly optimize unicode casing

f321b0b

Signed-off-by: martinvuyk <martin.vuyklop@gmail.com>

martinvuyk force-pushed the fix-unicode-lowercasing branch from 46a2469 to f321b0b Compare October 6, 2025 12:50

martinvuyk mentioned this pull request Oct 12, 2025

[stdlib] Add performant String.is_like() to replace lhs.lower() == rhs.lower() patterns #5389

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[stdlib] Greatly optimize unicode casing #5395

[stdlib] Greatly optimize unicode casing #5395

martinvuyk commented Oct 5, 2025

Uh oh!

Uh oh!

barcharcraz commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Search code, repositories, users, issues, pull requests...

[stdlib] Greatly optimize unicode casing #5395

Are you sure you want to change the base?

[stdlib] Greatly optimize unicode casing #5395

Conversation

martinvuyk commented Oct 5, 2025

Benchmark results

Uh oh!

Uh oh!

barcharcraz commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants