CPU and Memory Optimizations #56

yuzawa-san · Jul 23, 2024

I found the CPU and Memory in the decode hot path is very high, so I did some light refactoring to alleviate this.

Added a BitString and BitStringBuilder for efficient bitstring operations
Make substring operation zero-copy (BitString slices the underlying data)
BitString also removes the need to validate 0/1 using Pattern.
Optimize base64 decoding by assembling reverse dictionary lookup table from char to BitString (previously it was a HashMap and from Character (boxed) to Integer, which needed to be converted to a bit string)
Reduce number of substring operations. e.g. FixedIntegerEncoder.decode(bitString, fromIndex, length)
~~Made FixedBitfieldEncoder return BitString directly which does fulfill List<Boolean>. This is a lot smaller that the ArrayList<Boolean>~~ Fixed bit ranges are backed by bitsets.
Use more StringBuilder
Presize things that we know the size or approximate sizes of
Use more constants. NOTE: I used Strings constants in the + usages since that is optimized (in JDK8) to a StringBuilder. if it used the char constant it appears to have to convert those chars to Strings each time. however in later JDK's this is not needed.
~~NOTE: the encode flow could technically use the BitString too, but i held off on that for now.~~ the encode accumulates using more efficiently. toString is only called at the last minute.
added an IntegerCache which is largest enough to contain all of the vendor ID's in the global vendor list. This cut down on a lot allocations.
Added JMH microbenchmarking
made constants final
do more presizing in segment initializeData
make GppModel more DRY and use switch statements instead of if/elses
contains containsKey + get calls to just get call with nullcheck. this avoids double Map reads.
make initializeSegments more memory efficient with Arrays.asList or Collections.singletonList
use CharSequence for zero copy string splits
use Instant instead of ZonedDatetime since Instant is more lightweight. (breaking)
add rudimentary toString implementations for easy debugging

microbenchmark results:

before
Benchmark                                  Mode  Cnt        Score      Error   Units
MyBenchmark.decodeGpp                     thrpt   25     3425.619 ±  103.495   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25     6099.516 ±  186.780  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  1867059.215 ± 4110.174    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25     2632.000             counts
MyBenchmark.decodeGpp:gc.time             thrpt   25     2700.000                 ms

after
Benchmark                                  Mode  Cnt      Score     Error   Units
MyBenchmark.decodeGpp                     thrpt   25  19205.372 ± 485.226   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25   1037.076 ±  26.204  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  56624.003 ±   0.001    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25    866.000            counts
MyBenchmark.decodeGpp:gc.time             thrpt   25    842.000                ms

seems to be around 6x faster than the last released version and uses only about 97% less memory.

ad-hoc benchmark code against https://github.com/InteractiveAdvertisingBureau/iabtcf-java (partially ported into JMH)

public class TcfBench {
	
	private static final String in = "CQCDewAQCDewAPoABABGA9EMAP-AAB4AAIAAKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-ClbVfxv_215b_l-9-n7ZHmNX_f4e-7EMQYXybPuBcy71vycF99hMzRN-ramCpkSALt2wSEDbRyY1E1QoWqIFa8w7GpNnKE7SifgZIhzEXtnWAheb5sLY_kCmeb6__d3edkf7f3a_t3c8s-VZ792vf_m9ViYnSufrR_7_20Tm_vyPvf-fv-L_Pzf6xNv3k9bf7Xr7e9_fvLb__f___f___-______9__gAAAAA.QKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-.IKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-";


	public static void main(String[] args) {
		while(1==1) {
		 TCString old = TCString.decode(in);
		old.getPubPurposesConsent();
		old.getPurposesConsent();
		old.getVendorConsent();
		old.getPurposesLITransparency();
		old.getVendorLegitimateInterest();
		old.getSpecialFeatureOptIns();
		old.getCmpId();
		old.getPublisherRestrictions();
		TcfEuV2 nu = new TcfEuV2(in);
		nu.getPublisherConsents();
		nu.getPurposeConsents();
		nu.getVendorConsents();
		nu.getPurposeLegitimateInterests();
		nu.getVendorLegitimateInterests();
		nu.getSpecialFeatureOptins();
		nu.getCmpId();
		nu.getPublisherRestrictions();
		nu.setFieldValue(TcfEuV2Field.CMP_ID, 14);
		nu.encode();
		
		}
	}

}

benchmarked using async-profiler asprof -d 20 -e cpu,alloc -f ~/Desktop/dump16.jfr TcfBench

Flame Charts:

memory before:

memory after (note how TCString parse was small teal sliver in the before graph, but the iab-gpp portion has shrank so much that the TCString parse is now a larger percentage of the icicle chart):

cpu before:

cpu after:

~~Future~~ Additional Ideas ~~(not in PR)~~. I'll likely open issues to discuss:

List<Integer> is still a little bulky in the charts above. I was thinking of maybe making a specialty class backed by int[] like https://github.com/InteractiveAdvertisingBureau/iabtcf-java/blob/master/iabtcf-decoder/src/main/java/com/iabtcf/utils/IntIterable.java but that would likely break API stability since this a beta candidate i have broken compatibility on this and introduce an IntegerSet
~~Should the fields map have keys be enum's instead of strings? The EnumMap would be a lot more lightweight.~~ the Fields are stored in a more efficient manner using a list
is there too much stuff which is public which should be not public?
~~should List be replaced with Set since I feel set contains is something worth optimizing (i.e. is this vendor(s) present in the set)~~ once, again IntegerSet extends AbstractSet
~~are the defensive copies in some getValue implementation necessary, could we achieve the same from returning read-only versions (Collections.unmodifiableList).~~ introduced a managed IntegerSet which can detect modification and update the dirty flags, such that once could do tcfeuv2.getPurposes().addInt(10)

fixes #25

supersedes #45

ChristopherWirt · Jul 25, 2024

This looks great 👍

yuzawa-san · Aug 21, 2024

@iabmayank @chuff can you please take a look at this?

yuzawa-san · Nov 6, 2024

@iabmayank @chuff i have rebased off of the most recent master

lamrowena · Nov 20, 2024

@yuzawa-san thank you for submitting this, we will be reviewing these in the working group

chuff · Dec 9, 2024

@yuzawa-san
This does look pretty great. Could you remove the benchmark code from the PR?

yuzawa-san · Dec 9, 2024

@chuff removed

i would like to try adding it again in the future in a separate pr since i feel it is quite important that contributors have the ability to easily generate benchmarks. just curious, why did you favor removing it?

yuzawa-san · Apr 16, 2025

@chuff @AntoxaAntoxic @ChristopherWirt this is now ready for review. i have added some more changes to this pr. i am coordinating with the maintainers @lamrowena and this PR is allowed to have breaking changes, so i have gone forward with the more aspirational (yet breaking) changes i had mentioned in the original pr description. preliminary benchmarks show its even faster than the prior benchmarks, so i'll try to get those up soon.

yuzawa-san · Apr 16, 2025

before
Benchmark                                  Mode  Cnt        Score       Error   Units
MyBenchmark.decodeGpp                     thrpt    5     7110.163 ±   360.715   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt    5    12518.891 ±   810.618  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt    5  1846188.899 ± 31226.192    B/op
MyBenchmark.decodeGpp:gc.count            thrpt    5      907.000              counts
MyBenchmark.decodeGpp:gc.time             thrpt    5      420.000                  ms

after
Benchmark                                  Mode  Cnt      Score      Error   Units
MyBenchmark.decodeGpp                     thrpt    5  43948.717 ± 1426.909   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt    5    764.478 ±   24.820  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt    5  18240.016 ±    0.001    B/op
MyBenchmark.decodeGpp:gc.count            thrpt    5    114.000             counts
MyBenchmark.decodeGpp:gc.time             thrpt    5     52.000                 ms

98.7% less memory (which means less GC activity, which means less CPU)
6.18x faster

optimize fibonacci encoding clean up overflow validation optimize fibonacci end tag finding cache fibonacci numbers and enforce limit

yuzawa-san mentioned this pull request Jul 23, 2024

Reduce the number of object allocations on the decode/encode hotpath #45

Closed

yuzawa-san force-pushed the cpu-memory-optimizations branch from 5e4b971 to 02f53c4 Compare July 25, 2024 16:34

yuzawa-san mentioned this pull request Jul 25, 2024

High CPU Consumption #25

Open

This was referenced Oct 15, 2024

Fail fast #55

Closed

Tx fl or mt #57

Merged

AntoxaAntoxic approved these changes Oct 15, 2024

View reviewed changes

yuzawa-san force-pushed the cpu-memory-optimizations branch from 8bac901 to c30373e Compare November 6, 2024 00:27

yuzawa-san mentioned this pull request Nov 16, 2024

DE IA NE NH NJ TN MSPA USNAT #65

Merged

chuff approved these changes Apr 24, 2025

View reviewed changes

iabmayank force-pushed the master branch 2 times, most recently from 897eb42 to 5b10385 Compare June 9, 2025 21:08

aitnitishshelage force-pushed the master branch from db965bf to 6235cbe Compare June 11, 2025 12:06

lamrowena changed the base branch from master to 4.X June 26, 2025 16:23

yuzawa-san added 9 commits June 26, 2025 18:32

cpu and memory optimizations

24147c4

zero copy split

d14374d

optimize fl,mt,or,tx

f7b9f73

update mt

8595a96

remove microbenchmarking

f3f8c3f

fewer substrings

165d8ec

fibonacci optimization

564bda1

optimize fibonacci encoding clean up overflow validation optimize fibonacci end tag finding cache fibonacci numbers and enforce limit

annotate tests

0c9a10f

loop variables

68788e1

yuzawa-san added 28 commits June 26, 2025 18:36

store fields in list instead of map

dcc958d

use StringBuilders to avoid copies

0702f1b

store dirty on DataType

8cf0bcc

detect external modification of values

bcee628

fix tests

949a235

move and rename

6323d96

move and rename

caff49f

remove

62e0b2d

cleanup

56a9d68

smaller max size

830665d

sonarqube fixes

749fc88

trailing whitespace

c72610f

more sonarqube

eb678a4

log when exceeding IntegerBitSet.MAX_COLLECTION_SIZE

41330b7

validation

485a5b2

CharSequence.isEmpty()

ab84b32

fix tests

2d8d330

fix tests

2488a16

clean up fixed bitfields

2613640

add rudimentary toString implementations

7240018

use lightweight Instant

4d7c38c

do not fetch date on every parse

5cc4caf

add exception message

0804c21

expose integerset on rangeentry

83e8d5a

clean up rebase

560f123

clean up rebase

e3ea164

clean up rebase

5acf45a

set version

8a87187

yuzawa-san force-pushed the cpu-memory-optimizations branch from 623ee38 to 8a87187 Compare June 26, 2025 23:04

lamrowena merged commit 6ac876f into IABTechLab:4.X Jul 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU and Memory Optimizations #56

CPU and Memory Optimizations #56

Uh oh!

yuzawa-san commented Jul 23, 2024 •

edited

Loading

Uh oh!

ChristopherWirt commented Jul 25, 2024

Uh oh!

yuzawa-san commented Aug 21, 2024

Uh oh!

yuzawa-san commented Nov 6, 2024

Uh oh!

lamrowena commented Nov 20, 2024

Uh oh!

chuff commented Dec 9, 2024

Uh oh!

yuzawa-san commented Dec 9, 2024

Uh oh!

yuzawa-san commented Apr 16, 2025

Uh oh!

yuzawa-san commented Apr 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Search code, repositories, users, issues, pull requests...

CPU and Memory Optimizations #56

CPU and Memory Optimizations #56

Uh oh!

Conversation

yuzawa-san commented Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChristopherWirt commented Jul 25, 2024

Uh oh!

yuzawa-san commented Aug 21, 2024

Uh oh!

yuzawa-san commented Nov 6, 2024

Uh oh!

lamrowena commented Nov 20, 2024

Uh oh!

chuff commented Dec 9, 2024

Uh oh!

yuzawa-san commented Dec 9, 2024

Uh oh!

yuzawa-san commented Apr 16, 2025

Uh oh!

yuzawa-san commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuzawa-san commented Jul 23, 2024 •

edited

Loading

yuzawa-san commented Apr 16, 2025 •

edited

Loading