Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@yuzawa-san
Copy link
Contributor

@yuzawa-san yuzawa-san commented Jul 23, 2024

I found the CPU and Memory in the decode hot path is very high, so I did some light refactoring to alleviate this.

  • Added a BitString and BitStringBuilder for efficient bitstring operations
  • Make substring operation zero-copy (BitString slices the underlying data)
  • BitString also removes the need to validate 0/1 using Pattern.
  • Optimize base64 decoding by assembling reverse dictionary lookup table from char to BitString (previously it was a HashMap and from Character (boxed) to Integer, which needed to be converted to a bit string)
  • Reduce number of substring operations. e.g. FixedIntegerEncoder.decode(bitString, fromIndex, length)
  • Made FixedBitfieldEncoder return BitString directly which does fulfill List<Boolean>. This is a lot smaller that the ArrayList<Boolean> Fixed bit ranges are backed by bitsets.
  • Use more StringBuilder
  • Presize things that we know the size or approximate sizes of
  • Use more constants. NOTE: I used Strings constants in the + usages since that is optimized (in JDK8) to a StringBuilder. if it used the char constant it appears to have to convert those chars to Strings each time. however in later JDK's this is not needed.
  • NOTE: the encode flow could technically use the BitString too, but i held off on that for now. the encode accumulates using more efficiently. toString is only called at the last minute.
  • added an IntegerCache which is largest enough to contain all of the vendor ID's in the global vendor list. This cut down on a lot allocations.
  • Added JMH microbenchmarking
  • made constants final
  • do more presizing in segment initializeData
  • make GppModel more DRY and use switch statements instead of if/elses
  • contains containsKey + get calls to just get call with nullcheck. this avoids double Map reads.
  • make initializeSegments more memory efficient with Arrays.asList or Collections.singletonList
  • use CharSequence for zero copy string splits
  • use Instant instead of ZonedDatetime since Instant is more lightweight. (breaking)
  • add rudimentary toString implementations for easy debugging

microbenchmark results:

before
Benchmark                                  Mode  Cnt        Score      Error   Units
MyBenchmark.decodeGpp                     thrpt   25     3425.619 ±  103.495   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25     6099.516 ±  186.780  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  1867059.215 ± 4110.174    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25     2632.000             counts
MyBenchmark.decodeGpp:gc.time             thrpt   25     2700.000                 ms

after
Benchmark                                  Mode  Cnt      Score     Error   Units
MyBenchmark.decodeGpp                     thrpt   25  19205.372 ± 485.226   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25   1037.076 ±  26.204  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  56624.003 ±   0.001    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25    866.000            counts
MyBenchmark.decodeGpp:gc.time             thrpt   25    842.000                ms

seems to be around 6x faster than the last released version and uses only about 97% less memory.

ad-hoc benchmark code against https://github.com/InteractiveAdvertisingBureau/iabtcf-java (partially ported into JMH)

public class TcfBench {
	
	private static final String in = "CQCDewAQCDewAPoABABGA9EMAP-AAB4AAIAAKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-ClbVfxv_215b_l-9-n7ZHmNX_f4e-7EMQYXybPuBcy71vycF99hMzRN-ramCpkSALt2wSEDbRyY1E1QoWqIFa8w7GpNnKE7SifgZIhzEXtnWAheb5sLY_kCmeb6__d3edkf7f3a_t3c8s-VZ792vf_m9ViYnSufrR_7_20Tm_vyPvf-fv-L_Pzf6xNv3k9bf7Xr7e9_fvLb__f___f___-______9__gAAAAA.QKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-.IKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-";


	public static void main(String[] args) {
		while(1==1) {
		 TCString old = TCString.decode(in);
		old.getPubPurposesConsent();
		old.getPurposesConsent();
		old.getVendorConsent();
		old.getPurposesLITransparency();
		old.getVendorLegitimateInterest();
		old.getSpecialFeatureOptIns();
		old.getCmpId();
		old.getPublisherRestrictions();
		TcfEuV2 nu = new TcfEuV2(in);
		nu.getPublisherConsents();
		nu.getPurposeConsents();
		nu.getVendorConsents();
		nu.getPurposeLegitimateInterests();
		nu.getVendorLegitimateInterests();
		nu.getSpecialFeatureOptins();
		nu.getCmpId();
		nu.getPublisherRestrictions();
		nu.setFieldValue(TcfEuV2Field.CMP_ID, 14);
		nu.encode();
		
		}
	}

}

benchmarked using async-profiler asprof -d 20 -e cpu,alloc -f ~/Desktop/dump16.jfr TcfBench

Flame Charts:

memory before:
image

memory after (note how TCString parse was small teal sliver in the before graph, but the iab-gpp portion has shrank so much that the TCString parse is now a larger percentage of the icicle chart):
image

cpu before:
image

cpu after:
image

Future Additional Ideas (not in PR). I'll likely open issues to discuss:

  • List<Integer> is still a little bulky in the charts above. I was thinking of maybe making a specialty class backed by int[] like https://github.com/InteractiveAdvertisingBureau/iabtcf-java/blob/master/iabtcf-decoder/src/main/java/com/iabtcf/utils/IntIterable.java but that would likely break API stability since this a beta candidate i have broken compatibility on this and introduce an IntegerSet
  • Should the fields map have keys be enum's instead of strings? The EnumMap would be a lot more lightweight. the Fields are stored in a more efficient manner using a list
  • is there too much stuff which is public which should be not public?
  • should List be replaced with Set since I feel set contains is something worth optimizing (i.e. is this vendor(s) present in the set) once, again IntegerSet extends AbstractSet
  • are the defensive copies in some getValue implementation necessary, could we achieve the same from returning read-only versions (Collections.unmodifiableList). introduced a managed IntegerSet which can detect modification and update the dirty flags, such that once could do tcfeuv2.getPurposes().addInt(10)

fixes #25

supersedes #45

@ChristopherWirt
Copy link

This looks great 👍

@yuzawa-san yuzawa-san force-pushed the cpu-memory-optimizations branch from 5e4b971 to 02f53c4 Compare July 25, 2024 16:34
@yuzawa-san
Copy link
Contributor Author

@iabmayank @chuff can you please take a look at this?

This was referenced Oct 15, 2024
@yuzawa-san yuzawa-san force-pushed the cpu-memory-optimizations branch from 8bac901 to c30373e Compare November 6, 2024 00:27
@yuzawa-san
Copy link
Contributor Author

@iabmayank @chuff i have rebased off of the most recent master

@lamrowena
Copy link
Collaborator

@yuzawa-san thank you for submitting this, we will be reviewing these in the working group

@chuff
Copy link
Contributor

chuff commented Dec 9, 2024

@yuzawa-san
This does look pretty great. Could you remove the benchmark code from the PR?

@yuzawa-san
Copy link
Contributor Author

@chuff removed

i would like to try adding it again in the future in a separate pr since i feel it is quite important that contributors have the ability to easily generate benchmarks. just curious, why did you favor removing it?

@yuzawa-san
Copy link
Contributor Author

@chuff @AntoxaAntoxic @ChristopherWirt this is now ready for review. i have added some more changes to this pr. i am coordinating with the maintainers @lamrowena and this PR is allowed to have breaking changes, so i have gone forward with the more aspirational (yet breaking) changes i had mentioned in the original pr description. preliminary benchmarks show its even faster than the prior benchmarks, so i'll try to get those up soon.

@yuzawa-san
Copy link
Contributor Author

yuzawa-san commented Apr 16, 2025

before
Benchmark                                  Mode  Cnt        Score       Error   Units
MyBenchmark.decodeGpp                     thrpt    5     7110.163 ±   360.715   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt    5    12518.891 ±   810.618  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt    5  1846188.899 ± 31226.192    B/op
MyBenchmark.decodeGpp:gc.count            thrpt    5      907.000              counts
MyBenchmark.decodeGpp:gc.time             thrpt    5      420.000                  ms

after
Benchmark                                  Mode  Cnt      Score      Error   Units
MyBenchmark.decodeGpp                     thrpt    5  43948.717 ± 1426.909   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt    5    764.478 ±   24.820  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt    5  18240.016 ±    0.001    B/op
MyBenchmark.decodeGpp:gc.count            thrpt    5    114.000             counts
MyBenchmark.decodeGpp:gc.time             thrpt    5     52.000                 ms


98.7% less memory (which means less GC activity, which means less CPU)
6.18x faster

@iabmayank iabmayank force-pushed the master branch 2 times, most recently from 897eb42 to 5b10385 Compare June 9, 2025 21:08
@lamrowena lamrowena changed the base branch from master to 4.X June 26, 2025 16:23
@yuzawa-san yuzawa-san force-pushed the cpu-memory-optimizations branch from 623ee38 to 8a87187 Compare June 26, 2025 23:04
@lamrowena lamrowena merged commit 6ac876f into IABTechLab:4.X Jul 17, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High CPU Consumption

5 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.