Timeline for SIMD instructions lowering CPU frequency
Current License: CC BY-SA 4.0
37 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Mar 4 at 3:41 | history | edited | Peter Cordes |
edited tags
|
|
| Aug 2, 2019 at 14:15 | audit | Triage | |||
| Aug 2, 2019 at 14:48 | |||||
| Jul 30, 2019 at 19:13 | audit | Triage | |||
| Jul 30, 2019 at 19:43 | |||||
| Jul 30, 2019 at 15:16 | audit | Close votes | |||
| Jul 31, 2019 at 8:23 | |||||
| Jul 29, 2019 at 17:54 | audit | Close votes | |||
| Jul 30, 2019 at 0:18 | |||||
| Jul 29, 2019 at 14:12 | audit | Triage | |||
| Jul 29, 2019 at 14:36 | |||||
| Jul 26, 2019 at 6:18 | audit | Close votes | |||
| Jul 26, 2019 at 6:18 | |||||
| Jul 25, 2019 at 15:00 | audit | Triage | |||
| Jul 25, 2019 at 15:12 | |||||
| Jul 25, 2019 at 7:37 | audit | Triage | |||
| Jul 25, 2019 at 7:57 | |||||
| Jul 22, 2019 at 14:06 | audit | Triage | |||
| Jul 22, 2019 at 14:53 | |||||
| Jul 19, 2019 at 6:21 | audit | Triage | |||
| Jul 19, 2019 at 6:21 | |||||
| Jul 17, 2019 at 9:45 | audit | Triage | |||
| Jul 17, 2019 at 10:01 | |||||
| Jul 16, 2019 at 5:45 | audit | Triage | |||
| Jul 16, 2019 at 6:13 | |||||
| Jul 4, 2019 at 0:16 | comment | added | BeeOnRope | @HCSF important routines in libc are generally compiled multiple times for different ISAs and then the version appropriate for the current CPU is selected at runtime using the dynamic loader's IFUNC capability. So you'll usually get a version optimized for your CPU (unless your libc is quite old and your CPU quite new). | |
| Jul 3, 2019 at 15:06 | comment | added | HCSF | @BeeOnRope you brought up an interesting point -- "other libraries, at a minimum libc - and these have 256-bit instructions". I thought most libraries come with the Linux distros were not compiled for a specific x86 CPU, and some x86 CPUs don't have AVX 256 support and so library like libc shouldn't have any 256-bit instructions. No? | |
| Jul 3, 2019 at 14:31 | comment | added | BeeOnRope | Yeah vzeroupper makes more sense after using umm registers, to avoid transition penalties for "dirty upper" and probably isn't needed for xmm only code. I think there is a flag to turn it's emission off. | |
| Jul 3, 2019 at 14:30 | comment | added | BeeOnRope | Yeah that's a reasonable way to check the binary. Keep in mind at runtime you'll likely use other libraries, at a minimum libc - and these have 256-bit instructions, eg in their memcpy implementation. So you really have to do a runtime check to be sure you aren't executing any "forbidden" instructions. I don't the 256b instructions in libc are likely to be a problem wrt the licenses since they are light. | |
| Jul 3, 2019 at 7:34 | comment | added | HCSF |
I tried to compile with -march=skylake-avx512 -mtune=skylake-avx512 -mprefer-vector-width=128, and then I decompiled it objdump -d my binary > binary.asm, and then grep -i ymm binary.asm. I guess it is safe to conclude that it doesn't use any 256 and 512 bit registers and so no AVX-256 and AVX-512 instructions are emitted? @BeeOnRope Tho, I still see many vzeroupper instructions. I thought it were only used with ymm registers. No?
|
|
| Jul 3, 2019 at 6:59 | comment | added | HCSF | I can try it now. But is there a way to check whether L1/L2 instructions are in the binary? | |
| Jul 3, 2019 at 6:57 | comment | added | BeeOnRope | Try those options and then check if any L1/L2 instructions pop up using the performance counter events for L1 and L2 licenses. | |
| Jul 3, 2019 at 6:57 | comment | added | BeeOnRope |
Peter mentioned the -mpreferred-vector-width=256 option. I don't know if it prevents gcc from ever producing AVX-512 instructions (outside of direct intrinsic use), but it is certainly possible. I am not aware of any option which distinguishes between "heavy" and "light" instructions however. Usually this isn't a problem, since if you turn off AVX-512 and don't have a bunch of FP ops, you are probably targeting L0 anyways, and AVX-512 light is still L1.
|
|
| Jul 3, 2019 at 6:54 | comment | added | HCSF | @BeeOnRope Based on your answer, is there anyway to tell GCC not to generate any AVX-512 and AVX-256-heavy instructions? But all other instructions are okay. | |
| Jul 3, 2019 at 6:51 | history | edited | BeeOnRope | CC BY-SA 4.0 |
edited title
|
| Jul 3, 2019 at 6:25 | history | edited | HCSF | CC BY-SA 4.0 |
added 5 characters in body
|
| Jul 3, 2019 at 3:37 | history | edited | phuclv | CC BY-SA 4.0 |
Improved Formatting
|
| Jul 3, 2019 at 0:03 | comment | added | BeeOnRope |
@HCSF - you can avoid the ldd related penalty by issuing a vzeroupper at the start of your program.
|
|
| Jul 3, 2019 at 0:02 | answer | added | BeeOnRope | timeline score: 99 | |
| Jul 2, 2019 at 20:35 | history | edited | Peter Cordes | CC BY-SA 4.0 |
edited tags
|
| Jul 2, 2019 at 20:34 | answer | added | Peter Cordes | timeline score: 18 | |
| Jul 2, 2019 at 14:52 | comment | added | Margaret Bloom | @HCSF You can make three builds, one without AVX, one with AVX/AVX2 and one with AVX-512 (if applicable) and profile them. Then take the fastest one. | |
| Jul 2, 2019 at 14:33 | comment | added | HCSF | @MargaretBloom thanks for sharing your thought and all the links. I also read Beeonrope's post about the penalty in ld. Given my ld is very old, I think it is best for me to avoid AVX and AVX512 related instructions. And as you pointed out, the ratio of vector to scalar is also important. Given I write high level C++ code, it is hard to figure the ratio unless I check the assembly output each time, slowing the the development... | |
| Jul 2, 2019 at 13:10 | comment | added | Margaret Bloom | ... two cycles and drop the frequency according to their tier (e.g. AVX-512 heavy instrs drop the frequency to the AV-512 base). Travis also shared the code he used to test here. You can find the behaviour of each instruction with a bit of patience or by his rule of thumb. Finally note that this frequency scaling is a problem iif the ratio of vector to scalar instruction is low enough so that the drop in frequency is not balanced by the bigger width at which data is processed. Check the final binary to see if you really gained anything. | |
| Jul 2, 2019 at 13:06 | comment | added | Margaret Bloom | Trevis Down (aka Beeonrope on OS) wrote about this in the comments in this post and continued the discussion here. He found that each ties (scalar, AVX/AVX2, AVX-512) has "cheap" (no FP, simple operations) instructions and "heavy" instruction. Cheap instructions drop the frequency to the one of the next higher tier (e.g. cheap AVX-512 inst use the AVX/AVX2 tier) even if used sparsely. Heavy inst must be used more than 1 every ... | |
| Jul 2, 2019 at 12:51 | comment | added | HCSF | @500-InternalServerError in order to avoid jitters in the system. Think about a laser arm gets jitters. | |
| Jul 2, 2019 at 12:50 | history | edited | HCSF | CC BY-SA 4.0 |
added 99 characters in body
|
| Jul 2, 2019 at 12:49 | comment | added | 500 - Internal Server Error |
instructions to avoid in order to accomplish what exactly?
|
|
| Jul 2, 2019 at 12:45 | history | asked | HCSF | CC BY-SA 4.0 |