![]() ![]() The RAT keeps track of what space in the register file is assigned to which register. This is a bit like managing the heap with malloc and free, if you think of each register as a pointer. Your processor doesn’t have a single physical location where each register lives, it has what’s called a Register File and a Register Allocation Table. Now that we know what vzeroupper does, how does it do it? Now any future results won’t depend on what those bits are, so we safely avoid that bottleneck! The Vector Register File These stalls are what glibc is trying to avoid with vzeroupper. This promotion adds a dependency on those upper bits, and that causes unnecessary stalls while the processor waits for results it didn’t really need. This works fine, but superscalar processors need to track dependencies so that they know which operations can be parallelized. The reason we do this is because if you mix XMM and YMM registers, the XMM registers automatically get promoted to full width. You guessed it, vzeroupper will zero the upper bits of the vector registers. You might have noticed that I missed one instruction, and that’s vzeroupper. You can probably imagine just how often strlen is running on your system right now, but suffice to say, bits and bytes are flowing into these vector registers from all over your system constantly. ![]() Now we have the position of the first nul byte, in just four machine instructions! That’s a common enough operation that there’s an instruction for it - tzcnt (Trailing Zero Count). vpmovmskb eax, ymm1 vpxor xmm0, xmm0, xmm0įinding the first zero byte is now just a case of counting the number of trailing zero bits. Now we can extract the result into a general purpose register like eax with vpmovmskb.Īny nul byte will create a 1 bit, and any other value will create a 0 bit. vpcmpeqb ymm1, ymm0, rdi vpxor xmm0, xmm0, xmm0 Here rdi contains a pointer to our string, so vpcmpeqb will check which bytes in ymm0 match our string, and stores the result in ymm1.Īs we’ve already set ymm0 to all zero bytes, only nul bytes will match. VPXOR xmm0, xmm0, xmm0 > vpxor xmm0, xmm0, xmm0 The first step is to initialize ymm0 to zero, which is done by just xoring xmm0 with itself 1. The full routine is complicated and handles lots of cases, but let’s step through this simple case. Here are the first few instructions of glibc’s AVX2 optimized strlen: (gdb) x/20i _strlen_avx2 These big registers are useful in lots of situations, not just number crunching! They’re even used by standard C library functions, like strcmp, memcpy, strlen and so on. The 256-bit extended registers are called YMM, and the 512-bit registers are ZMM. You can never have enough bits, so recent CPUs have extended the width of those registers up to 256-bit and even 512-bits. If you remove the first word from the string "hello world", what should the result be? This is the story of how we discovered that the answer could be your root password! IntroductionĪll x86-64 CPUs have a set of 128-bit vector registers called the XMM registers. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |