Programming

x86 어셈블리에서 xor, mov 또는 and에서 레지스터를 0으로 설정하는 가장 좋은 방법은 무엇입니까?

procodes 2020. 8. 5. 21:29
반응형

x86 어셈블리에서 xor, mov 또는 and에서 레지스터를 0으로 설정하는 가장 좋은 방법은 무엇입니까?


다음 명령어는 모두 동일한 작업 %eax을 수행합니다. 0으로 설정 합니다. 어떤 방법이 최적입니까 (가장 적은 기계 사이클이 필요함)?

xorl   %eax, %eax
mov    $0, %eax
andl   $0, %eax

TL; DR 요약 : xor same, same는 IS 모든 CPU를위한 최선의 선택 . 다른 방법은 다른 방법보다 유리하며 다른 방법보다 장점이 있습니다. 인텔과 AMD가 공식적으로 권장합니다. 32 비트 reg를 쓰면 상위 32가 0xor r32, r32 이므로 64 비트 모드에서는 여전히 사용 합니다. REX 접두사가 필요하기 때문에 바이트 낭비입니다.xor r64, r64

그보다 더 나쁜 Silvermont는 xor r32,r3264 비트 피연산자 크기가 아니라 뎁 브레이킹으로 만 인식 합니다. 따라서 r8..r15를 0으로 지정하여 REX 접두사가 여전히 필요한 경우에도 xor r10d,r10d, not을 사용하십시오xor r10,r10 .

GP 정수 예 :

xor   eax, eax       ; RAX = 0.  Including AL=0 etc.
xor   r10d, r10d     ; R10 = 0
xor   edx, edx       ; RDX = 0

; small code-size alternative:    cdq    ; zero RDX if EAX is already zero

; SUB-OPTIMAL
xor   rax,rax       ; waste of a REX prefix, and extra slow on Silvermont
mov   eax, 0        ; doesn't touch FLAGS, but not faster and takes more bytes

xor   al, al        ; false dep on some CPUs, not a zeroing idiom.
mov   al, 0         ; only 2 bytes, and probably better than xor al,al *if* you need to leave the rest of EAX/RAX unmodified

벡터 레지스터를 제로화하는 것은 일반적으로로하는 것이 가장 좋습니다 pxor xmm, xmm. 이는 일반적으로 gcc의 기능입니다 (FP 명령어와 함께 사용하기 전에도).

xorps xmm, xmm이해할 수 있습니다. 1 바이트보다 짧지 만 Intel Nehalem에서 실행 포트 5가 필요 pxor하지만 모든 포트 (0/1/5)에서 실행할 수 있습니다. 정수와 FP 사이의 Nehalem의 2c 바이 패스 지연 대기 시간은 일반적으로 관련이 없습니다. 비 순차적 실행은 일반적으로 새로운 종속성 체인의 시작에서이를 숨길 수 있기 때문입니다.xorpspxor

SnB 계열 마이크로 아키텍처에서 xor-zeroing의 특징은 실행 포트가 필요하지 않습니다. AMD에, 및 P6 / 코어 2 인텔, - 네 할렘 사전 xorpspxor(벡터 정수의 지시로) 같은 방식으로 처리됩니다.

128b 벡터 명령어의 AVX 버전을 사용하면 reg의 상단 부분도 0으로 설정되므로 vpxor xmm, xmm, xmmYMM (AVX1 / AVX2) 또는 ZMM (AVX512) 또는 향후 벡터 확장을 제로화하는 데 적합합니다. vpxor ymm, ymm, ymm그러나 인코딩하는 데 여분의 바이트가 필요하지 않으며 Intel에서는 동일하게 실행되지만 Zen2 (2 uops) 이전의 AMD에서는 느립니다. AVX512 ZMM 제로화에는 여분의 바이트 (EVEX 접두어의 경우)가 필요하므로 XMM 또는 YMM 제로화가 선호됩니다.

XMM / YMM / ZMM 예

#Good:
 xorps   xmm0, xmm0         ; smallest code size (for non-AVX)
 pxor    xmm0, xmm0         ; costs an extra byte, runs on any port on Nehalem.
 xorps   xmm15, xmm15       ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX.  Code-size is the only penalty.

   # with AVX
 vpxor xmm0, xmm0, xmm0    ; zeros X/Y/ZMM0
 vpxor xmm15, xmm0, xmm0   ; zeros X/Y/ZMM15, still only 2-byte VEX prefix

 vpxord xmm30, xmm30, xmm30  ; EVEX is unavoidable when zeroing high 16, but still prefer XMM or YMM for fewer uops on probable future AMD.  May be worth it to avoid needing vzeroupper.
 vpxord zmm30, zmm30, zmm30  ; Without AVX512VL you have to use a 512-bit instruction.

#sub-optimal:
 vpxor   xmm15, xmm15, xmm15   ; 3-byte VEX prefix for high source reg
 vpxord  zmm0, zmm0, zmm0      ; EVEX prefix (4 bytes), and a 512-bit uop.

참조 되어-vxorps 제로 빠른 YMM보다 XMM 레지스터와 AMD 재규어 / 불도저 / 선에?
단일 또는 기사 상륙에 대한 몇 가지 ZMM 등록을 취소 할 수있는 가장 효율적인 방법은 무엇입니까?

세미 관련 : 빠른 방법 모두 ONE __m256 비트 값을 설정 하고
(1) CPU 레지스터에 설정된 모든 비트를 효율적으로 또한 AVX512 커버 k0..7마스크 레지스터.


다양한 우치에서 xor와 같은 관용구 제로화에 대한 특별한 점

일부 CPU sub same,same는 제로 관용구처럼 인식xor 하지만 제로 관용구 를 인식하는 모든 CPU는 인식합니다xor . 그냥 사용 xor하면 CPU는 관용구를 제로화하는 인식하는 걱정할 필요가 없습니다.

xor (being a recognized zeroing idiom, unlike mov reg, 0) has some obvious and some subtle advantages (summary list, then I'll expand on those):

  • smaller code-size than mov reg,0. (All CPUs)
  • avoids partial-register penalties for later code. (Intel P6-family and SnB-family).
  • doesn't use an execution unit, saving power and freeing up execution resources. (Intel SnB-family)
  • smaller uop (no immediate data) leaves room in the uop cache-line for nearby instructions to borrow if needed. (Intel SnB-family).
  • doesn't use up entries in the physical register file. (Intel SnB-family (and P4) at least, possibly AMD as well since they use a similar PRF design instead of keeping register state in the ROB like Intel P6-family microarchitectures.)

Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.


The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.

See my answer on another question about zeroing registers for some more details.

Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).


On AMD Bulldozer-family CPUs, mov immediate runs on the same EX0/EX1 integer execution ports as xor. mov reg,reg can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor over mov is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.


Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.

From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):

The processor recognizes the XOR of a register with itself as setting it to zero. A special tag in the register remembers that the high part of the register is zero so that EAX = AL. This tag is remembered even in a loop:

    ; Example    7.9. Partial register problem avoided in loop
    xor    eax, eax
    mov    ecx, 100
LL:
    mov    al, [esi]
    mov    [edi], eax    ; No extra uop
    inc    esi
    add    edi, 4
    dec    ecx
    jnz    LL

(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.

pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.


xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.

It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.

Ideally, you should use xor / set flags / setcc / read full register:

...
call  some_func
xor     ecx,ecx    ; zero *before* the test
test    eax,eax
setnz   cl         ; cl = (some_func() != 0)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).

Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.

Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)


and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor, and many disadvantages.

See http://agner.org/optimize/ for microarch documentation, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)

Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov and then xor-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL. (See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and @Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul chain. This confirms Agner Fog's results, unfortunately.)


If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, though.

참고URL : https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and

반응형