Performance
Performance of SSE code is tricky to estimate, especially in real-world scenarios. Properties of code tested in a microbenchmark do not transfer well to the wild. For example, aggressive register usage generally translates to better microbenchmark results, but may not result in faster running "real-world" applications.
On this page, assembly snippets for common routines are provided for inspection, as well as analysis produced by llvm-mca. Users are encouranged to benchmark their own code and create an issue if a performance issue is believed to have been discovered. The analysis provided by LLVM-MCA cannot be used as a proxy for predicting performance, although it is a useful tool to compare alternatives.
Only a few routines are provided here as an indicator of the performance and implementation characteristics of the rest of the code. To understand the implications of the various counters and resource estimates provided, please refer to the excellent analysis provided at uops.info.
Rotor Composition
kln::rotor ab(kln::rotor const& a, kln::rotor const& b)
{
return a * b;
}
Klein LLVM-MCA assembly and analysis
Iterations: 100
Instructions: 2400
Total Cycles: 821
Total uOps: 2400
Dispatch Width: 6
uOps Per Cycle: 2.92
IPC: 2.92
Block RThroughput: 8.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 6 0.50 * movaps (%rdi), %xmm0
1 6 0.50 * movaps (%rsi), %xmm1
1 1 0.33 movaps %xmm0, %xmm2
1 1 1.00 shufps $0, %xmm0, %xmm2
1 4 0.50 mulps %xmm1, %xmm2
1 1 0.33 movaps %xmm0, %xmm3
1 1 1.00 shufps $121, %xmm0, %xmm3
1 1 0.33 movaps %xmm1, %xmm4
1 1 1.00 shufps $157, %xmm1, %xmm4
1 4 0.50 mulps %xmm3, %xmm4
1 4 0.50 subps %xmm4, %xmm2
1 1 0.33 movaps %xmm0, %xmm3
1 1 1.00 shufps $230, %xmm0, %xmm3
1 1 0.33 movaps %xmm1, %xmm4
1 1 1.00 shufps $2, %xmm1, %xmm4
1 4 0.50 mulps %xmm3, %xmm4
1 1 1.00 shufps $159, %xmm0, %xmm0
1 1 1.00 shufps $123, %xmm1, %xmm1
1 4 0.50 mulps %xmm0, %xmm1
1 4 0.50 addps %xmm4, %xmm1
1 1 0.25 movl $-2147483648, %eax
1 1 1.00 movd %eax, %xmm0
1 1 0.33 pxor %xmm1, %xmm0
1 4 0.50 addps %xmm2, %xmm0
Resources:
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 6.49 6.50 1.00 1.00 - 8.01 1.00 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - - - 1.00 - - - - movaps (%rdi), %xmm0
- - - - 1.00 - - - - - movaps (%rsi), %xmm1
- - 0.50 0.50 - - - - - - movaps %xmm0, %xmm2
- - - - - - - 1.00 - - shufps $0, %xmm0, %xmm2
- - 0.03 0.97 - - - - - - mulps %xmm1, %xmm2
- - 0.50 0.50 - - - - - - movaps %xmm0, %xmm3
- - - - - - - 1.00 - - shufps $121, %xmm0, %xmm3
- - 0.51 0.48 - - - 0.01 - - movaps %xmm1, %xmm4
- - - - - - - 1.00 - - shufps $157, %xmm1, %xmm4
- - 0.98 0.02 - - - - - - mulps %xmm3, %xmm4
- - 0.94 0.06 - - - - - - subps %xmm4, %xmm2
- - 0.50 0.50 - - - - - - movaps %xmm0, %xmm3
- - - - - - - 1.00 - - shufps $230, %xmm0, %xmm3
- - 0.50 0.50 - - - - - - movaps %xmm1, %xmm4
- - - - - - - 1.00 - - shufps $2, %xmm1, %xmm4
- - 0.51 0.49 - - - - - - mulps %xmm3, %xmm4
- - - - - - - 1.00 - - shufps $159, %xmm0, %xmm0
- - - - - - - 1.00 - - shufps $123, %xmm1, %xmm1
- - 0.53 0.47 - - - - - - mulps %xmm0, %xmm1
- - 0.02 0.98 - - - - - - addps %xmm4, %xmm1
- - - - - - - - 1.00 - movl $-2147483648, %eax
- - - - - - - 1.00 - - movd %eax, %xmm0
- - 0.48 0.52 - - - - - - pxor %xmm1, %xmm0
- - 0.49 0.51 - - - - - - addps %xmm2, %xmm0
For comparison, here is the assembly and analysis corresponding to semantically identical code from RTM.
rtm::quatf ab(rtm::quatf const& a, rtm::quatf const& b)
{
return rtm::quat_mul(a, b);
}
RTM LLVM-MCA assembly and analysis
Iterations: 100
Instructions: 2300
Total Cycles: 824
Total uOps: 2600
Dispatch Width: 6
uOps Per Cycle: 3.16
IPC: 2.79
Block RThroughput: 7.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 6 0.50 * movaps (%rdi), %xmm0
1 6 0.50 * movaps (%rsi), %xmm2
1 1 0.33 movaps %xmm2, %xmm1
1 1 1.00 shufps $0, %xmm2, %xmm1
1 1 0.33 movaps %xmm2, %xmm3
1 1 1.00 shufps $85, %xmm2, %xmm3
1 1 0.33 movaps %xmm2, %xmm4
1 1 1.00 shufps $170, %xmm2, %xmm4
1 1 1.00 shufps $255, %xmm2, %xmm2
1 4 0.50 mulps %xmm0, %xmm2
1 1 0.33 movaps %xmm0, %xmm5
1 1 1.00 shufps $27, %xmm0, %xmm5
1 4 0.50 mulps %xmm5, %xmm1
1 1 1.00 shufps $177, %xmm5, %xmm5
1 4 0.50 mulps %xmm3, %xmm5
2 7 0.50 * xorps .LCPI0_0(%rip), %xmm1
1 1 1.00 shufps $177, %xmm0, %xmm0
1 4 0.50 mulps %xmm4, %xmm0
2 7 0.50 * xorps .LCPI0_1(%rip), %xmm5
2 7 0.50 * xorps .LCPI0_2(%rip), %xmm0
1 4 0.50 addps %xmm2, %xmm1
1 4 0.50 addps %xmm5, %xmm0
1 4 0.50 addps %xmm1, %xmm0
Resources:
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 6.49 6.50 2.50 2.50 - 8.01 - -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - - 0.50 0.50 - - - - movaps (%rdi), %xmm0
- - - - 0.49 0.51 - - - - movaps (%rsi), %xmm2
- - 0.03 0.96 - - - 0.01 - - movaps %xmm2, %xmm1
- - - - - - - 1.00 - - shufps $0, %xmm2, %xmm1
- - 0.95 0.05 - - - - - - movaps %xmm2, %xmm3
- - - - - - - 1.00 - - shufps $85, %xmm2, %xmm3
- - 0.04 0.96 - - - - - - movaps %xmm2, %xmm4
- - - - - - - 1.00 - - shufps $170, %xmm2, %xmm4
- - - - - - - 1.00 - - shufps $255, %xmm2, %xmm2
- - 0.49 0.51 - - - - - - mulps %xmm0, %xmm2
- - 0.95 0.05 - - - - - - movaps %xmm0, %xmm5
- - - - - - - 1.00 - - shufps $27, %xmm0, %xmm5
- - 0.52 0.48 - - - - - - mulps %xmm5, %xmm1
- - - - - - - 1.00 - - shufps $177, %xmm5, %xmm5
- - 0.49 0.51 - - - - - - mulps %xmm3, %xmm5
- - 0.48 0.52 0.50 0.50 - - - - xorps .LCPI0_0(%rip), %xmm1
- - - - - - - 1.00 - - shufps $177, %xmm0, %xmm0
- - 0.52 0.48 - - - - - - mulps %xmm4, %xmm0
- - 0.48 0.52 0.50 0.50 - - - - xorps .LCPI0_1(%rip), %xmm5
- - - - 0.51 0.49 - 1.00 - - xorps .LCPI0_2(%rip), %xmm0
- - 0.51 0.49 - - - - - - addps %xmm2, %xmm1
- - 0.52 0.48 - - - - - - addps %xmm5, %xmm0
- - 0.51 0.49 - - - - - - addps %xmm1, %xmm0
Finally, for good measure, here is the same procedure and analysis for GLM
glm::quat rotor_composition(glm::quat const& a, glm::quat const& b)
{
return a * b;
}
GLM LLVM-MCA assembly and analysis
Iterations: 100
Instructions: 5700
Total Cycles: 1522
Total uOps: 5800
Dispatch Width: 6
uOps Per Cycle: 3.81
IPC: 3.75
Block RThroughput: 14.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 5 0.50 * movss (%rdi), %xmm4
1 5 0.50 * movss 4(%rdi), %xmm3
1 5 0.50 * movss 8(%rdi), %xmm2
1 5 0.50 * movss 12(%rdi), %xmm0
1 5 0.50 * movss (%rsi), %xmm9
1 5 0.50 * movss 4(%rsi), %xmm8
1 5 0.50 * movss 8(%rsi), %xmm7
1 5 0.50 * movss 12(%rsi), %xmm10
1 1 0.33 movaps %xmm0, %xmm5
1 1 0.33 movaps %xmm4, %xmm1
1 1 0.33 movaps %xmm0, %xmm6
1 4 0.50 mulss %xmm10, %xmm1
1 1 0.33 movaps %xmm2, %xmm11
1 4 0.50 mulss %xmm9, %xmm5
1 4 0.50 mulss %xmm8, %xmm6
1 4 0.50 mulss %xmm10, %xmm11
1 4 0.50 addss %xmm1, %xmm5
1 1 0.33 movaps %xmm3, %xmm1
1 4 0.50 mulss %xmm7, %xmm1
1 4 0.50 addss %xmm1, %xmm5
1 1 0.33 movaps %xmm2, %xmm1
1 4 0.50 mulss %xmm8, %xmm1
1 4 0.50 subss %xmm1, %xmm5
1 1 0.33 movaps %xmm3, %xmm1
1 4 0.50 mulss %xmm10, %xmm1
1 4 0.50 addss %xmm1, %xmm6
1 1 0.33 movaps %xmm2, %xmm1
1 4 0.50 mulss %xmm9, %xmm1
1 4 0.50 mulss %xmm7, %xmm2
1 4 0.50 addss %xmm1, %xmm6
1 1 0.33 movaps %xmm4, %xmm1
1 4 0.50 mulss %xmm7, %xmm1
1 4 0.50 subss %xmm1, %xmm6
1 1 0.33 movaps %xmm0, %xmm1
1 4 0.50 mulss %xmm7, %xmm1
1 4 0.50 mulss %xmm10, %xmm0
1 1 1.00 unpcklps %xmm6, %xmm5
1 1 0.33 movaps %xmm5, %xmm7
1 4 0.50 addss %xmm11, %xmm1
1 1 0.33 movaps %xmm4, %xmm11
1 4 0.50 mulss %xmm8, %xmm11
1 4 0.50 mulss %xmm9, %xmm4
1 4 0.50 addss %xmm11, %xmm1
1 1 0.33 movaps %xmm3, %xmm11
1 4 0.50 mulss %xmm8, %xmm3
1 4 0.50 subss %xmm4, %xmm0
1 4 0.50 mulss %xmm9, %xmm11
1 4 0.50 subss %xmm3, %xmm0
1 4 0.50 subss %xmm11, %xmm1
1 4 0.50 subss %xmm2, %xmm0
1 1 1.00 unpcklps %xmm0, %xmm1
1 1 1.00 movlhps %xmm1, %xmm7
2 1 1.00 * movaps %xmm7, -40(%rsp)
1 5 0.50 * movq -32(%rsp), %rax
1 5 0.50 * movq -40(%rsp), %xmm0
1 1 1.00 movq %rax, %xmm1
1 1 1.00 * movq %rax, -16(%rsp)
Resources:
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 15.00 15.00 5.01 5.01 2.00 15.00 - 1.98
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - - - 1.00 - - - - movss (%rdi), %xmm4
- - - - 1.00 - - - - - movss 4(%rdi), %xmm3
- - - - 0.01 0.99 - - - - movss 8(%rdi), %xmm2
- - - - 0.99 0.01 - - - - movss 12(%rdi), %xmm0
- - - - - 1.00 - - - - movss (%rsi), %xmm9
- - - - 1.00 - - - - - movss 4(%rsi), %xmm8
- - - - 0.98 0.02 - - - - movss 8(%rsi), %xmm7
- - - - 0.02 0.98 - - - - movss 12(%rsi), %xmm10
- - 0.48 - - - - 0.52 - - movaps %xmm0, %xmm5
- - - 0.49 - - - 0.51 - - movaps %xmm4, %xmm1
- - 0.98 0.01 - - - 0.01 - - movaps %xmm0, %xmm6
- - 0.98 0.02 - - - - - - mulss %xmm10, %xmm1
- - 0.01 0.01 - - - 0.98 - - movaps %xmm2, %xmm11
- - 0.50 0.50 - - - - - - mulss %xmm9, %xmm5
- - 1.00 - - - - - - - mulss %xmm8, %xmm6
- - 0.50 0.50 - - - - - - mulss %xmm10, %xmm11
- - 0.49 0.51 - - - - - - addss %xmm1, %xmm5
- - - 0.01 - - - 0.99 - - movaps %xmm3, %xmm1
- - 0.99 0.01 - - - - - - mulss %xmm7, %xmm1
- - - 1.00 - - - - - - addss %xmm1, %xmm5
- - - - - - - 1.00 - - movaps %xmm2, %xmm1
- - 1.00 - - - - - - - mulss %xmm8, %xmm1
- - 0.01 0.99 - - - - - - subss %xmm1, %xmm5
- - 0.01 - - - - 0.99 - - movaps %xmm3, %xmm1
- - 0.50 0.50 - - - - - - mulss %xmm10, %xmm1
- - 0.49 0.51 - - - - - - addss %xmm1, %xmm6
- - - - - - - 1.00 - - movaps %xmm2, %xmm1
- - 0.99 0.01 - - - - - - mulss %xmm9, %xmm1
- - 0.01 0.99 - - - - - - mulss %xmm7, %xmm2
- - 0.50 0.50 - - - - - - addss %xmm1, %xmm6
- - - - - - - 1.00 - - movaps %xmm4, %xmm1
- - 1.00 - - - - - - - mulss %xmm7, %xmm1
- - - 1.00 - - - - - - subss %xmm1, %xmm6
- - - - - - - 1.00 - - movaps %xmm0, %xmm1
- - 1.00 - - - - - - - mulss %xmm7, %xmm1
- - 0.50 0.50 - - - - - - mulss %xmm10, %xmm0
- - - - - - - 1.00 - - unpcklps %xmm6, %xmm5
- - - - - - - 1.00 - - movaps %xmm5, %xmm7
- - 0.50 0.50 - - - - - - addss %xmm11, %xmm1
- - - - - - - 1.00 - - movaps %xmm4, %xmm11
- - 0.99 0.01 - - - - - - mulss %xmm8, %xmm11
- - 0.02 0.98 - - - - - - mulss %xmm9, %xmm4
- - 0.02 0.98 - - - - - - addss %xmm11, %xmm1
- - - - - - - 1.00 - - movaps %xmm3, %xmm11
- - 0.98 0.02 - - - - - - mulss %xmm8, %xmm3
- - 0.01 0.99 - - - - - - subss %xmm4, %xmm0
- - 0.51 0.49 - - - - - - mulss %xmm9, %xmm11
- - 0.02 0.98 - - - - - - subss %xmm3, %xmm0
- - - 1.00 - - - - - - subss %xmm11, %xmm1
- - 0.01 0.99 - - - - - - subss %xmm2, %xmm0
- - - - - - - 1.00 - - unpcklps %xmm0, %xmm1
- - - - - - - 1.00 - - movlhps %xmm1, %xmm7
- - - - 0.01 - 1.00 - - 0.99 movaps %xmm7, -40(%rsp)
- - - - - 1.00 - - - - movq -32(%rsp), %rax
- - - - 1.00 - - - - - movq -40(%rsp), %xmm0
- - - - - - - 1.00 - - movq %rax, %xmm1
- - - - - 0.01 1.00 - - 0.99 movq %rax, -16(%rsp)
Motor-Point Application (Dual Quat Application)
kln::point motor_application(kln::motor const& m, kln::point const& p)
{
return m(p);
}
Klein LLVM-MCA assembly and analysis
Iterations: 100
Instructions: 5900
Total Cycles: 1831
Total uOps: 5900
Dispatch Width: 6
uOps Per Cycle: 3.22
IPC: 3.22
Block RThroughput: 13.5
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 6 0.50 * movaps (%rdi), %xmm3
1 6 0.50 * movaps 16(%rdi), %xmm6
1 1 0.33 movaps %xmm3, %xmm12
1 1 1.00 shufps $0, %xmm3, %xmm12
1 1 0.33 movaps %xmm3, %xmm11
1 1 0.33 movaps %xmm3, %xmm10
1 1 0.33 movaps %xmm12, %xmm8
1 1 0.33 movaps %xmm12, %xmm9
1 1 0.33 movaps %xmm3, %xmm4
1 1 0.33 movaps %xmm3, %xmm1
1 1 0.33 movaps %xmm6, %xmm7
1 4 0.50 mulps %xmm6, %xmm12
1 1 0.33 movaps %xmm6, %xmm0
1 1 1.00 shufps $0, %xmm6, %xmm6
1 4 0.50 mulps %xmm3, %xmm6
1 1 1.00 shufps $156, %xmm3, %xmm3
1 4 0.50 mulps %xmm3, %xmm11
1 1 1.00 shufps $120, %xmm10, %xmm10
1 4 0.50 mulps %xmm10, %xmm8
1 4 0.50 subps %xmm8, %xmm11
1 4 0.50 mulps %xmm3, %xmm9
1 4 0.50 mulps %xmm10, %xmm4
1 4 0.50 addps %xmm9, %xmm4
1 4 0.50 mulps %xmm1, %xmm1
1 1 0.33 movaps %xmm1, %xmm5
1 1 1.00 shufps $1, %xmm1, %xmm5
1 4 0.50 addps %xmm1, %xmm5
1 1 0.33 movaps %xmm1, %xmm2
1 1 1.00 shufps $158, %xmm1, %xmm2
1 1 1.00 shufps $123, %xmm1, %xmm1
1 4 0.50 addps %xmm2, %xmm1
1 1 0.25 movl $-2147483648, %eax
1 1 1.00 movd %eax, %xmm2
1 1 0.33 pxor %xmm1, %xmm2
1 6 0.50 * movaps .LCPI3_0(%rip), %xmm1
1 4 0.50 mulps %xmm1, %xmm11
1 4 0.50 mulps %xmm1, %xmm4
1 4 0.50 subps %xmm2, %xmm5
1 1 1.00 shufps $156, %xmm7, %xmm7
1 4 0.50 mulps %xmm10, %xmm7
1 4 0.50 subps %xmm12, %xmm7
1 1 1.00 shufps $120, %xmm0, %xmm0
1 4 0.50 mulps %xmm3, %xmm0
1 4 0.50 subps %xmm0, %xmm7
1 4 0.50 subps %xmm6, %xmm7
1 4 0.50 mulps %xmm1, %xmm7
1 6 0.50 * movaps (%rsi), %xmm0
1 1 0.33 movaps %xmm0, %xmm1
1 1 1.00 shufps $156, %xmm0, %xmm1
1 4 0.50 mulps %xmm11, %xmm1
1 1 0.33 movaps %xmm0, %xmm2
1 1 1.00 shufps $120, %xmm0, %xmm2
1 4 0.50 mulps %xmm4, %xmm2
1 4 0.50 addps %xmm1, %xmm2
1 4 0.50 mulps %xmm0, %xmm5
1 4 0.50 addps %xmm2, %xmm5
1 1 1.00 shufps $0, %xmm0, %xmm0
1 4 0.50 mulps %xmm7, %xmm0
1 4 0.50 addps %xmm5, %xmm0
Resources:
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 18.01 18.01 2.00 2.00 - 17.98 1.00 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - - - 1.00 - - - - movaps (%rdi), %xmm3
- - - - 1.00 - - - - - movaps 16(%rdi), %xmm6
- - 0.01 0.98 - - - 0.01 - - movaps %xmm3, %xmm12
- - - - - - - 1.00 - - shufps $0, %xmm3, %xmm12
- - - 1.00 - - - - - - movaps %xmm3, %xmm11
- - 0.01 0.49 - - - 0.50 - - movaps %xmm3, %xmm10
- - 0.49 0.01 - - - 0.50 - - movaps %xmm12, %xmm8
- - - 0.50 - - - 0.50 - - movaps %xmm12, %xmm9
- - - 0.51 - - - 0.49 - - movaps %xmm3, %xmm4
- - 0.50 0.01 - - - 0.49 - - movaps %xmm3, %xmm1
- - 0.01 0.50 - - - 0.49 - - movaps %xmm6, %xmm7
- - - 1.00 - - - - - - mulps %xmm6, %xmm12
- - - - - - - 1.00 - - movaps %xmm6, %xmm0
- - - - - - - 1.00 - - shufps $0, %xmm6, %xmm6
- - 0.50 0.50 - - - - - - mulps %xmm3, %xmm6
- - - - - - - 1.00 - - shufps $156, %xmm3, %xmm3
- - 0.98 0.02 - - - - - - mulps %xmm3, %xmm11
- - - - - - - 1.00 - - shufps $120, %xmm10, %xmm10
- - 0.02 0.98 - - - - - - mulps %xmm10, %xmm8
- - 0.03 0.97 - - - - - - subps %xmm8, %xmm11
- - 0.01 0.99 - - - - - - mulps %xmm3, %xmm9
- - 0.49 0.51 - - - - - - mulps %xmm10, %xmm4
- - 0.49 0.51 - - - - - - addps %xmm9, %xmm4
- - 0.51 0.49 - - - - - - mulps %xmm1, %xmm1
- - 0.49 - - - - 0.51 - - movaps %xmm1, %xmm5
- - - - - - - 1.00 - - shufps $1, %xmm1, %xmm5
- - 0.99 0.01 - - - - - - addps %xmm1, %xmm5
- - 0.01 0.50 - - - 0.49 - - movaps %xmm1, %xmm2
- - - - - - - 1.00 - - shufps $158, %xmm1, %xmm2
- - - - - - - 1.00 - - shufps $123, %xmm1, %xmm1
- - 0.03 0.97 - - - - - - addps %xmm2, %xmm1
- - - - - - - - 1.00 - movl $-2147483648, %eax
- - - - - - - 1.00 - - movd %eax, %xmm2
- - 0.99 0.01 - - - - - - pxor %xmm1, %xmm2
- - - - - 1.00 - - - - movaps .LCPI3_0(%rip), %xmm1
- - 0.52 0.48 - - - - - - mulps %xmm1, %xmm11
- - 0.01 0.99 - - - - - - mulps %xmm1, %xmm4
- - 0.98 0.02 - - - - - - subps %xmm2, %xmm5
- - - - - - - 1.00 - - shufps $156, %xmm7, %xmm7
- - 0.49 0.51 - - - - - - mulps %xmm10, %xmm7
- - 0.50 0.50 - - - - - - subps %xmm12, %xmm7
- - - - - - - 1.00 - - shufps $120, %xmm0, %xmm0
- - 0.52 0.48 - - - - - - mulps %xmm3, %xmm0
- - 0.50 0.50 - - - - - - subps %xmm0, %xmm7
- - 1.00 - - - - - - - subps %xmm6, %xmm7
- - 1.00 - - - - - - - mulps %xmm1, %xmm7
- - - - 1.00 - - - - - movaps (%rsi), %xmm0
- - 0.48 0.52 - - - - - - movaps %xmm0, %xmm1
- - - - - - - 1.00 - - shufps $156, %xmm0, %xmm1
- - 0.99 0.01 - - - - - - mulps %xmm11, %xmm1
- - - 1.00 - - - - - - movaps %xmm0, %xmm2
- - - - - - - 1.00 - - shufps $120, %xmm0, %xmm2
- - 0.50 0.50 - - - - - - mulps %xmm4, %xmm2
- - 0.99 0.01 - - - - - - addps %xmm1, %xmm2
- - 0.01 0.99 - - - - - - mulps %xmm0, %xmm5
- - 0.99 0.01 - - - - - - addps %xmm2, %xmm5
- - - - - - - 1.00 - - shufps $0, %xmm0, %xmm0
- - 0.98 0.02 - - - - - - mulps %xmm7, %xmm0
- - 0.99 0.01 - - - - - - addps %xmm5, %xmm0
glm::vec4 motor_application(glm::dualquat const& a, glm::vec4 const& b)
{
return glm::mat3x4_cast(a) * b;
}
GLM LLVM-MCA assembly and analysis
Iterations: 100
Instructions: 14100
Total Cycles: 5435
Total uOps: 15700
Dispatch Width: 6
uOps Per Cycle: 2.89
IPC: 2.59
Block RThroughput: 38.5
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 5 0.50 * movss 4(%rdi), %xmm1
1 5 0.50 * movss (%rdi), %xmm2
1 5 0.50 * movss 8(%rdi), %xmm15
1 5 0.50 * movss 12(%rdi), %xmm7
1 1 0.33 movaps %xmm2, %xmm3
1 1 0.33 movaps %xmm1, %xmm0
1 1 0.33 movaps %xmm1, %xmm6
1 5 0.50 * movss (%rsi), %xmm10
1 4 0.50 mulss %xmm1, %xmm0
1 1 0.33 movaps %xmm7, %xmm5
1 1 0.33 movaps %xmm2, %xmm4
1 5 0.50 * movss 4(%rsi), %xmm9
1 4 0.50 mulss %xmm2, %xmm3
1 1 0.33 movaps %xmm15, %xmm14
1 1 0.33 movaps %xmm2, %xmm11
1 5 0.50 * movss 8(%rsi), %xmm8
1 1 0.33 movaps %xmm15, %xmm12
1 1 0.33 movaps %xmm15, %xmm13
1 4 0.50 addss %xmm0, %xmm3
1 1 0.33 movaps %xmm15, %xmm0
1 4 0.50 mulss %xmm15, %xmm0
1 4 0.50 addss %xmm0, %xmm3
1 1 0.33 movaps %xmm7, %xmm0
1 4 0.50 mulss %xmm7, %xmm0
1 4 0.50 addss %xmm0, %xmm3
1 1 0.33 movaps %xmm15, %xmm0
1 11 3.00 divss %xmm3, %xmm5
1 11 3.00 divss %xmm3, %xmm6
1 11 3.00 divss %xmm3, %xmm4
1 11 3.00 divss %xmm3, %xmm14
1 1 0.33 movaps %xmm5, %xmm3
1 1 0.33 movaps %xmm1, %xmm5
1 4 0.50 mulss %xmm6, %xmm5
1 4 0.50 addss %xmm6, %xmm6
1 4 0.50 mulss %xmm3, %xmm7
1 4 0.50 addss %xmm3, %xmm3
1 4 0.50 mulss %xmm4, %xmm11
1 4 0.50 addss %xmm4, %xmm4
1 4 0.50 mulss %xmm6, %xmm13
2 1 1.00 * movss %xmm5, -56(%rsp)
1 4 0.50 mulss %xmm3, %xmm2
1 1 0.33 movaps %xmm1, %xmm5
1 4 0.50 mulss %xmm3, %xmm1
1 4 0.50 mulss %xmm4, %xmm12
2 1 1.00 * movss %xmm11, -40(%rsp)
1 4 0.50 mulss %xmm4, %xmm5
1 5 0.50 * movss 24(%rdi), %xmm11
2 1 1.00 * movss %xmm13, -16(%rsp)
1 4 0.50 mulss %xmm3, %xmm15
1 5 0.50 * movss 16(%rdi), %xmm13
2 1 1.00 * movss %xmm2, -24(%rsp)
2 1 1.00 * movss %xmm1, -20(%rsp)
1 5 0.50 * movss 28(%rdi), %xmm1
2 1 1.00 * movss %xmm12, -28(%rsp)
1 5 0.50 * movss 20(%rdi), %xmm12
2 1 1.00 * movss %xmm1, -12(%rsp)
1 4 0.50 mulss %xmm14, %xmm0
1 4 0.50 addss %xmm14, %xmm14
1 5 0.50 * movss -40(%rsp), %xmm1
1 1 0.33 movaps %xmm5, %xmm2
1 4 0.50 subss %xmm15, %xmm5
1 4 0.50 addss %xmm15, %xmm2
1 5 0.50 * movss -16(%rsp), %xmm15
1 4 0.50 addss %xmm7, %xmm1
2 9 0.50 * subss -56(%rsp), %xmm1
1 4 0.50 mulss %xmm10, %xmm5
1 4 0.50 mulss %xmm9, %xmm2
1 4 0.50 subss %xmm0, %xmm1
1 4 0.50 mulss %xmm10, %xmm1
1 4 0.50 addss %xmm2, %xmm1
1 5 0.50 * movss -28(%rsp), %xmm2
2 9 0.50 * subss -20(%rsp), %xmm2
1 4 0.50 mulss %xmm8, %xmm2
1 4 0.50 addss %xmm2, %xmm1
1 5 0.50 * movss -56(%rsp), %xmm2
1 4 0.50 addss %xmm7, %xmm2
2 9 0.50 * subss -40(%rsp), %xmm2
1 4 0.50 subss %xmm0, %xmm2
1 4 0.50 addss %xmm7, %xmm0
1 1 0.33 movaps %xmm3, %xmm7
2 9 0.50 * subss -40(%rsp), %xmm0
2 9 0.50 * subss -56(%rsp), %xmm0
1 4 0.50 mulss %xmm13, %xmm7
1 4 0.50 mulss %xmm9, %xmm2
1 4 0.50 addss %xmm5, %xmm2
1 5 0.50 * movss -24(%rsp), %xmm5
1 4 0.50 addss %xmm15, %xmm5
2 9 0.50 * subss -24(%rsp), %xmm15
1 4 0.50 mulss %xmm8, %xmm5
1 4 0.50 mulss %xmm9, %xmm15
1 4 0.50 addss %xmm5, %xmm2
1 5 0.50 * movss -28(%rsp), %xmm5
2 9 0.50 * addss -20(%rsp), %xmm5
1 4 0.50 mulss %xmm10, %xmm5
1 1 1.00 unpcklps %xmm2, %xmm1
1 4 0.50 addss %xmm15, %xmm5
1 1 0.33 movaps %xmm0, %xmm15
1 4 0.50 mulss %xmm8, %xmm15
1 4 0.50 addss %xmm15, %xmm5
1 5 0.50 * movss -12(%rsp), %xmm15
1 1 0.33 movaps %xmm15, %xmm0
1 4 0.50 mulss %xmm4, %xmm0
1 4 0.50 subss %xmm7, %xmm0
1 1 0.33 movaps %xmm14, %xmm7
1 4 0.50 mulss %xmm12, %xmm7
1 4 0.50 addss %xmm7, %xmm0
1 1 0.33 movaps %xmm6, %xmm7
1 4 0.50 mulss %xmm11, %xmm7
1 4 0.50 subss %xmm7, %xmm0
1 1 0.33 movaps %xmm14, %xmm7
1 4 0.50 mulss %xmm13, %xmm7
1 4 0.50 mulss %xmm15, %xmm14
1 4 0.50 mulss %xmm0, %xmm10
1 1 0.33 movaps %xmm15, %xmm0
1 4 0.50 mulss %xmm6, %xmm0
1 4 0.50 mulss %xmm13, %xmm6
1 4 0.50 subss %xmm7, %xmm0
1 1 0.33 movaps %xmm3, %xmm7
1 4 0.50 mulss %xmm12, %xmm7
1 4 0.50 addss %xmm6, %xmm14
1 1 0.33 movaps %xmm1, %xmm6
1 4 0.50 mulss %xmm11, %xmm3
1 4 0.50 subss %xmm7, %xmm0
1 1 0.33 movaps %xmm4, %xmm7
1 4 0.50 mulss %xmm12, %xmm4
1 4 0.50 mulss %xmm11, %xmm7
1 4 0.50 subss %xmm4, %xmm14
1 4 0.50 addss %xmm7, %xmm0
2 7 0.50 * xorps .LC0(%rip), %xmm0
1 4 0.50 subss %xmm3, %xmm14
1 4 0.50 mulss %xmm0, %xmm9
1 4 0.50 mulss %xmm14, %xmm8
1 4 0.50 subss %xmm10, %xmm9
1 4 0.50 subss %xmm8, %xmm9
1 1 1.00 unpcklps %xmm9, %xmm5
1 1 1.00 movlhps %xmm5, %xmm6
2 1 1.00 * movaps %xmm6, -56(%rsp)
1 5 0.50 * movq -48(%rsp), %rax
1 5 0.50 * movq -56(%rsp), %xmm0
1 1 1.00 movq %rax, %xmm1
1 1 1.00 * movq %rax, -40(%rsp)
Resources:
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- 12.00 41.03 41.04 15.49 15.51 9.00 29.93 - 6.00
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - - 0.50 0.50 - - - - movss 4(%rdi), %xmm1
- - - - 0.50 0.50 - - - - movss (%rdi), %xmm2
- - - - 0.50 0.50 - - - - movss 8(%rdi), %xmm15
- - - - 0.50 0.50 - - - - movss 12(%rdi), %xmm7
- - - - - - - 1.00 - - movaps %xmm2, %xmm3
- - - 0.01 - - - 0.99 - - movaps %xmm1, %xmm0
- - 0.01 - - - - 0.99 - - movaps %xmm1, %xmm6
- - - - 0.50 0.50 - - - - movss (%rsi), %xmm10
- - 0.99 0.01 - - - - - - mulss %xmm1, %xmm0
- - - - - - - 1.00 - - movaps %xmm7, %xmm5
- - 0.01 - - - - 0.99 - - movaps %xmm2, %xmm4
- - - - 0.50 0.50 - - - - movss 4(%rsi), %xmm9
- - 0.99 0.01 - - - - - - mulss %xmm2, %xmm3
- - - - - - - 1.00 - - movaps %xmm15, %xmm14
- - 0.01 - - - - 0.99 - - movaps %xmm2, %xmm11
- - - - 0.49 0.51 - - - - movss 8(%rsi), %xmm8
- - 0.99 - - - - 0.01 - - movaps %xmm15, %xmm12
- - - 0.01 - - - 0.99 - - movaps %xmm15, %xmm13
- - 0.99 0.01 - - - - - - addss %xmm0, %xmm3
- - 0.01 - - - - 0.99 - - movaps %xmm15, %xmm0
- - 0.99 0.01 - - - - - - mulss %xmm15, %xmm0
- - 0.99 0.01 - - - - - - addss %xmm0, %xmm3
- - - - - - - 1.00 - - movaps %xmm7, %xmm0
- - 0.99 0.01 - - - - - - mulss %xmm7, %xmm0
- - 0.01 0.99 - - - - - - addss %xmm0, %xmm3
- - 0.01 - - - - 0.99 - - movaps %xmm15, %xmm0
- 3.00 1.00 - - - - - - - divss %xmm3, %xmm5
- 3.00 1.00 - - - - - - - divss %xmm3, %xmm6
- 3.00 1.00 - - - - - - - divss %xmm3, %xmm4
- 3.00 1.00 - - - - - - - divss %xmm3, %xmm14
- - - - - - - 1.00 - - movaps %xmm5, %xmm3
- - - - - - - 1.00 - - movaps %xmm1, %xmm5
- - - 1.00 - - - - - - mulss %xmm6, %xmm5
- - 1.00 - - - - - - - addss %xmm6, %xmm6
- - - 1.00 - - - - - - mulss %xmm3, %xmm7
- - 1.00 - - - - - - - addss %xmm3, %xmm3
- - 1.00 - - - - - - - mulss %xmm4, %xmm11
- - - 1.00 - - - - - - addss %xmm4, %xmm4
- - - 1.00 - - - - - - mulss %xmm6, %xmm13
- - - - - 0.99 1.00 - - 0.01 movss %xmm5, -56(%rsp)
- - - 1.00 - - - - - - mulss %xmm3, %xmm2
- - 0.01 - - - - 0.99 - - movaps %xmm1, %xmm5
- - 1.00 - - - - - - - mulss %xmm3, %xmm1
- - - 1.00 - - - - - - mulss %xmm4, %xmm12
- - - - 0.99 - 1.00 - - 0.01 movss %xmm11, -40(%rsp)
- - 1.00 - - - - - - - mulss %xmm4, %xmm5
- - - - 0.51 0.49 - - - - movss 24(%rdi), %xmm11
- - - - - 0.01 1.00 - - 0.99 movss %xmm13, -16(%rsp)
- - - 1.00 - - - - - - mulss %xmm3, %xmm15
- - - - 0.49 0.51 - - - - movss 16(%rdi), %xmm13
- - - - 0.01 0.99 1.00 - - - movss %xmm2, -24(%rsp)
- - - - - - 1.00 - - 1.00 movss %xmm1, -20(%rsp)
- - - - 0.51 0.49 - - - - movss 28(%rdi), %xmm1
- - - - - - 1.00 - - 1.00 movss %xmm12, -28(%rsp)
- - - - 0.49 0.51 - - - - movss 20(%rdi), %xmm12
- - - - - - 1.00 - - 1.00 movss %xmm1, -12(%rsp)
- - 1.00 - - - - - - - mulss %xmm14, %xmm0
- - - 1.00 - - - - - - addss %xmm14, %xmm14
- - - - 0.51 0.49 - - - - movss -40(%rsp), %xmm1
- - - - - - - 1.00 - - movaps %xmm5, %xmm2
- - - 1.00 - - - - - - subss %xmm15, %xmm5
- - - 1.00 - - - - - - addss %xmm15, %xmm2
- - - - 0.50 0.50 - - - - movss -16(%rsp), %xmm15
- - 1.00 - - - - - - - addss %xmm7, %xmm1
- - - 1.00 0.99 0.01 - - - - subss -56(%rsp), %xmm1
- - - 1.00 - - - - - - mulss %xmm10, %xmm5
- - - 1.00 - - - - - - mulss %xmm9, %xmm2
- - - 1.00 - - - - - - subss %xmm0, %xmm1
- - 0.01 0.99 - - - - - - mulss %xmm10, %xmm1
- - 0.01 0.99 - - - - - - addss %xmm2, %xmm1
- - - - 0.50 0.50 - - - - movss -28(%rsp), %xmm2
- - 1.00 - 0.49 0.51 - - - - subss -20(%rsp), %xmm2
- - 0.01 0.99 - - - - - - mulss %xmm8, %xmm2
- - 0.01 0.99 - - - - - - addss %xmm2, %xmm1
- - - - 0.51 0.49 - - - - movss -56(%rsp), %xmm2
- - 1.00 - - - - - - - addss %xmm7, %xmm2
- - - 1.00 0.50 0.50 - - - - subss -40(%rsp), %xmm2
- - - 1.00 - - - - - - subss %xmm0, %xmm2
- - 1.00 - - - - - - - addss %xmm7, %xmm0
- - - - - - - 1.00 - - movaps %xmm3, %xmm7
- - 1.00 - 0.50 0.50 - - - - subss -40(%rsp), %xmm0
- - 1.00 - 0.50 0.50 - - - - subss -56(%rsp), %xmm0
- - 1.00 - - - - - - - mulss %xmm13, %xmm7
- - 0.01 0.99 - - - - - - mulss %xmm9, %xmm2
- - 0.01 0.99 - - - - - - addss %xmm5, %xmm2
- - - - 0.50 0.50 - - - - movss -24(%rsp), %xmm5
- - 0.99 0.01 - - - - - - addss %xmm15, %xmm5
- - 1.00 - 0.50 0.50 - - - - subss -24(%rsp), %xmm15
- - 0.99 0.01 - - - - - - mulss %xmm8, %xmm5
- - - 1.00 - - - - - - mulss %xmm9, %xmm15
- - - 1.00 - - - - - - addss %xmm5, %xmm2
- - - - 0.50 0.50 - - - - movss -28(%rsp), %xmm5
- - - 1.00 0.50 0.50 - - - - addss -20(%rsp), %xmm5
- - - 1.00 - - - - - - mulss %xmm10, %xmm5
- - - - - - - 1.00 - - unpcklps %xmm2, %xmm1
- - - 1.00 - - - - - - addss %xmm15, %xmm5
- - - - - - - 1.00 - - movaps %xmm0, %xmm15
- - 0.01 0.99 - - - - - - mulss %xmm8, %xmm15
- - 0.01 0.99 - - - - - - addss %xmm15, %xmm5
- - - - 0.50 0.50 - - - - movss -12(%rsp), %xmm15
- - - - - - - 1.00 - - movaps %xmm15, %xmm0
- - - 1.00 - - - - - - mulss %xmm4, %xmm0
- - - 1.00 - - - - - - subss %xmm7, %xmm0
- - - - - - - 1.00 - - movaps %xmm14, %xmm7
- - 1.00 - - - - - - - mulss %xmm12, %xmm7
- - - 1.00 - - - - - - addss %xmm7, %xmm0
- - - - - - - 1.00 - - movaps %xmm6, %xmm7
- - 1.00 - - - - - - - mulss %xmm11, %xmm7
- - 0.99 0.01 - - - - - - subss %xmm7, %xmm0
- - - - - - - 1.00 - - movaps %xmm14, %xmm7
- - - 1.00 - - - - - - mulss %xmm13, %xmm7
- - 1.00 - - - - - - - mulss %xmm15, %xmm14
- - 0.99 0.01 - - - - - - mulss %xmm0, %xmm10
- - - - - - - 1.00 - - movaps %xmm15, %xmm0
- - - 1.00 - - - - - - mulss %xmm6, %xmm0
- - 1.00 - - - - - - - mulss %xmm13, %xmm6
- - 1.00 - - - - - - - subss %xmm7, %xmm0
- - - - - - - 1.00 - - movaps %xmm3, %xmm7
- - 1.00 - - - - - - - mulss %xmm12, %xmm7
- - 1.00 - - - - - - - addss %xmm6, %xmm14
- - - - - - - 1.00 - - movaps %xmm1, %xmm6
- - 1.00 - - - - - - - mulss %xmm11, %xmm3
- - 0.99 0.01 - - - - - - subss %xmm7, %xmm0
- - - - - - - 1.00 - - movaps %xmm4, %xmm7
- - - 1.00 - - - - - - mulss %xmm12, %xmm4
- - 1.00 - - - - - - - mulss %xmm11, %xmm7
- - - 1.00 - - - - - - subss %xmm4, %xmm14
- - 0.99 0.01 - - - - - - addss %xmm7, %xmm0
- - - - 0.50 0.50 - 1.00 - - xorps .LC0(%rip), %xmm0
- - - 1.00 - - - - - - subss %xmm3, %xmm14
- - - 1.00 - - - - - - mulss %xmm0, %xmm9
- - 0.01 0.99 - - - - - - mulss %xmm14, %xmm8
- - 1.00 - - - - - - - subss %xmm10, %xmm9
- - - 1.00 - - - - - - subss %xmm8, %xmm9
- - - - - - - 1.00 - - unpcklps %xmm9, %xmm5
- - - - - - - 1.00 - - movlhps %xmm5, %xmm6
- - - - - - 1.00 - - 1.00 movaps %xmm6, -56(%rsp)
- - - - 0.50 0.50 - - - - movq -48(%rsp), %rax
- - - - 0.50 0.50 - - - - movq -56(%rsp), %xmm0
- - - - - - - 1.00 - - movq %rax, %xmm1
- - - - - 0.01 1.00 - - 0.99 movq %rax, -40(%rsp)
Note
An RTM implementation of the dual-quat application is not provided (internally, RTM uses a type called qvvf
which stores scale, translation, and rotation as separate components). The qvvf
cannot be interpolated the same way as a motor or dual quaternion so it is omitted from this portion of the analysis.