

## Programming Lab 4C Optimization



Topics: Address alignment, address and data dependencies, overlapped execution.

## Prerequisite Reading: Chapters 1-4 Revised: February 17, 2021

Run-time performance can be adversely affected due to improper address alignment, the sequence in which instructions are executed, or by not overlapping the execution of floating-point divide or square root instructions with the execution of integer instructions. Faster execution can sometimes be achieved by a simple rearrangement of instructions. In this lab you will create the following functions and measure their execution time from which you can determine the corresponding performance penalties in clock cycles per instruction<sup>1</sup>.

FullWordAccess: HalfWordAccess: Address Alignment: Extra memory cycles are re-.rept 100 .rept 100 quired when 16-bit operands are not located at ad-LDR R1,[R0] LDRH R1,[R0] dresses that are a multiple of 2, and 32-bit or 64-bit .endr .endr operands are not at a multiple of 4. ВΧ LR ΒX LR Address Dependency: Any load or store instruction AddressDependency: NoAddressDependency: .rept 100 .rept 100 whose address depends on a register that was modi-R1,[R0] LDR R1,[R0] LDR fied by the preceding instruction will always be de-LDR R0,[R1] LDR R2, [R0] layed while the register is updated. .endr .endr ВΧ LR ВΧ LR DataDependency: NoDataDependency: Data Dependency: A floating-point instruction .rept 100 .rept 100 (e.g., VADD. F32) will always be delayed if one of its VADD.F32 \$1,50,50 VADD.F32 S1,S0,S0 input operands is the output of the preceding float-VADD.F32 VADD.F32 \$2,50,50 S0,S1,S1 ing-point arithmetic instruction. (See the footnote<sup>2</sup>) .endr .endr about the VMOV instructions.) VMOV S1,S0 VMOV S1,S0 ВΧ LR ΒX LR Concurrent Execution: The slow execution of VDIVOverlap: ARM Assembly VDIV.F32 S2,S1,S0 VDIV.F32 or VSQRT.F32 may be overlapped with for Embedded Applications 1 .rept a sequence of several integer-only instructions. De-NOP termine amount of overlap possible by increasing the .endr Half Word Access: repetitions of the NOP until the displayed execution VMOV S3,S2 Adrs = X...X00: TBD cycles time of the function begins to increase. (See foot-ΒX LR  $note^2$  about the VMOV instruction.) Analyze the measured execution times and source code to determine: Full Word Access: Half word address = X...XX1 penalty: cycles/instruction

| 1 5                               | ,                  |
|-----------------------------------|--------------------|
| Full word address = XX01 penalty: | cycles/instruction |
| Full word address = XX10 penalty: | cycles/instruction |
| Full word address = XX11 penalty: | cycles/instruction |
| Address dependency penalty:       | cycles/instruction |
| Data dependency penalty:          | cycles/instruction |
| Maximum VDIV/VSQRT overlap:       | clock cycles       |

Adds = X...X00: TBD cycles Adds = X...X01: TBD cycles Adds = X...X10: TBD cycles Adds = X...X11: TBD cycles Adds = X...X00: TBD cycles Adds = X...X01: TBD cycles Adds = X...X10: TBD cycles Address Dependency: TBD cycles No Dependency: TBD cycles No Dependency: TBD cycles No Dependency: TBD cycles VDIV Overlap: TBD cycles VDIV Overlap: TBD cycles

<sup>&</sup>lt;sup>1</sup> The "TBD's" shown in the figure will be replaced by the cycle counts required to execute the assembly language functions. <sup>2</sup> The VMOV instruction in functions DataDependency and VDIVOverLap are used to force the previous floating-point instruction to complete before executing the BX LR return. The VMOV in NoDataDependency provides measurement consistency.