third_party/LLVM/lib/Target/ARM/README-Thumb.txt - SwiftShader - Git at Google

 //===---------------------------------------------------------------------===//
 // Random ideas for the ARM backend (Thumb specific).
 //===---------------------------------------------------------------------===//

 * Add support for compiling functions in both ARM and Thumb mode, then taking
   the smallest.

 * Add support for compiling individual basic blocks in thumb mode, when in a
   larger ARM function.  This can be used for presumed cold code, like paths
   to abort (failure path of asserts), EH handling code, etc.

 * Thumb doesn't have normal pre/post increment addressing modes, but you can
   load/store 32-bit integers with pre/postinc by using load/store multiple
   instrs with a single register.

 * Make better use of high registers r8, r10, r11, r12 (ip). Some variants of add
   and cmp instructions can use high registers. Also, we can use them as
   temporaries to spill values into.

 * In thumb mode, short, byte, and bool preferred alignments are currently set
   to 4 to accommodate ISA restriction (i.e. add sp, #imm, imm must be multiple
   of 4).

 //===---------------------------------------------------------------------===//

 Potential jumptable improvements:

 * If we know function size is less than (1 << 16) * 2 bytes, we can use 16-bit
   jumptable entries (e.g. (L1 - L2) >> 1). Or even smaller entries if the
   function is even smaller. This also applies to ARM.

 * Thumb jumptable codegen can improve given some help from the assembler. This
   is what we generate right now:

 	.set PCRELV0, (LJTI1_0_0-(LPCRELL0+4))
 LPCRELL0:
 	mov r1, #PCRELV0
 	add r1, pc
 	ldr r0, [r0, r1]
 	mov pc, r0
 	.align	2
 LJTI1_0_0:
 	.long	 LBB1_3
         ...

 Note there is another pc relative add that we can take advantage of.
      add r1, pc, #imm_8 * 4

 We should be able to generate:

 LPCRELL0:
 	add r1, LJTI1_0_0
 	ldr r0, [r0, r1]
 	mov pc, r0
 	.align	2
 LJTI1_0_0:
 	.long	 LBB1_3

 if the assembler can translate the add to:
        add r1, pc, #((LJTI1_0_0-(LPCRELL0+4))&0xfffffffc)

 Note the assembler also does something similar to constpool load:
 LPCRELL0:
      ldr r0, LCPI1_0
 =>
      ldr r0, pc, #((LCPI1_0-(LPCRELL0+4))&0xfffffffc)


 //===---------------------------------------------------------------------===//

 We compile the following:

 define i16 @func_entry_2E_ce(i32 %i) {
         switch i32 %i, label %bb12.exitStub [
                  i32 0, label %bb4.exitStub
                  i32 1, label %bb9.exitStub
                  i32 2, label %bb4.exitStub
                  i32 3, label %bb4.exitStub
                  i32 7, label %bb9.exitStub
                  i32 8, label %bb.exitStub
                  i32 9, label %bb9.exitStub
         ]

 bb12.exitStub:
         ret i16 0

 bb4.exitStub:
         ret i16 1

 bb9.exitStub:
         ret i16 2

 bb.exitStub:
         ret i16 3
 }

 into:

 _func_entry_2E_ce:
         mov r2, #1
         lsl r2, r0
         cmp r0, #9
         bhi LBB1_4      @bb12.exitStub
 LBB1_1: @newFuncRoot
         mov r1, #13
         tst r2, r1
         bne LBB1_5      @bb4.exitStub
 LBB1_2: @newFuncRoot
         ldr r1, LCPI1_0
         tst r2, r1
         bne LBB1_6      @bb9.exitStub
 LBB1_3: @newFuncRoot
         mov r1, #1
         lsl r1, r1, #8
         tst r2, r1
         bne LBB1_7      @bb.exitStub
 LBB1_4: @bb12.exitStub
         mov r0, #0
         bx lr
 LBB1_5: @bb4.exitStub
         mov r0, #1
         bx lr
 LBB1_6: @bb9.exitStub
         mov r0, #2
         bx lr
 LBB1_7: @bb.exitStub
         mov r0, #3
         bx lr
 LBB1_8:
         .align  2
 LCPI1_0:
         .long   642


 gcc compiles to:

 	cmp	r0, #9
 	@ lr needed for prologue
 	bhi	L2
 	ldr	r3, L11
 	mov	r2, #1
 	mov	r1, r2, asl r0
 	ands	r0, r3, r2, asl r0
 	movne	r0, #2
 	bxne	lr
 	tst	r1, #13
 	beq	L9
 L3:
 	mov	r0, r2
 	bx	lr
 L9:
 	tst	r1, #256
 	movne	r0, #3
 	bxne	lr
 L2:
 	mov	r0, #0
 	bx	lr
 L12:
 	.align 2
 L11:
 	.long	642


 GCC is doing a couple of clever things here:
   1. It is predicating one of the returns.  This isn't a clear win though: in
      cases where that return isn't taken, it is replacing one condbranch with
      two 'ne' predicated instructions.
   2. It is sinking the shift of "1 << i" into the tst, and using ands instead of
      tst.  This will probably require whole function isel.
   3. GCC emits:
   	tst	r1, #256
      we emit:
         mov r1, #1
         lsl r1, r1, #8
         tst r2, r1


 //===---------------------------------------------------------------------===//

 When spilling in thumb mode and the sp offset is too large to fit in the ldr /
 str offset field, we load the offset from a constpool entry and add it to sp:

 ldr r2, LCPI
 add r2, sp
 ldr r2, [r2]

 These instructions preserve the condition code which is important if the spill
 is between a cmp and a bcc instruction. However, we can use the (potentially)
 cheaper sequnce if we know it's ok to clobber the condition register.

 add r2, sp, #255 * 4
 add r2, #132
 ldr r2, [r2, #7 * 4]

 This is especially bad when dynamic alloca is used. The all fixed size stack
 objects are referenced off the frame pointer with negative offsets. See
 oggenc for an example.


 //===---------------------------------------------------------------------===//

 Poor codegen test/CodeGen/ARM/select.ll f7:

 	ldr r5, LCPI1_0
 LPC0:
 	add r5, pc
 	ldr r6, LCPI1_1
 	ldr r2, LCPI1_2
 	mov r3, r6
 	mov lr, pc
 	bx r5

 //===---------------------------------------------------------------------===//

 Make register allocator / spiller smarter so we can re-materialize "mov r, imm",
 etc. Almost all Thumb instructions clobber condition code.

 //===---------------------------------------------------------------------===//

 Add ldmia, stmia support.

 //===---------------------------------------------------------------------===//

 Thumb load / store address mode offsets are scaled. The values kept in the
 instruction operands are pre-scale values. This probably ought to be changed
 to avoid extra work when we convert Thumb2 instructions to Thumb1 instructions.

 //===---------------------------------------------------------------------===//

 We need to make (some of the) Thumb1 instructions predicable. That will allow
 shrinking of predicated Thumb2 instructions. To allow this, we need to be able
 to toggle the 's' bit since they do not set CPSR when they are inside IT blocks.

 //===---------------------------------------------------------------------===//

 Make use of hi register variants of cmp: tCMPhir / tCMPZhir.

 //===---------------------------------------------------------------------===//

 Thumb1 immediate field sometimes keep pre-scaled values. See
 Thumb1RegisterInfo::eliminateFrameIndex. This is inconsistent from ARM and
 Thumb2.

 //===---------------------------------------------------------------------===//

 Rather than having tBR_JTr print a ".align 2" and constant island pass pad it,
 add a target specific ALIGN instruction instead. That way, GetInstSizeInBytes
 won't have to over-estimate. It can also be used for loop alignment pass.

 //===---------------------------------------------------------------------===//

 We generate conditional code for icmp when we don't need to. This code:

   int foo(int s) {
     return s == 1;
   }

 produces:

 foo:
         cmp     r0, #1
         mov.w   r0, #0
         it      eq
         moveq   r0, #1
         bx      lr

 when it could use subs + adcs. This is GCC PR46975.
	//===---------------------------------------------------------------------===//
	// Random ideas for the ARM backend (Thumb specific).
	//===---------------------------------------------------------------------===//

	* Add support for compiling functions in both ARM and Thumb mode, then taking
	the smallest.

	* Add support for compiling individual basic blocks in thumb mode, when in a
	larger ARM function. This can be used for presumed cold code, like paths
	to abort (failure path of asserts), EH handling code, etc.

	* Thumb doesn't have normal pre/post increment addressing modes, but you can
	load/store 32-bit integers with pre/postinc by using load/store multiple
	instrs with a single register.

	* Make better use of high registers r8, r10, r11, r12 (ip). Some variants of add
	and cmp instructions can use high registers. Also, we can use them as
	temporaries to spill values into.

	* In thumb mode, short, byte, and bool preferred alignments are currently set
	to 4 to accommodate ISA restriction (i.e. add sp, #imm, imm must be multiple
	of 4).

	//===---------------------------------------------------------------------===//

	Potential jumptable improvements:

	* If we know function size is less than (1 << 16) * 2 bytes, we can use 16-bit
	jumptable entries (e.g. (L1 - L2) >> 1). Or even smaller entries if the
	function is even smaller. This also applies to ARM.

	* Thumb jumptable codegen can improve given some help from the assembler. This
	is what we generate right now:

	.set PCRELV0, (LJTI1_0_0-(LPCRELL0+4))
	LPCRELL0:
	mov r1, #PCRELV0
	add r1, pc
	ldr r0, [r0, r1]
	mov pc, r0
	.align 2
	LJTI1_0_0:
	.long LBB1_3
	...

	Note there is another pc relative add that we can take advantage of.
	add r1, pc, #imm_8 * 4

	We should be able to generate:

	LPCRELL0:
	add r1, LJTI1_0_0
	ldr r0, [r0, r1]
	mov pc, r0
	.align 2
	LJTI1_0_0:
	.long LBB1_3

	if the assembler can translate the add to:
	add r1, pc, #((LJTI1_0_0-(LPCRELL0+4))&0xfffffffc)

	Note the assembler also does something similar to constpool load:
	LPCRELL0:
	ldr r0, LCPI1_0
	=>
	ldr r0, pc, #((LCPI1_0-(LPCRELL0+4))&0xfffffffc)


	//===---------------------------------------------------------------------===//

	We compile the following:

	define i16 @func_entry_2E_ce(i32 %i) {
	switch i32 %i, label %bb12.exitStub [
	i32 0, label %bb4.exitStub
	i32 1, label %bb9.exitStub
	i32 2, label %bb4.exitStub
	i32 3, label %bb4.exitStub
	i32 7, label %bb9.exitStub
	i32 8, label %bb.exitStub
	i32 9, label %bb9.exitStub
	]

	bb12.exitStub:
	ret i16 0

	bb4.exitStub:
	ret i16 1

	bb9.exitStub:
	ret i16 2

	bb.exitStub:
	ret i16 3
	}

	into:

	_func_entry_2E_ce:
	mov r2, #1
	lsl r2, r0
	cmp r0, #9
	bhi LBB1_4 @bb12.exitStub
	LBB1_1: @newFuncRoot
	mov r1, #13
	tst r2, r1
	bne LBB1_5 @bb4.exitStub
	LBB1_2: @newFuncRoot
	ldr r1, LCPI1_0
	tst r2, r1
	bne LBB1_6 @bb9.exitStub
	LBB1_3: @newFuncRoot
	mov r1, #1
	lsl r1, r1, #8
	tst r2, r1
	bne LBB1_7 @bb.exitStub
	LBB1_4: @bb12.exitStub
	mov r0, #0
	bx lr
	LBB1_5: @bb4.exitStub
	mov r0, #1
	bx lr
	LBB1_6: @bb9.exitStub
	mov r0, #2
	bx lr
	LBB1_7: @bb.exitStub
	mov r0, #3
	bx lr
	LBB1_8:
	.align 2
	LCPI1_0:
	.long 642


	gcc compiles to:

	cmp r0, #9
	@ lr needed for prologue
	bhi L2
	ldr r3, L11
	mov r2, #1
	mov r1, r2, asl r0
	ands r0, r3, r2, asl r0
	movne r0, #2
	bxne lr
	tst r1, #13
	beq L9
	L3:
	mov r0, r2
	bx lr
	L9:
	tst r1, #256
	movne r0, #3
	bxne lr
	L2:
	mov r0, #0
	bx lr
	L12:
	.align 2
	L11:
	.long 642


	GCC is doing a couple of clever things here:
	1. It is predicating one of the returns. This isn't a clear win though: in
	cases where that return isn't taken, it is replacing one condbranch with
	two 'ne' predicated instructions.
	2. It is sinking the shift of "1 << i" into the tst, and using ands instead of
	tst. This will probably require whole function isel.
	3. GCC emits:
	tst r1, #256
	we emit:
	mov r1, #1
	lsl r1, r1, #8
	tst r2, r1


	//===---------------------------------------------------------------------===//

	When spilling in thumb mode and the sp offset is too large to fit in the ldr /
	str offset field, we load the offset from a constpool entry and add it to sp:

	ldr r2, LCPI
	add r2, sp
	ldr r2, [r2]

	These instructions preserve the condition code which is important if the spill
	is between a cmp and a bcc instruction. However, we can use the (potentially)
	cheaper sequnce if we know it's ok to clobber the condition register.

	add r2, sp, #255 * 4
	add r2, #132
	ldr r2, [r2, #7 * 4]

	This is especially bad when dynamic alloca is used. The all fixed size stack
	objects are referenced off the frame pointer with negative offsets. See
	oggenc for an example.


	//===---------------------------------------------------------------------===//

	Poor codegen test/CodeGen/ARM/select.ll f7:

	ldr r5, LCPI1_0
	LPC0:
	add r5, pc
	ldr r6, LCPI1_1
	ldr r2, LCPI1_2
	mov r3, r6
	mov lr, pc
	bx r5

	//===---------------------------------------------------------------------===//

	Make register allocator / spiller smarter so we can re-materialize "mov r, imm",
	etc. Almost all Thumb instructions clobber condition code.

	//===---------------------------------------------------------------------===//

	Add ldmia, stmia support.

	//===---------------------------------------------------------------------===//

	Thumb load / store address mode offsets are scaled. The values kept in the
	instruction operands are pre-scale values. This probably ought to be changed
	to avoid extra work when we convert Thumb2 instructions to Thumb1 instructions.

	//===---------------------------------------------------------------------===//

	We need to make (some of the) Thumb1 instructions predicable. That will allow
	shrinking of predicated Thumb2 instructions. To allow this, we need to be able
	to toggle the 's' bit since they do not set CPSR when they are inside IT blocks.

	//===---------------------------------------------------------------------===//

	Make use of hi register variants of cmp: tCMPhir / tCMPZhir.

	//===---------------------------------------------------------------------===//

	Thumb1 immediate field sometimes keep pre-scaled values. See
	Thumb1RegisterInfo::eliminateFrameIndex. This is inconsistent from ARM and
	Thumb2.

	//===---------------------------------------------------------------------===//

	Rather than having tBR_JTr print a ".align 2" and constant island pass pad it,
	add a target specific ALIGN instruction instead. That way, GetInstSizeInBytes
	won't have to over-estimate. It can also be used for loop alignment pass.

	//===---------------------------------------------------------------------===//

	We generate conditional code for icmp when we don't need to. This code:

	int foo(int s) {
	return s == 1;
	}

	produces:

	foo:
	cmp r0, #1
	mov.w r0, #0
	it eq
	moveq r0, #1
	bx lr

	when it could use subs + adcs. This is GCC PR46975.