Program Flow Control


Program Flow Control


Many programs (or parts of programs) execute a sequence of instructions stored in memory in the exact order they are stored. In this case, the program counter is simply incremented after each instruction. But the linear flow of a program may need to be broken for any one of several reasons, including: reaching the terminal instruction of a “for”, “while” or “infinite” loop construct (for loops iterate a predetermined number of times, while loops iterate for as long as some condition is true, and infinite loops iterate until a reset condition); building if-then constructs; performing subroutine calls and returns; processing interrupts; and processing exceptions (these last two topics are dealt with in a later project).

In the ARM processor, the branch instruction (mnemonic B) is used to redirect program flow. The B instruction loads a new address value into the PC. In assembly code, the B mnemonic is always followed by a label. The Assembler builds the opcode by calculating the distance between the branch instruction and the label, and inserting that calculated number into B opcode as an immediate operand. At run time, the ALU adds or subtracts the immediate value from the PC, and loads the result back into the PC. Since branch instructions use only the top 8 bits in the opcode, 24 opcode bits are left for the immediate operand. This relative address calculation scheme means a branch instruction can redirect program flow anywhere from -33554432 to 33554428 from the current program counter.

Branch instructions can be conditional, meaning any of the extended mnemonics can be used (see table below). In fact, it is far more common to use extended mnemonics with a branch instruction than not. As an example, a for-loop construct might load a loop count value into a register, and then decrement that register until it equals zero (see code below).

Table 2. Condition code bits (Arm Architecture Reference Manual, page A8-288)
Condition Mnemonic Extension Meaning (integer) Meaning (Floating-point) Condition Flags
0000 EQ Equal Equal Z == 1
0001 NE Not Equal Not Equal or unordered Z == 0
0010 CS Carry Set Greater than, equal, or unordered C== 1
0011 CC Carry Clear Less than C == 0
0100 MI Minus, negative Less than N == 1
0101 PL Plus, positive or zero Greater than, equal, or unordered N == 0
0110 VS Overflow Unordered V == 1
0111 VC No overflow Not unordered V == 0
1000 HI Unsigned Higher Greater than, or unordered C == 1 and Z == 0
1001 LS Unsigned lower or same Less than, or equal C == 0 or Z == 1
1010 GE Signed greater than or equal Greater than or equal N == V
1011 LT Signed less than Less than, or unordered N != V
1100 GT Signed greater than CGreater than Z == 0 and N == V
1101 LE Signed less than or equal Less than, equal, or unordered Z == 1 or N != V
1110 None (AL) Always (unconditional) Always (unconditional) Any
        MOV    R2,#10	    @ R2 <- 10; R2 is loop index
myloop:	ADD    R1,R1, #1    @ Increment R1
        SUBS   R2,R2, #1    @ Decrement R2
        BNE    myloop       @ If R2 != 0, branch to myloop
        MOV    R4, 0xABC    @ Continue on with program...

Notice the SUBS instruction in the example above. This subtracts one from the loop index value that was placed in R2, and due to the “S” extension, the ALU status bits are updated. Immediately after the SUBS instruction, the BNE instruction checks the status bits, and then loads the PC with the address of the instruction adjacent to the “myloop” label if the Z bit is zero. A condition check for a while-loop construct is entirely similar. An infinite loop is created by using an unconditional branch to an instruction above the B instruction.


Subroutines are compartmentalized blocks of code that perform particular functions. They are used in several situations: to create a reusable module that performs a certain function that may be needed at various points in a higher-level program; to abstract away tedious implementation details from the “main” body of code in order to make the main body more readable; to create modules that can be stored in external libraries for use by other programs; to create a more understandable overall design partition; to create software objects that present uniform interfaces to calling programs; and for many other reasons as well.

Using (or “calling”) a subroutine amounts to branching to the location of the subroutines first instruction. But, unlike other branches, the expectation is that when the subroutine is complete, execution will resume at the instruction immediately following the subroutine call. This means the address of the instruction following the subroutine call must be preserved. ARM designates R14 as the “link register” (LR), and its job is to preserve the return address when a subroutine is called. ARM has a special “Branch with Link” (BL) instruction that is used for this purpose. The BL instruction branches to a given label, and then also stores the current PC into the LR. Then, at the end of the subroutine, the “Branch and Exchange” (BX) instruction loads the PC from the LR. See the example below. By the time the program is stuck in the infinite loop, R0=10, R1=20, R2=3, and R3=R4=30.

Main:   LDR R0, =#1
        LDR R1, =#2
        ADD R2, R1, R0
        BL  Mysub
        ADD R3, R1, R0
Loop:   B   Loop
Mysub:  MOV R0, #10
        MOV R1, #20
        ADD R4,R0,R1
        BX  R14

Subroutines are just sections of Assembly code that can appear in the same source file as the main/calling program. They look just the same as any other code listings, except for two distinguishing features: the first instruction of the subroutine has a label to designate the “branch to” location; and the last line must be “BX LR” to load the contents of the LR into the PC.

Calling a subroutine often involves a “context switch”. A context switch occurs when processing is disrupted for some reason, and a new, potentially unrelated and temporary process gets control of the CPU. This can happen for any number of reasons. As examples, an expected but infrequent (and potentially unrelated) event may occur that must be dealt with in a timely fashion, or a process that is common to many programs might occur (like parsing a text string), or some process that isn’t critical to the main program needs to be performed (like resorting a data set), or any number of other partitioned tasks. Whatever the subroutines purpose is, it may need to use some registers, just like any other segment of assembly code.

In the case of a subroutine, the calling program may expect that the subroutine executes and returns without changing any register contents. But if the subroutine does use registers, it will overwrite the register contents, and any values that were placed there by the calling program will be lost. There is no way around this – if the subroutine needs registers, it must overwrite any contents placed there by the calling program.

This potential problem can be dealt with in two ways: the calling program could store all register contents in temporary locations prior to every subroutine call, and then restore the registers after the subroutine ends; or the subroutine could temporarily store any registers it intends to use, and then restore them before returning control to the calling program. It is standard practice to place this burden on the subroutine, for several reasons. The subroutine can “modularize” the storing and restoring so that the necessary instructions aren’t in the main code (better partitioning); the subroutine can store only those registers it uses (more efficient); when subroutines are called for exceptional events (discussed later), the main program may not even be able to store the registers; and there are other reasons as well. By convention, when the subroutine process is complete, and the original processing resumes, the register contents should be what they were when the subroutine was called, and the burden of keeping the register contents intact falls to the subroutine.

Saving register contents to a predetermined memory area on subroutine entry, and then restoring the contents immediately prior to exiting a subroutine is generally accepted as “good practice”. But this is not required – if a simple subroutine in a simple program only needs one or two registers that are not otherwise being used, it may be acceptable to skip the storing and restoring steps. But, preserving context (i.e., saving registers) is common enough that a generally accepted mechanism has evolved, and that mechanism involves a data structure known as a “stack”.


The stack is a memory structure in main memory that is organized as a “last-in, first-out” storage area. There is nothing special about the address range used for the stack – it can be located anywhere, and it can be any size. Stacks can use ascending memory, meaning new values are stored in the next larger address, or descending. The ARM defines a descending stack (a descending stack is the most typical). Most processors designate a register (called the “stack pointer”, or SP) to hold the current stack address – by definition, the SP points to the most recently added data. When writing data to an ascending stack, the SP would first be incremented and then a new data item stored; in a descending stack the address would first be decremented and then new data stored. When reading data, the data would be transferred from the location pointed at by the SP, and then the SP would be updated. Writing the stack is typically called a “push”, and reading from the stack is typically called a “pop”. In the ARM processor, any register can point to the stack. But certain instructions (like PUSH and POP) assume R13 holds the stack address, so R13 is designated as the stack pointer.

Figure 1. Stack
Figure 1. Stack

The stack is typically used to store register contents (including the LR) and local variables during a context switch (e.g., when calling a subroutine). But this is not required, it is simply a typical practice. As mentioned earlier, if a subroutine does not need to use many registers, and/or if no local variables need to be passed, then the registers may not be saved, and the stack may not be used at all.

In an ARM-published document called the “ARM Architecture Procedure Call Standard” (AAPCS), proper register usage for subroutine calls is formally defined. This definition should be followed so that any code you write will be compatible with other code from other sources. The AAPCS defines R0-R3 as parameter registers that are used to exchange data with a subroutine, and the calling program should not expect that these registers are preserved. This means you can put data into R0-R3 to pass into the subroutine, and the subroutine can put new data in those same registers to pass back to the calling program. If more that four data words must be exchanged, the stack can be used, or a seperate memory structure can be established. R4-R12 and R14 (the LR) should be preserved. The LR is preserved to allow nesting of subroutines – since there is only one LR, if a subroutine is called from within a subroutine, the first subroutine return address needs to be saved. R13 (the SP) and R15 (the PC) are not preserved.

The ARM instruction set includes LDM “load multiple” instruction and STM “store multiple” instructions that can be used for stack operations. These instructions act on multiple registers at the same time, and they can use pre-indexing or post-indexing extended mnemonics:

  • IA for pre-indexing increment (“increment after”, which means access memory first, then increment);
  • IB for post-indexing increment (“increment before”, which means first increment, then access memory);
  • DA for pre-indexing decrement (“decrement after” which means access memory first, then decrement);
  • DB for post-indexing decrement (“decrement before”, which means decrement the address, then access memory).

To implement a descending stack, the following instructions could be used:

@ push the used registers onto the stack at the beginning of the subroutine
STMDB R13!, {R4, R5, R6, R14}
@ pop registers from stack at the end of the subroutine (only 4 regs used in this example)
LDMIA R13!, {R4, R5, R6, R14}

The following example illustrates the use of these instructions. By the time the program is stuck in the infinite loop, R0=1, R1=2, R2=R3=3, and R4=30.

Main:   LDR R13, =Stacktop
        LDR R0, =#1
        LDR R1, =#2
        ADD R2, R1, R0
        BL  Mysub
        ADD R3, R1, R0
Loop:   B   Loop
Mysub:  STMDB R13!, {R0,R1}	@push R0 and R1 onto stack
        MOV R0, #10
        MOV R1, #20
        ADD R4, R0, R1
        LDMIA R13!, {R0,R1}
        BX  R14

Stack operations are fundamental to most processors, including the ARM. Because of this, stack-related instructions have been added to the ARM instruction set. The PUSH and POP instructions assume that R13 holds the current stack pointer, and they assume a descending stack. The “user program” is responsible for setting a memory area aside that can be used as the stack, and loading R13 with that address. The PUSH instruction is the same as STMDB R13! {Register List}, and the POP instruction is the same as LDMIA R13! {Register List}. In the example above, PUSH could replace STMDB and POP could replace LDMIA.