Preface

Fifteen years after the introduction of the Atari 8-bit computer, virtually the entire base of machines is still using the same CPU, and thus has the same basic processing power as when the machine was first shipped.  Not that we need to be like other platforms -- which seem to get CPU upgrades before you've had enough time to pay off the credit card bill on the last system -- but we have missed out on a great amount of satisfaction that comes when you find out that you can do tasks faster, easier, and better than before.  Or when you find that the machine is now capable of feats that it previously could not achieve.

When I met the 65816 processor for the first time, I was immediately captivated by the range of enhancements that it provided.  It was a feeling surpassed only by the still cherished memory of my first successful Assembly language program.  Some of these enhancements are brilliant, and are implemented in very clever ways.  For the designers to have achieved this much, and yet remain completely compatible with the 6502 at the same time, is truly a work of art.  It is a processor that plain and simply _belongs_ inside the Atari computer; for the Atari is unquestionably in my mind the best 8-bit computer that has ever been created, or will ever be created.  It is itself a work of art, that more than anything else, is responsible for the attachment I have to computers, and to programming in particular.  There is no other machine more deserving, or more able to utilize the extra power of the 65816.  As an Atari enthusiast since 1979, it gives me great pleasure to be a part of this new upgrade for the machine.

John Harris
Ahwahnee, California


Programming notes for the 65816.

The 65816 is a wonderful improvement over the 6502, containing 78 new opcodes, 9 new addressing modes, block memory moves, 16-bit registers, relocatable stack and zero-page, and a 16 megabyte linear address space.  This is intended to be an introduction to the many new features and techniques provided by the 65816 processor.
 
- Native Mode -

The 816 powers-up in a state known as "6502 emulation mode".  This is its default state, where it remains virtually 100% compatible with the 6502.  Most of the 816's enhancements are still available in emulation mode, including 24-bit addressing, and the new opcodes and addressing modes.  The primary missing ingredient is 16-bit registers.  These are not available until the processor is switched into a different state, known as "native mode".  This mode switching was necessary so that enhancements could be provided for the CPU, while still maintaining a way to provide complete 6502 compatibility.  As such, there are a number of operational differences when the 816 is in native mode, and these will be noted in the upcoming sections.

Switching between native and emulation modes is accomplished with a new instruction, called XCE, for eXchange Carry and Emulation bits.  The emulation bit is like a shadow of the carry flag, and so to change modes, you set the carry for emulation mode, or clear the carry for native mode, and then do an XCE.  The previous state of the emulation flag will be in the carry after the exchange has been made.

; Change to Native Mode
        CLC
        XCE

; Change to Emulation Mode
        SEC
        XCE

Running the 816 in native mode requires changes to the interrupt processing system in the OS.  A new OS is being worked on, but until this is available, the Atari will crash if you attempt to enter native mode without completely disabling interrupts.  You will find that there is still a considerable amount of power available even in emulation mode, and you still have the option of entering native mode with all interrupts disabled.

The wait for an 816-aware version of the OS won't be too much longer, and in the meantime, you can create a macro which enters native mode and disables interrupts at the same time.  Later on, when the OS supports native mode with the interrupts still active, you can just remove the interrupt disable from the macro, and reassemble your code.  Here's an example of the native and emulation mode macros supplied with the MAE assembler as part of the MACROS include file.

!!!NAT  .MD              ;Macro definition for NATive
        STZ $D40E
        SEI
        CLC
        XCE
        .ME

!!!EMU  .MD              ;Macro definition for EMUlation
        SEC
        XCE
        LDA #$C0         ;Change this to #$40 if you aren't using DLIs
        STA $D40E
        CLI
        .ME

- B accumulator -

The 816 has a second "hidden" accumulator known as the B accumulator.  Though there are no operations that work directly with it, there is an instruction called XBA to exchange the A and B accumulators.  Thus, B makes a great temporary holding spot.  The B accumulator is actually the upper 8 bits of a 16-bit wide accumulator.  When A and B are used together as a 16-bit register, it is referred to as the C accumulator.  The B accumulator and XBA instruction are accessible in emulation mode.  Only the 16-bit wide C accumulator requires the native mode before it can be accessed.  Later on, we will learn how to configure the CPU for 16-bit register use, but it can also be useful to load 16 bit values into both accumulators while still in 8-bit register mode.  These three instructions perform this operation.

        LDA #>address    ;The high byte of the address is loaded first,
        XBA              ; then swapped into the upper accumulator,
        LDA #<address    ; and then the low byte is loaded.

There are some instructions that work with the C accumulator, even when not in 16-bit register mode, so it helps to know how to set up 16-bit values in the 8-bit mode.  There are also conditions where the programmer might not want to be in 16-bit register mode, and the A+B accumulators can still make 16-bit calculations easier.  For example, once a 16-bit value has been loaded as in the above routine, here are some sample opertaions.

        ASL              ;16-bit shift
        XBA
        ROL
        XBA

        CLC              ;16-bit add
        ADC #<value
        XBA
        ADC #>value
        XBA

The A+B accumulator is also a great place for passing a 16-bit value to a subroutine.  

- Changing Register Sizes -

Register size is controlled by two new bits in the native mode processor status register.  The new "M" bit controls the size of the accumulator and all memory accesses, and the "X" bit controls the size of the X and Y index registers.  The native mode status register looks like this:

bit  7  6  5  4  3  2  1  0
     N  V  M  X  D  I  Z  C
           |  |
           |   Index register size
           |
            Accumulator and memory size

These bits are set to 1 when the register size is 8 bits.  To enable 16-bit registers, set the desired bit to 0.  It is possible to have one bit set to 1, and the other bit set to 0.  When transferring between registers of unequal size, a general rule is that the size of the destination register controls the width of the data moved.  The following table will summarize this.

      A=8-bit,X=16-bit                   A=16-bit,X=8-bit
      ----------------                   ----------------
TAX - Moves A and B into X               Moves 8-bit A into X

TXA - Moves lower 8 bits of X into A.    Moves 8 bit X into A,
      B remains unchanged.               and 0 into B.

Two new instructions are provided for setting and resetting bits in the status register.  Use the instruction SEP to set processor status bits.  The operand should be an immediate quantity, containing binary 1's for each of the bits you wish to set.  Consider this like an ORA instruction into the status register.  Use the instruction REP to reset or clear processor status bits.  The operand should be an immediate quantity, containing binary 1's for each of the bits you wish to clear.

        REP #$30    ;Set to all 16-bit registers
        REP #$20    ;Set 16-bit accumulator and memory
        REP #$10    ;Set 16-bit index registers
        SEP #$30    ;Set to all 8-bit registers
        SEP #$20    ;Set 8-bit accumulator and memory
        SEP #$10    ;Set 8-bit index registers

Note that the SEP and REP instructions can set or reset any of the status register bits, so their use is not restricted to only changing register sizes.  Other uses could include:

        SEP #$09    ;Setting or resetting multiple bits at a time.
        SEP #$40    ;Set the V flag.
        SEP #$80    ;Setting or clearing N or Z flags...
        REP #$02    ;...without changing register values.

There are many things to watch out for when changing register sizes.  Switching the accumulator from 16 to 8 bits is a reversible operation, since the upper 8 bits are preserved in the B register.  However, switching the index register size from 16 to 8 bits will zero the upper 8 bits of both X and Y.  Even if only one register is going to be used, the downsizing operation itself can clobber the other register, so the previous value must be saved and restored if the contents need to be preserved.  

When pushing and pulling registers from the stack, make sure the register size is in the same state for both operations, or else the incorrect number of bytes will be pulled from the stack.

When assembling code with immediate operands, the assembler needs to know what the register sizes are, in order to know how many bytes to assemble into the object code.  The CPU will want to fetch 16 bits if the register size is 16-bit, so the assembler better have put a 16-bit quantity there or else the code will not run properly.  The programmer must use psuedo-ops to keep the assembler in sync with the state of the register sizes in the CPU.  Four such psuedo-ops are provided in the MAE assembler.

        .AW          ;Accumulator is word sized
        .AB          ;Accumulator is byte sized
        .IW          ;Index registers are word sized
        .IB          ;Index registers are byte sized

So after changing the register sizes with SEP or REP, always follow with the appropriate psuedo-ops to match the new register sizes.  It is best to code both the CPU instruction and the psuedo-ops into macros, to reduce the chances of any mismatches occurring.  The example MACROS include file provided with the MAE assembler has macros for this purpose.  Here is what they look like:

!!!R16  .MD          ;Macro define to set 16-bit registers
        REP #$30
        .AW
        .IW
        .ME

!!!R8   .MD          ;Set 8-bit registers
        SEP #$30
        .AB
        .IB
        .ME

!!!A16  .MD          ;Set 16-bit accumulator
        REP #$20
        .AW
        .ME

!!!A8   .MD          ;Set 8-bit accumulator
        SEP #$20
        .AB
        .ME

There are also I16 and I8 macros for setting index register size by itself.

If your code goes into native mode to use 16-bit registers, and plans to return to emulation mode when the 16-bit need is finished, then you do not need to explicitly return to 8-bit registers with SEP.  Returning to emulation mode will automatically return the registers to 8-bit size.  Remember that you still need to use the .AB and .IB pseudo-ops.

Be extremely careful when creating code that branches or JSRs out of a routine that changes register sizes.  Consider the following example:

        R16               ; Macro to set 16 bit registers
        do some code
        CMP #VALUE
        BEQ YES
        do the "no" code
RETURN  R8                ; Macro to return to 8 bit registers
        RTS
;
YES     do the "yes" code
        BRA RETURN

The problem with this code, is that the assembler will not know that the CPU is in 16-bit register mode at the label YES.  So if the code contained any immediate references, they would not be assembled properly.  This simple example could easily be rewritten by placing the YES portion of the routine above the label RETURN, and having the "no" section branch around it.  This is a preferable arrangement if it is feasible, but in some cases this will not apply as easily.  If you need to create a structure similar to the above, you will need to add additional sizing psuedo-ops to keep the assembler in sync with the code.  Add the following lines before and after the YES segment:

;
        .AW               ;Since the yes code executes in 16-bit mode,
        .IW               ; set the assembler register sizes to match.
YES     do the "yes" code
        BRA RETURN
        .AB               ;Set the assembler back to 8-bit mode.
        .IB

- Bank Registers -

The 816 still has a 16-bit program counter, and while it does offer a few 24-bit data addressing modes, most modes are still 16-bit.  Thus, the CPU creates 24-bit addresses by using "bank registers" added to a 16-bit address.  Much like a "page" of memory is an even 256 bytes, a "bank" of memory is an even 64K.  The upper 8 bits of a 24-bit address is called the bank byte.  The 816 provides two 8-bit bank registers, one for the PC and one for data addressing.  By storing different values into these bank registers, the CPU can address the full 16 megabytes provided by the 24-bit address space.

The bank register for the PC is called the PBR, or Program Bank Register.  The bank register for data addressing is called the DBR, or Data Bank Register.  Things get slightly confusing when looking at the mnemonics to access these bank registers.  Mnemonics referring to the DBR will use the letter B; and the letter K is used to refer to the PBR.  Access to the DBR is provided only through the stack.  The instruction "PHB" pushes the DBR onto the stack, while "PLB" pulls a byte from the stack and puts it into the DBR.  Thus, to set the DBR to another value, push that value onto the stack, and then do a PLB.  The following code will set the DBR to $40.

        LDA #$40
        PHA
        PLB

There is a "PHK" instruction to push the PBR onto the stack, but no "PLK" instruction.  Such an instruction would make little sense, as it would cause the program counter to immediately jump to a new location in the 24-bit address space.  To set the PBR, there is a new 24-bit jump instruction, "JML", or jump-long.  The operand is a 3 byte, full 24-bit address.

It is sometimes useful to set the DBR to the same bank as the PBR, so that data access is aligned to the same bank where the code is running.  You can do this with these simple instructions:

        PHK          ;Push the PBR,
        PLB          ; and pull it into the DBR

- Direct Page -

The concept of zero-page from the 6502 has been extended for the 816 processor.  By adding a 16-bit register that can point to any location in the first 64K of memory, you can now relocate zero-page to any 256 byte block of memory within the first 64K.  This register is called the Direct Page Register, and the zero-page addressing modes are now referred to as Direct Page modes.  The direct page register does not need to point to an even page of memory, but odd memory alignments will slow down all direct page addressing by one cycle.

There is one important difference about the way the 816 handles direct page indexed references that overflow the end of the page boundary.  Such as referencing $F0,X when X=$20.  On the 6502, any references above $FF would wrap around to $00 again, keeping all addresses within the page.  On the 816 in native mode, indexes will cross the page boundary and reference locations outside of the direct page.  In 6502 emulation mode, in order to maintain compatibility, index references will wrap around to $00, but, (and this is where it gets a bit strange), only for instructions that existed on the 6502, and only when the direct page register is set to 0.  All operations that are new to the 816 will always generate addresses outside of the direct page in cases where the index calculation overflows the page boundary.  I think a number of examples would best demonstrate the conditions here.

; X=$20, E=emu, dp=0

        LDA $F0,X    ;Gets byte at $10
        STZ $F0,X    ;Clears byte at $110 (non-6502 instruction)

; X=$20, E=emu, dp=$4000

        LDA $F0,X    ;Gets byte at $4110 (dp is not 0)

; X=$20, E=nat, dp=0

        LDA $F0,X    ;Gets byte at $110 (native mode)

; X=$FFFF, E=nat, dp=0, 16-bit register mode

        LDA $F0,X    ;Gets byte at $100EF (yes, past the first 64K)

The last example demonstrates the interesting effect that with 16-bit index registers, direct page indexed addresses can be up to 64K bytes away from the location of the direct page, all with no cycle penalties!

Setting the direct page register is similar to setting the DBR, except that the direct page register is 16 bits.  There are PHD and PLD instructions, but they will be two byte accesses.  Thus to set the direct page register, you will need to push a two byte address onto the stack.  There is a new opcode which is handy for this use.  It's called PEA, for Push Effective Absolute address.  The operand is an absolute address, and it is the address itself that gets pushed onto the stack, rather than the contents at that address.  It's as if the operand is treated like an immediate constant, but the assembler syntax does not use the "#" immediate identifier.  For example, if you want to set the direct page register to $4000, you would use:

        PEA $4000
        PLD

The direct page register can also be transferred directly to and from the accumulator.  The instructions are TCD and TDC.  The mnemonic choice of C was used to indicate that this is always a 16-bit transfer of the A+B accumulators, regardless of whether the accumulator is set to 8-bit or 16-bit mode.

There are all sorts of unique possibilities for this relocatable direct page.  If you need to do some very fast graphic routines, you can point the direct page register into screen memory!  Or even point the direct page register at $D000, which would let you access the hardware graphic registers at direct page speed!  Be creative, but keep in mind that if interrupts are occurring when the direct page is pointing somewhere else, direct page accesses within the interrupt would go to the wrong place.  The interrupt routine would need to push the direct page, reset it back to 0, and then restore the direct page on exit.  A new version of the OS, when it becomes available, may take care of this automatically.  Until then, the programmer will be responsible for not allowing interrupts to occur when the direct page register is non-zero, or else providing the enhancements necessary to the interrupt service routines to handle the situation properly.  More information will be given in the section on interrupts, near the end of this document.

- Stack Register -

The 816 now provides a 16-bit stack pointer, which can point to any location in the first 64K of memory.  It also allows the stack to be longer than 256 bytes, and can be up to as much memory as you want to set aside for it.  The 16-bit stack pointer is only available in native mode.  When the CPU is in 6502 emulation mode, the upper byte of the stack pointer will be set to 1, forcing all stack accesses to page 1 just like the 6502.

In addition to the TSX and TXS instructions, the 816 also offers TSC and TCS instructions for transferring the accumulator to and from the stack pointer.  The mnemonic uses the letter C to indicate that the full 16-bit value in the A+B accumulator will be transferred, regardless of the setting of the M bit in the CPU status.  Of course, since the high byte of the stack pointer is forced to 1 when the CPU is in emulation mode, only the lower 8 bits of the accumulator are transferred in emulation mode.

Be careful of using the TXS instruction in native mode.  The stack pointer will be 16 bits, but if the index registers are set to 8 bits, the upper byte of the X register will be zero.  A TXS instruction from native mode with the index registers set to 8 bits, will wind up positioning the stack into page zero!  This is a really sad occurrence, because it often winds up being the only problem when trying to run existing 6502 code in native mode.  To set the stack pointer from native mode and 8 bit registers, this is the easiest method:

     TSC         ;This will place the current 16 bit stack pointer in A & B
     LDA xxx     ;Replace the lower 8 bits with the new desired value.
     TCS         ;The upper 8 bits was kept in B, and will be unchanged

- New Addressing Modes -

The 816 provides many new addressing modes, and in some cases, expands the allowable modes of older instructions.  For example, the BIT instruction now accepts immediate operands, plus indexed,X modes for absolute and direct page.  INC and DEC can operate on the accumulator, in the same manner as the shift and rotate instructions.  Just write "INC A", or simply "INC".

New addressing modes include:

        JMP (address,X)   ;Absolute Indexed Indirect

Available only on JMP and JSR instructions, this addressing mode allows easy access into 2 byte jump tables.  It's too bad the 6502 didn't have this mode.  We wouldn't have been subjected to the -1 addresses in the $E400 jump tables if it had.  In fact, a neat enhancement to an 816 version of the Atari OS would be to add a NOP instruction in front of every one of the $E400 routines.  In this way, the current RTS method would still work, with its +1 adjustment skipping over the NOP and winding up in the same place the routine starts now.  The difference, is that now a JSR (indirect,X) would enter the routine at the NOP, providing an easier method for programmer access.

        LDA (dp)          ;Direct Page Indirect
This is similar to the indirect indexed form of addressing, but without the ",Y" index.  This is very handy when you only need one byte at the indirect address, and it does not require the use of a register set to 0.  This mode is available for all instructions that currently allow indirect indexed.

The following modes generate 24-bit addresses, and are only useful when memory is expanded beyond the initial 64K.  Unless otherwise noted, all modes are available with the types of instructions that previously allowed the full range of addressing modes.  Such as LDA, STA, and math operations like ADC and ORA.  Operations that modify the operand, such as DEC, do not support these new addressing modes, nor do operations with the X or Y registers.

        LDA >long          ;Absolute Long
Simply a 24-bit version of absolute addressing, taking one extra address byte for the operand, and thus, one extra cycle as well.  The ">" symbol is required by the MAE assembler to denote the long addressing mode.

        LDA >long,X        ;Absolute Long Indexed,X
A 24-bit version of absolute indexed,X, again taking one extra address byte for the operand, and one extra cycle.

        JMP [address]      ;Absolute Indirect Long
The operand is a 16-bit address, but this address should contain a 24-bit pointer to the destination jump location.  This mode is only valid for the JMP instruction.  Square brackets "[]" are used to denote 24-bit indirect pointers.

        LDA [dp]           ;Direct Page Indirect Long
Much like direct page indirect, and the operand will still be a one byte direct page location.  However, the address at that location should be a 24-bit address.

        LDA [dp],Y         ;DP Indirect Long Indexed,Y
Adds a ",Y" index to direct page indirect long addressing.

The last two addressing modes deserve a topic by themselves.

- Stack Relative Addressing -

The most significant improvement to the addressing modes in the 816, is the addition of stack relative addressing.  This gives the programmer easy and direct access to any byte or address within the last $FF bytes pushed onto the stack.  The operand of stack relative addressing is a one byte index into the stack.  Since the stack pointer always points to the next available location, a relative pointer of $1 should be used to access the last byte pushed onto the stack.  Higher numbers will access bytes successively deeper in the stack.  Note that a relative pointer of $0 will reference the last byte that was pulled off the stack, but usage of this byte will be dangerous unless the conditions are such that no interrupts could have occurred.  Interrupt processing will push additional bytes onto the stack, overwriting the byte that was "left over" from the last stack use.

The direct form of stack relative addressing is written in Assembly as:

        LDA $1,S

There is also an indirect indexed form of stack relative addressing.  This operates much like the "(indirect),Y" type of addressing, except that instead of the address being stored in zero page, the address is pushed onto the stack.  This is best explained by an example showing the same operation in an already familiar method, along side the new stack relative mode.

        Indirect Indexed    Stack Relative Indirect Indexed
        ----------------    -------------------------------

        LDA #$40            PEA $4000
        STA $81             LDY #0
        LDY #0              LDA ($1,S),Y    ;"$1,S" defines the location
        STY $80                             ; of the address bytes
        LDA ($80),Y

Note that in the stack relative example, the programmer is responsible for pulling the address off of the stack when it is no longer needed.  This example used an immediate address, for which the 816 can easily push all 16 bits using the PEA instruction.  In cases where the address is calculated, or taken from some other memory variable, push the high byte of the address first, and then the low byte.

The stack relative form of indirect addressing is 1-2 cycles slower than using the direct page, so it is not ideal for speed critical applications.  Still, there are many excellent uses for stack relative addressing.  In its simplest form, it can be a handy method for allocating temporary variables or pointers, without requiring the use of other memory locations.  This can be especially helpful if you need indirect addressing capabilities, and do not want to overwrite any direct page locations.  You can push the desired address onto the stack, and then use the stack relative form of indirect indexed addressing.

At some point, you may need to use a direct page variable, modifying its contents, and then be responsible for restoring it to its original value.  Traditionally, you would push the contents onto the stack, or transfer it to another memory location.  Stack relative gives you another option.  Once you push the original contents onto the stack, you may use that stacked address as your temporary pointer.  In fact, there is a single instruction that can push the original pointer, called Push Effective Indirect address, or PEI.  PEI is written in Assembly as:

        PEI ($80)

This instruction will take the 16-bit address at the location of the direct page operand, and push it onto the stack.  Much like PEA, it pushes the effective indirect address rather than the contents stored at the address.  For example, lets say that the direct page location $80 contains the bytes $00,$40, (basically, the address $4000), and location $4000 contains $FF.  The instruction "LDA ($80)" would return the contents $FF, while the instruction "PEA ($80)" would push $4000, being the effective address that the operand points to.

The stack can also be a useful method for passing parameters to subroutines.  You can push as many arguments as you need onto the stack before the subroutine call, and access them from within the subroutine by using stack relative addressing.  Since there will be a two byte return address on the very top of the stack, the first parameter will be at an index of $3 when using stack relative.  If your subroutine needs to preserve registers on the stack first, the index will increase accordingly.

Some of the more exotic uses are in the form of reentrant or recursive programming.  A routine is said to be reentrant if it can be called from another source, like an interrupt, while it is in the middle of execution.  A routine is recursive if it can call itself.  For a routine to be able to have these properties, it must preserve all registers and use no external memory references.  One of the example code routines at the end of this document shows an example of recursive programming.

- Relocatable Code -

Relocatable code has the ability to be moved anywhere in memory, and still execute correctly.  Improvements in addressing modes and opcodes make it much easier for the 816 programmer to write fully relocatable code.  In combination with the stack relative addressing modes, the final key ingredient is an instruction called PER, Push Effective address Relative.  Since relocatable code cannot rely on absolute addressing, PER provides a method for referencing other parts of a program or its data by using a 16-bit signed offset relative to the current PC.  The PC always contains the absolute address of where a program is running.  So by knowing how far it is from the currently executing instruction to another part of the program, PER can generate an absolute address by adding or subtracting this distance from the PC.  In the assembler, you only need to provide the label name for where you want to reference, as the operand of the PER instruction.  The assembler will calculate the proper offset and store it in the object code.

To write a relocatable program, the programmer will need to replace situations that use absolute addresses, with relative address references.  Instead of JMP instructions, use BRA for short jumps within 128 bytes, or BRL for long jumps of up to 32K.  There is no BSR instruction for relative subroutine calls, but such an instruction can easily be created in a macro by using two other instructions.

!!!BSR  .MD (ZLOC)  ;BSR macro definition, with 1 parameter for the adr
        PER *+5     ;Push return adr, adj -1 for RTS; Points after the BRL
        BRL ZLOC    ;Branch to the subroutine.  RTS returns to next instr.
	.ME        

In the above macro, we have used the PER instruction to push the return address for the subroutine as an offset of +6 from the current PC.  The RTS is going to increment the pushed address before continuing execution, so we actually have to pass a relative address of +5 to get the RTS to return where we want.  Then, it is a simple matter to jump to the subroutine with a relative branch.

Since some subroutine calls could be within 128 bytes, it would be more efficient to allow a short BRA to be used in these situations.  This is not practical for the assembler to do automatically, since it will not know how far away forward references will be.  Thus it can not determine how many bytes to reserve for the branch, and changing this during the second pass would cause an error in the location of all subsequent addresses.  Therefore, we recommend creating two macros and allowing the programmer to specify the size of the branch with the macro name.  The MACROS include file supplied with the MAE assembler uses the name BSL for the long branch to subroutine as shown in the example above.  Another macro called BSR uses the BRA instruction, and would be used only for subroutine calls within 128 bytes.

That covers the absolute JMP and JSR instructions.  Using relocatable structures for data storage is more complex, because there are more varieties in the ways programs need to access data.  We can explore a few generalizations, after which, you should be capable of creating approaches that fit your individual needs.

The basic building block is the PER instruction.  Anytime you need to access an address that is within part of the program's space, you need to find the runtime location for that address using PER.  PER pushes the address onto the stack, and what you do from there depends upon how often your program needs to use that address.  For references that are confined to fairly local use, it is probably best to leave the address on the stack, and use it for stack relative indirect addressing.  Let's look at a fairly simple table look-up routine to see how this works.  We'll examine a key code look-up with a JMP table, first in standard absolute addressing, and then in a relocatable form using PER and stack relative.

Absolute addressing - 6502 method
---------------------------------
        LDA $2FC       ;Get key code
        LDY #MAXCMD
?FND    CMP KEYCMDS,Y
        BEQ ?HV
        DEY
        BPL ?FND
        SEC            ;Return with "Not Found"
        RTS
?HV     TYA            ;When key code is found,
        ASL            ; multiply by 2 for word indexing
        TAY
        LDA CMDADR+1,Y
        PHA
        LDA CMDADR,Y
        PHA
        RTS            ; and jump to routine
;
KEYCMDS .HE BF 95 92
MAXCMD  = *-KEYCMDS-1
CMDADR  .WO C.A-1
        .WO C.B-1
        .WO C.C-1

With the 816, the above example could be shortened a bit by using the
JMP (indirect,X) instruction.  The ending code would look like this, plus the CMDADR table would not use -1 after the addresses with this method.


?HV     TYA
        ASL
        TAX
        JMP (CMDADR,X)

Now, here's a relocatable version
---------------------------------
        LDA $2FC
        PER KEYCMDS    ;Push address of KEYCMDS
        LDY #MAXCMD
?FND    CMP ($1,S),Y
        BEQ ?HV
        DEY
        BPL ?FND
        PLA            ;Pull KEYCMDS address off stack
	PLA
        SEC            ; and return "Not Found"
        RTS
?HV     R16            ;Set 16-bit registers
        PLA            ;Pull KEYCMDS adr off stack
        PER CMDADR     ; and push CMDADR
        TYA
        ASL
        TAY
        LDA ($1,S),Y   ;Addresses in the table are offsets from CMDADR
        ADC $1,S       ;Add it to CMDADR, which is still on the stack, and
        STA $1,S       ; it points to the absolute address of the routine
        R8             ;8-bit register mode
        RTS            ;And go
;
KEYCMDS .HE BF 95 92
MAXCMD  = *-KEYCMDS-1
CMDADR  .WO C.A-CMDADR-1 ;Addresses are stored as offsets from CMDADR
        .WO C.B-CMDADR-1 ;
        .WO C.C-CMDADR-1 ;Also includes -1 for RTS adjust

This example used 16-bit registers, so keep in mind the current restrictions regarding native mode and interrupts.  Until the new OS is in place, you may want to stay in emulation mode, and just use a double byte add with 8-bit registers.

Not all program data is going to fit this type of example either.  If the data is accessed at many locations throughout the program, the above process can get inefficient very quickly.  In some cases, you may be able to leave the address on the stack, and continue to use it with stack relative addressing.  This can quickly become impractical though when needing to access the address from different subroutine levels or with other items pushed onto the stack.  A better solution is to do the PER at the start of the program, and then pull off the address and store it permanently in a direct page variable.  Then use direct page indirect addressing to access the data.  For single bytes and other small amounts of temporary storage, put the most often used variables directly into the direct page, or into free absolute memory areas like $400-$6FF.

- Block Memory Moves -

The 816 provides two instructions for moving blocks of memory, one incremental and one decremental.  This is to handle the potentially destructive operation of moving overlapping blocks of memory.  The incremental instruction is MVN, or Move Next, and the decremental instruction is MVP, or Move Previous.  The block moves are about twice as fast, or more, as normal move operations using 6502 methods.

Block moves are set up by placing the source address in the X register, the destination address in the Y register, and the number of bytes -1 to be moved in the C accumulator.  The C register decrements to $FFFF in the move operation, thus the reason to subtract 1 from the desired byte count.  The registers should be set to 16-bit mode of course, otherwise block moves would take place in page 0 only.  This also means that block moves have very little use in 6502 emulation mode.

Two operands follow the block move mnemonic in the assembler syntax, which are the bank bytes for the source and destination addresses respectively.  Thus block moves can move data from one bank to another if desired.  Here is an example that will move $1000 bytes from $4000-$4FFF to $8000-$8FFF.

        LDX #$4000
        LDY #$8000
        LDA #$FFF
        MVN 0 0

The MVN process will move one byte at a time, using 7 cycles per byte, incrementing both X and Y registers while decrementing the count in C.  When the C register gets to $FFFF, the move will be completed, and X and Y will contain the address one past the range that was just moved.  In this example, X would contain $5000, and Y would contain $9000.

When the destination address is higher than the source address, and the blocks overlap, the programmer should use the MVP instruction (Move Previous).  MVP moves the bytes using decrementing registers, so X and Y need to be initialized to point at the highest byte in the source and destination blocks respectively.  These values can be found very easily with 16-bit ADC instructions.  Assuming the C, X, and Y registers are initialized to the start of the memory blocks, as for the MVN process, the following lines of code will set X and Y to the right locations to start a MVP.

        PHA            ;Save byte count
        TXA
        CLC
        ADC $1,S       ;Add the stacked byte count to X
        TAX
        TYA
        ADC $1,S       ; ...and Y
        TAY
        PLA            ;Restore byte count
        MVP 0 0        ;All ready for the MVP

- Other New Instructions -

There are a number of other new instructions that haven't been introduced yet, or haven't been discussed formally.  These include:

        PHX            ;Push X
        PHY            ; or Y
        PLX            ;and Pull
        PLY
        TXY            ;Transfer X register to Y
        TYX            ; or Y to X

        JSL >$123456   ;Jump Subroutine Long, also pushes 24-bit address
        RTL            ;Return subroutine Long, pulls 24-bit address.

Be careful to always use RTL when using JSL, and use RTS when using JSR.  Mixing these up will cause an improper number of bytes to be pulled from the stack, and will most likely crash the machine.

        STZ operand    ;Store Zero to memory.  No register required.

A handy instruction for clearing memory.  Usable in addressing modes of:
 direct page
 direct page indexed,X
 absolute
 absolute indexed,X

        TRB operand    ;Test and Reset Bits

Similarly to the BIT instruction, TRB will logically AND the value in the accumulator with the operand, with the Z flag indicating the results.  TRB goes one step further, by resetting the corresponding bits in the operand.  Technically, it logically ANDs the complement of the accumulator with the operand, and stores the result.

In other words, TRB resets bits in the operand, and the Z flag lets you know if they were clear before the operation.

        TSB operand    ;Test and Set Bits

The counterpart to TRB, the TSB instruction sets bits in the operand, and the Z flag will indicate if the bits were clear before the operation.

TRB and TSB support only absolute and direct page addressing modes.  They are useful for setting or clearing bits in the $D301 register, and other bit-mapped locations that can be both read and written.  Because this is a read-modify-write operation, it must be able to read the current value at the address, and thus can not be used on any hardware locations that do not support this.  Locations like $D40E, $D208, $D00C, etc., can not be used with TRB and TSB.

- Interrupts -

This last section is not for the faint at heart.  Early on it was mentioned that without changes to the interrupt processing in the Atari OS, the 816 will crash if you try to enter native mode with interrupts still active.  We are about to learn why that happens, and why 816 interrupt processing routines on the Atari are a considerable challenge.  Keep in mind that once a new 816-aware version of the OS is available, most of the problems we are about to look at will have already been addressed.  The information here is provided for those of you that may need to write your own interrupt handlers, or are simply curious about what goes on behind-the-scenes.

The 6502 standard interrupt vectors at $FFFA and above are used whenever the 816 is in emulation mode.  When in native mode however, the 816 utilizes a second set of CPU interrupt vectors.  Since the Atari OS does not contain valid addresses at these vector locations, this is the primary reason the system will crash on any native mode interrupts.  The following table shows the locations of the native mode interrupt vectors.

$FFEE - IRQ
(note there is no native RESET vector, since RESET forces emulation mode)
$FFEA - NMI
$FFE8 - ABORT (not used in the Atari)
$FFE6 - BRK instruction
$FFE4 - COP instruction

Where the 6502 vectored software BRK instructions through the IRQ vector, the 816 in native mode provides a separate BRK vector.  This is necessary because the B status register bit is no longer available to distinguish an interrupt caused by the BRK instruction.  The other new vector is designed to allow a co-processor to be used with the 816.  Though that is an unlikely scenario with the Atari, the COP vector operates the same as the BRK vector, but in response to the COP processor instruction.  Thus, a programmer could use the COP instruction and COP vector for applications similar to where BRK is used.

Interrupts in native mode on the 816 are a significant programming concern.  Since the processor can be in many states, a great deal of care must be used to ensure that the CPU is in the proper state for the interrupt handling routine, and restored to its original state when the interrupt exits.  The 6502 had only the decimal mode to worry about.  When an interrupt occurred, the programmer could not be sure whether the CPU was in decimal mode or not, and so the decimal flag needed to be cleared if the interrupt routine needed to do any math.  Interestingly, the 816 always clears the decimal mode when responding to an interrupt, so 816 programmers don't need to be concerned with it.  In its place however, the 816 programmer must be concerned with the size of the accumulator and index registers, and the contents of the DBR and direct page bank registers. 

This is why retrofitting the 816 into the Atari creates some serious problems.  It is more than just providing the native mode vectors, because the programmer will not need to use the native mode unless he is also planning to use 16-bit registers.  In order to utilize 816 enhancements while leaving the interrupts active, the interrupt service routines need to be aware of the many available states of the 816.  They need to save the current state, reset everything to the default state that the interrupt routine expects, and then restore the state back to where it started when the interrupt occurred.  The first two of these three steps are comparatively simple, because all of the interrupts start at the main interrupt vectors, and a new OS could provide the required code at the start of interrupt processing.  The following code segment is an example of what is required to save the state of the DBR, direct page register, and register sizes, and also reset them to what would be expected as default conditions.

        PHD           ;Save DBR and Direct
        PHB
        PEA $0000     ;Provides both operands for the following pulls
        PLD           ;Set DBR and Direct to 0
        PLB
        PHX           ;X&Y always need to be saved, because resetting them
        PHY           ; to 8 bits would destroy the upper half contents.
        PHP           ;Save M & X status bits, so we can restore
;                     ; the registers at their proper size.
        SEP #$30      ;Set 8 bit regs

Now you're ready to begin the interrupt routine with the CPU in its default state.  The problem, is that the interrupt service routine needs to restore all of these things when it exits, and current Atari interrupt routines do not expect all of this stuff to be pushed on the stack.  While the built-in OS routines could be fixed to provide this, there is no way to control what happens when the interrupt vectors get patched by external programs.  All of these external programs, and even PBI devices, are going to exit with everything still on the stack, which would bring the system to a crashing halt.

There are no perfect solutions to this this problem.  If execution speed were of no concern, a completely safe solution is to push an extra 'dummy' interrupt return address onto the stack, so that the RTI from the interrupt service routines would return to the OS, where it could pull everything else from the stack and restore the state of the machine.  Then a second RTI would resume main program execution.  We already had to push the status register as part of the above operation, so we would only need to push the dummy return address right before it to set the stack up properly for an RTI return.  Since we are in native mode, we must also push the PBR bank byte to create a 24-bit return address.  The PER instruction works ideal for pushing the remaining 16-bit portion of the address, like we used in the BSR example in a previous section.  The RTI does not add $1 to the return address, so in this case we may push the exact address with PER.

Here is the completed interrupt pre-handler, along with the code to restore the processor state and exit.

        PHD           ;Save DBR and Direct
        PHB
        PEA $0000     ;Provides both operands for the following pulls
        PLD           ;Set DBR and Direct to 0
        PLB
        PHX           ;X&Y always need to be saved, because resetting them
        PHY           ; to 8 bits would destroy the upper half contents.
        PHK           ;Push bank byte,
	PER IRETURN   ; and return address, and status for the RTI
        PHP           ;Save M & X status bits, so we can restore
;                     ; the registers at their proper size.
        SEP #$30      ;Set 8 bit regs
;       now jump to the appropriate interrupt vector, such as...
        JMP ($216)    ;Its RTI will return to IRETURN

IRETURN PLY           ;Status byte was pulled as part of the RTI, so
        PLX           ; now restore X&Y
        PLB           ;Restore DBR and Direct
        PLD
        RTI           ;Return to main program

This adds up to 66 or 70 cycles, (depending on index register size).  A delay like this could be detrimental to PBI and serial IRQs, and immediate VBLANK, and would be completely unacceptable for DLIs.  The solution in the other extreme, would be to require interrupts to be completely disabled whenever the programmer needed to change the state of the CPU.  In this way, the interrupt routines would not need any of the above maintenance code, and would be just as fast as existing routines.  This method however would be too restrictive to the programmer.  Everything else is a compromise somewhere, but by taking a few points into consideration, a workable solution can be reached.

Coding improvements to the interrupt routines by using 816-specific features can regain part of the 66-70 cycle overhead.  Another point to keep in mind, is that the emulation mode interrupt vectors could be kept as-is, and thus can be used if the response time of the native mode vectors is too slow.  All this means, is that the program can switch to emulation mode when doing disk I/O, or other interrupt speed sensitive tasks.  Until the 816-OS is finished and receives considerable testing, we won't know if and where speed problems will occur.

DLI routines need access as soon as possible.  Thus, the NMI service routine should always plan to get to the DLI without doing any of the safety measures.  In practice, this won't be that detrimental to the DLI programmer.  DLI's rarely need direct page access, often don't require the index registers, and could get around changes in the DBR by coding all accesses to the hardware with 24-bit absolute addressing.  In any case, exceptions to these conditions can be handled by the programmer in many ways, and in a much more efficient manner than the "save everything" method shown above.  Since there simply is not enough time to save everything, there is little choice but to handle DLIs as a special case.  The big drawback, is that it forces the programmer to be aware of the potential dangers of DLI programming in native mode, since the CPU may not be in a known state.  The main NMI vector will still need to set the accumulator to 8 bits, which is needed for checking the NMIST location to determine the cause of the interrupt.  So the native mode NMI handler will likely look like this:

        SEP #$20      ;Set 8 bit accumulator/memory
        BIT $D40F
        BPL NODLI
        JMP ($200)
NODLI   would then do the safety routine as above, and JMP ($222)

This is going to be 4 cycles slower getting to the DLI than the normal Atari handler, (3 for the SEP and 1 extra for the bank byte push on the NMI itself), but this is the best that can be done.

Finally, the OS should provide a native mode IRQ master vector, similar to the vector at $216, which would be in front of all of the safety code.  The programmer can use this vector to customize the interrupt needs, if necessary.  Providing a similar master vector for the NMI would add another 5 cycle delay to the DLI, and thus be undesirable.  A compromise, would be to provide a native mode VBI vector that would be called at the NODLI location in the above NMI handler.  This would give programming access to the native VBI before any of the safety measures.  Such a vector is currently being considered.

There are a few more considerations for interrupts on the 816.  When the 816 responds to an interrupt in native mode, it will push an additional byte onto the stack between the status byte and the return address.  This byte is the PBR; the current bank register for the program counter.  Also, one extra cycle of delay will occur due to this byte being pushed.  The PBR is not pushed when the CPU is in emulation mode, which can set up a potential interrupt problem if the programmer tries to run code outside of bank 0 while in emulation mode.  Because the interrupt will set the PBR to 0, there will be no way for the processor to know where to return to when the interrupt exits.  The best way to avoid this scenario is to never execute code outside of bank 0 unless the CPU is in native mode, or unless interrupts are fully disabled.

816 interrupts in both native and emulation modes do not exhibit the 6502's interrupt bugs.  On the 6502, if an interrupt occurs while processing a BRK instruction, the BRK gets ignored.  The 6502 can also skip NMIs and IRQs under certain conditions.  Interrupts on the 816 appear to be completely stable.

- Sample Code: Testing for the presence of the 65816.

If you want to write a program that will run on both the 6502 and 65816 CPUs, yet take advantage of extra 816 abilities if it is present, then you need a way to determine which processor the code is running on.  The following short code segment will do this, based on the code's behavior in decimal mode.  When doing BCD math on the 6502, the N and Z flags are not set correctly -- basically because the 6502 is setting the flags before the final BCD adjustment is made internally.  The 816 sets the flags properly, so we only have to do a simple BCD add where the flags come out differently.  Here's the code:

        SED
        LDA #$99
        CLC
        ADC #1
        CLD
        BEQ HV816

; 6502 here           ;flags show N set, and Z clear, even though A=0

HV816                 ;816 here

- Sample Code: Printing a line of text to the screen.

Many of you may already use a routine that you can call like this:

        JSR PRINT
        .BY "This text is printed" 0
        ...code continues here

On the 6502, this routine would normally pull the return address from the stack, and place it in a zero-page variable.  After incrementing it by one, this would now point to the start of the text string.  The bytes of text can then be output one by one, until reaching the $0 end-of-string marker.  Then the address is pushed back onto the stack, where an RTS returns to the program code following the text string.  The stack relative mode of the 816 makes this routine more elegant, as you will see in the following comparison.

6502 version           65816 version             65816 16-bit version
------------           -------------             --------------------
PRINT   PLA            PRINT   LDA $1,S          PRINT   REP #$30
        STA ZP                 INC                       LDA $1,S
        PLA                    STA $1,S                  INC
        STA ZP+1               BNE ?1                    STA $1,S
PBYTE   INC ZP                 LDA $2,S                  SEP #$30
        BNE ?1                 INC                       LDY #0
        INC ZP+1               STA $2,S                  LDA ($1,S),Y
?1      LDY #0         ?1      LDY #0                    BEQ PEND
        LDA (ZP),Y             LDA ($1,S),Y              JSR PUTE
        BEQ PEND               BEQ PEND                  BRA PRINT
        JSR PUTE               JSR PUTE          PEND    RTS
        JMP BPYTE              BRA PRINT         
PEND    LDA ZP+1       PEND    RTS
        PHA
        LDA ZP      
        PHA
        RTS

The 65816 versions do not require a direct page variable, and do not need to pull and push the address from the stack.  The stack relative addressing allows the code to operate directly on the address automatically pushed by the JSR.  Also, when this stacked address is incremented through the text string, it is automatically in the right place for the RTS.  Sadly, INC does not work with a stack relative operand, forcing us to LDA the value to the accumulator, increment it, and STA it back.  The 16-bit version makes this process a little smaller, but again require having resident OS interrupt routines that allow size changes to the register.

In all routines, "JSR PUTE" can be considered to mean any suitable routine to output a byte to the screen.

- Sample Code: Block Move Patch

The 816 block move instructions can move memory at least twice as fast as 6502 implementations.  It will usually take less code as well, which sets up an opportunity to patch existing 6502 programs with a block move instruction.  All we need to do is find the existing routine that does memory moves, and replace the instructions with an 816 equivalent block move.

For a demonstration of this, I chose the word processing program TextPro, version 4.55.  It is best if the application needs to do a lot of memory management, and this should be a good choice.  I'll load the file into memory using the debugger.  Since I want the entire file contents sequential, and at the address I choose, I'll use the "-" option for the load which does a straight CIO transfer.

L -3000 TP.COM

The first step is to find the existing memory move code, and while you could browse through the disassembly, a little detective work with the debugger is a better choice.  A memory move should have a searchable sequence of an LDA instruction, followed immediately by an STA.  The most common type will be using (indirect),Y addressing, and a quick check of the opcode table shows the hex values to be $B1 for the LDA, and $91 for the STA.  Since we don't know what the operand will be, we will use a wildcard for that.

H 3000 7800 B1 ? 91

This found one occurrence, but examination of the code showed it to not be what we were looking for.  To search for other choices, you need to think of the other ways memory can be moved.  Self-modifying code is sometimes used, where you would see an LDA absolute,X or absolute,Y followed by an STA.  Since the operands in this case are two bytes, we would use two wildcards.  It turns out that the routines we want use the absolute,Y addressing, and the search looks like this:

H 3000 7800 B9 ? ? 99

This actually finds quite a few matches, being a common programming structure for other uses.  We luck out, and the routine we are looking for happens to be the very first match, at $36C4.  Even if it weren't though, it would have only taken a few minutes to examine the code at each location to find the right one.  Anyway, here's what the code looks like, and I have adjusted the addresses in the left column to be the actual runtime locations, to match up with the self-modifying operands.

$394B   LDA $80
        STA $396B
        LDA $81
        STA $396C
        LDA $82
        STA $396E
        LDA $83
        STA $396F
        LDX $85
        BEQ $3983
$3963   LDA #0
$3965   STA $B73C
        LDY #0
$396A   LDA $FFFF,Y
$396D   STA $FFFF,Y
	INY
	CPY $B73C
        BNE $396A
        INC $396C
        INC $396F
        CPX #0
        BEQ $3987
        DEX
        BNE $3963
$3983   LDA $84
        BNE $3965
$3987   RTS

The next step is to determine the locations where we can get the parameters for the move.  The source address is easy to detect, being the first four lines that store into the LDA $FFFF,Y instruction.  Likewise, the destination address is the next four lines, making the source address found in $80-$81, and the destination in $82-$83.  Finding the length takes a bit more examination, but winds up at $84-$85.

All we have to do now, is replace this subroutine with an 816 code segment that uses the block move instruction.  This code gets placed at $36A5, and although it can be entered with the single line assembler in the debugger, it will get confusing since almost every instruction that you enter will be an 816-specific instruction.  The disassembler will not display any of these properly, so the display will look like a real mess.  The debugger will also have problems giving you the proper address to enter the next line of code, since it doesn't know the length of 816 instructions.  It will assume all unknown instructions to be one byte, which is sometimes okay, but if you enter a three byte instruction like "MVN 0 0", the debugger will give you the next line with the address pointing to the first 0 operand.  You need to be aware of how long the 816 instructions really are, and move down to where the next instruction needs to start.  I apologize for not providing a debugger that is 816-aware yet, but this *will* be fixed soon.  I need it for my own work, so that ensures it a high priority in the 'to-do' list.

For now, entering code segments like this is safer in the main assembler.  You can assemble a disk file with an origin of $36A5, then return to the debugger, re-load TP.COM, and then load your 816 patch on top of it.  I purposely won't give detailed instructions on how to enter the following code segment within the debugger.  If you aren't comfortable with what is going on, and with what is required to get around it, entering code this way is probably too error-prone.  I will give a nice tip though for those of you that are going to do it anyway!  First fill the target area with all $EA's.  In this case, "F 36A5 36E0 EA".  This does two things for you.  After entering an instruction, the next line should display a NOP.  If it doesn't, then you are still within the operand of the previous instruction.  Thus, it makes a great visual safety measure.  Also, since you will need to move down to a lower address, having all NOPs at one byte per line will ensure that the address you need will be displayed.  (Even if the address you need appears in the operand of the current line, there will still be another copy of it further down the screen). If you had left the previous code there, you might need an address that is in the operand field of an old instruction, requiring you to enter the address by hand.  You also should move down with the cursor keys, and not the Return key.  Okay, with all that out of the way, here is the 816 code segment.

$36A5   STZ $D40E     ;Until Atari OS can handle interrupts in native mode
        SEI           ; we need to disable NMIs and IRQs
        CLC
        XCE           ;Set native mode,
        REP #$30      ; and 16-bit regs.
        LDA $84       ;All parameters loaded as 16-bit quantities.
        BEQ $36B9     ;Catch 0 size condition
        DEC           ;Adjust size for MVN count being -1
        LDX $80
        LDY $82
        MVN 0 0
$36B9   XCE           ;Carry was not changed, so this returns where we
        CLI           ; were, including 8-bit regs.
        LDA #$C0
        STA $D40E     ;Interrupts are back on.
	LDY $84       ;These last two lines are a safety measure, just
        LDX #0        ; to match the closing register and flag values from
        TXA           ; the previous routine.  If you have enough space,
        RTS           ; it's a good idea to do this in case the programmer
;                       relied on these values being there after the call.

Most programs that do memory management will need to have two routines, for doing both incremental and decremental type moves, just as the 816 CPU has two block move instructions.  Right after the routine we just patched, is another routine starting at $36E2.  The requirements are similar, but with the added task of needing to compute the ending address for both the source and destination memory blocks, which is where the move needs to begin.  I'll skip right to the 816 code in this case.

$36E2   STZ $D40E
        SEI
        CLC
        XCE
        REP #$30
        LDA $84
        BEQ $36FE
        DEC           ;Same so far
	TAY
        CLC
        ADC $80       ;Gets to end of source block
        TAX
        TYA
        PHA
        ADC $82       ;End of destination block
        TAY
        PLA
        MVP 0 0
        SEC           ;We changed the carry this time.
$36FE   XCE
        CLI
        LDA #$C0
        STA $D40E
        LDY #$FF
        LDX #0
        TXA
        RTS

That completes the patches.  Now we need to save to new version back to disk.  Obviously, you should use a different file name in case something went wrong.  If you are using SpartaDOS, it is a simple matter of checking the file size in the disk directory.  The file shows as 17993 bytes long, so you can enter the save command: (remember the "-", and the relationship between file size and end address -1)

S -3000 3000+#17992 TP.COM

If you are using a DOS that does not give you the true file sizes, then you need to find the end of the TP.COM file another way.  If you have been following this example, for now you should just enter the save command listed above.  The method I will suggest for finding the end of the file needs to be done before loading TP.COM, which we are too late for.  What we could have done, was fill memory with some known value before loading the TP.COM file.  Such as "F 3000 8000 11".  Then load TP.COM, and search for a long string of 11's.  If you have picked a number that is not used in long sequential quantities within the file, then the match address you find should be the end of the file.  Save from 3000 to one byte lower than where the 11's start.

That completes this project.  Operations like inserting and deleting text will be twice as fast in this new version.  Note however, there will be some screen flickering caused by when we turned off the interrupts -- including DLIs -- during the move.  When the 816 OS is available, we will not have to disable interrupts, and this patch will be completely clean.  At this time, you could go back and put NOPs over the STZ $D40E and SEI instructions.

- Sample Code: Sieve of Eratosthenes Benchmark

This is a slightly improved version of the routine that appeared in "Programming The 65816", by David Eyes and Ron Lichty.  The Sieve program calculates all of the prime numbers between 3 and 16,384.  The basic procedure is to eliminate every multiple of a number N, since that number can't be prime while having N as a factor.  All even numbers are non-prime, so we can start N at 3, and step up to the mid-point in the table skipping all even numbers.  In the following routine, even numbers are effectively skipped by using 1/2 the table size, and considering the first table entry to be for the number 3, 5 in the second entry, 7 in the third, and so on.  The routine will set the direct page to $3000, so that the entire 8K table can be accessed with dp,X addressing and 16-bit index registers.

SIZE   =   8192
       .OR $00       ;This is added to the dp register of $3000,
ITER   .DS 2         ; putting these 3 variables at $3000-$3005
COUNT  .DS 2
PRIME  .DS 2
FLAGS  =   $3008

;Note: by setting FLAGS to $3008, we can use the label in absolute
; addressing, which is required when we need to index by Y.  Using
; "<FLAGS" in the assembler forces the dp addressing mode in
; situations where we can use it.

       .OR $2800
;
       STZ $D40E
       STZ $22F
       STZ $D400
       SEI
       CLC
       XCE
       R16
       PEA $3000     ;Set dp register to $3000, so that we can access
       PLD           ; an 8K array with dp addressing
       LDA #100
       STA ITER      ;Number of times to run the routine, for timing ease
AGAIN  STZ COUNT
       LDX #SIZE-2
       LDA #$FFFF
LOOP   STA <FLAGS,X  ;Set all entries as prime initially
       DEX
       DEX
       BPL LOOP
       LDY #0
       BRA MAIN
;
SKIP   INY
       CPY #SIZE/2
       BEQ QUIT
MAIN   LDA FLAGS-1,Y
       BPL SKIP      ;Skip entries which have already been cleared.
       TYA
       ASL           ;Adjustment for no even entries in the table
       ADC #3        ; and +3
       STA PRIME     ; gives us the increment to step through the table
       TYA
       BRA TOP
INNER  TAX
       SEP #$20      ;8-bit access to only clear one byte
       STZ <FLAGS,X  ;Clear non-prime entries
       REP #$21
TOP    ADC PRIME     ;Get to next multiple
       CMP #SIZE+1
       BCC INNER     ;Continue until end of table
SKIP2  INC COUNT
       INY
       CPY #SIZE/2
       BNE MAIN
QUIT   DEC ITER
       BNE AGAIN
       PEA !0
       PLD
       SEC
       XCE
       .AB
       .IB
       LDA #$40
       STA $D40E
       CLI
       LDA #$22
       STA $22F
       STA $D400
       BRK
       .EN

For those that are curious, here is the original routine as it appeared in "Programming the 65816", with notes as to where the improvements were made.

SIZE   =   8192
       .OR $80
ITER   .DS 2
COUNT  .DS 2
PRIME  .DS 2
FLAGS  =   $3000
       .OR $2800
;
       CLC
       XCE
       REP #$30
       .AW
       .IW
       LDA #100
       STA ITER
AGAIN  STZ COUNT
       LDY #SIZE-1   ;This was chged to SIZE-2 so Y decrements to 0...
       LDA #$FFFF
       STA FLAGS     ;...which makes this instruction unneeded
LOOP   STA FLAGS,Y   ;This loop was chged to dp,X addressing,
       DEY           ; using X since there is no STA dp,y
       DEY
       BPL LOOP
       LDY #0
MAIN   LDA FLAGS-1,Y
       BPL SKIP      ;Moved the label SKIP above MAIN, so code is reduced
       TYA           ; from two branches, to one branch and a fall-thru
       ASL
       INC           ;Why would they have done 3 INCs here
       INC           ; instead of just ADC #3???
       INC
       STA PRIME
       TYA
       CLC           ;Not needed -- C already clear
       ADC PRIME
TOP    CMP #SIZE+1
       BCS SKIP2     ;80+% of the speedup came from moving the SKIP2 label
       TAX           ;This innermost loop has a fall-thru SKIP2, & BRA TOP
       SEP #$20      ;That structure can be optimized to 1 branch, saving
       STZ FLAGS,X   ; 2 cycles, plus 1 more for dp addressing of FLAGS
       REP #$21
       ADC PRIME     ;They had two ADC PRIMES, which can be avoided simply
       BRA TOP       ; by deleting this one, & moving label TOP up 1 line
SKIP2  INC COUNT
SKIP   INY
       CPY #SIZE     ;All of the primes will be found by the time Y=SIZE/2
       BNE MAIN      ;It was unecessary to keep the loop going this far.
QUIT   DEC ITER
       BNE AGAIN
       SEC
       XCE
       .AB
       .IB
       RTS
       .EN

Execution time for the original routine was listed in "Programming the 65816" as 1.56 seconds @4MHz for ten iterations, but I measured its results (after converting from 1.79MHz and 100 iterations with DMA and interrupts off) at 1.73 seconds.  The optimized version clocked in at 1.31 -- an improvement of 25%.  Even the original version is over twice as fast as the 6502 could perform this operation at the same clock speed.

To see how other processors match up, I'll quote some additional figures from a January 1983 BYTE article, all normalized to the 4MHz speed.

8088  - 5.0 Seconds  ;Speeds for the Intel processors were about as
8086  - 3.8          ; slow as I would expect.  <grin>
68000 - .98

I can get the 65816 version down to about 1.2 seconds, but that requires doubling the size of the FLAGS table, and stepping Y by twos.  The advantage here, is that each entry is two bytes, and thus you can eliminate the overhead of needing the set 8-bit memory access for the "STZ FLAGS,X".  This creates a table in a different format than the above routine however, so I did not consider it valid as a speed upgrade only.  Regardless, the 816 gets close to the performance of the 68000, even though the 68000 has an external 16-bit path to memory, and tons of registers to replace the memory variables needed in the 816 version.  For many applications, I think the 816 would even be faster than the 68000 at the same clock speed.  And for small code size, it blows the 68000 away.

- Sample Code: String Matching Decompression

This next example shows a form of recursive programming.  It may be difficult to follow, but useful examples of recursive programming are rare, so I thought I would put this in.

The code is a message decompression routine, based on a string matching type of compression.  If the data contains sequences of letters that are repeated in multiple locations, the compressed data will contain pointers for the repeated strings that identify a length and a location to find the original occurrence of the string.  When decompressing the data stream, if you find any pointers, then you need to go to the location pointed to, and copy the number of bytes indicated by the length portion of the pointer.

Usually, these compression methods use pointers that refer back to previous data in the decompression buffer -- in other words, the data that has already been decompressed.  In my application, all of the individual messages were small, and they would not generate significant amounts of repetition within themselves to get a good compression ratio.  However, there are lots of these messages, and so if I could use backward pointers into the source data, instead of the destination, then pointers could refer to string matches in previous messages, and obtain a much higher compression ratio.  The challenge with this approach, is that the previous messages are already compressed.  When you try to copy a matched string from a previous occurrence, you might run into additional compression pointers.  To create an optimum approach, the routine should follow any additional pointers that come along, into deeper and deeper levels, and also be able to work itself back out of the pointers as the deeper levels get used up.  Since I know this is going to be confusing, I need to start with an example to show how this decompression process works.

The format I use for compressed pointers, is a $7F 'ID' byte, followed by the length of the match, followed by the backwards relative distance where the match was found.  (The ID byte can also be $7E, which is used for match distances greater than $100 bytes away.  That allows the match distance to be up to $1FF).

First, a simple example.  Consider the text string:

BY THE WAY THE PATH

When compressing this, the second occurrence of the letters " THE ", including the spaces, is repeated, and can be stored as a pointer.  Thus, the compressed data would look like this:

 .BY "BY THE WAY" $7F 5 8 "PATH"

To decompress this data, you would start by copying bytes into the output buffer.  When you get to $7F, you consider the next byte the length of a match, and the byte after that to be a backward pointer.  Subtracting 8 from the location of the $7F, gets you to the space after the word BY.  Copying five bytes from that location yields " THE ", and from there, you are ready to continue with the rest of the source data "PATH".  This recreates the original string.

Now we need to do a complex example to show what happens when we run into more compression pointers during the string copy phase.

 .BY "BY THE WAY THE PATHWAY THE PARTY"

Don't judge my English here.  I'm just looking for matching words.  Now let's see how this string would compress.  It will start out identical to the previous example, until you get to the W in PATHWAY.  From there, you can match the string "WAY THE PA" -- a nice chunk of 10 bytes.  The distance would be 12 if it were based on the original string, but for this routine we need to count backwards using our compressed data, which makes it only 10.  Here's the final compressed string for the second example, which marks several positions below it that will help identify and keep track of where we are at different places in the decompression process.

 .BY "BY THE WAY" $7F 5 8 "PATH" $7F 10 10 "RTY"
        ^   ^^     ^       ^^     ^         ^
        4   52     3       67     1         8

Let's follow through the decompression of this one.  Up to the point where we reach the second $7F, everything is the same as the first example, and we will have "BY THE WAY THE PATH" in our decompression output buffer.  We are now at location #1, and the $7F pointer directs us 10 bytes backwards, to the word WAY at location #2.  We can't just copy 10 bytes from there though, because we would wind up with the bytes $7F 5 8 in our output string.  We copy the first 3 bytes, "WAY", and then find another $7F byte at position #3.  We need to copy the bytes that this pointer refers to, and so we subtract 8 from there, reaching point #4.  We've already copied 3 of the 10 required bytes from the original pointer, but what happens if we copy the 7 remaining bytes from this position?  That would be the string " THE WA", and that's not the right data.  If we did this, our output would come out, "BY THE WAY THE PATHWAY THE WA", which has the last two letters wrong.  The problem is, when we went backwards the second time, the length of that string match is only 5.  Thus we can only copy five bytes from there -- the locations #4 thru #5 -- and then we have to return back one level, to after the second string pointer at location #6.  We have 2 bytes left to copy, to reach the requirement of 10 from the original pointer, and that takes us to position #7.  The original pointer is now completed, so we can return to position #8 and copy the remaining bytes.  Basically, the $7F match pointer at position #1 refers to the string from #2 thru #7.  That string happens to have a substring inside of it from #4 thru #5.  If you stop to think about this going into additional levels, and at each level we need to keep track of how long this substring is, where is the previous return point, how many bytes are left in the previous substring, and how many bytes are left in the original string where we started, it can quickly add up to a formidable programming challenge.

The problem is a lot more manageable if you can realize that no matter how many levels deep you are, the program's requirements are the same.

You are copying text bytes to the output string.  You need to know the length of the current string to know how many bytes to copy.  You need to return to the previous level when you reach either the end of this string, or the end of the previous string, whichever comes first.  (If you're on the top when you reach this point, you're done).  You can also reach a $7F byte at any time, at which point you need to save the current string parameters and go one level deeper.  Once you do this though, you can just return to the same code, as in the top of this paragraph.

This is the essence of recursive programming -- The ability to execute the same block of code again and again, but at deeper levels.  The 816 processor provides a good method for doing this, in the form of stack relative addressing.  If you keep all of the parameters on the stack, going one level deeper simply means pushing another set of parameters.  The first set will be preserved, and the routine operates on the new set that is on the top of the stack.  Returning back to previous levels is accomplished by pulling the 'finished' parameters off the top, re-exposing the previous set.

With that in mind, I hope you can follow though the completed code example.

; This subroutine enters with the address of the compressed
; source message in the A and X registers, LO, HI respectively.
; It will place the decompressed output string in the buffer MSGBUF.
; The end of the source string is marked with the bytes $7E FF
;
; During the routine, the byte on top of the stack will be the
; remaining length of the current level of string lookup.  This starts
; as $FF, since there is no string lookup in progress at the root level.
; The address at two bytes down in the stack, is the pointer where bytes
; are being copied to the output buffer.  This starts at the source of
; the compressed data, but new addresses will get pushed when string
; lookups are in progress.

MVMSG  PHX           ;Push the start address of the source...
       PHA
       LDX #$FF
       PHX           ;...and the FF 'root' marker
       INX           ;=0
       I16
CKLEN  LDA 1,S       ;If there are no more bytes to copy at this level,
       BEQ RET       ; return to the previous level.
       LDY #0
BYTE   LDA (2,S),Y   ;Get a byte from the source string
       CMP #$7E
       BEQ MC7E
       CMP #$7F
       BEQ MC7F      ;If 7F or 7E, initiate a new string match process
       STA MSGBUF,X  ;Else, store a byte
       INX
       INY
       LDA 1,S
       BMI BYTE      ;At root level, we can just keep copying
       TYA
       CMP 1,S       ;Else, we need to check the string length
       BNE BYTE
RET    PLA           ;When a string lookup is complete,
       PLA           ; return to the previous level
       PLA
       BRA CKLEN
;
MC7E   INY
       LDA (2,S),Y   ;Get next byte after $7E
       DEY
       INC
       BEQ ENDMV     ;If it is $FF, that marks end of data
       SEC
MC7F   LDA 1,S
       BMI ?ROOT
       STY TEMP      ;When string length is not root, subtract the
       SBC TEMP      ; number of bytes already moved,
       STA 1,S       ; and STA the number of remaining bytes
?ROOT  TYA
       ADC 2,S       ;+1
       STA SCNADR    ;Put address of the $7E/$7F into SCNADR
       LDA (2,S),Y
       AND #1        ;This adds $100 if the byte was $7F.  Later on,
       ADC 3,S       ; it will get subtracted
       STA SCNADR+1
       INY
       LDA SCNADR
       SBC (2,S),Y   ;-1  Subtract the pointer distance...
       STA SCNADR
       LDA SCNADR+1
       SBC #1        ;...-$100 more.  Cancels earlier +$100 if byte was
       STA SCNADR+1  ; 7F, or puts 7E pointers in the range -$100-$1FF
       INY
       LDA (2,S),Y
       STA TEMP      ;Match size
       TYA
       SEC
       ADC 2,S       ;Update current string address to the point
       STA 2,S       ; after the match pointer
       LDA 3,S
       ADC #0
       STA 3,S
       LDA 1,S
       BMI ?OK       ;If root level, use full match size
       SEC
       SBC TEMP      ;Else, we need to check if the bytes we need at the
       BPL ?OK       ; current level is more than the match size.
       LDA 1,S       ;If not, set the new level to go only as far as the
       STA LINMAP+2  ; number of bytes needed to complete current level
       LDA #0        ;And current level will be finished when that's done
?OK    STA 1,S       ;Update current level size,
       LDA TEMP
       LDY SCNADR
       PHY           ; and push to new level
       PHA
       JMP CKLEN
;
ENDMV  PLA           ;Msg is complete.  Clean off stack usage.
       PLA
       PLA
       LDA #0
       STA MSGBUF,X  ;Mark end of message, and exit
       I8
       RTS
;