Speed up RAMLink transfers with the Double-DMA Technique

by Doug Cotton (cmd-doug@genie.com) and Mark Fellows

Introduction

When CMD designed the RAMLink, we tried to make the system as fast as possible, but costs and complexity prohibited us from duplicating the operation of the DMA operation found in the Commodore RAM Expansion Unit (REU), The 8726 DMA controller found in the REU is a very complex item that allows the REU to transfer one byte per 1 MHz CPU clock cycle (1 microsecond). On the other hand, the RAMLink uses the 6510/8502 CPU load and store operations to transfer memory from the RAMLink memory to main memory. For the user who uses RL-DOS and RAMDOS, the difference is not noticeable, because although the RAMLink transfer is slower, RAMDOS continually pages its code in and out of main memory, effectively slowing its effective transfer speed down significantly.

But, what if the programmer isn't using RAMDOS? Then, the speed of the RAMLink becomes an issue. The RAMLink takes about 8 cycles to perform a transfer of a byte, while the REU does it in 1. This is significant. However, if a user owns both a RAMLink and an REU, there is a way to boost the transfer rate of the RAMLink via software. The method is called Double-DMA.

Double-DMA Description

Basically, the process is quite simple. Since the REU has the ability to transfe memory at 1 byte/microsecond, you can use the REU DMA to transfer memory from the RAMLink to main memory. To understand how we can do this, remember that the normal RL-DOS transfer routines use the CPU to perform the memory transfer. Well, to do that, at least some of the RAMLink RAM must be mapped into main memory. To be exact, 256 bytes is mapped in. So, to utilize the Double-DMA technique, the programmer simply makes the appropriate 256 bytes of RAMLink memory to be transferred visible in the main memory map, uses the REU to transfer that 256 bytes to the REU, and then uses the REU to transfer the 256 bytes in the REU to its destination in the main memory map. Thus, the Double-DMA technique will allow the RAMLink to transfer data at rouyghly 1/2 the speed of the REU, or 3-4 times faster than using the CPU to perform transfers.

The RAMLink memory map

To achieve this transfer speed gain, the programmer must forego RL-DOS usage and write specialized transfer routines. To do that, we need to discuss how the RAMLink maps itself into main memory and detail the various RAMLink registers needed to make this feat possible:

Address Description
------- -----------
$de00   256 bytes of data (See $dfc0-$dfc3 for more information)
$df7e   write to this location to activate the RAMLink hardware
$df7f   write to this location to deactivate the RAMLink hardware.
$dfa0   lo byte of requested RAMCard memory page
$dfa1   hi byte of requested RAMCard memory page
$dfc0   write to this location to show RL variable RAM at $de00 (default)
$dfc1   write to this location to show RAMCard memory at $de00
$dfc2   write to this location to show the RAM Port device $de00 page at $de00
$dfc0   write to this location to show Pass-Thru Port dev. $de00 page at $de00
For all locations that have the description "write to this address...", the program can safely write any byte to those locations, as the RAMLink hardware simply waits for an access, not any particular byte to be written.

Order of Operations

Although the Double-DMA technique relies on use of the REU, it is beyond the scope of this article to detail how to access the REU RAM under programmatic control. For more information on transferring data from the Commodore 128/64 and the 17XX REU, refer to the back of a REU owner's manual.

The following steps will realize the Double-DMA method:

Notes: P = PAGE in RAMCard RAM to be transferred to/from A = PAGE of RAM in main memory to be transferred to/from X = single page of memory in REU used as temp RAM

  1. if computer = 128, set up correct RAM bank
  2. make I/O visible in main memory
  3. sei
  4. sta $df7e - activate RAMLink
  5. lda #

    sta $dfa0

  6. lda #>P
  7. sta $dfa1
  8. sta $dfc1 - make $de00 show PAGE of RAM on RAMCard

Now, with the RAMLink hardware enabled in this way, the REU registers are also visible, so one can do a double DMA transfer at this point. There are two choices:

Transfer A->P:

  1. set up REU for A->X transfer
  2. initiate REU DMA transfer
  3. set up REU for X->$de00 transfer
  4. initiate REU DMA transfer

Transfer P->A

  1. set up REU for X->$de00 transfer
  2. initiate REU DMA transfer
  3. set up REU for A->X transfer
  4. initiate REU DMA transfer

Now, to go on:

  1. If more byte need transferrring, A=A+1, P=P+1, goto 5
  2. sta $dfc1 - restore contents of $de00
  3. sta $df7f - deactivate RAMLink hardware
  4. if computer = 128, restore bank
  5. restore I/O visibility if needed
  6. cli

Address Translation

To effectively use the Double-DMA technique, a programmer will want to set up a DACC partition in the RAMLink for use as external RAM. The programmer will need to determine the start address of the partition with the RL-DOS G-P command (or its sister command, G-[shift]P) This command will return the address of the DACC partition, or will it?

The answer is: Maybe. If a user has inserted an REU into the RAMLink RAM port and has the Normal/Direct swittch set to Normal, RL-DOS uses REU memory as the lowest RAM in the RAMLink memory map. However, when directly accessing the RAMLink and bypassing RL-DOS, the REU is not mapped into the RAMLink memory map. So, for such a condition, the code that determines the start of the DACC partition must SUBTRACT the size of the REU from the address returned by the G-P command. It's non-utopian, but the program need only do this once. However, for such an REU configuration, one must take care to ensure that at least 256 bytes of REU RAM is available and not already in use before utilizing the Double-DMA technique.

Performance

Craig Bruce, who has implemented this technique in his ACE operating system, provides the following performance figures for different access techniques:
Type            Bandwidth   Latency Notes
                (bytes/sec) (~usec)
-------------   ---------   ------- -----
REU             1,007,641      65.8 REU in Direct mode
REU thru RL     1,007,641      77.8 REU in RAM Port in Normal mode
RAMLink           105,792     199.2 Regular RAMLink access
RL with REU       372,827     319.8 Double-DMA
Internal RAM0     120,181      44.2 Zero-page
Internal RAM1      80,283      56.3 All main memory except zero-page

So, using this technique in ACE results in a 3.7x increase in transfer speed. For some applications, that is well worth the trouble.

Conclusion

Obviously, CMD recommends that the RL-DOS be used for most operations, but we realize that some programmers simply need faster transfer rates. The Double-DMA technique should provide the speed needed from the RAMLink. Obviously, since this technique bypasses RL-DOS, code using it can potentially corrupt RAMLink memory if errors occur or if the technique is improperly used. When using the technique, we recommend extensive testing using various DACC partitions and different REU configurations to ensure proper operation.

Double-DMA Code

Following is a set of functions that will perform transfers using Double-DMA. They are copied from the routines used in Craig Bruce's ACE operating system, Release 14, which incorporates the Double-DMA method. We thank Craig for the code below:
; Name:        Double-DMA memory transfer
; Author:      Craig Bruce
; Date:        1995-12-4
; Description: The following routines use the Double-DMA technique to transfer
;              memory to/from main RAM and the RAMLink.  If no RL is present,
;              normal CPU transfer methods are utilized.
;
; Variables:   [mp] holds the address of RAMCard memory to transfer
;              ramlinkNearPtr hold the address of main memory to transfer
;              ramlinkLength is length of data to transfer
;              ramlinkOpcode = $90: main memory -> RL
;                            = $91: RL -> main memory 

reu = $df00
rlActivate   = $df7e
rlDeactivate = $df7f
rlSram       = $dfc0
rlPageSelect = $dfa0
rlPageActivate = $dfc1
rlPageData   = $de00

ramlinkOpcode .buf 1
ramlinkLength .buf 2
ramlinkNearPtr .buf 2
ramlinkMpSave .buf 3
ramlinkZpSave .buf 2

ramlinkOp = *  ;( [mp]=farPtr, ramlinkNearPtr, ramlinkLength, ramlinkOpcode )
   lda mp+0
   ldy mp+1
   ldx mp+2
   sta ramlinkMpSave+0
   sty ramlinkMpSave+1
   stx ramlinkMpSave+2
   lda zp+0
   ldy zp+1
   sta ramlinkZpSave+0
   sty ramlinkZpSave+1
   lda ramlinkNearPtr+0
   ldy ramlinkNearPtr+1
   sta zp+0
   sty zp+1
   clc
   lda mp+1
   adc aceRamlinkStart+0
   sta mp+1
   lda mp+2
   adc aceRamlinkStart+1
   sta mp+2
-  lda ramlinkLength+0
   ora ramlinkLength+1
   beq +
   jsr rlTransferChunk
   jmp -
+  lda ramlinkMpSave+0
   ldy ramlinkMpSave+1
   ldx ramlinkMpSave+2
   sta mp+0
   sty mp+1
   stx mp+2
   lda ramlinkZpSave+0
   ldy ramlinkZpSave+1
   sta zp+0
   sty zp+1
   clc
   rts

   rlTrSize .buf 1

   rlTransferChunk = *  ;( [mp]=rlmem, (zp)=nearmem, rlLength, rlOpcode )
   ;** figure maximum page operation
   lda ramlinkLength+1
   beq +
   lda #0
   ldx mp+0
   beq rlTrDo
   sec
   sbc mp+0
   jmp rlTrDo
+  lda mp+0
   beq +
   lda #0
   sec
   sbc mp+0
   cmp ramlinkLength+0
   bcc rlTrDo
+  lda ramlinkLength+0

   ;** do the transfer
   rlTrDo = *
   tay
   sty rlTrSize
   jsr rlPageOp

   ;** update the pointers and remaining length
   clc
   lda rlTrSize
   bne +
   inc mp+1
   inc zp+1
   dec ramlinkLength+1
   rts
+  adc mp+0
   sta mp+0
   bcc +
   inc mp+1
+  clc
   lda zp+0
   adc rlTrSize
   sta zp+0
   bcc +
   inc zp+1
+  sec
   lda ramlinkLength+0
   sbc rlTrSize
   sta ramlinkLength+0
   bcs +
   dec ramlinkLength+1
+  rts

   rlPageOp = *  ;( [mp]=rlmem, (zp)=nearmem, .Y=bytes, ramlinkOpcode )
   php
   sei
   sta rlActivate
   lda mp+1
   sta rlPageSelect+0
   lda mp+2
   sta rlPageSelect+1
   sta rlPageActivate
   lda aceReuRlSpeedPage+3
   bne rlPageOpReu  ;xxx dependency on aceMemNull==0
   rlPageOpNonReu = *
   tya
   clc
   adc mp+0
   tax

   lda ramlinkOpcode
   cmp #$91
   bne rlPageOpWrite
   dex
   dey
   beq +
-  lda rlPageData,x
   sta (zp),y
   dex
   dey
   bne -
+  lda rlPageData,x
   sta (zp),y
   jmp rlPageOpContinue

   rlPageOpWrite = *
   dex
   dey
   beq +
-  lda (zp),y
   sta rlPageData,x
   dex
   dey
   bne -
+  lda (zp),y
   sta rlPageData,x

   rlPageOpContinue = *
   sta rlSram
   sta rlDeactivate
   plp
   rts

   rlPageOpReu = * ;( [mp]=rlmem, (zp)=nearmem, .Y=bytes, ramlinkOpcode )
   ;** ramlink hardware already switched in
   ldx #1
   tya
   beq +
   ldx #0
   cmp #0  ;xx cut-off value
   bcc rlPageOpNonReu
+  ldy ramlinkOpcode
   cpy #$90
   beq +
   ldy #$90            ;rl->reu->intern
   jsr rlPageOpReuRl
   ldy #$91
   jsr rlPageOpReuIntern
   jmp ++
+  ldy #$90            ;intern->reu->rl
   jsr rlPageOpReuIntern
   ldy #$91
   jsr rlPageOpReuRl
+  sta rlSram
   sta rlDeactivate
   plp
   rts

   rlPageOpReuIntern = *  ;( .AX=bytes, .Y=op )
   sta reu+7  ;len
   stx reu+8
   sty temp1
   pha
   lda zp+0
   ldy zp+1
   sta reu+2
   sty reu+3
   lda aceReuRlSpeedPage+0
   ldy aceReuRlSpeedPage+1
   sta reu+4
   sty reu+5
   lda aceReuRlSpeedPage+2
   sta reu+6
.if computer-64
   ldy vic+$30
   lda #0
   sta vic+$30
.ife
   lda temp1
   sta reu+1
.if computer-64
   sty vic+$30
.ife
   pla
   rts

   rlPageOpReuRl = *  ;( .AX=bytes, .Y=op )
   sta reu+7  ;len
   stx reu+8
   sty temp1
   pha
   lda mp+0
   ldy #>rlPageData
   sta reu+2
   sty reu+3
   lda aceReuRlSpeedPage+0
   ldy aceReuRlSpeedPage+1
   sta reu+4
   sty reu+5
   lda aceReuRlSpeedPage+2
   sta reu+6
.if computer-64
   ldy vic+$30
   lda #0
   sta vic+$30
.ife
   lda temp1
   sta reu+1
.if computer-64
   sty vic+$30
.ife
   pla
   rts

Last Updated: 1995-12-4