SOLVING LARGE SYSTEMS OF LINEAR EQUATIONS ON A C64 WITHOUT MEMORY

by Alan Jones (alan.jones@qcs.org)

OK, now that I have your attention, I lied. You can't solve dense linear systems of equations by direct methods without using memory to store the problem data. However, I'll come back to this memory free assertion later. The main purpose of this article is to rescue a usefull numerical algorithm, "Quartersolve", and also to provide a brief look at the COMAL programming language and BLAS routines.

Linear systems of equations, A(,)*x()=b(), where A is a square matrix and x and b are vectors (or arrays), must often be solved for x in the solution of a variety of problems. The size or dimension of the problem is n and just storing A requires 5*n*n bytes of memory, assuming C64/128 5 byte real variables. The prefered solution method is a form of Gaussian Elimination which requires 1/3 n*n*n multiplications. I'll ignore the additional n*n and n multiplies. For large problems our C64 has two serious limitations, small memory size and slow floating point arithmetic. Problems with n=10 can be computed easily. Problems with n=100 will require 100 times more memory and 1000 times more computing time. The computing time is not a real problem. I don't mind letting my computer run while I watch a movie, sleep, or go on a trip. Calculating or setting up the problem may take much longer than its solution anyway. Available memory is the practical limiting factor. After we use up available RAM we have to resort to other algorithms that will use the disk drive to move data in and out of the computer. The 1541 drive is particularly slow and I would not want to subject it to undue wear and tear.

How big a problem do we need to be able to solve? In many cases the problem itself will fix n and there is no way to reduce it. In other cases you might be modeling a real continuous problem with a discrete number of elements. N should be infinity but for problem solution n=50 might be big enough. Consider calculating the aerodynamic potential flowfield around a body of revolution. You could fix points on the surface of the body (a meridian) and have a series of sort line segments make up elements to approximate the shape. The lager n is the closer the smooth shape is aproximated and the more accurate the computed solution becomes. n=100 might be a good choice for a simple shape. We could also use a "higher order" menthod. In this case we can substitute a curved line element for the straight line segment. Calculating the matrix elements will be more difficult but n=40 curved elements might give a more accurate solution than 100 flat elements. Another consideration is the resolution of the solution. You might want to plot the solution on the 200 x 320 pixel hi-res C64 screen. 40 points might be too coarse and 320 might be overkill. We might also need to calculate the slope or derivatives from the calculated solution which will require more closely spaced solution points. There are often choices that you can make in modeling a system and selecting a solution algorithm so that a problem can be solved within the limits of a C64. There are often interesting tradeoffs in memory requirements and execution speed.

How big a problem can we solve with a C64? Using Quartersolve with assembly language we can probably do n=200 or more. If we are going to store the problem data on a single 1541 diskette and read it in a row at time we can only do n=182 or so. Actually I think n should be well under 100. Different operating systems and languages limit the amount of useable RAM; BASIC 40K, COMAL 2.0 30K, GEOS 23K, the initial disk loaded COMAL 0.14 10K... Solving a linear system may only be a small subproblem inside a large application program. The idea is to be able to solve reasonable sized problems using your prefered computing environment without having to do a lot of chaining or loading of separate programs. Quartersolve can free up a lot of memory for other routines or allow n to be doubled.

SPEED

There are a few things that we can do to speed up the calculations. First we can select a fast programming language. I prefere COMAL 2.0 which is a fast three pass interpreter. Using an assembler could be the fastest and provide the most useable memory. A true compiler such as C or Pascal could also be a good choice. BASIC is a poor choice except that it is built in and free. In most cases execution can be sped up with some machine language routines like the BLAS (Basic Linear Algebra Subroutines). Calculation speed is measured in FLOPS/sec (Floating Point OPerationS) where, c(i#,j#):=c(i#,j#) + a(i#,k#)*b(k#,j#) is the operation. It is one FP multiply, one FP add, and some indexing overhead. With some interpreters the indexing and interpreting overhead can far exceed the actual FP multiply time. With assembled code the FP multiply time should dominate. I use a ML level 1 BLAS package with COMAL 2.0. For example:

    c(i#,J#):+sdot(n#,a(i#,1),1,b(1,j#),sdb#)
    FOR k#:=1 to n# do c(i#,j#):+a(i#,k#)*b(k#,j#)

both calculate the same thing, a dot product with n# FLOPS. For large n# on a C64 the BLAS approach about 320 FLOPS/sec., The overhead of calling the procedure from the interpreter is about the equivalent of 4 FLOPS. Of course modern computer performance is measured in MegaFLOPS/sec. with 8 byte reals (super computers run hundreds or thousands of MFLOPS/sec.). They also inflate the performance by counting the multiply and add as two FLOPS. In his article I use the "old flops" or number of multiplies.

It may also be possible to code 6502 FP arithmetic routines using lookup tables that may perform faster than the built in routines. We could also use the CPU in the disk drives to do distributed processing. But this is beyond the scope of this article.

SOLUTION METHODS

Consider the following choices for numerical solution algorithms:

METHOD                  MEMORY       FLOPS
Gaussian Elimination       n*n     1/3 n*n*n
Cholesky Decomposition 1/2 n*n     1/6 n*n*n
QR decomposition           n*n     2/3 n*n*n
QR updating            1/2 n*n      2  n*n*n
Gauss-Jordan               n*n     1/2 n*n*n
Quartersolve           1/4 n*n     1/2 n*n*n
Gaussian Elimination is the prefered method when enough memory is available. In modern terminology this is LU decomposition where A is decomposed or factored into a lower triangular matrix and an upper triangular matrix. Partial pivoting of rows or columns is an additional complication often required for a stable solution. After the LU decompostion you can readily solve for any number of right hand side vectors in n*n flops each. In addition you can calculate matrix condition number estimates and use iterative improvement techniques. The LU decomposition is done in place overwriting the problem matrix A.

Cholesky Decomposition is a specialized version of Gaussian Elimination for symetric positive definite matrices only. Since A is symetric we only need n*(n+1)/2 memory storage locations. The L and U triangular matrices are simply transposes of the other so only one needs to be stored and is computed in place overwriting the original storage used for A. No pivoting is required. This algorithm cannot solve general nonsymetric problems and is included only for comparison.

QR decomposition factors A into an orthogonal matrix Q and a triangular matrix R. QR decomposition is very stable and can be performed without pivoting. Since Q is orthogonal its inverse is just Q transpose. To solve the linear system we multiply the right hand side vector by Q transpose then solve the triangular system R. Q is computed in a special compact form and stored in the space below R. The decomposition is done in place in the storage used for A, plus an additional n storage locations. QR decomposition requires about twice as many flops as Gaussian Elimination.

There is a variation of the QR solution known as QR updating. The problem is solved a row at a time. A Row of A can be read in from disk storage or calculated as needed. Only R needs to be stored in main memory, n*(n+1)/2 memory locations. R is initialy the identity matrix and is updated as each row of A and its right hand side element are processed. Q is not stored, but the right hand side vector is sequentialy multiplied by Q transpose. After all n rows have been processed the solution is found by simply solving the triangular system R. Since this method only needs half as much memory storage as LU decomposition, we can solve problems 40% larger in a limited memory space. However, the cost in flops is high. Actually QR updating is best used for solving large overdetermined least squares problems.

Gauss-Jordan is a variation of Gaussian Elimination that reduces A to the Identity matrix instead of to LU factors. By applying the same transformations to to the right hand side that reduce A to the identity matrix, the right hand side becomes the solution at completion. Pivoting is requiered. Gauss-Jordan requires about 50% more flops than Gaussian Elimination and most codes use n*n memory storage. Since the LU factors are not computed we can't solve additional right hand side vectors later, or estimate the matrix condition number, or use iterative improvement techniques. It will solve multiple right hand sides that are available from the start.

Quartersolve is a clever implementation of Gauss-Jordan(?) that solves the problem a row at a time like QR updating but only requires 1/4 n*n memory storage. With fixed available memory Quartersolve can solve a problem twice as large as Gaussian Elimination but with a modest performance penalty. Solving a 2n problem with Quartersolve would take 12 times longer (instead of 8) than Gaussian Elimination on a size n problem.

My recommendation is to use Gaussian elimination for solving dense general systems of linear equations when enough main memory is available and switch to Quartersolve for larger problems. For solving huge problems requiering external storage a blocked version of QR decomposition might work best. Cholesky decomposition should be used for symetric positive definite problems. Large problems are often sparse, containing lots of zeros that need not be stored. Specialized code exists for solving many sparce problems, particularly banded matrices, and many of these methods can be used on a C64. Codes for solving unstructured sparce problems are not very suitable for the C64 since they are complex and reduce the amount of memory available for solving the problem. However, large sparce problems can also be solved on the C64 by iterative methods such as Gauss-Siedel and Conjugate Gradient algorithms.

QUARTERSOLVE

Quartersolve is a useful method for solving general dense systems of linear equations that I discovered almost by accident while doing random research in the library. I have not seen any recent texts or papers mentioning this algorithm. I have not seen any reference to it in the C64 literature either. At least one older text mentioned it in passing saying that the code was too long or complex. This is a valid point since usualy the code size directly subtracts from the problem storage. The code is longer than the Gaussian Elimination code but in my implementation it only takes about 2K of main memory storage and it is a real advantage on the C64. With a C64 we can also put the entire code in an EPROM on a cartridge so the code size is of little concern.

I found Quartersolve in Ref. 1 (R. A. Zamberdino, 1974), which credited the algorithm to Ref. 2 (A. Orden, 1960). I am a little uneasy describing the algorithm since I have not seen Ref. 2 or analyzed the algorithm. I have coded the algorithm, tested it, and used it to solve some large problems on a C64, up to n=90. Zambardino makes two interesting statements in Ref 1. "The number of arithmetic operations is the same as for the Gaussian Elimination method." I am reasonably sure from the description that he meant Gauss-Jordan which requires about 50% more arithmetic than Gaussian Elimination. After processing the ith row only i(n-i) storage locations are required to store the reduced matrix. Max[i(n-i)] = n*n/4. This maximum memory requirement occurs at i = n/2. As i increases further memory required is reduced. Although n*n/4 memory locations must be allocated and dimensioned in an array at the start, Quartersolve always uses the first portion of the array continuously and does not free up memory in holes scattered throughout the array. The C language could possibly use "heap storage" and release the memory for any other use as the procedure advances.

Now back to my initial memory free claim. The large problem that I actually wanted to solve was: A*x=b, B*x=r, for r given b and the square matrices A and B. Elements of A and B are most efficiently calculated at the same time. I could write B to the drive and read it back in after x is computed to calculate r, but I actually wanted to solve this repeatedly inside another loop and I did not want to read and write to a lousy 1541 that much. Using Gaussian elimination would require 2n*n storage. Using Quartersolve could require 1.25n*n storage. However, only n*n storage is needed, that for B. At the ith step the ith row of A and B are calculated. The row of A is processed into the the n*n dimensioned array B filling it from the front. The corresponding row of B is stored in the same array B filling from from the end of array B. As the process continues Quartersolve "dissuses" array B so that rows of B never overwrite storage needed by Quartersolve. At the end we have computed x and all of B is stored in the array B. Simple multiplication produces r. So I can say with pride, at the expense of honesty, that I have solved A*x=b without any additional memory storage for A.

PROC slv(n#,nr#,i#,REF a(),REF c(),REF b(,),sdb#,REF sw#(),REF fail#) CLOSED // This routine solves a system of equations using the quartersolve // algorithm with partial pivoting. // It is called a "line at a time" and uses only // 0.25*nn memory locations which enables larger problems to be solved. // The LU factorization is not available, nor a condition estimate. // n# is the dimension of the problem // nr# is the number of right hand vectors to be solved for. // b(,) is the right hand side columns // sdb# is the second dimension of the array b(,) USE blas USE strings q#:=i#-1; m#:=n#-q#; mm1#:=m#-1; fail#:=TRUE; ip1#:=i#+1 IF i#=1 THEN //initialize pivot array FOR j#:=1 TO n# DO sw#(j#):=j# ENDIF FOR j#:=1 TO q# DO //adjust for previous pivoting k#:=sw#(j#) WHILE k#<j# DO k#:=sw#(k#) IF k#>j# THEN swap'real(c(j#),c(k#)) ENDFOR j# FOR j#:=i# TO n# DO c(j#):-sdot(q#,c(1),1,a(j#-q#),m#) p#:=q#+isamax#(m#,c(i#),1) r:=ABS(c(p#)) IF r=0 THEN RETURN fail#:=FALSE IF p#<>i# THEN swap'real(c(i#),c(p#)) swap'integer(sw#(i#),sw#(p#)) sswap(q#,a(1),m#,a(p#-q#),m#) ENDIF r:=1/c(i#) IF mm1#<>0 THEN sscal(mm1#,r,c(ip1#),1) FOR j#:=1 TO nr# DO b(i#,j#):=r*(b(i#,j#)-sdot(q#,c(1),1,b(1,j#),sdb#)) FOR k#:=1 TO nr# DO saxpy(q#,-b(i#,k#),a(1),m#,b(1,k#),sdb#) IF mm1#>0 THEN t#:=1 FOR j#:=1 TO q# DO r:=a(t#); t#:+1 scopy(mm1#,a(t#),1,a(t#-j#),1) saxpy(mm1#,-r,c(ip1#),1,a(t#-j#),1) t#:+mm1# ENDFOR j# scopy(mm1#,c(ip1#),1,a(mm1#*q#+1),1) ELSE //unscramble solution from pivoting FOR j#:=1 TO nr# DO FOR k#:=1 TO n# DO c(sw#(k#)):=b(k#,j#) scopy(n#,c(1),1,b(1,j#),sdb#) ENDFOR j# ENDIF ENDPROC slv // n#:=8; sdrh#:=1; nrh#:=1; nr#:=1 // a is of dimension n*n/4 DIM a(16), b(n#), rhs(n#,nrh#), sw#(n#) FOR i#:=1 TO n# DO FOR j#:=1 TO n# DO b(j#):=2297295/(i#+j#-1) s:=0 FOR j#:=n# TO 1 STEP -1 DO s:+b(j#) rhs(i#,1):=s slv(n#,nr#,i#,a(),b(),rhs(,),sdrh#,sw#(),fail#) IF fail# THEN PRINT "singularity detected at i=";i# STOP ENDIF ENDFOR i# FOR j#:=1 TO n# DO PRINT rhs(j#,1); END The Quartersolve algorithm is presented here as a COMAL 2.0 procedure "slv". COMAL is pretty much a dead language and I don't expect anyone to run this code. However, COMAL is a structured algorithmic language that is easy to read. You can readily translate it into the programming language of your choice. Slv is coded as a CLOSED procedure as a personal matter of style. An open procedure would execute faster. The arrays are passed by REFERENCE and do not allocate additional local storage. The main program is just an example for testing. It calls slv n times to solve the linear system. The test problem solves a scaled Hilbert matrix which is ill conditioned. In the absence of roundoff error the solution should be a vector of ones. I usually dimension a() to n#*(n#+1)/4. Slv is presented in its full generality, but you may want to make some simplifications.

Slv can handle multiple right hand side vectors with the two dimensional array b(,) in most applications you will only use a single vector, nr#=1, and you can make some simplifications by just using a one dimensional array.

Pivoting also complicates the code. Problems which are positive definite, or diagonally dominant, or sometimes just well conditioned can be safely solved without pivoting. Stripping out the pivoting code is straight forward and will shorten the code and speed execution.

Anything following // is a comment and can be deleted from your running code.

In COMAL 2.0 you can also "protect" the code which will strip out comments and other information to make a shorter running version.

The remaining discussion will concern COMAL 2.0 and the BLAS.

COMAL 2.0

COMAL 2.0 is an excellent programming language for the C64/128 and I can't describe all of its virtues here. It has one serious limitation. It does not use "continuation" lines, so line length is limited. This is most restrictive in function and procedure lines where it limits the number of parameters that can be passed. Line length is limited to 80 characters. However, if you use a separate text editor or word processor you can enter 120 character lines. Comal will actually execute tokenized lines up to 256 characters so the limitation is really in the editor rather than COMAL. Procedure and variable names can be quite long in Comal, but are kept short because of the line length limitation. "Quartersolve" was shortened to "slv" for this reason.

a:+t is a shorter faster version a:=a+t, and a:-t is a shorter faster version of a:=a-t. This is most usefull when "a" is an array element or an integer.

Comal 2.0 supports ML packages. A package is a collection of functions or procedures that can be called and executed. A packaged can be ROMMED and stored in EPROM on the Comal 2.0 cartridge. A package can also be loaded from disk and will normally be stored in a RAM location that COMAL does not use for normal programs. LINK "filename" will load and link the ML package to a Comal program. It will stay attached to the program when the program is saved and loaded, unless it is marked as ROMMED. The entire slv procedure could be coded in assembly language and be placed in a package. The slv procedure uses two packages, strings and blas. The command USE packagename makes all of the functions and procedures of the package known. Alternatively, you could place the USE packagename command in the main program and put IMPORT procedurename inside all of the closed procedures that call procedurename.

Slv calls the swap'real and swap'integer proocedures from the strings package. The strings package is a ROMMED package on the Super Chip ROM.

It does exactly what it says, e.g. swap'real(a,b) is the same as: t:=a; a:=b; b:=t.

Slv calls the sdot, isamax#, sswap, sscal, saxpy, and scopy routines from the blas package. The blas package is LINKed to the program, but it could, and should, be placed on EPROM.

Basic Linear Algebra Subroutines, BLAS

The BLAS were originally written for the Fortran language to speed execution and streamline code used for solving linear algebra and other matrix problems. The LINPACK routines, Ref. 3, use the BLAS and are perhaps the best known. The idea is that the BLAS routines will be highly optimized for a particular computer, coded in ML or a High Order Language. Some operating systems even include BLAS like routines. Writing fast efficient programs is then a simple matter of selecting the best solution algorithm and coding it in a manner that makes best use of the blas routines. There are blas routines for single precision, double precision, and complex numbers. The level 1 BLAS perform operations on rows or columns of an array and typicaly do n scalar operations replacing the inner most loop of code. There are also level 2 BLAS that perform n*n operations and Level 3 BLAS that perform n*n*n operations. Nicholas Higham has coded most of the single precision level 1 blas routines and put them in a Comal 2.0 package. The Comal blas package is included on the Packages Library Volume 2 disk. I am not aware of ML blas routines coded for any other C64/128 languages although this is certainly possible and recommended.

The Comal blas routines behave exactly the same way that the Fortran blas routines do except that Fortran can pass the starting address of an array with just "a", while Comal requires "a(1)". The Comal blas will allow you pass an array, by reference, of single or multiple dimensions and start from any position in the array. If you code the blas routines as ordinary Comal routines you have to pass additional parameters and have separate routines for single dimensioned arrays and two dimensional arrays. Note also that Fortran stores two dimensional arrays by columns, and Comal (like many other languages) stores two dimensional arrays by rows. If you translate code between Fortran and Comal using blas routines you will have to change the increment variables.

            Fortran                          Comal
    dimension c(n), a(ilda,isda)     DIM c(n#), a(lda#,sda#)
    scopy(n,c,1,a(i,1),ilda)         scopy(n#,c(1),1,a(i#,1),1)
    scopy(n,c,1,a(1,j),1)            scopy(n#,c(1),1,a(1,j#),sda#)

The first scopy copies array c into the ith row of array a. The second scopy copies array c into the jth column of array a. This is what scopy does in Fortran:
    subroutine scopy(n,sx,incx,sy,incy)
    real sx(1),sy(1)
    ix=1
    iy=1
    do 10 i = 1,n
      sy(iy) = sx(ix)
      ix = ix + incx
      iy = iy + incy
 10 continue
    return
    end

The Comal BLAS does exactly the same thing. If coded entirely in COMAL rather than as a package it would have to be different. The call would change.

scopy(n#,c(1),1,a(1,j#),sda#)
would have to become,
scopy(n#,c(),1,1,a(,),1,j#,sda#,sda#)
and the Comal procedure might be:
PROC scopy(n#, REF x(), ix#, incx#, REF y(,), iy#, jy#, sdy#, incy#) CLOSED
  iyinc#:=incy# DIV sdy#  //assuming y is dimensioned y(?,sdy#)
  jyinc#:=incy# MOD sdy#
  FOR i#=1 TO n# DO
    y(iy#,jy#):=x(ix#)
    ix#:+incx#; iy#:+iyinc#; jy#:+jyinc#
  ENDFOR
ENDPROC scopy

Note that more information has to be passed to the procedure and used that the ML blas picks up automatically. Also we would need separate procedures to handle every combination of single and multi dimensional arrays. The Comal ML blas are indeed wonderful. For speed considerations this should also be left as an open procedure or better yet just use in line code.

Here is a very simplified description of what each of the routines in the Comal BLAS package does.

sum:=sasum(n#,x(1),1)  Returns sum of absolute values in x().
  sum:=0
  FOR i#:=1 TO n# DO sum:+ABS(x(i#))

saxpy(n#,sa,x(1),1,y(1),1)  Add a multiple of x() to y().
  FOR i#:=1 TO n# DO y(i#):+sa*x(i#)

prod:=sdot(n#,x(1),1,y(1),1)  Returns dot product of x() and y().
  prod:=0
  FOR i#:=1 TO n# DO prod:+x(i#)*y(i#)

sswap(n#,x(1),1,y(1),1)  Swaps x() and y().
  FOR i#:=1 TO n# DO t:=x(i#); x(i#):=y(i#); y(i#):=t

scopy(n#,x(1),1,y(1),1)  Copy x() to y().
  For i#:=1 TO n# DO y(i#):=x(i#)

max#:=isamax#(n,x(1),1)  Returns index of the element of x() with the
                         largest absolute value.
  t:=0; max#:=1
  FOR i#:=1 TO n#
    IF ABS(x(i#))>t THEN t:=ABS(x(i#)); max#:=i#
  ENDFOR i#

sscal(n#,sa,x(1),1)  Scale x() by a constant sa.
  FOR i#:=1 TO n# DO x(i#):=sa*x(i#)

snrm2(n#,x(1),1)  Returns the 2 norm of x().
  norm2:=0
  FOR i#:=1 TO n# DO norm2:+x(i#)*x(i#)
  norm2:=SQR(norm2)

srot(n#,x(1),1,y(1),1,c,s)  Apply Givens rotation.
  FOR i#:=1 TO n# DO
    t:=c*x(i#) + s*y(i#)
    y(i#):=s*x(i#) + c*y(i#)
    x(i#):=t
  ENDFOR i#

Bear in mind that each of these simple examples can be more complex as was given for scopy. You now have enough information to write your own BLAS routines in ML or the programming language of your choice, or to expand the BLAS routine calls in slv to ordinary in line code.

You can also apply the BLAS routines in creative ways besides just operating on rows or columns. For example you could create the identity matrix with:

  DIM a(n#,n#)
  a(1,1):=1; a(1,2):=0
  scopy(n#*n#-2,a(1,2),0,a(1,3),1) // zero the rest of the matrix
  scopy(n#-1,a(1,1),0,a(2,2),n#+1) // copy ones to the diagonal.

References

1. Zambardino, R. A., "Solutions of Systems of Linear Equations with Partial Pivoting and Reduced Storage Requirements", The Computer Journal Vol. 17, No. 4, 1974, pp. 377-378.

2. Orden A., "Matrix Inversion and Related Topics by Direct Methods", in Mathematical Methods for Digital Computers, Vol. 1, Edited by A. Ralston and H. Wilf, John Wiley and Sons Inc., 1960.

3. Dongarra, J. J., Moeler, C. B., Bunch, J. R., Stewart, G. W., Linpack Users' Guide, SIAM Press, Philadelphia, 1979.