OK, now that I have your attention, I lied. You can't solve dense linear systems of equations by direct methods without using memory to store the problem data. However, I'll come back to this memory free assertion later. The main purpose of this article is to rescue a usefull numerical algorithm, "Quartersolve", and also to provide a brief look at the COMAL programming language and BLAS routines.
Linear systems of equations, A(,)*x()=b(), where A is a square matrix and x and b are vectors (or arrays), must often be solved for x in the solution of a variety of problems. The size or dimension of the problem is n and just storing A requires 5*n*n bytes of memory, assuming C64/128 5 byte real variables. The prefered solution method is a form of Gaussian Elimination which requires 1/3 n*n*n multiplications. I'll ignore the additional n*n and n multiplies. For large problems our C64 has two serious limitations, small memory size and slow floating point arithmetic. Problems with n=10 can be computed easily. Problems with n=100 will require 100 times more memory and 1000 times more computing time. The computing time is not a real problem. I don't mind letting my computer run while I watch a movie, sleep, or go on a trip. Calculating or setting up the problem may take much longer than its solution anyway. Available memory is the practical limiting factor. After we use up available RAM we have to resort to other algorithms that will use the disk drive to move data in and out of the computer. The 1541 drive is particularly slow and I would not want to subject it to undue wear and tear.
How big a problem do we need to be able to solve? In many cases the problem itself will fix n and there is no way to reduce it. In other cases you might be modeling a real continuous problem with a discrete number of elements. N should be infinity but for problem solution n=50 might be big enough. Consider calculating the aerodynamic potential flowfield around a body of revolution. You could fix points on the surface of the body (a meridian) and have a series of sort line segments make up elements to approximate the shape. The lager n is the closer the smooth shape is aproximated and the more accurate the computed solution becomes. n=100 might be a good choice for a simple shape. We could also use a "higher order" menthod. In this case we can substitute a curved line element for the straight line segment. Calculating the matrix elements will be more difficult but n=40 curved elements might give a more accurate solution than 100 flat elements. Another consideration is the resolution of the solution. You might want to plot the solution on the 200 x 320 pixel hi-res C64 screen. 40 points might be too coarse and 320 might be overkill. We might also need to calculate the slope or derivatives from the calculated solution which will require more closely spaced solution points. There are often choices that you can make in modeling a system and selecting a solution algorithm so that a problem can be solved within the limits of a C64. There are often interesting tradeoffs in memory requirements and execution speed.
How big a problem can we solve with a C64? Using Quartersolve with assembly language we can probably do n=200 or more. If we are going to store the problem data on a single 1541 diskette and read it in a row at time we can only do n=182 or so. Actually I think n should be well under 100. Different operating systems and languages limit the amount of useable RAM; BASIC 40K, COMAL 2.0 30K, GEOS 23K, the initial disk loaded COMAL 0.14 10K... Solving a linear system may only be a small subproblem inside a large application program. The idea is to be able to solve reasonable sized problems using your prefered computing environment without having to do a lot of chaining or loading of separate programs. Quartersolve can free up a lot of memory for other routines or allow n to be doubled.
c(i#,J#):+sdot(n#,a(i#,1),1,b(1,j#),sdb#) FOR k#:=1 to n# do c(i#,j#):+a(i#,k#)*b(k#,j#)both calculate the same thing, a dot product with n# FLOPS. For large n# on a C64 the BLAS approach about 320 FLOPS/sec., The overhead of calling the procedure from the interpreter is about the equivalent of 4 FLOPS. Of course modern computer performance is measured in MegaFLOPS/sec. with 8 byte reals (super computers run hundreds or thousands of MFLOPS/sec.). They also inflate the performance by counting the multiply and add as two FLOPS. In his article I use the "old flops" or number of multiplies.
It may also be possible to code 6502 FP arithmetic routines using lookup tables that may perform faster than the built in routines. We could also use the CPU in the disk drives to do distributed processing. But this is beyond the scope of this article.
METHOD MEMORY FLOPS Gaussian Elimination n*n 1/3 n*n*n Cholesky Decomposition 1/2 n*n 1/6 n*n*n QR decomposition n*n 2/3 n*n*n QR updating 1/2 n*n 2 n*n*n Gauss-Jordan n*n 1/2 n*n*n Quartersolve 1/4 n*n 1/2 n*n*nGaussian Elimination is the prefered method when enough memory is available. In modern terminology this is LU decomposition where A is decomposed or factored into a lower triangular matrix and an upper triangular matrix. Partial pivoting of rows or columns is an additional complication often required for a stable solution. After the LU decompostion you can readily solve for any number of right hand side vectors in n*n flops each. In addition you can calculate matrix condition number estimates and use iterative improvement techniques. The LU decomposition is done in place overwriting the problem matrix A.
Cholesky Decomposition is a specialized version of Gaussian Elimination for symetric positive definite matrices only. Since A is symetric we only need n*(n+1)/2 memory storage locations. The L and U triangular matrices are simply transposes of the other so only one needs to be stored and is computed in place overwriting the original storage used for A. No pivoting is required. This algorithm cannot solve general nonsymetric problems and is included only for comparison.
QR decomposition factors A into an orthogonal matrix Q and a triangular matrix R. QR decomposition is very stable and can be performed without pivoting. Since Q is orthogonal its inverse is just Q transpose. To solve the linear system we multiply the right hand side vector by Q transpose then solve the triangular system R. Q is computed in a special compact form and stored in the space below R. The decomposition is done in place in the storage used for A, plus an additional n storage locations. QR decomposition requires about twice as many flops as Gaussian Elimination.
There is a variation of the QR solution known as QR updating. The problem is solved a row at a time. A Row of A can be read in from disk storage or calculated as needed. Only R needs to be stored in main memory, n*(n+1)/2 memory locations. R is initialy the identity matrix and is updated as each row of A and its right hand side element are processed. Q is not stored, but the right hand side vector is sequentialy multiplied by Q transpose. After all n rows have been processed the solution is found by simply solving the triangular system R. Since this method only needs half as much memory storage as LU decomposition, we can solve problems 40% larger in a limited memory space. However, the cost in flops is high. Actually QR updating is best used for solving large overdetermined least squares problems.
Gauss-Jordan is a variation of Gaussian Elimination that reduces A to the Identity matrix instead of to LU factors. By applying the same transformations to to the right hand side that reduce A to the identity matrix, the right hand side becomes the solution at completion. Pivoting is requiered. Gauss-Jordan requires about 50% more flops than Gaussian Elimination and most codes use n*n memory storage. Since the LU factors are not computed we can't solve additional right hand side vectors later, or estimate the matrix condition number, or use iterative improvement techniques. It will solve multiple right hand sides that are available from the start.
Quartersolve is a clever implementation of Gauss-Jordan(?) that solves the problem a row at a time like QR updating but only requires 1/4 n*n memory storage. With fixed available memory Quartersolve can solve a problem twice as large as Gaussian Elimination but with a modest performance penalty. Solving a 2n problem with Quartersolve would take 12 times longer (instead of 8) than Gaussian Elimination on a size n problem.
My recommendation is to use Gaussian elimination for solving dense general systems of linear equations when enough main memory is available and switch to Quartersolve for larger problems. For solving huge problems requiering external storage a blocked version of QR decomposition might work best. Cholesky decomposition should be used for symetric positive definite problems. Large problems are often sparse, containing lots of zeros that need not be stored. Specialized code exists for solving many sparce problems, particularly banded matrices, and many of these methods can be used on a C64. Codes for solving unstructured sparce problems are not very suitable for the C64 since they are complex and reduce the amount of memory available for solving the problem. However, large sparce problems can also be solved on the C64 by iterative methods such as Gauss-Siedel and Conjugate Gradient algorithms.
I found Quartersolve in Ref. 1 (R. A. Zamberdino, 1974), which credited the algorithm to Ref. 2 (A. Orden, 1960). I am a little uneasy describing the algorithm since I have not seen Ref. 2 or analyzed the algorithm. I have coded the algorithm, tested it, and used it to solve some large problems on a C64, up to n=90. Zambardino makes two interesting statements in Ref 1. "The number of arithmetic operations is the same as for the Gaussian Elimination method." I am reasonably sure from the description that he meant Gauss-Jordan which requires about 50% more arithmetic than Gaussian Elimination. After processing the ith row only i(n-i) storage locations are required to store the reduced matrix. Max[i(n-i)] = n*n/4. This maximum memory requirement occurs at i = n/2. As i increases further memory required is reduced. Although n*n/4 memory locations must be allocated and dimensioned in an array at the start, Quartersolve always uses the first portion of the array continuously and does not free up memory in holes scattered throughout the array. The C language could possibly use "heap storage" and release the memory for any other use as the procedure advances.
Now back to my initial memory free claim. The large problem that I actually wanted to solve was: A*x=b, B*x=r, for r given b and the square matrices A and B. Elements of A and B are most efficiently calculated at the same time. I could write B to the drive and read it back in after x is computed to calculate r, but I actually wanted to solve this repeatedly inside another loop and I did not want to read and write to a lousy 1541 that much. Using Gaussian elimination would require 2n*n storage. Using Quartersolve could require 1.25n*n storage. However, only n*n storage is needed, that for B. At the ith step the ith row of A and B are calculated. The row of A is processed into the the n*n dimensioned array B filling it from the front. The corresponding row of B is stored in the same array B filling from from the end of array B. As the process continues Quartersolve "dissuses" array B so that rows of B never overwrite storage needed by Quartersolve. At the end we have computed x and all of B is stored in the array B. Simple multiplication produces r. So I can say with pride, at the expense of honesty, that I have solved A*x=b without any additional memory storage for A.
Slv can handle multiple right hand side vectors with the two dimensional
array b(,) in most applications you will only use a single vector,
nr#=1, and you can make some simplifications by just using a one
dimensional array.
Pivoting also complicates the code. Problems which are positive
definite, or diagonally dominant, or sometimes just well conditioned
can be safely solved without pivoting. Stripping out the pivoting code
is straight forward and will shorten the code and speed execution.
Anything following // is a comment and can be deleted from your running
code.
In COMAL 2.0 you can also "protect" the code which will strip out
comments and other information to make a shorter running version.
The remaining discussion will concern COMAL 2.0 and the BLAS.
a:+t is a shorter faster version a:=a+t, and a:-t is a shorter faster
version of a:=a-t. This is most usefull when "a" is an array element or
an integer.
Comal 2.0 supports ML packages. A package is a collection of functions
or procedures that can be called and executed. A packaged can be ROMMED
and stored in EPROM on the Comal 2.0 cartridge. A package can also be
loaded from disk and will normally be stored in a RAM location that
COMAL does not use for normal programs. LINK "filename" will load and
link the ML package to a Comal program. It will stay attached to the
program when the program is saved and loaded, unless it is marked as
ROMMED. The entire slv procedure could be coded in assembly language
and be placed in a package. The slv procedure uses two packages,
strings and blas. The command USE packagename makes all of the
functions and procedures of the package known. Alternatively, you could
place the USE packagename command in the main program and put IMPORT
procedurename inside all of the closed procedures that call
procedurename.
Slv calls the swap'real and swap'integer proocedures from the strings
package. The strings package is a ROMMED package on the Super Chip ROM.
It does exactly what it says, e.g. swap'real(a,b) is the same as:
t:=a; a:=b; b:=t.
Slv calls the sdot, isamax#, sswap, sscal, saxpy, and scopy routines
from the blas package. The blas package is LINKed to the program, but
it could, and should, be placed on EPROM.
The Comal blas routines behave exactly the same way that the Fortran
blas routines do except that Fortran can pass the starting address of an
array with just "a", while Comal requires "a(1)". The Comal blas will
allow you pass an array, by reference, of single or multiple dimensions
and start from any position in the array. If you code the blas routines
as ordinary Comal routines you have to pass additional parameters and
have separate routines for single dimensioned arrays and two dimensional
arrays. Note also that Fortran stores two dimensional arrays by
columns, and Comal (like many other languages) stores two dimensional
arrays by rows. If you translate code between Fortran and Comal using
blas routines you will have to change the increment variables.
Here is a very simplified description of what each of the routines in
the Comal BLAS package does.
You can also apply the BLAS routines in creative ways besides just
operating on rows or columns. For example you could create the identity
matrix with:
2. Orden A., "Matrix Inversion and Related Topics by Direct Methods",
in Mathematical Methods for Digital Computers, Vol. 1, Edited by A.
Ralston and H. Wilf, John Wiley and Sons Inc., 1960.
3. Dongarra, J. J., Moeler, C. B., Bunch, J. R., Stewart, G. W.,
Linpack Users' Guide, SIAM Press, Philadelphia, 1979.
COMAL 2.0
COMAL 2.0 is an excellent programming language for the C64/128 and I
can't describe all of its virtues here. It has one serious limitation.
It does not use "continuation" lines, so line length is limited. This
is most restrictive in function and procedure lines where it limits the
number of parameters that can be passed. Line length is limited to 80
characters. However, if you use a separate text editor or word
processor you can enter 120 character lines. Comal will actually
execute tokenized lines up to 256 characters so the limitation is really
in the editor rather than COMAL. Procedure and variable names can be
quite long in Comal, but are kept short because of the line length
limitation. "Quartersolve" was shortened to "slv" for this reason.Basic Linear Algebra Subroutines, BLAS
The BLAS were originally written for the Fortran language to speed
execution and streamline code used for solving linear algebra and other
matrix problems. The LINPACK routines, Ref. 3, use the BLAS and are
perhaps the best known. The idea is that the BLAS routines will be
highly optimized for a particular computer, coded in ML or a High Order
Language. Some operating systems even include BLAS like routines.
Writing fast efficient programs is then a simple matter of selecting the
best solution algorithm and coding it in a manner that makes best use of
the blas routines. There are blas routines for single precision, double
precision, and complex numbers. The level 1 BLAS perform operations on
rows or columns of an array and typicaly do n scalar operations
replacing the inner most loop of code. There are also level 2 BLAS that
perform n*n operations and Level 3 BLAS that perform n*n*n operations.
Nicholas Higham has coded most of the single precision level 1 blas
routines and put them in a Comal 2.0 package. The Comal blas package is
included on the Packages Library Volume 2 disk. I am not aware of ML
blas routines coded for any other C64/128 languages although this is
certainly possible and recommended.
Fortran Comal
dimension c(n), a(ilda,isda) DIM c(n#), a(lda#,sda#)
scopy(n,c,1,a(i,1),ilda) scopy(n#,c(1),1,a(i#,1),1)
scopy(n,c,1,a(1,j),1) scopy(n#,c(1),1,a(1,j#),sda#)
The first scopy copies array c into the ith row of array a. The second
scopy copies array c into the jth column of array a.
This is what scopy does in Fortran:
subroutine scopy(n,sx,incx,sy,incy)
real sx(1),sy(1)
ix=1
iy=1
do 10 i = 1,n
sy(iy) = sx(ix)
ix = ix + incx
iy = iy + incy
10 continue
return
end
The Comal BLAS does exactly the same thing. If coded entirely in COMAL
rather than as a package it would have to be different. The call would
change.scopy(n#,c(1),1,a(1,j#),sda#)
would have to become,
scopy(n#,c(),1,1,a(,),1,j#,sda#,sda#)
and the Comal procedure might be:
PROC scopy(n#, REF x(), ix#, incx#, REF y(,), iy#, jy#, sdy#, incy#) CLOSED
iyinc#:=incy# DIV sdy# //assuming y is dimensioned y(?,sdy#)
jyinc#:=incy# MOD sdy#
FOR i#=1 TO n# DO
y(iy#,jy#):=x(ix#)
ix#:+incx#; iy#:+iyinc#; jy#:+jyinc#
ENDFOR
ENDPROC scopy
Note that more information has to be passed to the procedure and used
that the ML blas picks up automatically. Also we would need separate
procedures to handle every combination of single and multi dimensional
arrays. The Comal ML blas are indeed wonderful. For speed
considerations this should also be left as an open procedure or better
yet just use in line code.
sum:=sasum(n#,x(1),1) Returns sum of absolute values in x().
sum:=0
FOR i#:=1 TO n# DO sum:+ABS(x(i#))
saxpy(n#,sa,x(1),1,y(1),1) Add a multiple of x() to y().
FOR i#:=1 TO n# DO y(i#):+sa*x(i#)
prod:=sdot(n#,x(1),1,y(1),1) Returns dot product of x() and y().
prod:=0
FOR i#:=1 TO n# DO prod:+x(i#)*y(i#)
sswap(n#,x(1),1,y(1),1) Swaps x() and y().
FOR i#:=1 TO n# DO t:=x(i#); x(i#):=y(i#); y(i#):=t
scopy(n#,x(1),1,y(1),1) Copy x() to y().
For i#:=1 TO n# DO y(i#):=x(i#)
max#:=isamax#(n,x(1),1) Returns index of the element of x() with the
largest absolute value.
t:=0; max#:=1
FOR i#:=1 TO n#
IF ABS(x(i#))>t THEN t:=ABS(x(i#)); max#:=i#
ENDFOR i#
sscal(n#,sa,x(1),1) Scale x() by a constant sa.
FOR i#:=1 TO n# DO x(i#):=sa*x(i#)
snrm2(n#,x(1),1) Returns the 2 norm of x().
norm2:=0
FOR i#:=1 TO n# DO norm2:+x(i#)*x(i#)
norm2:=SQR(norm2)
srot(n#,x(1),1,y(1),1,c,s) Apply Givens rotation.
FOR i#:=1 TO n# DO
t:=c*x(i#) + s*y(i#)
y(i#):=s*x(i#) + c*y(i#)
x(i#):=t
ENDFOR i#
Bear in mind that each of these simple examples can be more complex as
was given for scopy. You now have enough information to write your own
BLAS routines in ML or the programming language of your choice, or to
expand the BLAS routine calls in slv to ordinary in line code.
DIM a(n#,n#)
a(1,1):=1; a(1,2):=0
scopy(n#*n#-2,a(1,2),0,a(1,3),1) // zero the rest of the matrix
scopy(n#-1,a(1,1),0,a(2,2),n#+1) // copy ones to the diagonal.
References
1. Zambardino, R. A., "Solutions of Systems of Linear Equations with
Partial Pivoting and Reduced Storage Requirements", The Computer Journal
Vol. 17, No. 4, 1974, pp. 377-378.