Woe be to Commodore, The marketer's have finally killed it, With a little bit of spending here, And not much over there, You know that Commodore has finally died. And that's the way life should be, The Commodore fanatics cried. We'll probably be better off they yell, Let Commodore go to hell. So the question is who'll purchase Commodore?Yes, for those of you who are unaware Commodore has declared bankruptcy. There are numerous rumours abound over whose interested in what divisions of Commodore and such - there's still no definate word on the net. Several factors can be blamed for Commodore's demise: Commodore never was successful in marketing products. The engineers would often turn out miracle machines that the marketing department (I wonder if there even was one) promoted badly, if at all. What has put the nail in the coffin for Commodore was the lack of financial capital to keep the company in operation. A lot of this news has been discussed on GEnie, the newsgroup Comp.Sys.Cbm, and various magazines as well so I won't elaborate any further.
Speaking of magazines, Creative Micro Designs has started a magazine for Commodore 8-bit's called, "Commodore World". The magazine is very well done. There are other magazines that also deserve mention: DieHard, the Underground and many others. Commodore may be bankrupt but the Commodore will still live forever.
Speaking of living forever, Commodore Hacking is looking for articles on any subject dealing with any aspect of technical programming or hardware on the Commodore. If you've got an article already written, or an idea for one please feel free to e-mail me via duck@pembvax1.pembroke.edu. Many thanks to the authors whose works make up this and previous issues.
It is time for another dose of trivia! As some of you may know, The Commodore Trivia Editions are posted every month to the USENET newsgroups comp.sys.cbm, alt.folklore.computers, and comp.sys.amiga.advocacy. This article is a compilation Trivia Editions 2-8, with only the questions appearing for Edition 8. These are part of a Trivia Contest in which anyone may enter. If you wish to participate in the newest Trivia contest (Which is on Trivia 8 as I write this), please send your answers to me at 'brain@mail.msen.com'.
The following article contains the answers to the January edition of trivia ($00A - $01F), the questions and answers for Febrary ($020 - $02F), March ($030 - $03F), April ($040 - $04F), May ($050 - $05F), June ($060 - $06F), and the questions for the July edition ($070 - $07F). Enjoy them!
Here are the answers to the Commodore trivia questions for January, 1994.
Commodore Trivia Edition #8!
[Editor's note: I'm wary of there being no voltage translation but am including
it because I do think you can get away with it... However, because this
magazine is free you get what you pay for... ]
Here's a modem interface schematic for the C=64/128, with it, and around
$5.00, you can use almost any hayes compat. external modem. To the best of
my knowedge, the 64 has a maximum baud rate (through the user port) of 2400,
and the 128's is 9600.
I DO NOT know who the original author of this is, but i re-wrote it in my
own words, hoping it will help someone. I CLAIM NO RIGHTS TO THIS ARTICLE.
The following article, initially written for a mailing list, describes
the Commodore REUs and explanes how to program them.
The external memory can not be directly addressed by the C64 with its 16 bit
address space--it has to be transferred from and to the main memory of the
C64. For that purpose, there is a built in RAM Expansion Controller (REC)
which transfers memory between the C64 and the REU using Direct Memory Access
(DMA). It can also be used for other purposes.
Another problem is the recognition of the number of RAM banks (64 KByte
units) installed. The SIZE bit only tells if there are at least 2 (1700) or
4 (1764, 1750) banks installed. By trying to access and verify bytes in as
many RAM banks as possible, the real size can be determined. This can be
seen in the source to "Dynamic memory allocation for the 128" in Commodore
Hacking Issue 2.
In any way, the user of a program should be able to choose, if and which REU
banks are to be used.
The following code transfers one KByte containing the screen memory
($0400..$07FF) to address 0 in the REU:
I think, this subset of 17xx functions would be enough for a reasonable RAM
expansion. However, if full compatibility with 17xx REUs is desired, also
the more complicated functions have to be implemented.
Example:
One other application (used in GEOS) is copying C64 RAM areas by first
transferring them to the REU and then transferring them back into the desired
position in C64 memory. Due to the fast DMA, this is about 5 times faster
than copying memory with machine language instructions.
Interesting things can be done by fixing base addresses: By fixing the REU
base address, large C64 areas can be fast filled with a single byte value.
It is also possible to find the end of an area containing equal bytes very
fast, e.g. for data compression.
Fixing the C64 base address is interesting if it points to an I/O-port.
Then, data can be written out faster than normally possible. It would be
possible to use real bitmap graphics in the upper and lower screen border by
changing the "magic byte" (byte with the highest address accessable by the
VIC) in every clock cycle. Therefore, of course, the vertical border would
have to be switched off.
Generally the REC could be used as graphics accelerator, e.g. to copy bitmap
areas or other display data fast into the VIC-addressable 16 KByte area.
So this is a little article which attempts to take some of the mystery out
of three dimensional graphics. Most of the mathematics involved are very
simple, and the geometric concepts are very intuitive. Coding it up on a
C-64 is more of a challenge, especially when you want to make it fast, but
even then it's not too tough. George and I wrote the code in about a week
(and talked about it for about a week before that). Perhaps you will
appreciate this aspect more if you know that I haven't written 6510 code
since 1988, and until the last two days George had no computer on which to
test his ideas (and on the last day it died)!
The goal of this article is that by the time you reach the end of it you
will be able to do your own cool-looking 3d graphics on a C64. Some of you
may find it patronizing at times, but I hope that it is at a level that
everyone can enjoy and learn something from. And feel free to write to us!
The first part explains some of the fundamental theoretical concepts
involved. Mostly what is required is some geometric imagination, although
you need to know a little trigonometry, as well as how to multiply two
matrices together.
The second part deals with implementing the algorithms on a computer; since
this is C=Hacking, it is a good assumption that the implementation is on the
C-64! Most of the code is designed for speed, and lots of it is also
designed so that it can be called from BASIC!
Finally, an example program which uses all of the techniques presented here
is included, including source. The program rotates a cube in three
dimensions.
By itself the code is not the fastest in the world; what is important here
are the concepts. With a little fiddling, and maybe some loop unrolling,
you can get these routines going quite fast; for instance, a 26 cycle line
drawing routine is not hard at all using more sophisticated versions of
these algorithms. This time around the code is designed for clarity over
quality.
There are lots and lots of little details that are not specifically covered
by this article. But if you understand all of the concepts here it
shouldn't be too hard to figure out the problem when something goes wrong.
This material is the result of a week's worth of discussions on comp.sys.cbm
between George, myself, and several other people. So a big thank you to
everyone who helped us to knock these ideas out, and we hope you find this
to be a useful reference!
Incidentally, the ideas and techniques in this article aren't just for
drawing neat pictures; for example, a good application is the stabilization
of an orbiting satellite. The mathematical ideas behind linear
transformations are important in, for instance, the study of dynamical
systems (which leads to Chaos and all sorts of advanced mathematical
subjects).
But it also makes you look really cool in front of your friends.
Second, you need to know a little math. You need to know about polar
coordinates, and you need to know how to multiply two matrices together.
The ideas are all geometric, but the computations are all (of course)
mathematical.
Now, let us start thinking about a cube!
Let's first center our cube at the origin. Not only does this make it easy
to visualize, but to make our cube do things (like rotate) the way we want
it to we are going to have to require this. A cube has eight corners, all
connected to each other in a particular way.
There's no reason to make things complicated already, so let's put the
corners of the cube at x=+/-1, y=+/-1, and z=+/-1. This gives us eight
points to work with: P1=[1 1 1] P2=[1 -1 1] P3=[-1 -1 1] etc.
Minimalists may disagree, but a cube all by itself isn't all that exciting.
So how do we do stuff to it? For that matter, what kinds of stuff can we do
to it?
Before starting, we need to know some important trig formulas (of course,
everyone knows important formulas like these, but let me just remind you of
them):
We can write the point [x1 y1] as the point (r,t), where r is the distance
from the origin to the point, and t is the angle from the x-axis, measured
counter-clockwise. Therefore, x1 = r*cos(t) and y1=r*sin(t). If we then
rotate this vector by an amount B,
You may be wondering why we write this all in matrix form. The above matrix
equations are called linear transformations of the vector [x1 y1 z1]. There
are lots of deep mathematical concepts sitting right behind what looks to be
an easy way of writing several equations. Entire books are written on the
subject, and that is as good a reason as any for me not to go into detail.
But writing things in this way also offers us several computational
advantages. Rotations aren't the only linear transformations; let's say
that I want to rotate a point about the x-axis, shear it in the y-direction,
reflect it through the line theta=pi/5, and rotate it through the z-axis.
You could have one subroutine which did the rotation, and one that did the
shear, etc. But by writing it in matrix form, the entire process is simply
a series of matrix multiplications.
If you think about it you might realize that it really is the same thing no
matter which way you do it, but there is a fundamental difference in the
viewpoint of each method: one views it as a series of unrelated mathematical
operations each with it's own individual function, while the other method
views it as a series of matrix multiplications so that it's basically the
same thing, over and over.
What this means for you is that if you want to rotate a point around the
x-axis, the y-axis, and the z-axis, you can take the matrix for each
transformation and multiply them all together, and then apply this one big
matrix to the point. One thing to be very aware of: in general, matrix
multiplication is not commutative. That is, if X is a rotation matrix about
the x-axis and Y is a rotation about the y-axis, it will almost never be
true that XY = YX. What this means geometrically is that if you take a
point, rotate it around the x-axis by an amount A, then rotate it around the
y-axis by an amount B, you will usually end up at a different point than if
you first rotate it around the y-axis.
If you are interested in learning more about rotations and their uses, a
good place to start is almost any book on mechanics, for instance "Classical
Mechanics" by Goldstein. If you want to learn more about linear
transformations you can find it in any decent book on linear algebra, as
well as in a lot of physics texts. There is a good introduction in Margenau
and Murphy "The Mathematics of Physics and Chemistry", and there is a
semi-OK book on linear algebra by Goldberg.
Now we know the geometric and mathematical principles behind rotations in
three dimensions. But we want to visualize this on a computer, on a
two-dimensional screen: we need some way of taking a point in
three-dimensions and bringing it down to two dimensions, but in a way that
fools us into thinking that it really is three-dimensional.
What we need are projections.
Sit back in your chair and imagine for a minute or two. Imagine the three
coordinate axes. Now imagine that there is a pinhole camera, with it's
pinhole lens at the origin, and it's film at the plane at z=1 parallel to
the x-y plane. Now we are going to take a snapshot of something.
Maybe a little picture would help:
By the way, if you stare at the above picture for a while, you may realize
that, in that geometric model, the object gets turned upside-down on the
film.
Now that we have a physical model of the equations that have been thrown
around, let's look at what we've been doing.
Consider a cube centered at the origin. Already there is a problem above if
z=0. What if one side of the cube has part of it's face below the x-y plane
(negative z) and part above the x-y plane? If you draw another picture and
trace rays through the origin, you'll see one part of the face at one end of
the film (negative z, say), and the other part way the heck out at the other
end! And the two parts don't touch, either!
So we need to be careful. In the geometric picture above, we assumed the
object was a fair distance away from the lens. Currently we have our lens
at the center of our cube, so something needs to move! Since rotations are
defined about the origin we can't just redefine the cube so that the
vertices are out at, say, z=2 and z=3. So what we need to do is to move the
camera away from the cube. Or, if you want to think of it another way, we
need to move the cube away from the camera before we take a picture of it.
In this case the translation needs to be done in the z-direction. The new
projection equations are then
Now not only have we eliminated possible problems with dividing by zero, but
the mathematics now match the physical model.
Some of you might want to think about the less-physical situation of putting
the object behind the film, i.e. to the right of the film in the above
picture.
As usual, there are some deeper mathematics lurking behind these equations,
called projective geometry. Walter Taylor has written a book with a fine
introduction to the subject (at least, I think the book was published; my
copy is an old preprint).
One thing you need to understand is 8-bit signed and unsigned numbers. Here
is a quick review: an 8-bit unsigned number ranges from 0..255. An 8-bit
signed number ranges from -128..127 and is written in two's-complement form.
In an 8-bit two's-complement number bits zero through six work like they
usually do, but the seventh (high) bit represents the sign of the number in
a special way. To find the 8-bit two's-complement of a number subtract it
from 256. Example: what is -21 in two's complement notation? It is 256-21
= 235 = $EB. What is the complement of -21? It is 256-235 = 21 -- like
magic. Another way to think about it is like a tape counter: 2 is $02, 1 is
$01, 0 is $00, -1 is $FF, -2 is $FE, etc. And what is 24-21 in two's
complement? It is: 24 + -21 = $EE + $EB = $0103. Throw away the carry
(subtract 256) and we come out with... $03!
Onwards!
First, we need to decide what language to use. You and I both know the
answer here: BASIC! Or maybe not. We need speed here, and speed on a
Commodore 64 is spelled a-s-s-e-m-b-l-y.
Next, we need to decide what kind of math we want to use, signed or
unsigned. Since the cosines and sines are going to generate negative and
positive numbers in funny ways, we definitely want to use signed numbers.
The alternative is to have lots of code and overhead to handle all the
cases, and if we put it in two's-complement form the computer does most of
the work for us.
How big should the numbers be? Since we are going for speed here, the
obvious choice is 8-bits. But this restricts us to numbers between
-128..127, is that OK? The size of our grid is 0..127 x 0..127, so this is
perfect! But it does mean that we need to be very careful. For instance,
consider the expression (a+b)/2. What happens if a=b=64? These are two
numbers within our range of numbers, and the expression evaluates to 64,
which is also in our range, BUT: if you evaluate the above in two's
complement form, you will find different answers depending on how you
evaluate it (i.e. (a+b)/2 will not give the same answer as a/2 + b/2, which
will give the correct answer).
Now we've got another problem: sine and cosine range between negative one
and one. To represent these floating point numbers as 8-bit signed integers
the idea will be to multiply all floating point numbers by a fixed amount.
That is, instead of dealing with the number 0.2, we use the number 64*0.2 =
12.8 = 13, and divide the end result by 64. As usual, we are trading
accuracy for speed, although it will turn out to make little difference
here.
Why did I pick 64? Obviously we want to pick some factor of two to make the
division at the end simple (just an LSR). 128 is too big. 32 doesn't give
us much accuracy. We also have to consider problems in expression
evaluation (see the above example of (a+b)/2), but as we shall see 64 will
work out nicely.
Now that we have accomplished the difficult task of decision making, we now
need to move on to the simple task of implementation, starting with
rotations.
Well, yes, that would work, but... each rotation is nine multiplications.
Each multiplication involves a lot of work, plus we have to shift the result
by our fixed amount each time. We would not only be using huge amounts of
time, but we would lose a lot of accuracy in the process. Computationally
speaking, this is called a "bad idea".
Once again, mathematics saves the day: here is where we get the payoff for
writing the equations as an algebraic system (a matrix). If X is the
transformation around the x-axis, Y the transformation around the y-axis,
and Z the transformation around the z-axis, then this is the equation to
transform a vector v by rotating the point first around the z-axis, then the
y-axis, then the x-axis:
To multiply the three rotation matrices together, we need to take advantage
of a few trigonometric properties. We need the two identites mentioned
earlier:
Here is also where we need to be extremely careful. The first entry in the
matrix M is the example I gave earlier about evaluating signed numbers. How
do we overcome this?
Easy! Notice in the matrix M that, apart from element C, every term is a
sine or a cosine divided by two. This is the only part of the program which
uses sines and cosines, so why not use the offset floating-point values
divided by two? This will make more sense in a minute.
The question arises: the above is all well and good, but how do we take the
sine of a number and make it fast? The answer of course is to use a table.
We used a BASIC routine to calculate the table for us (and to store the
numbers in two's-complement form). Calculate the sine and cosine of every
angle you want ahead of time, and then just look up the number.
The tables contain the values of sine and cosine multiplied by 64 (our
floating-point offset) and then divided by 2. Since the value is already
divided by two, the above calculation becomes at the same time faster and
safer: faster because I don't have to keep dividing by two, and safer
because I don't have to worry so much about overflow. (It can still happen,
but it won't if you're careful).
Here is an example of how to calculate elements A and B above:
That's all there is to calculating the rotation matrix. Next we have to
actually rotate the points. We have another decision to make: do we take
the rotated object and rotate it by a little amount, or do we take the
original object and rotate it by a big amount? Because of the way we have
set things up, the answer is clear: we want to increment the angle at each
step, and rotate the original object by this large angle (besides,
geometrically you can see that it will look much nicer this way).
For a cube this is easy. The points are P1=[1 1 1] P2=[1 -1 1]
P3=[-1 -1 1] P4=[-1 1 1] P5=[1 1 -1] P6=[1 -1 -1] P7=[-1 -1 -1] P8=[-1 1 -1].
This means that the rotations are just a series of additions and/or
subtractions of A,B,C,...,I! The code implements this in a funny way,
partly to make these procedures easy to see, but mostly to make debugging
the code much easier. It is much faster to do each rotation separately,
i.e.
Still worried about overflow? If you think about it geometrically, you will
see that the maximum value any part of a rotated coordinate can have is
sqrt(3). Since we have offset our numbers by 64, this means that, for
instance, the maximum possible value for A+B+C is 64*sqrt(3) which is about
111 -- in range of a signed 8-bit number with a little cushion for
additions. In other words, we ought to be safe from overflow.
So far we have managed to rotate all the coordinates -- a complicated series
of matrix operations involving trigonometric functions -- by just using a
bunch of additions and a bunch of table lookups! Not too bad! Now we just
need to project the point.
Well, wait a minute, maybe we can do something. How about using a table for
1/(z-z0), and then just use a multiply? Oh yeah, that's a really small
number. As long as we're using a table, why not incorporate the d into it?
Come to think of it, if the number weren't multiplied by the offset 64 it
would be a pretty reasonable number!
So, what we want to do is to construct a table of numbers such that when the
program calls
Since z is a signed number, we ought to add 128 to it to convert it into an
index. Does this have any meaning in two's-complement arithmetic? Yup. We
also need to remember that floats are offset by 64, and that the highest
value a signed number can have is 127.
Here is how the table is generated:
You may have noticed that z0-z is used, and not z-z0 like in the projection
equation. If you put on your geometric thinking cap for a moment, you will
realize that the way the projection equations were set up causes the image
to become inverted. To uninvert it, we need to multiply by negative one.
So we just add that step into the table.
But we still need to do a multiplication!
So, let's say we want to multiply the number P by the number R. If P has a
high bit in position N, we can start out with 256*R, and bit-shift it to the
right 8-N times. Why in the world would we do this? Because we can
pipeline the process in a way somewhat similar to the way a Cray
supercomputer multiplies two vectors together -- yes, I'm comparing your
C-64 to a Cray! Watch:
See the source code for an implementation of this.
Note that we could do a divide in a similar fashion, except shifting left
instead of right. Since we don't need a divide routine for our calculations
we don't need to worry about this.
Now we have all the tools we need to implement the mathematics. There is
still one part of the program left: drawing the thing!
To do this, we first need to find out which is larger:
If we take k steps in x before taking a step in y, then we want to chose k
such that
Of course, if dy were larger than dx, the idea would be the same, but now k
= dy/dx. k is never smaller than one.
In the code fragment which follows it is assumed that x2>x1, y2>y1, and
dx>dy. Obviously, then, any self-respecting line drawing routine needs to
handle all of these cases. One way is to have eight different routines, one
for each case. Another way (the way used by the program), is to force
x2>x1, so that there are only four cases to deal with. For the plotting
routine which we use, this turns out to be necessary. If you think about
it, you can come up with some more clever ways to deal with this.
Note that you also need to figure out what column the first point is in:
this algorithm knows how to walk forwards, but it doesn't know where it
should start.
The code is next to some similar BASIC7.0 code to make it easier to
understand.
The code can be sped up in a lot of ways. For one thing it could be made
self-modifying. All variables could be stored in zero page. In fact, the
entire routine could be stored in zero page! Also, with a little change in
the logic (and a subsequent change in the plotting routine) you can
eliminate the branching instruction. For the sake of clarity we don't do
that here; maybe in another paper ;-).
Also note that the largest value x can take on in this routine is 255. For
the way we are going to plot things, this won't matter. But a more general
routine needs a way to overcome this. One way would be to draw two separate
lines.
Note also that this could easily be used in a BASIC program; even a BASIC2.0
program. (If you would like the DATA statements to do this just drop us a
line, er... contact us).
Now, this routine works fine, but for drawing a line on a computer it
doesn't always look great. For instance, what happens if we draw a point
from 1,1 to 11,3? k=dx/dy=5, so se will take five steps in x and then a
step in y, then five more steps and... a step in y at the very last point!
So our line doesn't look so good -- we have a little square edge at the
endpoint.
One way to fix this is to trick the computer into thinking it needs to take
an extra step in y by letting k=dx/(dy+1), and being careful in keeping
track of our counter. The big problem with this method is that it produces
the square end-pixels when dx and dy are nearly the same (slope ~= 1).
A better way to fix this is to initialize the counter not to 0 (in our case,
256-dx), but instead to DX/2 (256-DX/2 in our case). This has the effect of
splitting one of the line segments between the two endpoints, and looks good
for all slopes. This is what the program does. In fact, as far as I can
tell, this is what BASIC7.0 does too!
There is still a part of our routine missing, however...
For this project, speed is the name of the game, and for speed we don't want
to use normal bitmapped graphics. Instead, we want to use character
graphics. The advantages of using a custom character set are:
For the second advantage, it is much faster to poke a character into screen
memory than it is to calculate and plot all 64 bits in a character. This
way, VIC does all the hard work for us. Also, if we are clever, we can
exploit several aspects of our cleverness to make plotting a single point
much easier.
Character graphics also give us a very simple means of double buffering: we
can just plot into two different character sets and tell VIC-II to move
between them. No raster interrupts here! If the two character sets were at
$3000 and $3800, here is how to switch between them:
The last is less obvious. A normal hires bitmap is organized like the
following:
This brings us to the primary disadvantage of using a character set: our
pictures are pretty small. TANSSAAFL.
Now we could just go merrily plotting into our character bitmap, but as
usual a little thought can yield some impressive return. The first thing to
notice is that the maximum value for y is 127; the only thing that sets the
high bit is the x-coordinate, and then only when it crosses a column (just
look at the above memory map if you don't see it).
Therefore, if we could keep track of the bit position of x, we could tell
when x crossed a column, and just add 128 to the base address. Not only
that, but we also know to increase the high byte of the pointer by one when
we have crossed two columns.
The logic is as follows:
In BASIC:
When combined with the earlier line drawing routine, this gives an average
time of 38 cycles or so (with a best time of 34 cycles); six of those cycles
are for PHA and PLA, since the line drawing routine uses A for other things.
Like most of the code, you can improve on this method if you think about it
a little. Most of the time is spent checking the special cases, so how can
you avoid them? Maybe if we do another article, we'll show you our
solution(s).
Now, this method has a few subtleties about it. First, what happens if the
first point to be plotted is x=0, or x=8? The above routine will increase
the base pointer right off the bat. This case needs to be taken care of.
Second, the above assumes that you always take a step in x. What happens if
we are taking a big step in y? Let's say that we take ten steps in y for
every step in x. What will the above plotter do if x takes a step across a
column, and then doesn't change for a while? Look to the source code for one
solution to this problem.
So that's all there is to it!
What's next? In the future you will undoubtably see lots of things from
George and myself, both the written word and the coded byte. Maybe we will
see something from you as well?
Also, this article may be a bit to weird for some people to grasp. Please
bear with me. This article is as much a scratchpad for my rough ideas about
the kind of system I want to build as it is an explanatory article. You may
need a Master's degree in software systems to understand some of the things I
talk about. This article makes references to the ACE operating system, which
is available via anonymous FTP from "ccnga.uwaterloo.ca". ACE is a
uni-tasking OS that has a Unix-like flavor. (Yeah, yeah, yeah, I'm still
working on the next release...).
One more note about the article: it is written in the present tense ("is")
rather than the future tense ("will"), since the present tense is easier to
write and understand. The system, however, does not presently exist and the
design may change in many ways if the system ever is made to exist.
A "distributed" OS is one that runs on a collection of independent computers
that are connected by a network. Unlike a "network" OS, a distributed OS
makes all of the independent computers look like ONE big computer. In
general, a distributed system, as compared to a centralized one (like MS-DOS
or Unix), gives you "a higher performance/price ratio (`more bang for the
buck'), potentially increased reliability and availability because of partial
failure modes (if protocols are implemented correctly), shared resources,
incremental growth and online extensibility, and [a closer modelling of] the
fact that some applications are inherently distributed." This is quoted from
my Ph.D. thesis about distributed systems of powerful workstations. To us, a
distributed system means increased modularity and ease of construction,
sharing devices like disk drives and resources like memory between multiple
computers, and the true parallelism of running different processes on
multiple computers at the same time. Not to mention "coolness".
A "microkernel" OS is one that has the smallest kernel possible by pushing
higher-level functionality (such as the file system) into the domain of user
processes. The kernel ends up being small, fast, and easy to construct
(relative to a monolithic kernel).
So why would we want our OS to have the features of Multitasking,
Distributed, and Microkernel? Because I'm designing it, and that's what
interests me. The ease-of-construction thing is important too. Another
important question is "can it be done?". The answer is "yes." And it will
be done, whenever I get around to it (one of these lifetimes).
The C-128 also has relocatable zero-page and stack-page pointers. This
feature are absolutely essential and you could not make an effective
multitasking OS for any 6502 machine without it. I wonder if Commodore
thought about this prospect when designing the MMU chip...
The last C-128 feature is *speed*. The C128 has a 2 MHz clock speed when
used with the 80-column VDC display. This is enough speed, when harnessed
properly, to make your applications zip along. For an example of speed that
is not harnessed properly, see Microsloth Windoze. The VDC display is also
very nice, too. Only the VDC display should be supported by a "real" OS, not
the VIC display.
The required network connects the user ports of C-128's into a bus. I'm not
completely sure how to connect more than two C-128's to this bus (I'd
probably need some diodes or logic gates), so the initial version of this
network hardware will have a maximum of two hosts. We will still be careful
to make the software of the system easily reconfigurable for any number of
hosts.
You will need two appropriate connectors and some 14-conductor ribbon cable
to build the network. One of my connectors is a 44-conductor connector of
the type used with the VIC-20 expansion port that I sawed in half and the
cable is some old junk ribbon cable that was lying around that I removed some
of the conductors from. Any old junk will do. You're probably best off if
your cable is less than six feet long (2 metres). The network is wired up as
follows:
You can also write your own applications for this network, since programming
it is quite easy; the hardware takes care of all of the handshaking. To
blast 256 bytes over the network from C128-A to C128-B, you would:
There is probably no need to do error checking on the data transmitted over
the network since the cable should be about as reliable as any of the other
cables hanging out the back of your computer (and none of them have error
checking (except maybe your modem cable)).
In reality, there is only 1 CPU in the 128 (well, that we are interested in
using), so its time is divided up and given out in small chunks to execute so
many instructions of each program before moving onto the next one. The act
of changing from executing one program to executing another is called
"context switching", and is a bit of a sticky business because there is only
one set of processor registers, so these must be saved and restored every
time we switch between processes. Effectively, a process' complete "state"
must be restored and saved every time it is activated and deactivated
(respectively). Since the 8502 has precious few internal registers, context
switching can be done quite efficiently (unlike with some RISC processors).
The maximum period of time between context switches is called the "quantum"
time. In our system, the quantum is 1/60 of a second. It is more than just
a coincidence that this period is the same as the keyboard-scanning period.
Depending on priorities and ready processes, a new or the same old process
may be selected for execution after the context switch of the 60-Hz
interrupt.
Spliting the time of one processor among N processes may sound like we're
simply making each one run N times slower, which may be unbearably slow, but
that is not generally the case. One thing that a CPU spends a lot of its
time doing is *waiting*. Executing instructions of a program requires the
full attention of the CPU, but waiting requires absolutely no CPU attention.
As an example, your speedy computer spends a lot of its time waiting for its
slow-as-molasses-launching-into-orbit user to type a key. If we were to put
the process that asks the OS for a keystroke into a state of suspended
animation, then the CPU time that process would have consumed in a
busy-waiting loop can be better spent on executing the other processes that
are "ready" to execute. In practice, many processes spend a lot of their
time waiting, so "multi-programming" is a big win.
There are a number of things other than keystrokes that processes may wait
for in our envisioned system: modem characters, disk drive operations (if
they are custom-programmed correctly), mouse & joystick movements, real-time
delays, and interactions with other processes. The OS provides facilities
for processes to communicate with one another when they cannot perform some
operation in isolation (i.e., when they become lonely).
A process has the following things: a program loaded into the internal memory
of the 128, its own zero page and processor stack page, and the global
variables of its program. A process can also own "far" memory (below) and
various other resources of servers throughout the distributed system. The
process is the unit of ownership, as well as execution. Processes also have
priorities that determine how much execution time they are to be given
relative to other processes in the system.
Processes are allocated memory at the time of startup at a random location on
some random bank of internal memory on the 128. The biggest challenge here
is to relocate the user program to execute at the chosen address. The kernel
interface is available to programs on all internal banks of memory.
Some useful software already exists for ACE, and ACE has a well-definied
interface and well-behaved programs. The ACE interface may need to evolve a
little too. The ultimate goal would be to have the same API for both systems
so you could run software with the more functional BOS if you have a C128 and
80-column monitor, or you could use the less functional ACE if you didn't
have all this hardware.
The software wouldn't be "binary-identical" since the operating systems
provide quite different program environments and requirements, but the two
systems should be application-source-code compatible.
Because of the vast differences between a microkernel and a monolithic
kernel, all of the ACE system calls would be redirected to user-library calls
in BOS. This user library would then carry out the operations accessing
whatever system services are needed.
Only the basic memory-accessing code is provided by the kernel; higher-level
memory management, such as dynamic memory allocation and deallocation, is
handled by the Memory Server (below).
Unlike ACE, BOS provides the fundamental concept of "distributed memory".
The Fetch and Stash primitives can also access the memory of a remote machine
in a completely user-transparent way. Thus, a far-memory pointer can be
passed between processes on different machines, and the memory that the
pointer refers to can be read and written with equal programming by both
processes. This feature can be dangerous without a synchronization mechanism,
so this memory sharing is intended to be used only with the communication
mechanism.
There should not be an unacceptable overhead in accessing remote memory on
the 128 (like how there would be with bigger computers) because far-memory
fetching for local memory is quite expensive anyways (relative to near
memory), so an application will optimize its far memory accessing, and the
necessary interrupt handling on the remote machine can be done with very
little latency because of the "responsiveness" of the 6502 processor design.
Send( processId, requestBuffer, reqLength, replyBuffer, maxRepLength ) : err;
Receive( ) : processId, requestBuffer, reqLength, replyBuffer, maxRepLength;
Reply( processId ) : err;
Send() is used to transmit a message to a remote process and get back a reply
message. The sending process suspends its execution while it is waiting for
remote process to execute its request. A message consists of an arbitrary
sequence of bytes whose meaning is completely defined by the user. The
message contents are stored in a buffer (hunk of memory) before sending, and
a length is specified at the time of sending. A buffer to receive the reply
message must also be allocated by the sender and specified at the time of
sending. To save us from the overhead of copying message contents to and fro
unnecessarily, only pointers to the buffers are passed around and the far
memory primitives are used to access message contents. This also works
across machine boundaries because of the distributed-memory mechanism
described above.
Receive() is used to receive a message transmitted by a remote process to the
current process. The receiver blocks until another process does a
corresponding Send() operation, and then the request and reply buffer
pointers and lengths are returned. The receiver is expected to fetch the
contents of the request message, process the request, prepare the reply
message in the far-memory reply buffer, and then execute the Reply()
primitive. There are no restrictions on what the receiver can do between
receiving a message from a process and issuing the corresponding reply
message. So, it could, for example, receive and process messages from other
processes until it gets what it needs, compute pi to 10_000 decimal places,
and then reply to the process that sent a message to it a long time ago.
Reply() is used to re-awaken a process that sent a message that was
Receive()d by the current process. The current process is expected to have
set up the far-memory reply buffer in whatever way the sending process
requires prior to issuing the Reply().
The expected usage of buffers is for the sender to use near memory for the
request and reply buffers and access them as regular near memory to construct
and interpret request and reply messages. The receiver will access the
buffers as far memory (which they may very well be since processes are
allowed to execute on different banks of internal memory and even on
different machines), and may wish to fetch parts of messages into near memory
for processing. The use of far pointers makes it so that data is copied only
when necessary.
And that's it. You only have this RPC mechanism for communicating with other
processes and for all I/O. Well, that's not entirely true; the RPC stuff is
hidden behind the application program interface, which provides such facades
as the Open and Read system calls, and a very-low level interrupt
notification mechanism which a user process will not normally use.
A useful implication of using servers rather than having user processes
execute inside of the kernel is mutual exclusion. Servers effectively
serialize user requests. I.e., user requests are serviced strictly
one-at-a-time. This is important because some of variables that need to be
manipulated in order to provide service must not be manipulated by multiple
processes simultaneously or you may get inconsistent results. To provide
mutually exclusive access to shared variables in a monolithic system, either
ugly and problematic semaphores must be used, or more-restrictive, simpler
mechanisms like allowing only one user process to enter the kernel.
The server is highly integrated with the kernel, and it is able to do things
that regular user processes cannot (like manipulate kernel data structures),
but it still functions as an independent entity, as a regular user process.
Its code is physically a part of the kernel for bootstrapping purposes, since
it can hardly be used to start itself.
When you wish to run a new program, a request message is sent to the process
server. This message includes the filename of the program to run, the
arguments to the new program, environmental variables, and a synchronous/
asynchronous flag. If you want to run a sub-process synchronously, the
process server does not reply to your request until the new process
terminates. If you select asynchronous mode, the process server replies to
your request as soon as the new process is created. Both of these modes are
quite useful in Unix (although Unix has a more complicated mechanism for
providing the service) (think "&" and no-"&" on command lines), so they are
provided here.
The process server allocates and initializes the kernel data structures
necessary for process management, and then starts the process running
bootstrapping code in the kernel. Since this code is in the kernel, it is
known to be trustworthy. The process then bootstraps itself by opening the
program file, reading the memory requirements, allocating sufficient memory,
reading in the program file, relocating the program for whatever memory
address it happened to load in at (bank relocation is no problem) and
far-calling the main routine (finally). The return is set up on the stack to
kill the process.
Since the process bootstraps itself, the process server's involvement in the
process creation procedure is minimal, and the process server is ready to
process new requests with minimal delay (maximal responsiveness). This
self-bootstrapping user process concept comes from my Master's Thesis.
Another advantage of having a process server is that you can start a process
running on any machine from any other machine in exactly the same way you
would start a process on the local machine; we have achieved transparentness,
Park.
The process server also takes care of process destruction (exit or kill) and
provides other less-significant services, like reading and setting the
current date and time. The mechanism by which process destruction is done is
similar to the self-bootstrapping idea and is discussed, probably
inappropriately, in the next section.
The server is located by having a well-known address. That is, the process
id is a constant and hard-coded into clients. Well-known addresses are small
integer values, for each machine (a machine-id is encoded into process ids),
and these integers are indexes into a small look-up table with the actual
addresses for well-known addresses, so the process ids aren't pinned but can
be used as if they were pinned.
There is also a call that deallocates all memory owned by a certain user
process. This call is normally only called by the process server*, since the
memory of the user program is be deallocated along with the rest of the
process' memory. A record is kept internally for each process about what
types and banks (later) of memory it has used so that bulk deallocation can
be done efficiently when the process exits.
A client process can also ask that far memory be allocated on a remote
machine. Remote memory is relatively slow to access, but it can be
convenient when you need LOTS of memory for a process. The obvious way to
get at this remote memory is to simply send a message directly to the remote
memory server of the machine you want to allocate memory on, and this does
indeed work, so this is what we will do. But, this doesn't record the fact
that you have allocated memory on a far machine by itself, and we don't want
to waste any effort in freeing all of the memory, both local and remote, that
a process owns when it terminates; i.e., we don't want to send deallocation
requests to all remote memory servers just to be sure.
There are a few alternatives for solving this problem, but I think this is a
good place for a quick-and-dirty hack. Whenever a user process sends a
message to a memory server (both local or remote, for whatever reason),
through the memory servers' well-known addresses, the bit corresponding to
the machine number (0-15) in a special 16-bit field of the sender's process
control block is set. Then, when the process terminates, the termination
procedure (next) peeks at this special field and sends free-all messages to
all remote memory servers that the process in question has interacted with.
This insures that all memory in the entire distributed system that is
allocated to a process is tidied up when the process terminates. Like the
process server, the memory server is integrated with the kernel.
The process server then suspends the doomed process' execution, rigs the
process' context so that the next thing it executes is the process shutdown
code inside of the kernel. This shutdown code closes all of the files that
the process has opened through the standard library calls (and other server-
resources held), deallocates all memory held by the process, maybe does some
other cleanup work, and then sends a special message to the process server to
remove the process control block. The process server will only accept this
special message from the process that is terminating after the first phase of
the process shutdown has been completed, to insure a proper termination. The
process control block is then deallocated and may be used again. The process
server is the only process that is allowed to manipulate process control
blocks.
Come to think of it, there is a slight problem with process initialization:
getting a copy of the arguments and environmental variables for an
asynchronously started new process. We don't want the sender to continue
before the new process has had a chance to make a copy of the arguments and
environment, so we will rig things so that it is the newly started process
that sends the reply message back to the parent process. Another dirty hack.
One possibility is to have "stateless servers". In other words, each server
does not keep track of, for example, which files a process has open or the
current file positions. Each time a read request comes in, the server opens
the file to be used, positions to the section of the file, performs the
operation, and closes the file again. This sounds like a lot of work, but
some intelligent caching makes it work efficiently. And if a user process
dies without closing all of its files, it doesn't matter since the files will
be closed anyway, logically at the completion of each operation. But, this
approach doesn't really work well with Commodore-DOS, which we will be using
for devices for which we don't have a custom device driver, so we won't use
it.
Another possibility is to have "semi-state" or "acknowledgementless" servers
(my own invention). Here, the server keeps track of, for example, which
files are open but doesn't keep the file position. When a request comes in,
the already-opened file is positioned according to the request and the file
operation takes place. If a client dies unexpectedly, the open file control
block (FCB) is left behind, but the FCB will be closed and reused after a
certain period of time. If the client actually hasn't died, then the
situation will be detected (through details not explained here) and the file
will be reopened as if nothing has happened. Other contingencies like a dead
process' name being reused are handled too. And the model works well with an
unreliable communication service. But, again, this doesn't model the
Commodore-DOS very well.
The final design considered is to have a registry of servers that that a
process has resources currently allocated on be associated with each process.
When a client makes an open request to the server (or some equivalent
resource-grabbing operation), the server checks to see if the client is
currently holding any other of the server's resources. If so, then the
request is processes normally. If not, then the server (or some agent on the
server's behalf) sends a message to the process server on the client's
machine telling the process server to record the fact that the client is (or
may be) holding some of the server's resources. The process server records
the server's process id in the process control block of the client, and when
the client terminates, it will send a standard "release all of the resources
that I am (may be) holding on this server" to the server as part of the
client's shutdown procedure. All of the client's open files will be closed,
etc.
In this "registry" design, servers can be completely "stateful", e.g., they
would contain both an open file entry and the file position information, and
files would always open and close when we intuitively expect them to. It is
assumed that the communication mechanism is reliable, which it is here. This
mechanism *does* model Commodore-DOS well. In fact, this idea is so nice
that I may redesign the memory allocation recovery mechanism to use this.
There is a slight possibility of a "race condition" in this mechanism, but
nothing bad can happen because of it. (This is just a note to myself: make
it so that if a process is killed while it is receive- or reply-blocked, then
ignore the reply from the server if the process id is reused; damn, there's
still a potential problem; I'll have to figure it out later; also watch out
for a distributed deadlock on the PCB list).
So, our server supports the regular file operations and implements them in
pretty much the expected way, since it is a "stateful" server. The main loop
of the server accepts a request, determines which type it is, extract the
arguments, calls the appropriate local procedure, prepares the reply message,
replies, and goes back to the top of the loop. Each opened file is
identified by a user process by a file control block number that has meaning
inside of the server, as per usual. But, unlike with ACE, we need a special
"Dup" operation for passing open files to children. Dup increments the
"reference count" of a FCB, and the reference count is decremented every time
a close operation takes place. A file will only be "really" closed when the
reference count reaches zero. Our system will not implement any security at
this time.
Because of the abstraction of sending formatted messages to a server,
different types of disk drives (Commodore-DOS, custom-floppy, ramdisk) are
all dealt with in exactly the same way. As one slight extension, we have to
hack our devices (at least some of them) a little to be able to handle
"symbolic links" in order to integrate well with the Prefix Server which is
described next.
If an application is given an absolute pathname, it will consult the prefix
server to resolve it to the process-id of an actual server. For example, the
pathname "/fd1/bob/fred" would resolve to server "<2:floppy1571>", relative
pathname "bob/fred". Pathname "/" would resolve to server "<1:ramdisk>",
relative pathname "".
The user process would then contact the appropriate server with the relative
pathname. A user process can assume that the prefix table will not change
while the system is running, so some intelligent caching can be done. Also,
directory tokens are given out for executing a "change directory" operation,
and these server/token pairs can be used for quick relative pathname
searches. A symbolic link mechanism is needed to insure that these relative
searches always follow through correctly.
This is quite similar to the ACE-128/64 Programmer's Reference Guide, which
is available via anonymous FTP from "ccnga.uwaterloo.ca" in file
"/pub/cbm/os/ace/ace-r10-prg.doc". Release #10 of ACE was the most current
at the time of writing this article.
Implementation: someday, maybe.
Jim Brain
brain@mail.msen.com
2306 B Hartland Road
Hartland, MI 48353
(810) 737-7300 x8528
RS232 Converter
by Walter Wickersham (shadow@connected.com)
PARTS LIST:
7404 Hex Inverter IC ($0.99 at Radio Shack)
Wires, solder, etc.
Commodore User port connector (I used one off a old 1650)
Programming the Commodore RAM Expansion Units (REUs)
by Richard Hable (Richard.Hable@jk.uni-linz.ac.at)
Contents:
1) External RAM Access With REUs
The REUs provide additional RAM for the C64/128. Three types of REUs have
been produced by Commodore. These are the 1700, 1764 and 1750 with 128, 256
and 512 KBytes built in RAM. However, they can be extended up to several
MBytes.
2) RAM Expansion Controller (REC) Registers
The REC is programmed by accessing its registers. When a REU is connected
through the expansion port, these registers appear memory mapped in the
I/O-area between $DF00 and $DF0A. They can be read and written to like VIC-
and SID-registers.
3) How To Recognize The REU
Normally, the addresses between $DF02 and $DF05 are unused, values stored
there get lost. Therefore, if e.g. the values 1,2,3,4 are written to
$DF02..$DF05 and do not stay there, no REU can be connected. However, if the
values are there, it could be caused by another kind of module connected that
also uses these addresses.
4) Simple RAM Transfer
Very little options of the REU are necessary for the main purposes of RAM
expanding. Just set the base addresses, transfer length, and then the
command register.
5) Additional Features
Swapping Memory
With the swap-command, memory between 17xx and C64 can be exchanged. The
programming is the same as in simple RAM transfer.
Comparing Memory
No RAM is transferred. Instead, the number of bytes specified in the transfer
length register is compared. If there are differences, the FAULT bit of the
status register is set. In order to get valid information, this bit has to
be cleared before comparing. This is possible by reading the status
register.
Using All C64 Memory
Normally, C64 memory is accessed in the memory configuration selected during
writing to the command register. In order to be able to write to the command
register, the I/O-area has to be active. If RAM between $D000 and $DFFF or
character ROM shall be used, it is possible to delay the execution of the
command by using a command byte with bit 4 ("FF00") cleared. The command
will then be executed when an arbitrary value is written to address $FF00.
6) Transfer Speed
During DMA the CPU is halted--the memory access cycles normally available for
the CPU are now used to access one byte each cycle. Therefore, with screen
and sprites switched off, in every clock cycle (985248 per second on PAL
machines) one byte is transferred. If screen is on or sprites are enabled,
transfer is a bit slower, as the VIC sometimes accesses RAM exclusively.
Comparing memory areas is as fast as transfering. (Comparison is stopped
once the first difference is found.) Swapping memory is only half as fast,
because two C64 memory accesses per byte (read & write) are necessary.
7) Interrupts
By setting certain bits in the interrupt mask register, IRQs at the end of a
DMA can be selected. However, as the CPU is halted during DMA, a transfer or
comparision will always be finished after the store instruction into the
command register or $FF00. Therefore, there is no need to check for an "END
OF BLOCK" (bit 6 of status register) or to enable an interrupt.
8) Executing Code In Expanded Memory
Code in expanded memory has to be copied into C64 memory before execution.
This is a disadvantage against bank switching systems. However, bank
switching can be simulated by the SWAP command. This is done e.g. in RAMDOS.
There, only 256 bytes of C64 memory are occupied, the 8 KByte RAM disk driver
is swapped in whenever needed. Too much swapping is one reason for RAMDOS to
be relatively slow at sequential file access.
9) Other Useful Applications Of The REU
The REC is not only useful for RAM transfer and comparison.
10) Comparision Of Bank Switching and DMA
When comparing bank switching and DMA for memory expansion, I think, DMA is
the more comfortable methode to program. It is also faster in most cases.
The disadvantage of code execution not possible in external memory can be
minimized by always copying only the necessary parts into C64 memory.
Executing the code will then take much more time than copying it into C64
memory.
A Different Perspective: Three-Dimensional Graphics on the C64
by Stephen Judd (judd@merle.acns.nwu.edu) and
George Taylor (yurik@io.org)
Introduction
We've all seen them: neat-looking three-dimensional graphics tumbling around
on a computer. But how is it done? In particular, how would you do it on a
Commodore-64? Nowadays the typical answer to the first question is "Just
use the functions in 3dgrphcs.lib" (or "Beats me."). The answer to the
second is either "Well an elite coder like me can't let secrets like that
out" or else "What, you mean people still use those things?"
First Things First
Before we begin, you are going to have to get a few ideas into your head.
First and foremost is the coordinate system we will be using: a right-handed
coordinate system. In our system, the x-axis is going to come out towards
you, the y-axis is going to go to your right, and the z-axis is going to go
"up".
Rotations in the Plane
One of the cool things to do with a three-dimensional object is of course to
rotate it. To understand how to do this, we need to first look at rotations
in the plane. A little later on, this article is going to assume you know
how to multiply two matrices together.
cos(A+B) = cos(A)cos(B) - sin(A)sin(B)
sin(A+B) = cos(A)sin(B) + sin(A)cos(B)
Let us take a look at rotations in the plane; that is, in two dimensions.
Think of the typical x-y axis. Let's say that we have a point at [x1 y1]
and want to rotate it by an angle B, about the origin, so that we end up at
the rotated coordinate [x2 y2]. What are x2 and y2? The easiest way to
find them is to use polar coordinates.
x2 = r*cos(t+B)
= r*(cos(t)cos(B) - sin(t)sin(B))
= x1*cos(B) - y1*sin(B).
Similarly,
y2 = r*sin(t+B) = x1*sin(B) + y1*cos(B).
In matrix form, this can be written as
[x2] = [cos(B) -sin(B)] [x1]
[y2] [sin(B) cos(B)] [y1]
How do we extend this to three dimensions? Easy. The key thing to realize
here is that, in three dimensions, the above rotations are really rotations
about the z-axis. At any point along the z-axis we could take a thin slice
of the three-dimensional space (so that our slice is parallel to the x-y
axis) and pretend that we are really in two-dimensional space. Therefore,
to rotate a point about the z-axis the x- and y-equations are the same as
above, and the z-coordinate stays fixed. In matrix form this is
[x2] [cos(B) -sin(B) 0] [x1]
[y2] = [sin(B) cos(B) 0] [y1]
[z2] [ 0 0 1] [z1]
Similarly, it is easy to see that
[x2] [ 1 0 0 ] [x1]
[y2] = [ 0 cos(B) -sin(B)] [y1]
[z2] [ 0 sin(B) cos(B)] [z1]
is a rotation about the x-axis, and that
[x2] [cos(B) 0 sin(B)] [x1]
[y2] = [ 0 1 0 ] [y1]
[z2] [-sin(B) 0 cos(B)] [z1]
is a rotation about the y-axis. You may have noticed that the signs of
sin(B) have been reversed; this is because in our right-handed coordinate
system the z-x plane is "backwards": in the z-x plane x increases to the
left, while z increases "up".
Projections
Now, we could just do a simple projection and set the z-coordinate equal to
zero, but in doing so we have eliminated some of the information, and it
won't look very three-dimensional to our eyes. So we need to think of a
better method.
|
|
/|
lens / |film
-----*--|------------ z-axis
/ |
/ |
/ z=1
object :-) (then again, maybe it won't!)
What does this object look like on the film?
Let's say one of the points of this something is [x y z]. Where does this
point come out on the film? Since the lens is at the origin, we want to
draw the line from [x y z] through the origin (since that's where our lens
is) and find the point [x1 y1 1] where it hits the film. The parametric
equation of this line is
t * [x y z]
so that we want to find the intersection of this line and the film:
t * [x y z] = [x1 y1 1].
The z-coordinate tells us that t*z=1, or t=1/z. If we then substitute this
in the above equation, we find that
x1 = x/z y1 = y/z
If, instead of placing the film at z=1 we place it at z=d, we get
x1 = d*x/z y1 = d*y/z
These then are the projection equations. Geometrically you can see that by
changing d all you will do is to "magnify" the object on the film. Anyone
who has watched an eclipse with a little pinhole camera has seen this.
x1 = d*x/(z-z0) y1 = d*y/(z-z0)
Where z0 is a translation amount that at the very least makes sure that
z-z0 < 0.
Implementation
Now that we've got the theory under our belt, we need to think about
implementing it on the computer. As a concession to all the programmers who
immediately skipped to this section, most of the discussion will be at a
reasonably high level.
Implementation: Rotations
We've got some more heavy-duty decision making ahead of us. We could
implement this is several ways. We could apply each rotation individually,
that is, we could rotate around the z-axis, then use these rotated points
and rotate them around the y-axis, etc.
XYZv = v'
where v' is the new point after all the rotation transformations. (You might
call it a conflagration of rotation transformations). Now the magic of
linear algebra begins to work: operations are associative, which is a fancy
way of saying that (AB)C = A(BC); For us this means that I can multiply all
three matrices X Y and Z together to get a single new matrix M:
M = XYZ
Mv= v'
"But," you may say, "we have to do the same number of multiplications to get
M as we do to apply each rotation separately! How is this supposed to
help?" This is how it is supposed to help:
The first advantage, less memory, should be clear. A custom character set
takes up 2k. A bitmap, on the other hand, takes up 8k.
sin(a+b) = sin(a)cos(b) + cos(a)sin(b)
cos(a+b) = cos(a)cos(b) - sin(a)sin(b)
We will also use the fact that cosine is even and sine is odd, that is
cos(-a) = cos(a)
sin(-a) = -sin(a)
Using the above identities it is easy to see that
sin(a)sin(b) = (cos(a-b) - cos(a+b))/2
cos(a)cos(b) = (cos(a+b) + cos(a-b))/2
sin(a)cos(b) = (sin(a+b) + sin(a-b))/2
We are going to rotate first around the z-axis by an amount sz, then the
y-axis by an amount sy, then the x-axis by an amount sx. Why rotate in that
order? Why not.
M = XYZ
If you multiply everything out (and I encourage you to do so, not only for
practice, but also as a double-check of my work), and use the above trig
identities, the result is:
[A B C]
M = [D E F]
[G H I]
Where
A = (cos(t1)+cos(t2))/2
B = (sin(t1)-sin(t2))/2
C = sin(sy)
D = (sin(t3)-sin(t4))/2 + (cos(t6)-cos(t5)+cos(t8)-cos(t7))/4
E = (cos(t3)+cos(t4))/2 + (sin(t5)-sin(t6)-sin(t7)-sin(t8))/4
F = (sin(t9)-sin(t10))/2
G = (cos(t4)-cos(t3))/2 + (sin(t6)-sin(t5)-sin(t8)-sin(t7))/4
H = (sin(t3)+sin(t4))/2 + (cos(t6)-cos(t5)+cos(t7)-cos(t8))/4
I = (cos(t9)+cos(t10))/2
with
t1 = sy-sz
t2 = sy+sz
t3 = sx+sz
t4 = sx-sz
t5 = sx+sy+sz = sx+t2
t6 = sx-sy+sz = sx-t1
t7 = sx+sy-sz = sx+t1
t8 = sy+sz-sx = t2-sx
t9 = sy-sx
t10= sy+sx
How is this supposed to be the "simplified" version? If you look closely,
there are no multiplies. We can calculate the entire rotation matrix M in
about the same time as it would take to do two multiplications. This also
means that the associated problem with multiplications, loss of accuracy, is
now gone.
Implementation: Projections
Recall that the projection equation is
x' = d*x/(z-z0)
y' = d*y/(z-z0)
It looks as if we have gone from a bunch of sneaky additions to
multiplications and divisions! Yuck.
LDX z
LDA table,z
it gets the absolute (i.e. non-offset) value A=d/(z-z0). What if we want to
change d? You could put a little piece of code into your program which
multiplies by a number less than one, and let d represent the maximum value
for d which makes the code work. But for the moment we won't bother with
that -- one thing at a time!
10 bz=whatever
20 d=45:z0=3:z=-128:dz=1
30 for i=0 to 255
40 q%=64*d/(64*z0-z):if q%>127 then q%=127
50 poke bz+i,q%:z=z+dz
60 next
Note that the offset chosen forces q% to always be positive. This fact can
be made use of in the multiplication routine (but isn't in the source code).
Fast Signed 8-bit Multiply
A binary number looks like the following:
P = 1*128 + 0*64 + 0*32 + 1*16 + 1*8 + 0*4 + 0*2 + 0*1
Therefore, if we want to multiply P by another number, 13 say, we find that
13*P = 13*128 + 0*64 + 0*32 + 13*16 + ...
that is to say, if there is a one in bit position N, then the new number
will have 13*2^N in it. So, to multiply two numbers we find out what bit
positions are high, and then add the other number*2^N to the result. This
doesn't seem too fast. Here is a trick: we can write 2^N as 256/2^(8-N).
Drawing a line
The geometric idea is: given an initial point [x1 y1], we want to draw a
line to the point [x2 y2]! Now we want to do this on a computer by taking
one step at a time, from point to point. The idea is to make it fast, and
since we're on a C64 there aren't any MUL or DIV instructions.
dx = |x2-x1|
dy = |y2-y1|
where | | denotes absolute value. Let's assume that it is dx, and that the
variable x is going to run from x1 to x2. Therefore, we want to increase x
by one at each step, and we want to increase y by some fractional amount (If
dy were larger we would want to take big steps in the y-direction). But we
don't want to calculate this fractional number. We do, however, want to
take a certain amount of steps in the x-direction before taking a step in
the y-direction.
dx/k = dy
which gives
k = dx/dy
where dx and dy are as above, the total number of steps to be taken in the
x- and y-directions respectively. What is dx/dy? We don't care. Instead,
every time we step in x, we need to increase a counter by the amount dy. As
soon as this counter is larger than dx, we have successfully divided dy into
dx, and so simply reset the counter (in a special way, so that we keep any
remainder from the division) and take a step in y.
Plotting a point
In the line routine presented earlier, the nebulous statement PPLOT was
written. Now we come to plotting a point in all its gory detail.
LDA VMCSB ;VMCSB=$D018
EOR #%00000010 ;Flip the bit
STA VMCSB
True, clearing the buffer each time is a bit slow, but for our purposes it
will do just fine.
00 08 ...
01 09
02 0A
... ...
07 0F
where the number represents the offset of the byte. This is fine for some
things, but calculating the position of a pixel is tricky. With a character
map, we can represent our data any way we want. In particular, we can
organize our bitmap to look like the following:
00 80 ...
01 81
02 82
... ...
7D FD
7E FE
7F FF ...
Or, in graphic form
@P... etc.
AQ
BR
CS
..
O<- (the back-arrow)
What we have done is, instead of putting characters side-by-side like a
hires bitmap does, we put them on top of each other. The above represents a
16x16 character array, which is a 128x128 pixel array. Now the y-coordinate
is a direct index into the row we are in. That is, base+Y = memory location
of point.
Post Script
That's all there is to it. Well, OK, there are a few details we left out,
but you can figure them out on your own. You can always look to the source
code to see how we overcame the same problem. The program is set up in a
way that you can experiment around with the projection parameters d and z0,
to see what changing them does to the math.
Da Code
DESIGN OF A 'REAL' OPERATING SYSTEM FOR THE 128: PART I
by Craig Bruce
0. PREFACE
I originally planned to write this entire article all in one go, but its
size, complexity, and scope of required design decisions have forced me to
split this article into two pieces. (Not to mention my own poor time
management in writing this article). This part gives an introduction to what
I am talking about and discusses, at an abstract level, how the system will
work. The next part will dive into all of the nuts and bolts of the
low-level design.
1. INTRODUCTION
The full title of this article should be "Design of a Multitasking
Distributed Microkernel Operating System for the Good Old '128". For
purposes of discussion, we will call the new operating system "BOS". A
"multitasking" operating system (OS) is one that is able to execute more than
one "process" "concurrently". A Process is an instance of a running program.
"Concurrently" means that the programs appear to be running at the same time,
although in reality they are not, because there is only one* processor in the
128 and it can only do one thing at a time.
2. GENERAL DESIGN OVERVIEW
There are a number of high-level design decisions that must be made before
going into a detailed design. This section discusses these decisions.
2.1. SPECIAL C128 FEATURES
The C-128 has a minumum set of special features that make it feasible to run
a multitasking operating system, as opposed to earlier machines like the
C-64. The simplest special feature that the C128 has is *enough memory*. The
64K of the C64 just isn't enough. The 128K of the C128 is just barely
enough. Expanded internal memory makes the proposition even easier.
2.2. NETWORK
The OS should be designed to run on a system of between 1 and N C-128's,
where N has a maximum of something like 8 or 16. We'll choose 16 for our
software design. The theory is that the style of operating system that we
are proposing makes the step between 1 and N C-128's a (relatively) easy one,
so why not go for it. Also, if N were to become some number like 256 or
65536, then we could start kicking some serious ass performance-wise, for
certain classes of computations. Also, I happen to own two C-128's and I
have already constructed a parallel-port network (a Jedi's weapon!), so I
might as well use it.
2.3. PROCESSES
A process is a user program that is in an active state of execution. In
uni-tasking operating systems like ACE or the Commodore Kernal, there is only
one process in the entire system. In a multi-tasking system, there are, duh,
multiple processes. Each process executes as an independently running
program, in isolation, logically as if it were the only process in the
system. Or, as if there were N 8502's available inside of the 128 and one of
them were used to run each program you have loaded.
2.4. APPLICATION PROGRAM INTERFACE
To take advantage of existing software, we would like our OS to provide an
application program interface (API) that is identical to that of the
ACE-128/64 operating system. In fact, this is the *real* reason why ACE was
developed -- as a stepping stone toward a real operating system. The ACE
Programmer's Reference Guide, which describes the API, is available from
"ccnga.uwaterloo.ca".
2.5. MEMORY MANAGEMENT
The memory management of BOS is analogous to that of ACE. There are two
different classes of memory: near and far. Near memory is on the same bank
as a program and can be accessed directly by processor instructions. Far
memory can only be accessed through the kernel by the special kernel calls
Fetch and Stash and must be specially allocated to a process by the operating
system. Note that near memory is considered a sub-class of far memory; the
far-memory primitives can be used to access near memory.
2.6. COMMUNICATION
In the type of system that is envisioned, processes are not strictly
independent and competitive; many must cooperate and comunicate to get work
done. To facilitiate this interprocess communication (IPC), a particular
organization is chosen: the Remote Procedure Call (RPC) paradigm. RPC is a
message-passing scheme that is used with the heavily hyped Client/Server
system architecture model. It reflects the implicit operations that take
place when you call a local procedure (a subroutine): the call, the entry,
the processing, and the return. The kernel provides three primitives for
RPC:
2.7. SYSTEM SERVERS
Since all that user program has for IPC and I/O is the RPC mechanism, a
number of system server processes must be set up to allow a user program to
do anything useful. These special servers execute as if they were regular
user programs but provide service that is normally implemented directly into
the operating system kernel. There are a number of advantages and
disadvantages to organizing a system in this way. A big advantage is that it
is easier to build a modular system like this, and a big disadvantage is that
you lose some performance to the overhead of the IPC mechanism.
2.7.1. PROCESS SERVER
This server is responsible for starting and terminating user processes.
Because of the way that the procedure is organized, the process server is
actually quite responsive dispite all of the work that must be done in order
to start up and terminate a user process.
2.7.2. MEMORY SERVER
The memory server handles the dynamic allocation and deallocation of far
memory. The client specifies in the request message the exact types of
memory that it can use, and the server gets the memory, sets the ownership to
the process, and returns a pointer. Deallocation of some of the memory owned
by a process is handled easily.
MORE PROCESS TERMINATION
Come to think of it, I should talk more about process termination. The best
idea would probably be for a user process to terminate itself, in the same
way that it bootstraps itself. A termination message is sent by a client
process that wants to kill someone to the process server. It is a valid
situation for a process to commit suicide. The termination message includes
the process id to be terminated and the exit code for the termination.
2.7.3. FILE SERVERS
Each disk drive in the system has a special server that provides an interface
for executing Open, Read, Write, Close, and a number of other common file
operations. A big problem with distributed operating systems is resource
reclamation for processes that die. There are a few ways to provide this,
and each has implications about the overall design of a server.
2.7.4. PREFIX SERVER
The prefix server idea is stolen from the computer science literature about a
network operating system called "Sprite". The prefix server is simply
provides a pathname lookup service for the pathnames of different disk file
and device servers. This is needed to provide a single, global, unified
pathname space on a system of multiple distributed file servers. It works a
lot like the "mount table" in Unix. Its prefix table looks something like
the following:
2.7.5. DEVICE SERVERS
Device servers are just another type of file server, except they control a
specific device other than a regular disk device, and they are likely to
support some custom operations and return error codes if some disk operations
are attempted. The interface is identical to a file server for convenience.
2.7.6. CONSOLE SERVER
Just a specific device server. It handles window management and console
calls, like WinClear, WinPut, GetKey, and ConWrite, that are used in ACE.
2.8. ASYNCHRONOUS EVENT HANDLING
As mentioned in the Process section above, there are many external events
that a process may have to wait for, including: modem characters, disk drive
operations (if they are custom-programmed correctly), mouse & joystick
movements, and real-time delays. There will be an AwaitEvent() kernel
primitive to allow a process to wait for one of these events to happen.
Normally, the only processes that wait for these events will be device
drivers. The kernel will also have to do some low-level processing for of
some devices (like the modem and keyboard) to insure that things don't become
unnecessarily inefficient.
3. KERNEL DESIGN
Next time.
4. SYSTEM SERVER DESIGN
Next time.
5. APPLICATION PROGRAM INTERFACE
Next time.
6. CONCLUSION
Next time.